abstract:d0730adc5590978d.tex

1: \begin{abstract}

2: Off-policy learning is powerful for reinforcement learning.

3: However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning falls into an uncontrolled instability.

4: In this paper, for reducing the variance, we introduce control variate technique to $\mathtt{Expected}$ $\mathtt{Sarsa}$($\lambda$) and propose a tabular $\mathtt{ES}$($\lambda$)-$\mathtt{CV}$ algorithm.

5: We prove that if a proper estimator of value function reaches, the proposed $\mathtt{ES}$($\lambda$)-$\mathtt{CV}$ enjoys a lower variance than $\mathtt{Expected}$ $\mathtt{Sarsa}$($\lambda$).

6: Furthermore, to extend $\mathtt{ES}$($\lambda$)-$\mathtt{CV}$ to be a convergent algorithm with linear function approximation,

7: we propose the $\mathtt{GES}$($\lambda$) algorithm under the convex-concave saddle-point formulation.

8: We prove that the convergence rate of $\mathtt{GES}$($\lambda$) achieves $\mathcal{O}(1/T)$, which matches or outperforms lots of state-of-art gradient-based algorithms, but we use a more relaxed condition.

9: Numerical experiments show that the proposed algorithm performs better with lower variance than several state-of-art gradient-based TD learning algorithms: $\mathtt{GQ}$($\lambda$), $\mathtt{GTB}$($\lambda$) and $\mathtt{ABQ}$($\zeta$).

10: \end{abstract}

11: