49c1bbe5b943f686.tex
1: \begin{abstract}
2: We consider emphatic temporal-difference learning algorithms for policy evaluation in discounted Markov decision processes with finite spaces.
3: Such algorithms were recently proposed by \citet*{SuMW14} as an improved solution to the problem of divergence of off-policy temporal-difference learning with linear function approximation. We present in this paper the first convergence proofs for two emphatic algorithms, ETD($\lambda$) and ELSTD($\lambda$). We prove, under general off-policy conditions, the convergence in $L^1$ for ELSTD($\lambda$) iterates, and the almost sure convergence of the approximate value functions calculated by both algorithms using a single infinitely long trajectory. Our analysis involves new techniques with applications beyond emphatic algorithms leading, for example, to the first proof that standard TD($\lambda$) also converges under off-policy training for $\lambda$ sufficiently large.
4: \end{abstract}
5: