8ea2eeed243e0ca0.tex
1: \begin{abstract}
2: Learning the value function of a given policy from data samples is an important problem in Reinforcement Learning.
3: TD($\lambda$) is a popular class of algorithms to solve this problem.
4: However, the weights assigned to different $n$-step returns in TD($\lambda$), controlled by the parameter $\lambda$, decrease exponentially with increasing $n$.
5: In this paper, we present a $\lambda$-schedule procedure that generalizes the TD($\lambda$) algorithm to the case when the parameter $\lambda$ could vary with time-step. This allows flexibility in weight assignment, i.e., the user can specify the weights assigned to different $n$-step returns by choosing a sequence $\{\lambda_t\}_{t \geq 1}$.
6: Based on this procedure, we propose an on-policy algorithm -- TD($\lambda$)-schedule, and two off-policy algorithms -- GTD($\lambda$)-schedule and TDC($\lambda$)-schedule,
7: respectively. We provide proofs of almost sure convergence for all three algorithms under a general Markov noise framework.
8: \end{abstract}
9: