abstract:8ea2eeed243e0ca0.tex

1: \begin{abstract}

2: Learning the value function of a given policy from data samples is an important problem in Reinforcement Learning.

3: TD($\lambda$) is a popular class of algorithms to solve this problem.

4: However, the weights assigned to different $n$-step returns in TD($\lambda$), controlled by the parameter $\lambda$, decrease exponentially with increasing $n$.

5: In this paper, we present a $\lambda$-schedule procedure that generalizes the TD($\lambda$) algorithm to the case when the parameter $\lambda$ could vary with time-step. This allows flexibility in weight assignment, i.e., the user can specify the weights assigned to different $n$-step returns by choosing a sequence $\{\lambda_t\}_{t \geq 1}$.

6: Based on this procedure, we propose an on-policy algorithm -- TD($\lambda$)-schedule, and two off-policy algorithms -- GTD($\lambda$)-schedule and TDC($\lambda$)-schedule,

7: respectively. We provide proofs of almost sure convergence for all three algorithms under a general Markov noise framework.

8: \end{abstract}

9: