abstract:ad8cd6e3d74a0410.tex

1: \begin{abstract}

2: We propose and analyze an alternate approach to off-policy multi-step

3: temporal difference learning, in which off-policy returns are

4: corrected with the current Q-function in terms of rewards, rather than

5: with the target policy in terms of transition

6: probabilities. We prove that such approximate

7: corrections are sufficient for off-policy convergence both in policy

8: evaluation and control, provided certain conditions. These conditions

9: relate the distance between the target and behavior policies, the eligibility

10: trace parameter and the discount factor, and formalize an underlying

11: tradeoff in off-policy TD($\lambda$).

12: We illustrate this theoretical relationship

13: empirically on a

14: continuous-state control task.

15: \end{abstract}

16: