ad8cd6e3d74a0410.tex
1: \begin{abstract}
2: We propose and analyze an alternate approach to off-policy multi-step
3: temporal difference learning, in which off-policy returns are
4: corrected with the current Q-function in terms of rewards, rather than
5: with the target policy in terms of transition
6: probabilities. We prove that such approximate
7: corrections are sufficient for off-policy convergence both in policy
8: evaluation and control, provided certain conditions. These conditions
9: relate the distance between the target and behavior policies, the eligibility
10: trace parameter and the discount factor, and formalize an underlying
11: tradeoff in off-policy TD($\lambda$). 
12: We illustrate this theoretical relationship 
13: empirically on a
14: continuous-state control task.
15: \end{abstract}
16: