abstract:ce97e70751cd4d79.tex

1: \begin{abstract}

2:     \noindent

3:     Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, particularly in the experience replay setting now commonly used with deep neural networks.

4:     Classically, off-policy estimation bias is corrected in a \textit{per-decision} manner:

5:     past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio (via eligibility traces) after each action.

6:     Many important off-policy algorithms such as Tree Backup and Retrace rely on this mechanism along with differing protocols for truncating (``cutting'') the ratios (``traces'') to counteract the excessive variance of the IS estimator.

7:     Unfortunately, cutting traces on a per-decision basis is not necessarily efficient;

8:     once a trace has been cut according to local information, the effect cannot be reversed later, potentially resulting in the premature truncation of estimated returns and slower learning.

9:     In the interest of motivating efficient off-policy algorithms, we propose a multistep operator that permits arbitrary past-dependent traces.

10:     We prove that our operator is convergent for policy evaluation, and for optimal control when targeting greedy-in-the-limit policies.

11:     Our theorems establish the first convergence guarantees for many existing algorithms including Truncated IS, Non-Markov Retrace, and history-dependent TD($\lambda$).

12:     Our theoretical results also provide guidance for the development of new algorithms that jointly consider multiple past decisions for better credit assignment and faster learning.

13: \end{abstract}

14: