ce97e70751cd4d79.tex
1: \begin{abstract}
2:     \noindent
3:     Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, particularly in the experience replay setting now commonly used with deep neural networks.
4:     Classically, off-policy estimation bias is corrected in a \textit{per-decision} manner:
5:     past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio (via eligibility traces) after each action.
6:     Many important off-policy algorithms such as Tree Backup and Retrace rely on this mechanism along with differing protocols for truncating (``cutting'') the ratios (``traces'') to counteract the excessive variance of the IS estimator.
7:     Unfortunately, cutting traces on a per-decision basis is not necessarily efficient;
8:     once a trace has been cut according to local information, the effect cannot be reversed later, potentially resulting in the premature truncation of estimated returns and slower learning.
9:     In the interest of motivating efficient off-policy algorithms, we propose a multistep operator that permits arbitrary past-dependent traces.
10:     We prove that our operator is convergent for policy evaluation, and for optimal control when targeting greedy-in-the-limit policies.
11:     Our theorems establish the first convergence guarantees for many existing algorithms including Truncated IS, Non-Markov Retrace, and history-dependent TD($\lambda$).
12:     Our theoretical results also provide guidance for the development of new algorithms that jointly consider multiple past decisions for better credit assignment and faster learning.
13: \end{abstract}
14: