1: \begin{abstract}
2: \noindent
3: Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, particularly in the experience replay setting now commonly used with deep neural networks.
4: Classically, off-policy estimation bias is corrected in a \textit{per-decision} manner:
5: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio (via eligibility traces) after each action.
6: Many important off-policy algorithms such as Tree Backup and Retrace rely on this mechanism along with differing protocols for truncating (``cutting'') the ratios (``traces'') to counteract the excessive variance of the IS estimator.
7: Unfortunately, cutting traces on a per-decision basis is not necessarily efficient;
8: once a trace has been cut according to local information, the effect cannot be reversed later, potentially resulting in the premature truncation of estimated returns and slower learning.
9: In the interest of motivating efficient off-policy algorithms, we propose a multistep operator that permits arbitrary past-dependent traces.
10: We prove that our operator is convergent for policy evaluation, and for optimal control when targeting greedy-in-the-limit policies.
11: Our theorems establish the first convergence guarantees for many existing algorithms including Truncated IS, Non-Markov Retrace, and history-dependent TD($\lambda$).
12: Our theoretical results also provide guidance for the development of new algorithms that jointly consider multiple past decisions for better credit assignment and faster learning.
13: \end{abstract}
14: