1: \begin{abstract}
2: Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging.
3: Classically, off-policy bias is corrected in a \emph{per-decision} manner:
4: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces.
5: Many off-policy algorithms rely on this mechanism, along with differing protocols for \emph{cutting} the IS ratios to combat the variance of the IS estimator.
6: Unfortunately, once a trace has been fully cut, the effect cannot be reversed.
7: This has led to the development of credit-assignment strategies that account for multiple past experiences at a time.
8: These \emph{trajectory-aware} methods have not been extensively analyzed, and their theoretical justification remains uncertain.
9: In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods.
10: We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones.
11: Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across $\lambda$-values in several off-policy control tasks.
12: \end{abstract}
13: