abstract:324fc2897d41f767.tex

1: \begin{abstract}

2:     Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging.

3:     Classically, off-policy bias is corrected in a \emph{per-decision} manner:

4:     past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces.

5:     Many off-policy algorithms rely on this mechanism, along with differing protocols for \emph{cutting} the IS ratios to combat the variance of the IS estimator.

6:     Unfortunately, once a trace has been fully cut, the effect cannot be reversed.

7:     This has led to the development of credit-assignment strategies that account for multiple past experiences at a time.

8:     These \emph{trajectory-aware} methods have not been extensively analyzed, and their theoretical justification remains uncertain.

9:     In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods.

10:     We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones.

11:     Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across $\lambda$-values in several off-policy control tasks.

12: \end{abstract}

13: