1: \begin{abstract}
2:
3:
4: Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL).
5: Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios.
6: Typical IS-free methods, such as $n$-step Q-learning, look ahead for $n$ time steps along the trajectory of actions (where $n$ is called the lookahead depth) and utilize off-policy data directly without any additional adjustment.
7: They work well for proper choices of $n$.
8: We show, however, that such IS-free methods underestimate the optimal value function (VF), especially for large $n$, restricting their capacity to efficiently utilize information from distant future time steps.
9: To overcome this problem, we introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF.
10: At its core lies a simple but non-trivial \emph{highway gate},
11: which controls the information flow from the distant future by comparing it to a threshold.
12: The highway gate guarantees convergence to the optimal VF for arbitrary $n$ and arbitrary behavioral policies. It gives rise to a novel family of off-policy RL algorithms that safely learn even when $n$ is very large, facilitating rapid credit assignment from the far future to the past.
13: On tasks with greatly delayed rewards, including video games where the reward is given only at the end of the game, our new methods outperform many existing multi-step off-policy algorithms.
14:
15:
16: \end{abstract}
17: