abstract:835319bc3a35fbc3.tex

1: \begin{abstract}

2:

3:

4: Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL).

5: Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios.

6: Typical IS-free methods, such as $n$-step Q-learning, look ahead for $n$ time steps along the trajectory of actions (where $n$ is called the lookahead depth) and utilize off-policy data directly without any additional adjustment.

7: They work well for proper choices of $n$.

8: We show, however, that such IS-free methods underestimate the optimal value function (VF), especially for large $n$, restricting their capacity to efficiently utilize information from distant future time steps.

9: To overcome this problem, we introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF.

10: At its core lies a simple but non-trivial \emph{highway gate},

11: which controls the information flow from the distant future by comparing it to a threshold.

12: The highway gate guarantees convergence to the optimal VF  for arbitrary $n$ and arbitrary behavioral policies. It gives rise to a novel family of off-policy RL algorithms that safely learn even when $n$ is very large, facilitating rapid credit assignment from the far future to the past.

13: On tasks with greatly delayed rewards, including video games where the reward is given only at the end of the game, our new methods outperform many existing multi-step off-policy algorithms.

14:

15:

16: \end{abstract}

17: