835319bc3a35fbc3.tex
1: \begin{abstract}
2: 
3: 
4: Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL). 
5: Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios. 
6: Typical IS-free methods, such as $n$-step Q-learning, look ahead for $n$ time steps along the trajectory of actions (where $n$ is called the lookahead depth) and utilize off-policy data directly without any additional adjustment. 
7: They work well for proper choices of $n$. 
8: We show, however, that such IS-free methods underestimate the optimal value function (VF), especially for large $n$, restricting their capacity to efficiently utilize information from distant future time steps. 
9: To overcome this problem, we introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF.
10: At its core lies a simple but non-trivial \emph{highway gate}, 
11: which controls the information flow from the distant future by comparing it to a threshold.
12: The highway gate guarantees convergence to the optimal VF  for arbitrary $n$ and arbitrary behavioral policies. It gives rise to a novel family of off-policy RL algorithms that safely learn even when $n$ is very large, facilitating rapid credit assignment from the far future to the past.
13: On tasks with greatly delayed rewards, including video games where the reward is given only at the end of the game, our new methods outperform many existing multi-step off-policy algorithms.
14: 
15: 
16: \end{abstract}
17: