abstract:5dcd4ab59589b2fb.tex

1: \begin{abstract}

2: In this work, we take a fresh look at some old and new algorithms for off-policy, return-based

3: reinforcement learning. Expressing these in a common form, we derive a novel algorithm,

4: Retrace($\lambda$), with three desired properties: (1) it has {\em low variance}; (2)

5: it {\em safely} uses samples collected from any behaviour policy, whatever its degree of ``off-policyness''; and

6: (3) it is {\em efficient} as it makes the best use of samples collected from near on-policy behaviour policies.

7: We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based

8: off-policy control algorithm converging a.s.~to $Q^*$ without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q($\lambda$), which was  an open problem since 1989.

9: We illustrate the benefits of Retrace($\lambda$) on a standard suite of Atari 2600 games.

10: \end{abstract}

11: