1: \begin{abstract}
2:
3: The Zap~Q-learning algorithm introduced in this paper is an improvement of Watkins' original algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed so that its asymptotic variance is optimal. Moreover, an ODE analysis suggests that the transient behavior is a close match to a deterministic Newton-Raphson implementation. This is made possible by a two time-scale update equation for the matrix gain sequence.
4:
5: The analysis suggests that the approach will lead to stable and efficient computation even for non-ideal parameterized settings. Numerical experiments confirm the quick convergence, even in such non-ideal cases. The comparison plot on this first page, taken from \Fig{6stateBEPlot} of this paper, is an illustration of the amazing acceleration in convergence using the new algorithm.
6:
7: A secondary goal of this paper is tutorial. The first half of the paper contains a survey on reinforcement learning algorithms, with a focus on minimum variance algorithms.
8:
9:
10: \medskip
11:
12: {\small
13: \noindent
14: \textbf{Keywords:}
15: Reinforcement learning,
16: Q-learning,
17: Stochastic optimal control}
18: \smallskip
19:
20: {\small
21: \noindent
22: \textbf{2000 AMS Subject Classification:}
23: 93E20, % Optimal stochastic control
24: 93E35 % Stochastic learning and adaptive control
25: %60J20 %Applications of Markov chains and discrete-time Markov processes on general state spaces (social mobility, learning theory, industrial processes, etc.) [See also 90B30, 91D10, 91D35, 91E40]
26: %60J22 %Computational methods in Markov chains [See also 65C40]
27:
28: % 60J10, % chains with discrete parameter
29: %60J25, % Markov processes with continuous parameter
30: %37A30, % Ergodic theorems, spectral theory, Markov operators
31: % 60F10, % Large deviations
32: %47H99. % nonlinear operators
33: }
34:
35:
36:
37: \vfill
38:
39: \Ebox{1}{6State_BEPlot_Beta08099.pdf}
40: \vfill
41:
42: \end{abstract}
43: