abstract:d4343a4f7a97117d.tex

1: \begin{abstract}

2: Greedy-GQ is an off-policy two timescale algorithm for optimal control in reinforcement learning \cite{maei2010toward}. This paper develops the first finite-sample analysis for the Greedy-GQ algorithm with linear function approximation under Markovian noise. Our finite-sample analysis provides theoretical justification for choosing stepsizes for this two timescale algorithm for faster convergence in practice, and suggests a trade-off between the convergence rate and the quality of the obtained policy. Our paper extends the finite-sample analyses of two timescale reinforcement learning algorithms from policy evaluation to optimal control, which is of more practical interest. Specifically, in contrast to existing finite-sample analyses for two timescale methods, e.g., GTD, GTD2 and TDC, where their objective functions are convex, the objective function of the Greedy-GQ algorithm is non-convex. Moreover, the Greedy-GQ algorithm is also not a linear two-timescale stochastic approximation algorithm.  Our techniques in this paper provide a general framework for finite-sample analysis of  non-convex value-based reinforcement learning algorithms for optimal control.

3:

4: % the first temporal-difference learning algorithm for off-policy control and linear function approximation. Even the algorithm is important, there are very limited studies about its convergence rate. We provide a non-asymptotic convergence analysis for two timescale Greedy-GQ under a non-i.i.d. Markov sample path and linear function approximation.   \zou{can we simplify this part to get the order? what we are summing over in the above expression?},

5:

6: % We also showed when constant stepsize is used, the fast rate of convergence is obtained when we choose $\beta_t=\frac{1}{T^{\frac{2}{3}}}$, and $\alpha_t=\frac{1}{T^a}$ with any  $\frac{2}{3} < a < 1 $, in that case the rate would be $\mathcal{O}(\frac{\log T}{T^{\frac{1}{3}}})$.

7: \end{abstract}

8: