9c8a066496e87e0e.tex
1: \begin{abstract}
2:     We study the Whittle index learning algorithm for restless multi-armed bandits (RMAB).  We first present  Q-learning algorithm and its variants---speedy Q-learning (SQL), generalized speedy Q-learning (GSQL) and phase Q-learning (PhaseQL). We also discuss exploration policies---$\epsilon$-greedy and Upper confidence bound (UCB). We extend the study of Q-learning and its variants with UCB policy. We illustrate using numerical example that Q-learning with UCB exploration policy has faster convergence and PhaseQL with UCB have fastest convergence rate. We next  extend the study of Q-learning variants for index learning to RMAB. The algorithm of index  learning is two-timescale variant of stochastic approximation, on slower timescale we update index learning scheme and on faster timescale we update Q-learning assuming fixed index value.  We study constant stepsizes two timescale stochastic approximation algorithm. 
3: 
4:     %Further, We present study on index learning with deep Q-network (DQN) learning and linear function approximation with state-aggregation method. 
5: 
6:     We describe the performance of our algorithms using numerical example. It illustrate that index learning with Q learning with UCB has faster convergence that $\epsilon$ greedy. Further, PhaseQL (with UCB and $\epsilon$ greedy) has  the best convergence than other Q-learning algorithms.
7: 
8:    % \rahul{Need to rewrite}
9: \end{abstract}
10: