77fba0ccfba93189.tex
1: \begin{abstract} 
2: We propose a stochastic approximation based method with randomisation of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our method results in an $O(d)$ improvement in complexity in comparison to regular LSTD, where $d$ is the dimension of the data.  We provide convergence rate results for our proposed method, both in high probability and in expectation. Moreover, we also establish that using our scheme in place of LSTD does not impact the rate of convergence of the approximate value function to the true value function. 
3: This result coupled with the low complexity of our method makes it attractive for implementation in {\em big data} settings, where $d$ is large.
4: Further, we also analyse a similar low-complexity alternative for least squares regression and provide finite-time bounds there.
5: We demonstrate the practicality of our method for LSTD empirically by combining it with the LSPI algorithm in a traffic signal control application. 
6: % We see that with a step-size of $c/n$ ($n$ is the number of iterations), our algorithms converge at the rate of $O\left(n^{-1/2}\right)$ for the optimal choice of $c$. Further, we show that with iterate averaging, this dependency on the choice of $c$ is removed. Our algorithms have a low complexity of the order $O(dn)$, where $d$ is the dimension of the feature vector. In comparison, the Sherman-Morrison lemma based approach has complexity of the order $O(d^2 n)$.
7: \end{abstract}