0707b7aef2497711.tex
1: \begin{abstract}
2:  We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates  policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.
3: 
4: \medskip
5: 
6: \noindent
7: {\bf Keywords:} Reinforcement learning; approximate dynamic programming; critic-actor algorithm; two time-scale stochastic approximation.
8: 
9: \end{abstract}
10: