196a004123d9b191.tex
1: \begin{abstract}
2: Learning policies in an asynchronous parallel way is essential to the numerous successes of RL for solving large-scale problems. However, their convergence performance is still not rigorously evaluated.  To this end, we adopt the asynchronous parallel zero-order policy gradient  (AZOPG) method to solve the continuous-time linear quadratic regulation problem. Specifically,  multiple workers independently perform system rollouts to estimate zero-order PGs  which are then aggregated in a master for policy updates. As in the celebrated A3C algorithm, each worker is allowed to interact with the master {\em asynchronously}. By quantifying the convergence rate of the AZOPG, we  show its linear speedup property, both in theory and simulation, which reveals the advantages of using asynchronous parallel workers in learning policies. 
3: \end{abstract}
4: