f50df52affed1348.tex
1: \begin{abstract}  
2: This paper proposes and analyzes two new policy learning methods: regularized policy gradient (RPG) and iterative policy optimization (IPO), for a class of discounted linear-quadratic control (LQC) problems over an infinite time horizon with entropy regularization. Assuming access to the exact policy evaluation,  both proposed approaches are 
3: proved to converge linearly in finding optimal policies of the regularized LQC.
4: Moreover, the IPO method can achieve a super-linear convergence rate once it enters a local region around the optimal policy. 
5: Finally, when the optimal policy for an RL problem with a known environment  is appropriately transferred as the initial policy to an RL problem with an unknown environment,  the IPO method is shown to enable a super-linear convergence rate if the two environments are sufficiently close.
6:  Performances of these proposed algorithms are supported by numerical examples.
7: \end{abstract}
8: