7a7dba9daea9a4e6.tex
1: \begin{abstract}
2: We study the global convergence and global  optimality of actor-critic, one of the most popular families of reinforcement learning algorithms.      
3: While
4:  most existing works on actor-critic employ bi-level or two-timescale  updates, 
5:  %where the  critic solves a policy evaluation sub-problem with the actor  held fixed, or has a much larger stepsize than that of the actor. 
6: % In contrast, 
7:  we focus on the more practical single-timescale setting, where the actor and  critic are updated simultaneously. 
8: Specifically, in each iteration,  the critic update is obtained by  applying the Bellman evaluation operator only once  while the actor is updated in the policy gradient direction computed using the critic.
9: Moreover, we consider two function approximation settings where both the actor and critic  are represented by linear or deep neural networks. 
10: For both cases, 
11: we prove that the actor sequence  converges to a globally optimal policy at a  sublinear $O(K^{-1/2})$ rate, where $K$ is the number of iterations.  
12: To the best of our knowledge, we establish the rate of convergence and global optimality of
13: single-timescale  actor-critic  with linear  function approximation   for the first time.
14: Moreover, under the broader  scope of policy optimization with nonlinear function approximation, we  prove that actor-critic with deep neural network finds the globally optimal policy at a sublinear rate for the first time. 
15: %The core of our analysis is a ``double contraction'' phenomenon, that is, 
16: %in addition to the contraction of objective value due to actor updates, 
17: %the contraction induce by the Bellman operators used in critic updates gradually shrinks the bias in  policy gradient estimates, which leads to convergence to the globally optimal policy. 
18: %where the discount factor $\gamma$ drives the contraction of the combined update of the actor and critic towards their global optima. 
19: \end{abstract}
20: