1: \begin{abstract}
2: We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms.
3: While
4: most existing works on actor-critic employ bi-level or two-timescale updates,
5: %where the critic solves a policy evaluation sub-problem with the actor held fixed, or has a much larger stepsize than that of the actor.
6: % In contrast,
7: we focus on the more practical single-timescale setting, where the actor and critic are updated simultaneously.
8: Specifically, in each iteration, the critic update is obtained by applying the Bellman evaluation operator only once while the actor is updated in the policy gradient direction computed using the critic.
9: Moreover, we consider two function approximation settings where both the actor and critic are represented by linear or deep neural networks.
10: For both cases,
11: we prove that the actor sequence converges to a globally optimal policy at a sublinear $O(K^{-1/2})$ rate, where $K$ is the number of iterations.
12: To the best of our knowledge, we establish the rate of convergence and global optimality of
13: single-timescale actor-critic with linear function approximation for the first time.
14: Moreover, under the broader scope of policy optimization with nonlinear function approximation, we prove that actor-critic with deep neural network finds the globally optimal policy at a sublinear rate for the first time.
15: %The core of our analysis is a ``double contraction'' phenomenon, that is,
16: %in addition to the contraction of objective value due to actor updates,
17: %the contraction induce by the Bellman operators used in critic updates gradually shrinks the bias in policy gradient estimates, which leads to convergence to the globally optimal policy.
18: %where the discount factor $\gamma$ drives the contraction of the combined update of the actor and critic towards their global optima.
19: \end{abstract}
20: