abstract:7a7dba9daea9a4e6.tex

1: \begin{abstract}

2: We study the global convergence and global  optimality of actor-critic, one of the most popular families of reinforcement learning algorithms.

3: While

4:  most existing works on actor-critic employ bi-level or two-timescale  updates,

5:  %where the  critic solves a policy evaluation sub-problem with the actor  held fixed, or has a much larger stepsize than that of the actor.

6: % In contrast,

7:  we focus on the more practical single-timescale setting, where the actor and  critic are updated simultaneously.

8: Specifically, in each iteration,  the critic update is obtained by  applying the Bellman evaluation operator only once  while the actor is updated in the policy gradient direction computed using the critic.

9: Moreover, we consider two function approximation settings where both the actor and critic  are represented by linear or deep neural networks.

10: For both cases,

11: we prove that the actor sequence  converges to a globally optimal policy at a  sublinear $O(K^{-1/2})$ rate, where $K$ is the number of iterations.

12: To the best of our knowledge, we establish the rate of convergence and global optimality of

13: single-timescale  actor-critic  with linear  function approximation   for the first time.

14: Moreover, under the broader  scope of policy optimization with nonlinear function approximation, we  prove that actor-critic with deep neural network finds the globally optimal policy at a sublinear rate for the first time.

15: %The core of our analysis is a ``double contraction'' phenomenon, that is,

16: %in addition to the contraction of objective value due to actor updates,

17: %the contraction induce by the Bellman operators used in critic updates gradually shrinks the bias in  policy gradient estimates, which leads to convergence to the globally optimal policy.

18: %where the discount factor $\gamma$ drives the contraction of the combined update of the actor and critic towards their global optima.

19: \end{abstract}

20: