1: \begin{abstract}
2:
3: Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years.
4: However, most of the existing theoretical support for AC algorithms focuses on the case of linear function approximations, or linearized neural networks, where the feature representation is fixed throughout training.
5: Such a limitation fails to capture the key aspect of representation learning in neural AC, which is pivotal in practical problems.
6: In this work, we take a mean-field perspective on the evolution and convergence of feature-based neural AC.
7: Specifically, we consider a version of AC where
8: the actor and critic are represented by overparameterized two-layer neural networks and are updated with two-timescale learning rates.
9: The critic is updated by temporal-difference (TD) learning with a larger stepsize while the actor is updated via proximal policy optimization (PPO) with a smaller stepsize.
10: In the continuous-time and infinite-width limiting regime, when the timescales are properly separated,
11: we prove that neural AC finds the globally optimal policy at a sublinear rate.
12: Additionally, we prove that the feature representation induced by the critic network is allowed to evolve within a neighborhood of the initial one.
13: %{\red Restarting mechanism.}
14: \end{abstract}