abstract:2bc580741f4378f1.tex

1: \begin{abstract}

2:

3: 	Actor-critic  (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years.

4: 	However, most of the existing theoretical support for AC algorithms focuses on the case of linear function approximations, or linearized neural networks, where the feature representation is fixed throughout training.

5: 	Such a limitation fails to capture the key aspect of representation learning in neural AC, which is pivotal in practical problems.

6: 	In this work, we take a mean-field perspective on the evolution and convergence of feature-based neural AC.

7: 	Specifically, we consider a version of  AC where

8: 	the actor and critic are represented by overparameterized two-layer neural networks and are updated with two-timescale learning rates.

9: 	The critic is updated by temporal-difference (TD) learning with a larger stepsize while the actor is updated via proximal policy optimization (PPO) with a smaller stepsize.

10: 	In the continuous-time and infinite-width  limiting regime, when the timescales are properly separated,

11: 	we prove that neural AC finds the globally optimal policy at a sublinear rate.

12: 	Additionally,  we prove that the feature representation induced by the critic network is allowed to evolve within a neighborhood of the initial one.

13: 	%{\red Restarting mechanism.}

14: \end{abstract}