2bc580741f4378f1.tex
1: \begin{abstract}
2: 	
3: 	Actor-critic  (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years. 
4: 	However, most of the existing theoretical support for AC algorithms focuses on the case of linear function approximations, or linearized neural networks, where the feature representation is fixed throughout training. 
5: 	Such a limitation fails to capture the key aspect of representation learning in neural AC, which is pivotal in practical problems.
6: 	In this work, we take a mean-field perspective on the evolution and convergence of feature-based neural AC. 
7: 	Specifically, we consider a version of  AC where 
8: 	the actor and critic are represented by overparameterized two-layer neural networks and are updated with two-timescale learning rates. 
9: 	The critic is updated by temporal-difference (TD) learning with a larger stepsize while the actor is updated via proximal policy optimization (PPO) with a smaller stepsize. 
10: 	In the continuous-time and infinite-width  limiting regime, when the timescales are properly separated,
11: 	we prove that neural AC finds the globally optimal policy at a sublinear rate. 
12: 	Additionally,  we prove that the feature representation induced by the critic network is allowed to evolve within a neighborhood of the initial one. 
13: 	%{\red Restarting mechanism.}
14: \end{abstract}