6bd91e556c5c66c0.tex
1: \begin{abstract}
2: We prove under commonly used assumptions the convergence of 
3: actor-critic reinforcement learning algorithms, which
4: simultaneously learn a policy function, the actor, 
5: and a value function, the critic.
6: Both functions can be deep neural networks of arbitrary complexity.
7: Our framework allows showing convergence of
8: the well known Proximal Policy Optimization (PPO) 
9: and of the recently introduced RUDDER.
10: For the convergence proof  
11: we employ recently introduced 
12: techniques from the two time-scale stochastic approximation theory.
13: Our results are valid for 
14: actor-critic methods that use episodic samples
15: and that have a policy that becomes more greedy during learning.
16: Previous convergence proofs 
17: assume linear function approximation,
18: cannot treat episodic examples, or
19: do not consider that policies become greedy.
20: The latter is relevant since optimal policies
21: are typically deterministic. 
22: \end{abstract}
23: