1: \begin{abstract}
2: We prove under commonly used assumptions the convergence of
3: actor-critic reinforcement learning algorithms, which
4: simultaneously learn a policy function, the actor,
5: and a value function, the critic.
6: Both functions can be deep neural networks of arbitrary complexity.
7: Our framework allows showing convergence of
8: the well known Proximal Policy Optimization (PPO)
9: and of the recently introduced RUDDER.
10: For the convergence proof
11: we employ recently introduced
12: techniques from the two time-scale stochastic approximation theory.
13: Our results are valid for
14: actor-critic methods that use episodic samples
15: and that have a policy that becomes more greedy during learning.
16: Previous convergence proofs
17: assume linear function approximation,
18: cannot treat episodic examples, or
19: do not consider that policies become greedy.
20: The latter is relevant since optimal policies
21: are typically deterministic.
22: \end{abstract}
23: