abstract:6bd91e556c5c66c0.tex

1: \begin{abstract}

2: We prove under commonly used assumptions the convergence of

3: actor-critic reinforcement learning algorithms, which

4: simultaneously learn a policy function, the actor,

5: and a value function, the critic.

6: Both functions can be deep neural networks of arbitrary complexity.

7: Our framework allows showing convergence of

8: the well known Proximal Policy Optimization (PPO)

9: and of the recently introduced RUDDER.

10: For the convergence proof

11: we employ recently introduced

12: techniques from the two time-scale stochastic approximation theory.

13: Our results are valid for

14: actor-critic methods that use episodic samples

15: and that have a policy that becomes more greedy during learning.

16: Previous convergence proofs

17: assume linear function approximation,

18: cannot treat episodic examples, or

19: do not consider that policies become greedy.

20: The latter is relevant since optimal policies

21: are typically deterministic.

22: \end{abstract}

23: