abstract:e7de67d27782d9ea.tex

1: \begin{abstract}

2:   We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs).

3:   Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization

4: functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a

5: dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of

6: state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus

7: to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of

8: \citet{schulman2015trust} actually converges to the optimal policy, while the entropy-regularized policy gradient methods of

9: \citet{mnih2016asynchronous} may fail to converge to a fixed point. Finally, we illustrate empirically the effects of using various

10: regularization techniques on learning performance in a simple reinforcement learning setup.

11:

12: \end{abstract}

13: