e7de67d27782d9ea.tex
1: \begin{abstract}
2:   We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). 
3:   Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization 
4: functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a 
5: dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of 
6: state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus
7: to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of 
8: \citet{schulman2015trust} actually converges to the optimal policy, while the entropy-regularized policy gradient methods of 
9: \citet{mnih2016asynchronous} may fail to converge to a fixed point. Finally, we illustrate empirically the effects of using various 
10: regularization techniques on learning performance in a simple reinforcement learning setup.
11: 
12: \end{abstract}
13: