1: \begin{abstract}
2: We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs).
3: Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization
4: functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a
5: dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of
6: state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus
7: to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of
8: \citet{schulman2015trust} actually converges to the optimal policy, while the entropy-regularized policy gradient methods of
9: \citet{mnih2016asynchronous} may fail to converge to a fixed point. Finally, we illustrate empirically the effects of using various
10: regularization techniques on learning performance in a simple reinforcement learning setup.
11:
12: \end{abstract}
13: