800c74add0299e23.tex
1: \begin{abstract}
2: % Natural policy gradient ($\NPG$) is a common policy optimization algorithm which can be viewed as mirror ascent in the space of probabilities. We propose a novel policy gradient method, referred to as $\Alg$, that corresponds to mirror ascent in the dual space of logits and does not require an explicit normalization across actions. For tabular MDPs, $\Alg$ with a constant step-size matches the linear convergence of $\NPG$ and achieves a better convergence rate than non-accelerated and accelerated constant step-size softmax policy gradient ($\SPG$). To handle large state-action spaces, we extend $\Alg$ to use a set of state-action features and a log-linear policy parameterization. Unlike that for $\NPG$, generalizing $\Alg$ to the linear function approximation (FA) setting does not require compatible function approximation. Unlike $\MDPO$, a more practical generalization of $\NPG$, $\Alg$ with linear FA only requires solving a convex softmax classification problem in each iteration. This enables us to prove theoretical guarantees for the resulting algorithm. Finally, we extend $\Alg$ to handle general non-linear FA and evaluate its empirical performance on the Mujoco and Atari benchmarks. Our experimental results demonstrate that $\Alg$ consistently outperforms $\SPG$ while achieving similar or better performance compared to $\MDPO$, $\PPO$, and $\TRPO$. 
3: % \end{abstract}
4: