abstract:11b25c5cd10c66ee.tex

1: \begin{abstract}

2: Natural policy gradient ($\NPG$) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Recently,~\citet{vaswani2021general} introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as $\Alg$). For tabular MDPs, we prove that $\Alg$ with a constant step-size matches the linear convergence of $\NPG$ and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend $\Alg$ to use a log-linear policy parameterization. Unlike that for $\NPG$, generalizing $\Alg$ to the linear function approximation (FA) setting does not require compatible function approximation. Unlike $\MDPO$, a practical generalization of $\NPG$, $\Alg$ with linear FA only requires solving convex softmax classification problems. We prove that $\Alg$ achieves linear convergence to the neighbourhood of the optimal value function. We extend $\Alg$ to handle non-linear FA and evaluate its empirical performance on the MuJoCo and Atari benchmarks. Our results demonstrate that $\Alg$ consistently achieves similar or better performance compared to $\MDPO$, $\PPO$ and $\TRPO$.

3: \end{abstract}

4: