abstract:c26abed58b15a30b.tex

1: \begin{abstract}

2: Due to the high variance of policy gradients, on-policy optimization algorithms are plagued with low sample efficiency. In this work, we propose Augment-Reinforce-Merge (ARM) policy gradient estimator as an unbiased low-variance alternative to previous baseline estimators on tasks with binary action space, inspired by the recent ARM gradient estimator for discrete random variable models \citep{yin2018arm}. We show that the ARM policy gradient estimator achieves variance reduction with theoretical guarantees, and leads to significantly more stable and faster convergence of policies parameterized by neural networks.

3: \end{abstract}

4: