6ada93653396b726.tex
1: \begin{abstract}
2: For on-policy reinforcement learning, discretizing action space for continuous control can easily express multiple modes and is straightforward to optimize.
3: However, without considering the inherent ordering between the discrete atomic actions, the explosion in the number of discrete actions can possess undesired properties and induce a higher variance for the policy gradient estimator. 
4: In this paper, we introduce a straightforward architecture that addresses this issue by constraining the discrete policy to be unimodal using Poisson probability distributions.
5: This unimodal architecture can better leverage the continuity in the underlying continuous action space using explicit unimodal probability distributions.
6: We conduct extensive experiments to show that the discrete policy with the unimodal probability distribution provides significantly faster convergence and higher performance for on-policy reinforcement learning algorithms in challenging control tasks, especially in highly complex tasks such as Humanoid.
7: We provide theoretical analysis on the variance of the policy gradient estimator, which suggests that our attentively designed unimodal discrete policy can retain a lower variance and yield a stable learning process.
8: \end{abstract}
9: