abstract:6ada93653396b726.tex

1: \begin{abstract}

2: For on-policy reinforcement learning, discretizing action space for continuous control can easily express multiple modes and is straightforward to optimize.

3: However, without considering the inherent ordering between the discrete atomic actions, the explosion in the number of discrete actions can possess undesired properties and induce a higher variance for the policy gradient estimator.

4: In this paper, we introduce a straightforward architecture that addresses this issue by constraining the discrete policy to be unimodal using Poisson probability distributions.

5: This unimodal architecture can better leverage the continuity in the underlying continuous action space using explicit unimodal probability distributions.

6: We conduct extensive experiments to show that the discrete policy with the unimodal probability distribution provides significantly faster convergence and higher performance for on-policy reinforcement learning algorithms in challenging control tasks, especially in highly complex tasks such as Humanoid.

7: We provide theoretical analysis on the variance of the policy gradient estimator, which suggests that our attentively designed unimodal discrete policy can retain a lower variance and yield a stable learning process.

8: \end{abstract}

9: