abstract:ec4def7891faea8d.tex

1: \begin{abstract}

2: Trust-region methods have yielded state-of-the-art results in policy

3: search. A common approach is to use KL-divergence to bound the region

4: of trust resulting in a natural gradient policy update. We show that

5: the natural gradient and trust region optimization are equivalent if

6: we use the natural parameterization of a standard exponential policy

7: distribution in combination with compatible value function

8: approximation. Moreover, we show that standard natural gradient

9: updates may reduce the entropy of the policy according to a wrong

10: schedule leading to premature convergence. To control entropy

11: reduction we introduce a new policy search method called compatible

12: policy search (COPOS) which bounds entropy loss. The experimental

13: results show that COPOS yields state-of-the-art results in challenging

14: continuous control tasks and in discrete partially observable tasks.

15: \end{abstract}

16: