ec4def7891faea8d.tex
1: \begin{abstract}
2: Trust-region methods have yielded state-of-the-art results in policy
3: search. A common approach is to use KL-divergence to bound the region
4: of trust resulting in a natural gradient policy update. We show that
5: the natural gradient and trust region optimization are equivalent if
6: we use the natural parameterization of a standard exponential policy
7: distribution in combination with compatible value function
8: approximation. Moreover, we show that standard natural gradient
9: updates may reduce the entropy of the policy according to a wrong
10: schedule leading to premature convergence. To control entropy
11: reduction we introduce a new policy search method called compatible
12: policy search (COPOS) which bounds entropy loss. The experimental
13: results show that COPOS yields state-of-the-art results in challenging
14: continuous control tasks and in discrete partially observable tasks.
15: \end{abstract}
16: