b270e518bd2c4b5d.tex
1: \begin{abstract}
2: 
3: We study a reinforcement learning setting, where the state transition function is a convex combination of a stochastic continuous function and a deterministic function. Such a setting generalizes the widely-studied stochastic state transition setting, namely the setting of deterministic policy gradient (DPG).
4: 
5: We firstly give a simple example to illustrate that the deterministic policy gradient may be infinite under deterministic state transitions, and introduce a theoretical technique to prove the existence of the policy gradient in this generalized setting. Using this technique,
6: we prove that the deterministic policy gradient indeed exists for a certain set of discount factors, and further prove two conditions that guarantee the existence for all discount factors.
7: We then derive a closed form of the policy gradient whenever exists. Furthermore, to overcome the challenge of high sample complexity of DPG in this setting, we propose the {\em Generalized Deterministic Policy Gradient} (GDPG) algorithm. The main innovation of the algorithm is a new method of applying model-based techniques to the model-free algorithm, the deep deterministic policy gradient algorithm (DDPG). GDPG optimize the long-term rewards of the model-based augmented MDP subject to a constraint that the long-rewards of the MDP is less than the original one.
8: 
9: We finally conduct extensive experiments comparing GDPG with state-of-the-art methods and the direct model-based extension method of DDPG on several standard continuous control benchmarks.
10: Results demonstrate that GDPG substantially outperforms DDPG, the model-based extension of DDPG and other baselines in terms of both convergence and long-term rewards in most environments.
11: \end{abstract}
12: