e8b0647a171773a9.tex
1: \begin{abstract}
2: Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks due to its high sample efficiency. 
3: To save the computation cost of conducting planning online, recent practices tend to distill optimized action sequences into an RL policy during the training phase.
4: Although the distillation can incorporate both the foresight of planning and the exploration ability of RL policies, the theoretical understanding of these methods is yet unclear.
5: In this paper, we extend the policy improvement step of Soft Actor-Critic (SAC) by developing an approach to distill from model-based planning to the policy. We then demonstrate that such an approach of policy improvement has a theoretical guarantee of monotonic improvement and convergence to the maximum value defined in SAC. 
6: We discuss effective design choices and implement our theory as a practical algorithm---\textit{\textbf{M}odel-based \textbf{P}lanning \textbf{D}istilled to \textbf{P}olicy (MPDP)}---that updates the policy jointly over multiple future time steps.
7: Extensive experiments show that MPDP achieves better sample efficiency and asymptotic performance than both model-free and model-based planning algorithms on six continuous control benchmark tasks in MuJoCo.
8: \end{abstract}
9: