1: \begin{abstract}
2: % This paper is inspired by the slow convergence speed and low data efficiency of data-driven reinforcement learning algorithm. The training process of traditional data-driven RL needs massive data due to high variance of policy gradient driven by measured data, which is quite an inefficient way to search policy. Although the model-driven RL has high efficiency, the uncertainty of the model damages the control accuracy of the solved policy. Therefore, efficient and accurate method of reinforcement learning need to be urgently developed for searching the optimal policy. In this paper, we develop a mixed RL framework by merging empirical environmental dynamics model and measured data, which aims to ensure both data efficiency and the accuracy of solved policy. The proposed method includes two following ideas: 1) iterative Bayesian estimation is embedded into the policy iteration architecture, which aims to estimate the additive model uncertainty; 2) the policy improvement is driven by the continuously updated environmental dynamics model, which provide global system characteristics; and 3) the developed algorithm can be potentially used for autopilot and robot control, and so on. In our future work, more general environmental dynamics and non-gaussian uncertainties will be considered.
3: % \end{abstract}
4: