d5170f51d3c0c0da.tex
1: \begin{abstract}
2: A dialogue policy module is an essential part of task-completion dialogue systems. Recently, increasing interest has focused on reinforcement learning (RL)-based dialogue policy. Its favorable performance and wise action decisions rely on an accurate estimation of action values. The overestimation problem is a widely known issue of 
3: RL since its estimate of the maximum action value is larger than the ground truth, which results in an unstable learning process and suboptimal policy. This problem is detrimental to RL-based dialogue policy learning.
4: To mitigate this problem, this paper proposes a dynamic partial average  estimator (\modelname) of the ground truth maximum action value. \modelname~calculates the partial average between the predicted \textit{maximum} action value and \textit{minimum} action value, where the weights are dynamically adaptive and problem-dependent. We incorporate \modelname~into a deep Q-network as the dialogue policy and. Our method can achieve better or comparable results compared to top baselines on three dialogue datasets of different domains \textit{with a lower computational load}. In addition, we also theoretically prove the convergence and derive the upper and lower bounds of the bias compared with those of other methods.
5: \end{abstract}
6: