a2348c53861298e9.tex
1: \begin{abstract}
2: Deep reinforcement learning (RL) methods have significant potential for dialogue policy optimisation. 
3: However, they 
4: suffer from a poor performance in the early stages of learning. This is especially problematic for on-line learning with real users.
5: Two approaches are introduced to tackle this problem. Firstly, to speed up the learning process, two sample-efficient neural networks algorithms: trust region actor-critic with experience replay (TRACER)  and episodic natural actor-critic with experience replay (eNACER) are presented. 
6: For TRACER, the trust region helps to control the learning step size and avoid catastrophic model changes. 
7: For eNACER, the natural gradient identifies the steepest ascent direction in policy space to speed up the convergence.
8: Both models employ  off-policy learning with experience replay to improve sample-efficiency. 
9: Secondly, to mitigate the cold start issue, a corpus of demonstration data is utilised to pre-train the models prior to 
10: on-line reinforcement learning. 
11: Combining these two approaches, we demonstrate a practical approach to learn deep RL-based dialogue policies and
12: demonstrate their effectiveness in a task-oriented information seeking domain.
13: 
14: \end{abstract}
15: