1: \begin{abstract}
2: Deep reinforcement learning (RL) methods have significant potential for dialogue policy optimisation.
3: However, they
4: suffer from a poor performance in the early stages of learning. This is especially problematic for on-line learning with real users.
5: Two approaches are introduced to tackle this problem. Firstly, to speed up the learning process, two sample-efficient neural networks algorithms: trust region actor-critic with experience replay (TRACER) and episodic natural actor-critic with experience replay (eNACER) are presented.
6: For TRACER, the trust region helps to control the learning step size and avoid catastrophic model changes.
7: For eNACER, the natural gradient identifies the steepest ascent direction in policy space to speed up the convergence.
8: Both models employ off-policy learning with experience replay to improve sample-efficiency.
9: Secondly, to mitigate the cold start issue, a corpus of demonstration data is utilised to pre-train the models prior to
10: on-line reinforcement learning.
11: Combining these two approaches, we demonstrate a practical approach to learn deep RL-based dialogue policies and
12: demonstrate their effectiveness in a task-oriented information seeking domain.
13:
14: \end{abstract}
15: