abstract:a2348c53861298e9.tex

1: \begin{abstract}

2: Deep reinforcement learning (RL) methods have significant potential for dialogue policy optimisation.

3: However, they

4: suffer from a poor performance in the early stages of learning. This is especially problematic for on-line learning with real users.

5: Two approaches are introduced to tackle this problem. Firstly, to speed up the learning process, two sample-efficient neural networks algorithms: trust region actor-critic with experience replay (TRACER)  and episodic natural actor-critic with experience replay (eNACER) are presented.

6: For TRACER, the trust region helps to control the learning step size and avoid catastrophic model changes.

7: For eNACER, the natural gradient identifies the steepest ascent direction in policy space to speed up the convergence.

8: Both models employ  off-policy learning with experience replay to improve sample-efficiency.

9: Secondly, to mitigate the cold start issue, a corpus of demonstration data is utilised to pre-train the models prior to

10: on-line reinforcement learning.

11: Combining these two approaches, we demonstrate a practical approach to learn deep RL-based dialogue policies and

12: demonstrate their effectiveness in a task-oriented information seeking domain.

13:

14: \end{abstract}

15: