abstract:caec4e3ee5ffed6d.tex

1: \begin{abstract}

2: % For neural sequence model training, maximum likelihood (ML) has been commonly adopted to optimize model parameters with respect to corresponding objectives.

3: In sequence prediction tasks like neural machine translation, training with cross-entropy loss often leads to models that overgeneralize and plunge into local optima.

4: In this paper, we propose an extended loss function called \emph{dual skew divergence} (DSD) that integrates two symmetric terms on KL divergences with a balanced weight.

5: We empirically discovered that such a balanced weight plays a crucial role in applying the proposed DSD loss into deep models.

6: Thus we eventually develop a controllable DSD loss for general-purpose scenarios.

7: Our experiments indicate that switching to the DSD loss after the convergence of ML training helps models escape local optima and stimulates stable performance improvements.

8: Our evaluations on the WMT 2014 English-German and English-French translation tasks demonstrate that the proposed loss as a general and convenient mean for NMT training indeed brings performance improvement in comparison to strong baselines. % and improves the diversity of the text as well

9: \end{abstract}

10: