caec4e3ee5ffed6d.tex
1: \begin{abstract}
2: % For neural sequence model training, maximum likelihood (ML) has been commonly adopted to optimize model parameters with respect to corresponding objectives.
3: In sequence prediction tasks like neural machine translation, training with cross-entropy loss often leads to models that overgeneralize and plunge into local optima. 
4: In this paper, we propose an extended loss function called \emph{dual skew divergence} (DSD) that integrates two symmetric terms on KL divergences with a balanced weight. 
5: We empirically discovered that such a balanced weight plays a crucial role in applying the proposed DSD loss into deep models. 
6: Thus we eventually develop a controllable DSD loss for general-purpose scenarios.
7: Our experiments indicate that switching to the DSD loss after the convergence of ML training helps models escape local optima and stimulates stable performance improvements. 
8: Our evaluations on the WMT 2014 English-German and English-French translation tasks demonstrate that the proposed loss as a general and convenient mean for NMT training indeed brings performance improvement in comparison to strong baselines. % and improves the diversity of the text as well
9: \end{abstract}
10: