fcdc83bf9ff8e262.tex
1: \begin{abstract}
2:   We propose \emph{BlackOut}, an approximation algorithm to efficiently
3:   train massive recurrent neural network language models (RNNLMs) with 
4:   million word vocabularies. BlackOut is motivated by using a
5:   discriminative loss, and we describe a weighted sampling strategy which
6:   significantly reduces computation while improving stability, sample
7:   efficiency, and rate of convergence. One way to understand BlackOut
8:   is to view it as an extension of the DropOut strategy to the output
9:   layer, wherein we use a discriminative training loss and a weighted
10:   sampling scheme. We also establish close connections between BlackOut,
11:   importance sampling, and noise contrastive estimation (NCE).  Our
12:   experiments, on the recently released one billion word language
13:   modeling benchmark, demonstrate scalability and accuracy of BlackOut;
14:   we outperform the state-of-the art, and achieve the lowest perplexity
15:   scores on this dataset. Moreover, unlike other established methods which typically require GPUs or CPU clusters, we show that a carefully implemented version of BlackOut requires only 1-10 days on a single
16:   machine to train a RNNLM with a million word vocabulary and billions
17:   of parameters on one billion words. Although we describe BlackOut in the context of RNNLM training, it can be used to any networks with large softmax output layers.
18: \end{abstract}
19: