1: \begin{abstract}
2: We propose \emph{BlackOut}, an approximation algorithm to efficiently
3: train massive recurrent neural network language models (RNNLMs) with
4: million word vocabularies. BlackOut is motivated by using a
5: discriminative loss, and we describe a weighted sampling strategy which
6: significantly reduces computation while improving stability, sample
7: efficiency, and rate of convergence. One way to understand BlackOut
8: is to view it as an extension of the DropOut strategy to the output
9: layer, wherein we use a discriminative training loss and a weighted
10: sampling scheme. We also establish close connections between BlackOut,
11: importance sampling, and noise contrastive estimation (NCE). Our
12: experiments, on the recently released one billion word language
13: modeling benchmark, demonstrate scalability and accuracy of BlackOut;
14: we outperform the state-of-the art, and achieve the lowest perplexity
15: scores on this dataset. Moreover, unlike other established methods which typically require GPUs or CPU clusters, we show that a carefully implemented version of BlackOut requires only 1-10 days on a single
16: machine to train a RNNLM with a million word vocabulary and billions
17: of parameters on one billion words. Although we describe BlackOut in the context of RNNLM training, it can be used to any networks with large softmax output layers.
18: \end{abstract}
19: