abstract:fcdc83bf9ff8e262.tex

1: \begin{abstract}

2:   We propose \emph{BlackOut}, an approximation algorithm to efficiently

3:   train massive recurrent neural network language models (RNNLMs) with

4:   million word vocabularies. BlackOut is motivated by using a

5:   discriminative loss, and we describe a weighted sampling strategy which

6:   significantly reduces computation while improving stability, sample

7:   efficiency, and rate of convergence. One way to understand BlackOut

8:   is to view it as an extension of the DropOut strategy to the output

9:   layer, wherein we use a discriminative training loss and a weighted

10:   sampling scheme. We also establish close connections between BlackOut,

11:   importance sampling, and noise contrastive estimation (NCE).  Our

12:   experiments, on the recently released one billion word language

13:   modeling benchmark, demonstrate scalability and accuracy of BlackOut;

14:   we outperform the state-of-the art, and achieve the lowest perplexity

15:   scores on this dataset. Moreover, unlike other established methods which typically require GPUs or CPU clusters, we show that a carefully implemented version of BlackOut requires only 1-10 days on a single

16:   machine to train a RNNLM with a million word vocabulary and billions

17:   of parameters on one billion words. Although we describe BlackOut in the context of RNNLM training, it can be used to any networks with large softmax output layers.

18: \end{abstract}

19: