80db957d3bcf8a43.tex
1: \begin{abstract}
2: We describe the neural-network training framework used in the Kaldi speech
3: recognition toolkit, which is geared towards training DNNs with large amounts
4: of training data using multiple GPU-equipped or multi-core machines.  In order
5: to be as hardware-agnostic as possible, we needed a way to use multiple
6: machines without generating excessive network traffic.  Our method is to
7: average the neural network parameters periodically (typically every minute or
8: two), and redistribute the averaged parameters to the machines for further
9: training.  Each machine sees different data.  By itself, this method does not
10: work very well.  However, we have another method, an approximate and efficient
11: implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD),
12: which seems to allow our periodic-averaging method to work well, as well as
13: substantially improving the convergence of SGD on a single machine.
14: \end{abstract}
15: