abstract:80db957d3bcf8a43.tex

1: \begin{abstract}

2: We describe the neural-network training framework used in the Kaldi speech

3: recognition toolkit, which is geared towards training DNNs with large amounts

4: of training data using multiple GPU-equipped or multi-core machines.  In order

5: to be as hardware-agnostic as possible, we needed a way to use multiple

6: machines without generating excessive network traffic.  Our method is to

7: average the neural network parameters periodically (typically every minute or

8: two), and redistribute the averaged parameters to the machines for further

9: training.  Each machine sees different data.  By itself, this method does not

10: work very well.  However, we have another method, an approximate and efficient

11: implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD),

12: which seems to allow our periodic-averaging method to work well, as well as

13: substantially improving the convergence of SGD on a single machine.

14: \end{abstract}

15: