e2078f1b747df1e7.tex
1: \begin{abstract}\label{abstract}
2: Nonlinear conjugate gradient (NLCG) based optimizers have shown superior loss convergence properties compared
3: to gradient descent based optimizers for traditional optimization problems. However, in
4: Deep Neural Network (DNN) training, the dominant optimization algorithm of choice is still
5: Stochastic Gradient Descent (SGD) and its variants. In this work, we propose and evaluate the stochastic
6: preconditioned nonlinear conjugate gradient algorithm for large scale DNN training tasks. We show that a
7: nonlinear conjugate gradient algorithm improves the convergence speed of DNN training, especially
8: in the large mini-batch scenario, which is essential for scaling synchronous distributed
9: DNN training to large number of workers. We show how to efficiently use second order information in the NLCG
10: pre-conditioner for improving DNN training convergence. For the ImageNet classification task,
11: at extremely large mini-batch sizes of greater than 65k, NLCG optimizer is able to improve top-1 accuracy by
12: more than 10 percentage points for standard training of the Resnet-50 model for 90 epochs.
13: For the CIFAR-100 classification task, at extremely large mini-batch sizes of greater than 16k, NLCG optimizer is able to
14: improve top-1 accuracy by more than 15 percentage points for standard training of the Resnet-32 model
15: for 200 epochs.
16: \end{abstract}
17: