abstract:e2078f1b747df1e7.tex

1: \begin{abstract}\label{abstract}

2: Nonlinear conjugate gradient (NLCG) based optimizers have shown superior loss convergence properties compared

3: to gradient descent based optimizers for traditional optimization problems. However, in

4: Deep Neural Network (DNN) training, the dominant optimization algorithm of choice is still

5: Stochastic Gradient Descent (SGD) and its variants. In this work, we propose and evaluate the stochastic

6: preconditioned nonlinear conjugate gradient algorithm for large scale DNN training tasks. We show that a

7: nonlinear conjugate gradient algorithm improves the convergence speed of DNN training, especially

8: in the large mini-batch scenario, which is essential for scaling synchronous distributed

9: DNN training to large number of workers. We show how to efficiently use second order information in the NLCG

10: pre-conditioner for improving DNN training convergence. For the ImageNet classification task,

11: at extremely large mini-batch sizes of greater than 65k, NLCG optimizer is able to improve top-1 accuracy by

12: more than 10 percentage points for standard training of the Resnet-50 model for 90 epochs.

13: For the CIFAR-100 classification task, at extremely large mini-batch sizes of greater than 16k, NLCG optimizer is able to

14: improve top-1 accuracy by more than 15 percentage points for standard training of the Resnet-32 model

15: for 200 epochs.

16: \end{abstract}

17: