abstract:6ac7001a20b69610.tex

1: \begin{abstract}

2: Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training.

3: A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training.

4: So far, attempts that adapt L-BFGS to large-scale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wall-clock time.

5: In this paper, we propose \method{}, a lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) optimization.

6: \method{} introduces a nearly cost-free momentum scheme into L-BFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing convergence during stochastic optimization.

7: For model training at a large scale, \method{} approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing nodes.

8: We provide a supporting convergence analysis for \method{} in stochastic settings.

9: To investigate \method{}’s potential in large-scale DNN training, we train benchmark neural models using \method{} and compare performance with baselines (SGD, Adam, and other quasi-Newton methods).

10: Results show that \method{} achieves both noticeable iteration-wise and wall-clock speedup.

11: \end{abstract}

12: