abstract:13006a102a28110b.tex

1: \begin{abstract}

2: With the growing importance of large network models and enormous training

3: datasets, GPUs have become increasingly necessary to train neural networks.

4: This is largely because conventional optimization algorithms rely on

5: stochastic gradient methods that don't scale well to large numbers of cores in

6: a cluster setting.  Furthermore, the convergence of all gradient methods,

7: including batch methods,  suffers from common problems like saturation

8: effects, poor conditioning, and saddle points.  This paper explores an

9: unconventional training method that uses alternating direction methods and

10: Bregman iteration to train networks without gradient descent steps.  The

11: proposed method reduces the network training problem to a sequence of

12: minimization sub-steps that can each be solved {\em globally} in closed form.

13: The proposed method is advantageous because it avoids many of the caveats that

14: make gradient methods slow on highly non-convex problems.  The

15: method exhibits strong scaling in the distributed setting, yielding linear

16: speedups even when split over thousands of cores.

17:

18: \end{abstract}