13006a102a28110b.tex
1: \begin{abstract} 
2: With the growing importance of large network models and enormous training
3: datasets, GPUs have become increasingly necessary to train neural networks.
4: This is largely because conventional optimization algorithms rely on
5: stochastic gradient methods that don't scale well to large numbers of cores in
6: a cluster setting.  Furthermore, the convergence of all gradient methods,
7: including batch methods,  suffers from common problems like saturation
8: effects, poor conditioning, and saddle points.  This paper explores an
9: unconventional training method that uses alternating direction methods and
10: Bregman iteration to train networks without gradient descent steps.  The
11: proposed method reduces the network training problem to a sequence of
12: minimization sub-steps that can each be solved {\em globally} in closed form.
13: The proposed method is advantageous because it avoids many of the caveats that
14: make gradient methods slow on highly non-convex problems.  The
15: method exhibits strong scaling in the distributed setting, yielding linear
16: speedups even when split over thousands of cores. 
17: 
18: \end{abstract}