1: \begin{abstract}
2: With the growing importance of large network models and enormous training
3: datasets, GPUs have become increasingly necessary to train neural networks.
4: This is largely because conventional optimization algorithms rely on
5: stochastic gradient methods that don't scale well to large numbers of cores in
6: a cluster setting. Furthermore, the convergence of all gradient methods,
7: including batch methods, suffers from common problems like saturation
8: effects, poor conditioning, and saddle points. This paper explores an
9: unconventional training method that uses alternating direction methods and
10: Bregman iteration to train networks without gradient descent steps. The
11: proposed method reduces the network training problem to a sequence of
12: minimization sub-steps that can each be solved {\em globally} in closed form.
13: The proposed method is advantageous because it avoids many of the caveats that
14: make gradient methods slow on highly non-convex problems. The
15: method exhibits strong scaling in the distributed setting, yielding linear
16: speedups even when split over thousands of cores.
17:
18: \end{abstract}