abstract:7cac5d6256409f24.tex

1: \begin{abstract}

2:

3:

4: Deep neural networks are typically trained by optimizing a loss function with an

5: SGD variant, in conjunction with a decaying learning rate, until convergence.  We show that

6: simple averaging of multiple points along the trajectory of SGD, with a cyclical

7: or constant learning rate, leads to better generalization than conventional training.

8: We also show that this \emph{Stochastic Weight Averaging} (SWA)

9: procedure finds much flatter solutions than SGD, and approximates the

10: recent \emph{Fast Geometric Ensembling} (FGE) approach with a single model.

11: Using SWA we achieve notable improvement in test accuracy over

12: conventional SGD training on a range of state-of-the-art residual networks,

13: PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-$10$,

14: CIFAR-$100$, and ImageNet.  In short, SWA is extremely easy to implement,

15: improves generalization, and has almost no computational overhead.

16:

17: \end{abstract}

18: