1: \begin{abstract}
2:
3:
4: Deep neural networks are typically trained by optimizing a loss function with an
5: SGD variant, in conjunction with a decaying learning rate, until convergence. We show that
6: simple averaging of multiple points along the trajectory of SGD, with a cyclical
7: or constant learning rate, leads to better generalization than conventional training.
8: We also show that this \emph{Stochastic Weight Averaging} (SWA)
9: procedure finds much flatter solutions than SGD, and approximates the
10: recent \emph{Fast Geometric Ensembling} (FGE) approach with a single model.
11: Using SWA we achieve notable improvement in test accuracy over
12: conventional SGD training on a range of state-of-the-art residual networks,
13: PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-$10$,
14: CIFAR-$100$, and ImageNet. In short, SWA is extremely easy to implement,
15: improves generalization, and has almost no computational overhead.
16:
17: \end{abstract}
18: