7cac5d6256409f24.tex
1: \begin{abstract}
2: 
3: 
4: Deep neural networks are typically trained by optimizing a loss function with an
5: SGD variant, in conjunction with a decaying learning rate, until convergence.  We show that 
6: simple averaging of multiple points along the trajectory of SGD, with a cyclical
7: or constant learning rate, leads to better generalization than conventional training.
8: We also show that this \emph{Stochastic Weight Averaging} (SWA)
9: procedure finds much flatter solutions than SGD, and approximates the 
10: recent \emph{Fast Geometric Ensembling} (FGE) approach with a single model.
11: Using SWA we achieve notable improvement in test accuracy over
12: conventional SGD training on a range of state-of-the-art residual networks,
13: PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-$10$, 
14: CIFAR-$100$, and ImageNet.  In short, SWA is extremely easy to implement, 
15: improves generalization, and has almost no computational overhead.
16:   
17: \end{abstract}
18: