c0aeef9c66fb6870.tex
1: \begin{abstract}
2: % Weight averaging, which aims to ensemble multiple models in the weight space to create high-performing deep neural networks (DNNs), has recently achieved much attention in the literature.
3: Despite the simplicity, stochastic gradient descent (SGD)-like algorithms are successful in training deep neural networks (DNNs). 
4: Among various attempts to improve SGD, weight averaging (WA), which averages the weights of multiple models,  has recently received much attention in the literature. 
5: Broadly, WA falls into two categories: 1) online WA, which averages  the weights of multiple models trained in parallel, is designed for reducing the gradient communication overhead of parallel mini-batch SGD, and 2) offline WA, which  averages  the weights of one model at different checkpoints, is typically used to improve the generalization ability of  DNNs.   
6: Though online and offline WA are  similar in form, they are seldom associated with each other.   { Besides, these methods typically perform either offline parameter averaging or online parameter averaging, but not both.} In this work, we firstly attempt  to incorporate   online and offline WA into  a general training framework termed Hierarchical Weight Averaging (HWA). 
7: By leveraging both the online and offline averaging manners, HWA  is able to achieve both faster convergence speed and superior generalization performance without any fancy learning rate adjustment.   {Besides, we also analyze the issues faced by existing WA methods, and how our HWA address them, empirically.}
8: Finally, extensive experiments verify that HWA outperforms the state-of-the-art  methods significantly.
9: 
10: 
11: 
12: 
13: 
14: 
15: \end{abstract}
16: