1: \begin{abstract}
2: % Weight averaging, which aims to ensemble multiple models in the weight space to create high-performing deep neural networks (DNNs), has recently achieved much attention in the literature.
3: Despite the simplicity, stochastic gradient descent (SGD)-like algorithms are successful in training deep neural networks (DNNs).
4: Among various attempts to improve SGD, weight averaging (WA), which averages the weights of multiple models, has recently received much attention in the literature.
5: Broadly, WA falls into two categories: 1) online WA, which averages the weights of multiple models trained in parallel, is designed for reducing the gradient communication overhead of parallel mini-batch SGD, and 2) offline WA, which averages the weights of one model at different checkpoints, is typically used to improve the generalization ability of DNNs.
6: Though online and offline WA are similar in form, they are seldom associated with each other. { Besides, these methods typically perform either offline parameter averaging or online parameter averaging, but not both.} In this work, we firstly attempt to incorporate online and offline WA into a general training framework termed Hierarchical Weight Averaging (HWA).
7: By leveraging both the online and offline averaging manners, HWA is able to achieve both faster convergence speed and superior generalization performance without any fancy learning rate adjustment. {Besides, we also analyze the issues faced by existing WA methods, and how our HWA address them, empirically.}
8: Finally, extensive experiments verify that HWA outperforms the state-of-the-art methods significantly.
9:
10:
11:
12:
13:
14:
15: \end{abstract}
16: