abstract:c0aeef9c66fb6870.tex

1: \begin{abstract}

2: % Weight averaging, which aims to ensemble multiple models in the weight space to create high-performing deep neural networks (DNNs), has recently achieved much attention in the literature.

3: Despite the simplicity, stochastic gradient descent (SGD)-like algorithms are successful in training deep neural networks (DNNs).

4: Among various attempts to improve SGD, weight averaging (WA), which averages the weights of multiple models,  has recently received much attention in the literature.

5: Broadly, WA falls into two categories: 1) online WA, which averages  the weights of multiple models trained in parallel, is designed for reducing the gradient communication overhead of parallel mini-batch SGD, and 2) offline WA, which  averages  the weights of one model at different checkpoints, is typically used to improve the generalization ability of  DNNs.

6: Though online and offline WA are  similar in form, they are seldom associated with each other.   { Besides, these methods typically perform either offline parameter averaging or online parameter averaging, but not both.} In this work, we firstly attempt  to incorporate   online and offline WA into  a general training framework termed Hierarchical Weight Averaging (HWA).

7: By leveraging both the online and offline averaging manners, HWA  is able to achieve both faster convergence speed and superior generalization performance without any fancy learning rate adjustment.   {Besides, we also analyze the issues faced by existing WA methods, and how our HWA address them, empirically.}

8: Finally, extensive experiments verify that HWA outperforms the state-of-the-art  methods significantly.

9:

10:

11:

12:

13:

14:

15: \end{abstract}

16: