abstract:ebdb72ea8c4b4bae.tex

1: \begin{abstract}

2:     It is generally accepted that starting neural networks training with large learning rates (LRs) improves generalization.

3:     Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs?

4:     We discover that only a narrow range of initial LRs slightly above the convergence threshold lead to optimal results after fine-tuning with a small LR or weight averaging.

5:     By studying the local geometry of reached minima, we observe that using LRs from this optimal range allows for the optimization to locate a basin that only contains high-quality minima.

6:     Additionally, we show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task.

7:     In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization.

8:     Conversely, using initial LRs that are too large fails to detect a basin with good solutions and extract meaningful patterns from the data.

9: \end{abstract}

10: