ebdb72ea8c4b4bae.tex
1: \begin{abstract} 
2:     It is generally accepted that starting neural networks training with large learning rates (LRs) improves generalization.
3:     Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs? 
4:     We discover that only a narrow range of initial LRs slightly above the convergence threshold lead to optimal results after fine-tuning with a small LR or weight averaging. 
5:     By studying the local geometry of reached minima, we observe that using LRs from this optimal range allows for the optimization to locate a basin that only contains high-quality minima. 
6:     Additionally, we show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task.
7:     In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization. 
8:     Conversely, using initial LRs that are too large fails to detect a basin with good solutions and extract meaningful patterns from the data.
9: \end{abstract}
10: