1: \begin{abstract}
2: It is generally accepted that starting neural networks training with large learning rates (LRs) improves generalization.
3: Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs?
4: We discover that only a narrow range of initial LRs slightly above the convergence threshold lead to optimal results after fine-tuning with a small LR or weight averaging.
5: By studying the local geometry of reached minima, we observe that using LRs from this optimal range allows for the optimization to locate a basin that only contains high-quality minima.
6: Additionally, we show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task.
7: In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization.
8: Conversely, using initial LRs that are too large fails to detect a basin with good solutions and extract meaningful patterns from the data.
9: \end{abstract}
10: