abstract:10dce2a31934ffea.tex

1: \begin{abstract}

2:     Recent findings by \cite{cohen_gradient_2021} demonstrate that when training neural networks with full-batch gradient descent at a step size of $\eta$, the sharpness—defined as the largest eigenvalue of the full batch Hessian—consistently stabilizes at $2/\eta$. These results have significant implications for convergence and generalization. Unfortunately, this was observed not to be the case for mini-batch stochastic gradient descent (SGD), thus limiting the broader applicability of these findings.

3:     We show that SGD trains in a different regime we call Edge of Stochastic Stability. In this regime, what hovers at $2/\eta$ is, instead, the average over the batches of the largest eigenvalue of the Hessian of the mini batch (\textsc{MiniBS}) loss—which is always bigger than the sharpness.

4:     This implies that the sharpness is generally lower when training with smaller batches or bigger learning rate, providing a basis for the observed implicit regularization effect of SGD towards flatter minima and a number of well established empirical phenomena.

5:     Additionally, we quantify the gap between the \textsc{MiniBS} and the sharpness, further characterizing this distinct training regime.

6:     % The idea is that the average highest eigenvalue of the mini-batch hessian is in practice much higher than the one of the full batch hessian because they are misaligned. And the answer is that the average highest eigenvalue of the mini-batch hessian is the thing that grows to 2/eta and it trains at the edge of stability.

7:     % This is also an explanation for why SGD gets to solution with smaller hessian as keskar observed I think, or why catapult happen (when you sample one that is completely misaligned).

8:     % Also it’s a good way to show that SGD noise is structured and how because when you do gd plus Gaussian noise it doesn’t show this behavior

9: \end{abstract}

10: