17416a20362c1f00.tex
1: \begin{abstract}
2: % \red{Pier proposal:}
3: We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize--about $2/\textrm{sharpness}$. It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution.
4: % Moreover, the norm and the sharpness decrease as the step size is increased.
5: Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization.
6: % Precisely, the slower the convergence the flatter will be the solution and the higher the implied regularization, the slower the convergence.
7: This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models.
8: % \red{Blake first version:}
9: % We study the gradient descent (GD) dynamics of a depth-2 linear neural net with a single input and output. We show that with a big stepsize, GD converges at an explicit linear rate to a global minimum of the training loss. For larger stepsizes, convergence is still assured, but may be very slow. We also prove that GD implicitly regularizes the sharpness of the solution it reaches, and it reaches a strictly flatter minimum than gradient flow. Our analysis reveals a trade-off between the speed of convergence and the magnitude of regularization which is related to the Edge of Stability phenomenon and has potential implications for training more complex models.
10: \end{abstract}
11: