77f702570956742e.tex
1: \begin{abstract}
2: We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. 
3: We show that when SAM is applied with a
4: convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence.
5: 
6: In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian.
7: In such cases,
8: SAM's update 
9: may be regarded as
10: a third derivative---the derivative of the Hessian in the leading eigenvector direction---that encourages drift toward wider minima.
11: \end{abstract}
12: