824dacc71e3b7f3b.tex
1: \begin{abstract}
2: Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\eta$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence.  In such regime, \citet{blanc2020implicit} proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\tr[\nabla^2 L]$.  
3: The current paper gives a general framework for such analysis by adapting ideas from~\cite{katzenberger1991solutions}. It allows in principle a complete characterization for the regularization effect of SGD around such manifold---i.e., the "implicit bias"---using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results:  (1) a \emph{global} analysis of the implicit bias valid for $\eta^{-2}$ steps, in contrast to the local analysis of \citet{blanc2020implicit} that is only valid for $\eta^{-1.6}$ steps 
4: and (2) allowing \emph{arbitrary} noise covariance. 
5: As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(\kappa\ln d)$ samples for learning a $\kappa$-sparse overparametrized linear model in $\mathbb{R}^d$~\citep{woodworth2020kernel}, while GD initialized in the kernel regime requires $\Omega(d)$ samples. This upper bound is minimax optimal and improves the previous $\widetilde{O}(\kappa^2)$ upper bound~\citep{haochen2020shape}.
6: \end{abstract}
7: