abstract:66d203c8e9c602d3.tex

1: \begin{abstract}

2: Compared to gradient descent, Gauss-Newton's method (GN) and variants are known to converge faster to local optima at the expense of a higher computational cost per iteration.

3: Still, GN is not widely used for optimizing deep neural networks despite a constant effort to reduce their higher computational cost.

4: In this work, we propose to take a step back and re-think the properties of GN in light of recent advances in the dynamics of gradient flows of over-parameterized models and the implicit bias they induce.

5: We first prove a fast global convergence result for the continuous-time limit of the generalized GN in the over-parameterized regime.

6: We then show empirically that GN exhibits both a \emph{kernel regime} where it generalizes as well as gradient flows, and a \emph{feature learning regime} where GN induces an implicit bias for selecting global solutions that systematically under-performs those found by a gradient flow.

7: Importantly, we observed this phenomenon even with enough computational budget to perform exact GN steps over the total training objective.

8: This study suggests the need to go beyond improving the computational cost of GN for over-parametrized models towards designing new methods that can trade off optimization speed and the quality of their implicit bias.

9: \end{abstract}

10: