abstract:4378c8daa3ffcbf2.tex

1: \begin{abstract}

2:   In this paper, we show that although the minimizers of cross-entropy and

3:   related classification losses are off at infinity, network weights learned by

4:   gradient flow converge \emph{in direction}, with an immediate corollary that

5:   network predictions, training errors, and the margin distribution also

6:   converge.

7:   This proof holds for deep homogeneous networks

8:   ---

9:   a broad class of networks allowing for ReLU, max-pooling, linear, and

10:   convolutional layers

11:   ---

12:   and we additionally provide empirical support not just close to the theory

13:   (e.g., the AlexNet), but also on non-homogeneous networks (e.g., the

14:   DenseNet).

15:   If the network further has locally Lipschitz gradients, we show that these

16:   gradients also converge in direction, and asymptotically \emph{align} with the

17:   gradient flow path, with consequences on margin maximization, convergence of saliency maps,

18:   and a few other settings.

19:   Our analysis complements and is distinct from the well-known neural tangent

20:   and mean-field theories, and in particular makes no requirements on network

21:   width and initialization, instead merely requiring perfect classification

22:   accuracy.

23:   The proof proceeds by developing a theory of unbounded nonsmooth

24:   Kurdyka-{\L}ojasiewicz inequalities for functions definable in an o-minimal

25:   structure, and is also applicable outside deep learning.

26: \end{abstract}

27: