4378c8daa3ffcbf2.tex
1: \begin{abstract}
2:   In this paper, we show that although the minimizers of cross-entropy and
3:   related classification losses are off at infinity, network weights learned by
4:   gradient flow converge \emph{in direction}, with an immediate corollary that
5:   network predictions, training errors, and the margin distribution also
6:   converge.
7:   This proof holds for deep homogeneous networks
8:   ---
9:   a broad class of networks allowing for ReLU, max-pooling, linear, and
10:   convolutional layers
11:   ---
12:   and we additionally provide empirical support not just close to the theory
13:   (e.g., the AlexNet), but also on non-homogeneous networks (e.g., the
14:   DenseNet).
15:   If the network further has locally Lipschitz gradients, we show that these
16:   gradients also converge in direction, and asymptotically \emph{align} with the
17:   gradient flow path, with consequences on margin maximization, convergence of saliency maps,
18:   and a few other settings.
19:   Our analysis complements and is distinct from the well-known neural tangent
20:   and mean-field theories, and in particular makes no requirements on network
21:   width and initialization, instead merely requiring perfect classification
22:   accuracy.
23:   The proof proceeds by developing a theory of unbounded nonsmooth
24:   Kurdyka-{\L}ojasiewicz inequalities for functions definable in an o-minimal
25:   structure, and is also applicable outside deep learning.
26: \end{abstract}
27: