1: \begin{abstract}
2: In this paper, we show that although the minimizers of cross-entropy and
3: related classification losses are off at infinity, network weights learned by
4: gradient flow converge \emph{in direction}, with an immediate corollary that
5: network predictions, training errors, and the margin distribution also
6: converge.
7: This proof holds for deep homogeneous networks
8: ---
9: a broad class of networks allowing for ReLU, max-pooling, linear, and
10: convolutional layers
11: ---
12: and we additionally provide empirical support not just close to the theory
13: (e.g., the AlexNet), but also on non-homogeneous networks (e.g., the
14: DenseNet).
15: If the network further has locally Lipschitz gradients, we show that these
16: gradients also converge in direction, and asymptotically \emph{align} with the
17: gradient flow path, with consequences on margin maximization, convergence of saliency maps,
18: and a few other settings.
19: Our analysis complements and is distinct from the well-known neural tangent
20: and mean-field theories, and in particular makes no requirements on network
21: width and initialization, instead merely requiring perfect classification
22: accuracy.
23: The proof proceeds by developing a theory of unbounded nonsmooth
24: Kurdyka-{\L}ojasiewicz inequalities for functions definable in an o-minimal
25: structure, and is also applicable outside deep learning.
26: \end{abstract}
27: