12702e028e1aadcd.tex
1: \begin{abstract}
2: This paper establishes risk convergence and
3: asymptotic weight matrix alignment
4: ---
5:   a form of implicit regularization
6: ---
7: of gradient flow and gradient descent when applied to deep linear networks
8: on linearly separable data.
9: In more detail, for gradient flow applied to strictly decreasing
10: loss functions (with similar results for gradient descent with
11: particular decreasing step sizes):
12: (i) the risk converges to $0$;
13: (ii) the normalized $i^\text{th}$ weight matrix asymptotically equals its
14: rank-$1$ approximation $u_iv_i^\top$;
15: (iii) these rank-$1$ matrices are
16:   aligned across layers, meaning $|v_{i+1}^\top u_i|\to1$.
17:   In the case of the logistic loss (binary cross entropy), more
18:   can be said: the linear function induced by the network ---
19:   the product of its weight matrices ---
20:   converges to the same direction as the maximum margin solution.
21:   This last property was identified in prior work,
22:   but only under assumptions on gradient descent which
23:   here are implied by the alignment phenomenon.
24: \end{abstract}
25: