abstract:f83d3bd331e01b7d.tex

1: \begin{abstract}

2: 	We study the dynamics of gradient descent on objective

3: 	functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar

4: 	parameters $w_1,\ldots,w_k$), which arise in the context of

5: 	training depth-$k$ linear neural networks. We prove that for standard

6: 	random initializations, and under mild assumptions on $f$, the number of

7: 	iterations required for convergence scales exponentially with the depth

8: 	$k$. We also show empirically that this phenomenon can occur in higher

9: 	dimensions, where each $w_i$ is a matrix. This highlights a potential

10: 	obstacle in understanding the convergence of gradient-based methods for

11: 	deep linear neural networks, where $k$ is large.

12: \end{abstract}

13: