f83d3bd331e01b7d.tex
1: \begin{abstract}
2: 	We study the dynamics of gradient descent on objective 
3: 	functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar 
4: 	parameters $w_1,\ldots,w_k$), which arise in the context of 
5: 	training depth-$k$ linear neural networks. We prove that for standard 
6: 	random initializations, and under mild assumptions on $f$, the number of 
7: 	iterations required for convergence scales exponentially with the depth 
8: 	$k$. We also show empirically that this phenomenon can occur in higher 
9: 	dimensions, where each $w_i$ is a matrix. This highlights a potential 
10: 	obstacle in understanding the convergence of gradient-based methods for 
11: 	deep linear neural networks, where $k$ is large. 
12: \end{abstract}
13: