1: \begin{abstract}
2: Deep neural networks have been used in various machine learning applications and achieved tremendous empirical successes.
3: However, training deep neural networks is a challenging task.
4: Many alternatives have been proposed in place of end-to-end back-propagation.
5: Layer-wise training is one of them, which trains a single layer at a time, rather than trains the whole layers simultaneously.
6: In this paper, we study a layer-wise training using a block coordinate gradient descent (BCGD) for deep linear networks.
7: We establish a general convergence analysis of BCGD and found the optimal learning rate, which results in the fastest decrease in the loss.
8: % More importantly, the optimal learning rate can directly be applied in practice, as it does not require any prior knowledge.
9: We identify the effects of depth, width, and initialization.
10: % in the training process.
11: When the orthogonal-like initialization is employed, we show that
12: the width of intermediate layers plays no role in gradient-based training
13: beyond a certain threshold.
14: % Also, a very deep network finds the global optimum after updating each weight matrix only once.
15: Besides, we found that the use of deep networks could drastically accelerate convergence when it is compared to those of a depth 1 network, even when the computational cost is considered.
16: Numerical examples are provided to justify our theoretical findings and demonstrate the performance of layer-wise training by BCGD.
17: \end{abstract}
18: