5003c87ba0b179fe.tex
1: \begin{abstract}
2:   Deep neural networks have been used in various machine learning applications and achieved tremendous empirical successes. 
3:   However, training deep neural networks is a challenging task.
4:   Many alternatives have been proposed in place of end-to-end back-propagation.
5:   Layer-wise training is one of them, which trains a single layer at a time, rather than trains the whole layers simultaneously.
6:   In this paper, we study a layer-wise training using a block coordinate gradient descent (BCGD) for deep linear networks.
7:   We establish a general convergence analysis of BCGD and found the optimal learning rate, which results in the fastest decrease in the loss.
8: %  More importantly, the optimal learning rate can directly be applied in practice, as it does not require any prior knowledge.
9:   We identify the effects of depth, width, and initialization.
10: %   in the training process.
11:   When the orthogonal-like initialization is employed, we show that 
12:   the width of intermediate layers plays no role in gradient-based training
13:   beyond a certain threshold.
14: %  Also, a very deep network finds the global optimum after updating each weight matrix only once.
15:   Besides, we found that the use of deep networks could drastically accelerate convergence when it is compared to those of a depth 1 network, even when the computational cost is considered. 
16:   Numerical examples are provided to justify our theoretical findings and demonstrate the performance of layer-wise training by BCGD.
17: \end{abstract}
18: