abstract:5003c87ba0b179fe.tex

1: \begin{abstract}

2:   Deep neural networks have been used in various machine learning applications and achieved tremendous empirical successes.

3:   However, training deep neural networks is a challenging task.

4:   Many alternatives have been proposed in place of end-to-end back-propagation.

5:   Layer-wise training is one of them, which trains a single layer at a time, rather than trains the whole layers simultaneously.

6:   In this paper, we study a layer-wise training using a block coordinate gradient descent (BCGD) for deep linear networks.

7:   We establish a general convergence analysis of BCGD and found the optimal learning rate, which results in the fastest decrease in the loss.

8: %  More importantly, the optimal learning rate can directly be applied in practice, as it does not require any prior knowledge.

9:   We identify the effects of depth, width, and initialization.

10: %   in the training process.

11:   When the orthogonal-like initialization is employed, we show that

12:   the width of intermediate layers plays no role in gradient-based training

13:   beyond a certain threshold.

14: %  Also, a very deep network finds the global optimum after updating each weight matrix only once.

15:   Besides, we found that the use of deep networks could drastically accelerate convergence when it is compared to those of a depth 1 network, even when the computational cost is considered.

16:   Numerical examples are provided to justify our theoretical findings and demonstrate the performance of layer-wise training by BCGD.

17: \end{abstract}

18: