1: \begin{abstract}
2: For deploy deep learning models to be deployed into production, they need to be compact enough to meet the latency and memory constraints. A compact model is usually both deep and thin. In this paper, we propose an efficient method to train a very deep and thin network with a theoretic guarantee. Our method is motivated by model compression. It consists of three stages. In the first stage, we sufficiently widen the deep thin network and train it until convergence. In the second stage, we use this well-trained deep wide network to warm up (or initialize) the original deep thin network. This is achieved by letting the thin network imitate the immediate outputs of the wide network from layer to layer. In the last stage, we further fine tune this already well warmed-up deep thin network. The theoretical guarantee is established by using mean field analysis. It shows that the proposed method is provably more efficient than training deep thin networks from scratch by backpropagation. We also conduct large-scale empirical experiments to validate our approach. By training with our method, ResNet50 can outperform ResNet101, and $\text{BERT}_\text{BASE}$ can be comparable with $\text{BERT}_\text{LARGE}, $ where both the latter models are trained via the standard training procedures as in the literature.
3: \end{abstract}
4: