5e0e8df19e30ee62.tex
1: \begin{abstract}
2: The success of deep architectures is at least in part attributed to the layer-by-layer unsupervised pre-training that initializes the network.
3: Various papers have reported extensive empirical analysis focusing on the design and implementation of good pre-training procedures.
4: However, an understanding pertaining to the consistency of parameter estimates, the convergence of learning procedures and the sample size estimates is still unavailable in the literature.
5: In this work, we study pre-training in classical and distributed denoising autoencoders with these goals in mind. 
6: We show that the gradient converges at the rate of $\frac{1}{\sqrt{N}}$ and has a sub-linear dependence on the size of the autoencoder network.
7: In a distributed setting where disjoint sections of the whole network are pre-trained synchronously, we show that the convergence improves by at least $\tau^{3/4}$, where $\tau$ corresponds to the size of the sections. 
8: We provide a broad set of experiments to empirically evaluate the suggested behavior.
9: \end{abstract}
10: