abstract:5e0e8df19e30ee62.tex

1: \begin{abstract}

2: The success of deep architectures is at least in part attributed to the layer-by-layer unsupervised pre-training that initializes the network.

3: Various papers have reported extensive empirical analysis focusing on the design and implementation of good pre-training procedures.

4: However, an understanding pertaining to the consistency of parameter estimates, the convergence of learning procedures and the sample size estimates is still unavailable in the literature.

5: In this work, we study pre-training in classical and distributed denoising autoencoders with these goals in mind.

6: We show that the gradient converges at the rate of $\frac{1}{\sqrt{N}}$ and has a sub-linear dependence on the size of the autoencoder network.

7: In a distributed setting where disjoint sections of the whole network are pre-trained synchronously, we show that the convergence improves by at least $\tau^{3/4}$, where $\tau$ corresponds to the size of the sections.

8: We provide a broad set of experiments to empirically evaluate the suggested behavior.

9: \end{abstract}

10: