abstract:9e71f8a8220c675d.tex

1: \begin{abstract}

2: Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties.

3: Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation.

4: Recently, Pennington \emph{et al.}~used free probability theory to show that dynamical isometry plays an integral role in efficient deep learning.

5: We show that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches.

6: Although much simpler than its predecessors, this gate enables training thousands of fully connected layers with fast convergence and better test performance for ResNets trained on CIFAR-10.

7: We apply this technique to language modeling and find that we can easily train 120-layer Transformers.

8: When applied to 12 layer Transformers, it converges 56\% faster on enwiki8.

9: \end{abstract}

10: