abstract:415406dff9da085b.tex

1: \begin{abstract}

2: The utilization of residual learning has become widespread in deep and scalable neural nets.

3: However, the fundamental principles that contribute to the success of residual learning remain elusive, thus hindering effective training of plain nets with depth scalability.

4: In this paper, we peek behind the curtains of residual learning by uncovering the ``dissipating inputs'' phenomenon that leads to convergence failure in plain neural nets: the input is gradually compromised through plain layers due to non-linearities, resulting in challenges of learning feature representations.

5: We theoretically demonstrate how plain neural nets degenerate the input to random noise and emphasize the significance of a residual connection that maintains a better lower bound of surviving neurons as a solution.

6: With our theoretical discoveries, we propose ``The Plain Neural Net Hypothesis'' (PNNH) that identifies the internal path across non-linear layers as the most critical part in residual learning, and establishes a paradigm to support the training of deep plain neural nets devoid of residual connections.

7: We thoroughly evaluate PNNH-enabled CNN architectures and Transformers on popular vision benchmarks, showing on-par accuracy, up to $0.3\times$ higher training throughput, and 2$\times$ better parameter efficiency compared to ResNets and vision Transformers.

8: \end{abstract}

9: