abstract:bf846ebc510cf784.tex

1: \begin{abstract}

2: One of the core pillars of efficient deep learning methods is architectural improvements such as the residual/skip connection, which has led to significantly better model convergence and quality. Since then the residual connection has become ubiquitous in not just convolutional neural networks but also transformer-based architectures, the backbone of LLMs.

3:

4: In this paper we introduce \emph{Learned Augmented Residual Layer} (\laurel)---a novel generalization of the canonical residual connection---with the goal to be an in-situ replacement of the latter while outperforming on both model quality and footprint metrics.

5: Our experiments show that using \laurel can help boost performance for both vision and language models.

6: %When pre-training a 3B parameter LLM with \laurel over two weeks with 1024 Cloud TPUv5e chips, we improve its performance on a variety of common LLM tasks and dataset combinations, using only $+0.012\%$ extra parameters and without significant latency changes.

7: %

8: %\laurel also outperforms naive model scaling both in terms of model quality and footprint metrics.

9: For example, on the ResNet-50, ImageNet 1K task, it achieves $60\%$ of the gains from adding an extra layer, while only adding $0.003\%$ more parameters, and matches it while adding $2.6\times$ fewer parameters.

10: \end{abstract}

11: