abstract:1386b31662c39e06.tex

1: \begin{abstract}

2:

3: Scaling model sizes to scale performance has worked remarkably well for the current large language models paradigm.

4: The research and empirical findings of various scaling studies led to novel scaling results and laws that guides subsequent research.

5: High training costs for contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups.

6: % However, prohibitively high training costs at contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups efficiently.

7: One direction to ameliorate the cost of pretraining large models is to \textit{warmstart} the large-scale training from smaller models that are cheaper to tune.

8: In this work, we attempt to understand if the behavior of optimal hyperparameters can be retained under warmstarting for scaling.

9: We explore simple operations that allow the application of theoretically motivated methods of zero-shot transfer of optimal hyperparameters using $\mut{}$.

10: We investigate the aspects that contribute to the speedup in convergence and the preservation of stable training dynamics  under warmstarting with $\mut{}$.

11: We find that shrinking smaller model weights, zero-padding, and perturbing the resulting larger model with scaled initialization from $\mup{}$ enables effective warmstarting of $\mut{}$.

12: \end{abstract}

13: