1: \begin{abstract}
2:
3: Scaling model sizes to scale performance has worked remarkably well for the current large language models paradigm.
4: The research and empirical findings of various scaling studies led to novel scaling results and laws that guides subsequent research.
5: High training costs for contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups.
6: % However, prohibitively high training costs at contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups efficiently.
7: One direction to ameliorate the cost of pretraining large models is to \textit{warmstart} the large-scale training from smaller models that are cheaper to tune.
8: In this work, we attempt to understand if the behavior of optimal hyperparameters can be retained under warmstarting for scaling.
9: We explore simple operations that allow the application of theoretically motivated methods of zero-shot transfer of optimal hyperparameters using $\mut{}$.
10: We investigate the aspects that contribute to the speedup in convergence and the preservation of stable training dynamics under warmstarting with $\mut{}$.
11: We find that shrinking smaller model weights, zero-padding, and perturbing the resulting larger model with scaled initialization from $\mup{}$ enables effective warmstarting of $\mut{}$.
12: \end{abstract}
13: