abstract:7f617ea62c480fc7.tex

1: \begin{abstract}

2:

3: Training LLMs is expensive, and recent evidence indicates training all the way to convergence is inefficient. In this paper, we investigate the ability of a simple idea – checkpoint averaging along the trajectory of a training run – to improve the quality of models before they have converged. This approach incurs no extra cost during training or inference. Specifically, we analyze the training trajectories of Pythia LLMs with 1-12 billion parameters and demonstrate that, particularly during the early-mid stages of training, this idea accelerates convergence and improves both test and zero-shot generalization. Loss spikes are a well-recognized problem in LLM training; in our analysis we encountered two instances of this in the underlying trajectories, and both instances were mitigated by our averaging.

4:

5: For a 6.9B parameter LLM, for example, our early weight averaging recipe can save upto 4200 hours of GPU time, which corresponds to significant savings in  cloud compute costs.

6: \end{abstract}

7: