7668210bb0ed0a1b.tex
1: \begin{abstract}
2: We study empirical scaling laws for language model performance on the cross-entropy loss.
3: The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.
4: Other architectural details such as network width or depth have minimal effects within a wide range.
5: Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size.
6: These relationships allow us to determine the optimal allocation of a fixed compute budget.
7: Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
8: 
9: \end{abstract}
10: