abstract:7668210bb0ed0a1b.tex

1: \begin{abstract}

2: We study empirical scaling laws for language model performance on the cross-entropy loss.

3: The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.

4: Other architectural details such as network width or depth have minimal effects within a wide range.

5: Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size.

6: These relationships allow us to determine the optimal allocation of a fixed compute budget.

7: Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

8:

9: \end{abstract}

10: