abstract:cd1a2ef93e5c1a1d.tex

1: \begin{abstract}

2: Large language models (LLMs) have made remarkable  advances in recent years, with scaling laws playing a critical role in this rapid progress.

3: % As model sizes, compute resources and amount of datasets continue to grow larger, new challenges arise in terms of training efficiency, model performance, and steadiness.

4: In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess.

5: We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount.

6: % We train GPT-sytle language models from 125 millions to 2.6 billions parameters on the Huawei Ascend chips and MindSpore framework using up to 300 Billions high-quality tokens.

7: We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models.

8: Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.

9:

10: % Our goals were: 1) Reproduce previous GPT-3 benchmarks to validate our setup, 2) Explore how large batch sizes and learning rates affect the convergence and generalization for these models, and 3) Extend LLMs' scaling law into the large batch regime and use empirical findings to provide guidance on batch size selection for LLM training, facilitating better model outcomes under given resources in practice.

11: \end{abstract}

12: