abstract:37d6f0715463a12a.tex

1: \begin{abstract}

2:     Pipeline parallelism enables efficient training of Large Language Models (LLMs) on large-scale distributed accelerator clusters. Yet, pipeline \textit{bubbles} during startup and tear-down reduce the utilization of accelerators.

3:     Although efficient pipeline schemes with micro-batching and bidirectional pipelines have been proposed to maximize utilization, a significant number of bubbles cannot be filled using synchronous forward and backward passes.

4:     To address this problem, we suggest that \textit{extra work} be assigned to the bubbles to gain \textit{auxiliary benefits} in LLM training.

5:     As an example in this direction, we propose \textit{PipeFisher}, which assigns the work of K-FAC, a second-order optimization method based on the Fisher information matrix, to the bubbles to \textit{accelerate convergence}.

6:     In Phase 1 pretraining of BERT-Base and -Large models, PipeFisher reduces the (simulated) training time to 50-75\% compared to training with a first-order optimizer by greatly improving the accelerator utilization and benefiting from the improved convergence by K-FAC.

7: \end{abstract}

8: