37d6f0715463a12a.tex
1: \begin{abstract}
2:     Pipeline parallelism enables efficient training of Large Language Models (LLMs) on large-scale distributed accelerator clusters. Yet, pipeline \textit{bubbles} during startup and tear-down reduce the utilization of accelerators. 
3:     Although efficient pipeline schemes with micro-batching and bidirectional pipelines have been proposed to maximize utilization, a significant number of bubbles cannot be filled using synchronous forward and backward passes.
4:     To address this problem, we suggest that \textit{extra work} be assigned to the bubbles to gain \textit{auxiliary benefits} in LLM training. 
5:     As an example in this direction, we propose \textit{PipeFisher}, which assigns the work of K-FAC, a second-order optimization method based on the Fisher information matrix, to the bubbles to \textit{accelerate convergence}.
6:     In Phase 1 pretraining of BERT-Base and -Large models, PipeFisher reduces the (simulated) training time to 50-75\% compared to training with a first-order optimizer by greatly improving the accelerator utilization and benefiting from the improved convergence by K-FAC. 
7: \end{abstract}
8: