abstract:7b6660959ccf4d21.tex

1: \begin{abstract}

2: Running out of GPU memory has become a main bottleneck for large-scale DNN training.

3: How to reduce the memory footprint during training has received intensive research attention.

4: We find that previous gradient accumulation reduces activation memory but fails to be compatible with gradient memory reduction due to a contradiction between preserving gradients and releasing gradients.

5: To address this issue, we propose a novel optimizer accumulation method for Adam, named Adam Accumulation (AdamA), which enables reducing both activation and gradient memory.

6: Specifically, AdamA directly integrates gradients into optimizer states and accumulates optimizer states over micro-batches, so that gradients can be released immediately after use.

7: We mathematically and experimentally demonstrate AdamA yields the same convergence properties as Adam.

8: Evaluated on transformer-based models, AdamA achieves up to 23\% memory reduction compared to gradient accumulation with less than 2\%  degradation in training throughput.

9: Notably, AdamA can work together with memory reduction methods for optimizer states to fit 1.26$\times$\textasciitilde 3.14$\times$ larger models over PyTorch and DeepSpeed baseline on GPUs with different memory capacities.

10: \end{abstract}

11: