1: \begin{abstract}
2: Running out of GPU memory has become a main bottleneck for large-scale DNN training.
3: How to reduce the memory footprint during training has received intensive research attention.
4: We find that previous gradient accumulation reduces activation memory but fails to be compatible with gradient memory reduction due to a contradiction between preserving gradients and releasing gradients.
5: To address this issue, we propose a novel optimizer accumulation method for Adam, named Adam Accumulation (AdamA), which enables reducing both activation and gradient memory.
6: Specifically, AdamA directly integrates gradients into optimizer states and accumulates optimizer states over micro-batches, so that gradients can be released immediately after use.
7: We mathematically and experimentally demonstrate AdamA yields the same convergence properties as Adam.
8: Evaluated on transformer-based models, AdamA achieves up to 23\% memory reduction compared to gradient accumulation with less than 2\% degradation in training throughput.
9: Notably, AdamA can work together with memory reduction methods for optimizer states to fit 1.26$\times$\textasciitilde 3.14$\times$ larger models over PyTorch and DeepSpeed baseline on GPUs with different memory capacities.
10: \end{abstract}
11: