abstract:8e88a8a034ab52b2.tex

1: \begin{abstract}

2: Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint.

3: This paper presents \method, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge.

4: By analyzing a linearized version of ACT's approximate gradient, we prove the convergence of \method without prior knowledge on operator type or model architecture.

5: To make training stable,

6: we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time.

7: We implement \method as a PyTorch library that readily applies to any NN architecture.

8: \method reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1$\times$, enabling training with a 4.2$\times$ to 24.7$\times$ larger batch size, with negligible accuracy loss. We implement \method as a PyTorch library at \url{https://github.com/LiuXiaoxuanPKU/GACT-ICML}.

9:

10:

11: %Evaluation on computer vision, natural language processing, and graph classification tasks shows that \method reduces the memory footprint of the activation by X, and it enables training with a X to X larger batch size.

12:

13: %compresses the activation to 4 bits per dimensional, with negligible accuracy loss.

14: %\method

15: \end{abstract}

16: