8e88a8a034ab52b2.tex
1: \begin{abstract}
2: Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint. 
3: This paper presents \method, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge.
4: By analyzing a linearized version of ACT's approximate gradient, we prove the convergence of \method without prior knowledge on operator type or model architecture. 
5: To make training stable, 
6: we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time.
7: We implement \method as a PyTorch library that readily applies to any NN architecture.
8: \method reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1$\times$, enabling training with a 4.2$\times$ to 24.7$\times$ larger batch size, with negligible accuracy loss. We implement \method as a PyTorch library at \url{https://github.com/LiuXiaoxuanPKU/GACT-ICML}.
9: 
10: 
11: %Evaluation on computer vision, natural language processing, and graph classification tasks shows that \method reduces the memory footprint of the activation by X, and it enables training with a X to X larger batch size.
12: 
13: %compresses the activation to 4 bits per dimensional, with negligible accuracy loss. 
14: %\method 
15: \end{abstract}
16: