c828aa263b4b40cd.tex
1: \begin{abstract}
2: % The abstract should briefly summarize the contents of the paper in
3: % 150--250 words.
4: 
5: 	The ever-growing scale of deep neural networks (DNNs) has lead to an equally
6: 	rapid growth in computational resource requirements. Many recent
7: 	architectures, most prominently Large Language Models, have to be trained
8: 	using supercomputers with thousands of accelerators, such as GPUs or TPUs.
9: 	Next to the vast number of floating point operations the memory footprint of
10: 	DNNs is also exploding. In contrast, GPU architectures are notoriously short
11: 	on memory. Even comparatively small architectures like some
12: 	\emph{EfficientNet} variants cannot be trained on a single consumer-grade GPU
13: 	at reasonable mini-batch sizes. During training, intermediate input
14: 	activations have to be stored until backpropagation for gradient calculation.
15: 	These make up the vast majority of the memory footprint. In this work we
16: 	therefore consider compressing activation maps for the backward pass using
17: 	pooling, which can reduce both the memory footprint and amount of data
18: 	movement. The forward computation remains uncompressed. We empirically show
19: 	convergence and study effects on feature detection at the example of the
20: 	common vision architecture \emph{ResNet}. With this approach we are able to
21: 	reduce the peak memory consumption by 29\% at the cost of a longer training
22: 	schedule, while maintaining prediction accuracy compared to an uncompressed
23: 	baseline.
24: 
25: \keywords{Compression, Deep Neural Networks, Training, Backpropagation}
26: \end{abstract}
27: