1: \begin{abstract}
2: % The abstract should briefly summarize the contents of the paper in
3: % 150--250 words.
4:
5: The ever-growing scale of deep neural networks (DNNs) has lead to an equally
6: rapid growth in computational resource requirements. Many recent
7: architectures, most prominently Large Language Models, have to be trained
8: using supercomputers with thousands of accelerators, such as GPUs or TPUs.
9: Next to the vast number of floating point operations the memory footprint of
10: DNNs is also exploding. In contrast, GPU architectures are notoriously short
11: on memory. Even comparatively small architectures like some
12: \emph{EfficientNet} variants cannot be trained on a single consumer-grade GPU
13: at reasonable mini-batch sizes. During training, intermediate input
14: activations have to be stored until backpropagation for gradient calculation.
15: These make up the vast majority of the memory footprint. In this work we
16: therefore consider compressing activation maps for the backward pass using
17: pooling, which can reduce both the memory footprint and amount of data
18: movement. The forward computation remains uncompressed. We empirically show
19: convergence and study effects on feature detection at the example of the
20: common vision architecture \emph{ResNet}. With this approach we are able to
21: reduce the peak memory consumption by 29\% at the cost of a longer training
22: schedule, while maintaining prediction accuracy compared to an uncompressed
23: baseline.
24:
25: \keywords{Compression, Deep Neural Networks, Training, Backpropagation}
26: \end{abstract}
27: