1: \begin{abstract}
2: Communication compression is a crucial
3: technique for modern distributed learning
4: systems to alleviate their communication
5: bottlenecks over slower networks.
6: Despite recent intensive studies
7: of gradient compression for data parallel-style
8: training, compressing the \textit{activations}
9: for models trained with
10: pipeline parallelism is still an
11: open problem. In this paper, we propose
12: \algname, a novel activation compression
13: algorithm for communication-efficient
14: pipeline parallelism training
15: over slow networks. Different from previous
16: efforts in activation compression,
17: instead of compressing activation values directly, \algname compresses the \textit{changes of the
18: activations}. This allows us to show,
19: to the best of our knowledge for the first time,
20: that one can still achieve
21: $O(1/\sqrt{T})$ convergence rate for
22: non-convex objectives
23: under
24: activation compression, without making
25: assumptions on gradient unbiasedness
26: that do not hold for deep learning models with non-linear activation functions.
27: We then show that \algname can be optimized
28: and implemented efficiently, without
29: additional end-to-end runtime overhead.
30: We evaluated \algname to fine-tune
31: language models with up to 1.5 billion parameters,
32: compressing activations to 2-4 bits.
33: \algname provides up to $4.3\times$ end-to-end speed-up in slower networks, without
34: sacrificing model quality.
35: Moreover, we also show that \algname
36: can be combined with state-of-the-art
37: gradient compression algorithms to enable
38: ``end-to-end communication compression'': \textit{All communications between machines, including model gradients, forward activations,
39: and backward gradients are compressed into lower precision}.
40: This provides up to $4.9\times$ end-to-end speed-up, without
41: sacrificing model quality.
42: \end{abstract}
43: