e589adb363fa35e0.tex
1: \begin{abstract}
2: Communication compression is a crucial 
3: technique for modern distributed learning 
4: systems to alleviate their communication 
5: bottlenecks over slower networks.
6: Despite recent intensive studies
7: of gradient compression for data parallel-style
8: training, compressing the \textit{activations}
9: for models trained with
10: pipeline parallelism is still an
11: open problem. In this paper, we propose 
12: \algname, a novel activation compression 
13: algorithm for communication-efficient 
14: pipeline parallelism training 
15: over slow networks. Different from previous 
16: efforts in activation compression,
17: instead of compressing activation values directly, \algname compresses the \textit{changes of the
18: activations}. This allows us to show,
19: to the best of our knowledge for the first time, 
20: that one can still achieve 
21: $O(1/\sqrt{T})$ convergence rate for 
22: non-convex objectives
23: under 
24: activation compression, without making 
25: assumptions on gradient unbiasedness
26: that do not hold for deep learning models with non-linear activation functions.
27: We then show that \algname can be optimized
28: and implemented efficiently, without 
29: additional end-to-end runtime overhead.
30: We evaluated \algname to fine-tune
31: language models with up to 1.5 billion parameters,
32: compressing activations to 2-4 bits.
33: \algname provides up to $4.3\times$ end-to-end speed-up in slower networks, without
34: sacrificing model quality.
35: Moreover, we also show that \algname
36: can be combined with state-of-the-art 
37: gradient compression algorithms to enable 
38: ``end-to-end communication compression'': \textit{All communications between machines, including model gradients, forward activations,
39: and backward gradients are compressed into lower precision}.
40: This provides up to $4.9\times$ end-to-end speed-up, without
41: sacrificing model quality.
42: \end{abstract}
43: