abstract:e589adb363fa35e0.tex

1: \begin{abstract}

2: Communication compression is a crucial

3: technique for modern distributed learning

4: systems to alleviate their communication

5: bottlenecks over slower networks.

6: Despite recent intensive studies

7: of gradient compression for data parallel-style

8: training, compressing the \textit{activations}

9: for models trained with

10: pipeline parallelism is still an

11: open problem. In this paper, we propose

12: \algname, a novel activation compression

13: algorithm for communication-efficient

14: pipeline parallelism training

15: over slow networks. Different from previous

16: efforts in activation compression,

17: instead of compressing activation values directly, \algname compresses the \textit{changes of the

18: activations}. This allows us to show,

19: to the best of our knowledge for the first time,

20: that one can still achieve

21: $O(1/\sqrt{T})$ convergence rate for

22: non-convex objectives

23: under

24: activation compression, without making

25: assumptions on gradient unbiasedness

26: that do not hold for deep learning models with non-linear activation functions.

27: We then show that \algname can be optimized

28: and implemented efficiently, without

29: additional end-to-end runtime overhead.

30: We evaluated \algname to fine-tune

31: language models with up to 1.5 billion parameters,

32: compressing activations to 2-4 bits.

33: \algname provides up to $4.3\times$ end-to-end speed-up in slower networks, without

34: sacrificing model quality.

35: Moreover, we also show that \algname

36: can be combined with state-of-the-art

37: gradient compression algorithms to enable

38: ``end-to-end communication compression'': \textit{All communications between machines, including model gradients, forward activations,

39: and backward gradients are compressed into lower precision}.

40: This provides up to $4.9\times$ end-to-end speed-up, without

41: sacrificing model quality.

42: \end{abstract}

43: