1: \begin{abstract}
2: %\vspace{-1em}
3:
4: Optimizing distributed learning systems is an art
5: of balancing between computation and communication.
6: There have been two lines of research that try to
7: deal with slower networks: {\em communication
8: compression} for
9: low bandwidth networks, and {\em decentralization} for
10: high latency networks. In this paper, We explore
11: a natural question: {\em can the combination
12: of both techniques lead to
13: a system that is robust to both bandwidth
14: and latency?}
15:
16: Although the system implication of such combination
17: is trivial, the underlying theoretical principle and
18: algorithm design is challenging: unlike centralized algorithms, simply compressing
19: {\rc exchanged information,
20: even in an unbiased stochastic way,
21: within the decentralized network would accumulate the error and fail to converge.}
22: In this paper, we develop
23: a framework of compressed, decentralized training and
24: propose two different strategies, which we call
25: {\em extrapolation compression} and {\em difference compression}.
26: We analyze both algorithms and prove
27: both converge at the rate of $O(1/\sqrt{nT})$
28: where $n$ is the number of workers and $T$ is the
29: number of iterations, matching the convergence rate for
30: full precision, centralized training. We validate
31: our algorithms and find that our proposed algorithm outperforms
32: the best of merely decentralized and merely quantized
33: algorithm significantly for networks with {\em both}
34: high latency and low bandwidth.
35: \end{abstract}
36: