b0882c2fd2ef6c6a.tex
1: \begin{abstract}
2: %\vspace{-1em}
3: 
4: Optimizing distributed learning systems is an art
5: of balancing between computation and communication.
6: There have been two lines of research that try to
7: deal with slower networks: {\em communication 
8: compression} for
9: low bandwidth networks, and {\em decentralization} for
10: high latency networks. In this paper, We explore
11: a natural question: {\em can the combination
12: of both techniques lead to
13: a system that is robust to both bandwidth
14: and latency?}
15: 
16: Although the system implication of such combination
17: is trivial, the underlying theoretical principle and
18: algorithm design is challenging:  unlike centralized algorithms, simply compressing
19: {\rc exchanged information,
20: even in an unbiased stochastic way, 
21: within the decentralized network would accumulate the error and fail to converge.} 
22: In this paper, we develop
23: a framework of compressed, decentralized training and
24: propose two different strategies, which we call
25: {\em extrapolation compression} and {\em difference compression}.
26: We analyze both algorithms and prove 
27: both converge at the rate of $O(1/\sqrt{nT})$ 
28: where $n$ is the number of workers and $T$ is the
29: number of iterations, matching the convergence rate for
30: full precision, centralized training. We validate 
31: our algorithms and find that our proposed algorithm outperforms
32: the best of merely decentralized and merely quantized
33: algorithm significantly for networks with {\em both} 
34: high latency and low bandwidth.
35: \end{abstract}
36: