8c29ea6d424e57ca.tex
1: \begin{abstract}
2: 
3:   While training a machine learning  model using multiple workers, each of
4:     which collects data from their own data sources, it would be most useful
5:     when the data collected from different workers can be {\em unique} and {\em
6:       different}. Ironically, recent analysis of decentralized parallel
7:   stochastic gradient descent (D-PSGD) relies on the assumption that the data
8:   hosted on different workers are {\em not too different}. In this paper, we ask
9:   the question: {\em Can we design a decentralized parallel stochastic gradient
10:     descent algorithm that is less sensitive to the data variance across
11:     workers?}
12: 
13:   In this paper, we present D$^2$, a novel decentralized parallel stochastic
14:   gradient descent algorithm designed for large data variance \xr{among workers}
15:   (imprecisely, ``decentralized'' data). The core of D$^2$ is a variance
16:   reduction extension of the standard D-PSGD algorithm, which improves the
17:   convergence rate from $O\left({\sigma \over \sqrt{nT}} +
18:     {(n\zeta^2)^{\frac{1}{3}} \over T^{2/3}}\right)$ to $O\left({\sigma \over
19:       \sqrt{nT}}\right)$ where $\zeta^{2}$ denotes the variance among data on
20:     different workers. As a result, D$^2$ is robust to data variance among
21:   workers. We empirically evaluated D$^2$ on image classification
22:     tasks where each worker has access to only the data of a
23:     limited set of labels, and find that D$^2$ significantly outperforms
24:   D-PSGD.
25: 
26: \end{abstract}
27: