1: \begin{abstract}
2:
3: While training a machine learning model using multiple workers, each of
4: which collects data from their own data sources, it would be most useful
5: when the data collected from different workers can be {\em unique} and {\em
6: different}. Ironically, recent analysis of decentralized parallel
7: stochastic gradient descent (D-PSGD) relies on the assumption that the data
8: hosted on different workers are {\em not too different}. In this paper, we ask
9: the question: {\em Can we design a decentralized parallel stochastic gradient
10: descent algorithm that is less sensitive to the data variance across
11: workers?}
12:
13: In this paper, we present D$^2$, a novel decentralized parallel stochastic
14: gradient descent algorithm designed for large data variance \xr{among workers}
15: (imprecisely, ``decentralized'' data). The core of D$^2$ is a variance
16: reduction extension of the standard D-PSGD algorithm, which improves the
17: convergence rate from $O\left({\sigma \over \sqrt{nT}} +
18: {(n\zeta^2)^{\frac{1}{3}} \over T^{2/3}}\right)$ to $O\left({\sigma \over
19: \sqrt{nT}}\right)$ where $\zeta^{2}$ denotes the variance among data on
20: different workers. As a result, D$^2$ is robust to data variance among
21: workers. We empirically evaluated D$^2$ on image classification
22: tasks where each worker has access to only the data of a
23: limited set of labels, and find that D$^2$ significantly outperforms
24: D-PSGD.
25:
26: \end{abstract}
27: