abstract:8c29ea6d424e57ca.tex

1: \begin{abstract}

2:

3:   While training a machine learning  model using multiple workers, each of

4:     which collects data from their own data sources, it would be most useful

5:     when the data collected from different workers can be {\em unique} and {\em

6:       different}. Ironically, recent analysis of decentralized parallel

7:   stochastic gradient descent (D-PSGD) relies on the assumption that the data

8:   hosted on different workers are {\em not too different}. In this paper, we ask

9:   the question: {\em Can we design a decentralized parallel stochastic gradient

10:     descent algorithm that is less sensitive to the data variance across

11:     workers?}

12:

13:   In this paper, we present D$^2$, a novel decentralized parallel stochastic

14:   gradient descent algorithm designed for large data variance \xr{among workers}

15:   (imprecisely, ``decentralized'' data). The core of D$^2$ is a variance

16:   reduction extension of the standard D-PSGD algorithm, which improves the

17:   convergence rate from $O\left({\sigma \over \sqrt{nT}} +

18:     {(n\zeta^2)^{\frac{1}{3}} \over T^{2/3}}\right)$ to $O\left({\sigma \over

19:       \sqrt{nT}}\right)$ where $\zeta^{2}$ denotes the variance among data on

20:     different workers. As a result, D$^2$ is robust to data variance among

21:   workers. We empirically evaluated D$^2$ on image classification

22:     tasks where each worker has access to only the data of a

23:     limited set of labels, and find that D$^2$ significantly outperforms

24:   D-PSGD.

25:

26: \end{abstract}

27: