abstract:7b5e1a25d5b0d150.tex

1: \begin{abstract}

2:   We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are available. We develop a framework which allows to choose the stepsize and the momentum parameters of these algorithms in a way to optimize performance by systematically trading off the bias, variance and dependence to network effects. When gradients do not contain noise, we also prove that D-ASG can \emph{achieve acceleration}, in the sense that it requires $\mathcal{O}(\sqrt{\kappa} \log(1/\varepsilon))$ gradient evaluations and $\mathcal{O}(\sqrt{\kappa} \log(1/\varepsilon))$ communications to converge to the same fixed point with the non-accelerated variant where $\kappa$ is the condition number and $\varepsilon$ is the target accuracy. %Our results lead to

3:   %To our knowledge, this is the first acceleration result where the bias term decays exponentially with the square root of the condition number in the context of \emph{primal} distributed stochastic gradient methods.

4:   For quadratic functions, we also provide finer performance bounds that are tight with respect to bias and variance terms. Finally, we study a multistage version of D-ASG with parameters carefully varied over stages to ensure exact convergence to the optimal solution. It achieves optimal and accelerated $\bigO(-k/\sqrt{\kappa})$ linear decay in the bias term as well as optimal $\bigO(\sigma^2/k)$ in the variance term. We illustrate through numerical experiments that our approach results in accelerated practical algorithms that are robust to gradient noise and that can outperform existing methods.

5: \end{abstract}

6: