8679e59ad829dd88.tex
1: \begin{abstract}
2:   We analyze the convergence of gradient-based optimization algorithms
3:   that base their updates on delayed stochastic gradient
4:   information. The main application of our results is to the
5:   development of gradient-based distributed optimization algorithms
6:   where a master node performs parameter updates while worker nodes
7:   compute stochastic gradients based on local information in parallel,
8:   which may give rise to delays due to asynchrony. We take motivation
9:   from statistical problems where the size of the data is so large
10:   that it cannot fit on one computer; with the advent of huge datasets
11:   in biology, astronomy, and the internet, such problems are now
12:   common. Our main contribution is to show that for smooth stochastic
13:   problems, the delays are asymptotically negligible and we can
14:   achieve order-optimal convergence results. In application to
15:   distributed optimization, we develop procedures that overcome
16:   communication bottlenecks and synchronization requirements. We show
17:   $n$-node architectures whose optimization error in stochastic
18:   problems---in spite of asynchronous delays---scales asymptotically
19:   as $\order(1 / \sqrt{nT})$ after $T$ iterations. This rate is known
20:   to be optimal for a distributed system with $n$ nodes even in the
21:   absence of delays. We additionally complement our theoretical
22:   results with numerical experiments on a statistical machine learning
23:   task.
24: \end{abstract}
25: