1: \begin{abstract}
2: We analyze the convergence of gradient-based optimization algorithms
3: that base their updates on delayed stochastic gradient
4: information. The main application of our results is to the
5: development of gradient-based distributed optimization algorithms
6: where a master node performs parameter updates while worker nodes
7: compute stochastic gradients based on local information in parallel,
8: which may give rise to delays due to asynchrony. We take motivation
9: from statistical problems where the size of the data is so large
10: that it cannot fit on one computer; with the advent of huge datasets
11: in biology, astronomy, and the internet, such problems are now
12: common. Our main contribution is to show that for smooth stochastic
13: problems, the delays are asymptotically negligible and we can
14: achieve order-optimal convergence results. In application to
15: distributed optimization, we develop procedures that overcome
16: communication bottlenecks and synchronization requirements. We show
17: $n$-node architectures whose optimization error in stochastic
18: problems---in spite of asynchronous delays---scales asymptotically
19: as $\order(1 / \sqrt{nT})$ after $T$ iterations. This rate is known
20: to be optimal for a distributed system with $n$ nodes even in the
21: absence of delays. We additionally complement our theoretical
22: results with numerical experiments on a statistical machine learning
23: task.
24: \end{abstract}
25: