236be2be56bf1c76.tex
1: \begin{abstract}
2: One of the most widely used methods for solving large-scale stochastic optimization problems is \ac{DASGD}, a family of algorithms that result from parallelizing stochastic gradient descent on distributed computing architectures (possibly) asychronously. 
3: However, a key obstacle in the efficient implementation of \ac{DASGD} is the issue of \emph{delays:}
4: when a computing node contributes a gradient update, the global model parameter may have already been updated by other nodes several times over, thereby rendering this gradient information stale.
5: These delays can quickly add up if the computational throughput of a node is saturated, so the convergence of \ac{DASGD} may be compromised in the presence of large delays. 
6: Our first contribution is that, by carefully tuning the algorithm's step-size, convergence to the critical set is still achieved in mean square, even if the delays grow unbounded at a polynomial rate.
7: We also establish finer results in a broad class of structured optimization problems (called variationally coherent), where we show that \ac{DASGD} converges to a global optimum with probability $1$ under the same delay assumptions.
8: Together, these results contribute to the broad landscape of large-scale non-convex stochastic optimization by offering state-of-the-art theoretical guarantees and providing insights for algorithm design.
9: \end{abstract}
10: