1758679ebd0b8c96.tex
1: \begin{abstract}
2: 
3: Asynchronous distributed stochastic gradient descent methods have trouble converging because of stale gradients.
4: A gradient update sent to a parameter server by a client is stale if the parameters used to calculate that gradient have since been updated on the server.
5: Approaches have been proposed to circumvent this problem that quantify staleness in terms of the number of elapsed updates.
6: In this work, we propose a novel method that quantifies staleness in terms of moving averages of gradient statistics.
7: We show that this method outperforms previous methods with respect to convergence speed and scalability to many clients.
8: We also discuss how an extension to this method can be used to dramatically reduce bandwidth costs
9: in a distributed training context.
10: In particular, our method allows reduction of total bandwidth usage by a factor of 5 with little impact on
11: cost convergence.
12: We also describe (and link to) a software library that we have used to simulate these algorithms
13: deterministically on a single machine.
14: 
15: \end{abstract}
16: