abstract:1758679ebd0b8c96.tex

1: \begin{abstract}

2:

3: Asynchronous distributed stochastic gradient descent methods have trouble converging because of stale gradients.

4: A gradient update sent to a parameter server by a client is stale if the parameters used to calculate that gradient have since been updated on the server.

5: Approaches have been proposed to circumvent this problem that quantify staleness in terms of the number of elapsed updates.

6: In this work, we propose a novel method that quantifies staleness in terms of moving averages of gradient statistics.

7: We show that this method outperforms previous methods with respect to convergence speed and scalability to many clients.

8: We also discuss how an extension to this method can be used to dramatically reduce bandwidth costs

9: in a distributed training context.

10: In particular, our method allows reduction of total bandwidth usage by a factor of 5 with little impact on

11: cost convergence.

12: We also describe (and link to) a software library that we have used to simulate these algorithms

13: deterministically on a single machine.

14:

15: \end{abstract}

16: