abstract:a8b64d04fe4c61ee.tex

1: \begin{abstract}

2: Asynchronous methods are widely used in deep learning, but have limited theoretical justification when applied to

3: non-convex problems.

4: We show that running stochastic

5: gradient descent (SGD) in an asynchronous manner can be viewed as

6: adding a momentum-like term to the SGD iteration. Our result does not

7: assume convexity of the objective function, so it is applicable to

8: deep learning systems. We observe that a standard queuing model of

9: asynchrony results in a form of momentum that is commonly used by deep

10: learning practitioners. This forges a link between queuing theory and

11: asynchrony in deep learning systems, which could be useful for systems

12: builders.

13: For convolutional neural networks, we experimentally

14: validate that the degree of asynchrony directly correlates with the

15: momentum, confirming our main result.

16: An important implication is that tuning the momentum parameter is important when considering different levels of asynchrony.

17: We assert that properly tuned momentum reduces the number of steps required for convergence.

18: Finally, our theory suggests new ways of counteracting the adverse effects of asynchrony: a simple mechanism like using negative algorithmic momentum can improve performance under high asynchrony.

19: Since asynchronous methods have better

20: hardware efficiency, this result may shed light on when asynchronous

21: execution is more efficient for deep learning systems.

22: \end{abstract}

23: