a8b64d04fe4c61ee.tex
1: \begin{abstract}
2: Asynchronous methods are widely used in deep learning, but have limited theoretical justification when applied to
3: non-convex problems.
4: We show that running stochastic
5: gradient descent (SGD) in an asynchronous manner can be viewed as
6: adding a momentum-like term to the SGD iteration. Our result does not
7: assume convexity of the objective function, so it is applicable to
8: deep learning systems. We observe that a standard queuing model of
9: asynchrony results in a form of momentum that is commonly used by deep
10: learning practitioners. This forges a link between queuing theory and
11: asynchrony in deep learning systems, which could be useful for systems
12: builders. 
13: For convolutional neural networks, we experimentally
14: validate that the degree of asynchrony directly correlates with the
15: momentum, confirming our main result.
16: An important implication is that tuning the momentum parameter is important when considering different levels of asynchrony.
17: We assert that properly tuned momentum reduces the number of steps required for convergence.
18: Finally, our theory suggests new ways of counteracting the adverse effects of asynchrony: a simple mechanism like using negative algorithmic momentum can improve performance under high asynchrony.
19: Since asynchronous methods have better
20: hardware efficiency, this result may shed light on when asynchronous
21: execution is more efficient for deep learning systems.
22: \end{abstract}
23: