1: \begin{abstract}
2: Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale
3: Bayesian learning, with well-developed theoretical convergence properties.
4: In such applications of SG-MCMC, it is becoming increasingly popular to employ
5: distributed systems, where stochastic gradients are computed based on some
6: outdated parameters, yielding what are termed {\em stale gradients}. While
7: stale gradients could be directly used in SG-MCMC, their impact on convergence
8: properties has not been well studied. In this paper we develop theory to show
9: that while the bias and MSE of an SG-MCMC
10: algorithm depend on the staleness of stochastic gradients, its estimation variance
11: (relative to the {\em expected} estimate, based on a prescribed number of samples) is
12: independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale
13: gradients are computed asynchronously by a set of workers, our theory indicates a
14: linear speedup on the decrease of estimation variance w.r.t.\! the number of workers.
15: Experiments on synthetic data and deep neural
16: networks validate our theory, demonstrating the effectiveness and scalability of SG-MCMC
17: with stale gradients.
18: \end{abstract}
19: