abstract:8b63cafc41b702e3.tex

1: \begin{abstract}

2: Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale

3: Bayesian learning, with well-developed theoretical convergence properties.

4: In such applications of SG-MCMC, it is becoming increasingly popular to employ

5: distributed systems, where stochastic gradients are computed based on some

6: outdated parameters, yielding what are termed {\em stale gradients}. While

7: stale gradients could be directly used in SG-MCMC, their impact on convergence

8: properties has not been well studied. In this paper we develop theory to show

9: that while the bias and MSE of an SG-MCMC

10: algorithm depend on the staleness of stochastic gradients, its estimation variance

11: (relative to the {\em expected} estimate, based on a prescribed number of samples) is

12: independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale

13: gradients are computed asynchronously by a set of workers, our theory indicates a

14: linear speedup on the decrease of estimation variance w.r.t.\! the number of workers.

15: Experiments on synthetic data and deep neural

16: networks validate our theory, demonstrating the effectiveness and scalability of SG-MCMC

17: with stale gradients.

18: \end{abstract}

19: