fe965c3ca1d846b9.tex
1: \begin{abstract}
2:   Stochastic Gradient Descent (SGD) is an important algorithm in
3:   machine learning.  With constant learning rates, it is a stochastic
4:   process that, after an initial phase of convergence, generates
5:   samples from a stationary distribution.  We show that SGD with
6:   constant rates can be effectively used as an approximate posterior
7:   inference algorithm for probabilistic modeling.  Specifically, we show
8:   how to adjust the tuning parameters of SGD such as to match the
9:   resulting stationary distribution to the posterior.  This analysis
10:   rests on interpreting SGD as a continuous-time stochastic process
11:   and then minimizing the Kullback-Leibler divergence between its
12:   stationary distribution and the target posterior.  (This is in the
13:   spirit of variational inference.)  In more detail, we model SGD as a
14:   multivariate Ornstein-Uhlenbeck process and then use properties of
15:   this process to derive the optimal parameters.  This theoretical framework also connects
16:   SGD to modern scalable inference algorithms; we analyze the recently
17:   proposed stochastic gradient Fisher scoring under this perspective.
18:   We demonstrate that SGD with properly chosen constant rates gives a new way to
19:   optimize hyperparameters in probabilistic models. 
20: 
21: 
22: \end{abstract}
23: