1: \begin{abstract}
2: Stochastic Gradient Descent (SGD) is an important algorithm in
3: machine learning. With constant learning rates, it is a stochastic
4: process that, after an initial phase of convergence, generates
5: samples from a stationary distribution. We show that SGD with
6: constant rates can be effectively used as an approximate posterior
7: inference algorithm for probabilistic modeling. Specifically, we show
8: how to adjust the tuning parameters of SGD such as to match the
9: resulting stationary distribution to the posterior. This analysis
10: rests on interpreting SGD as a continuous-time stochastic process
11: and then minimizing the Kullback-Leibler divergence between its
12: stationary distribution and the target posterior. (This is in the
13: spirit of variational inference.) In more detail, we model SGD as a
14: multivariate Ornstein-Uhlenbeck process and then use properties of
15: this process to derive the optimal parameters. This theoretical framework also connects
16: SGD to modern scalable inference algorithms; we analyze the recently
17: proposed stochastic gradient Fisher scoring under this perspective.
18: We demonstrate that SGD with properly chosen constant rates gives a new way to
19: optimize hyperparameters in probabilistic models.
20:
21:
22: \end{abstract}
23: