1: \begin{abstract}
2: Modern \gls{VI} uses stochastic gradients to avoid intractable
3: expectations, enabling large-scale probabilistic inference in
4: complex models. \gls{VI} posits a family of approximating
5: distributions $q$ and then finds the member of that family that is
6: closest to the exact posterior $p$. Traditionally, \gls{VI}
7: algorithms minimize the ``exclusive \gls{KL}'' $\KL{q}{p}$, often
8: for computational convenience. Recent research, however, has also
9: focused on the ``inclusive \gls{KL}'' $\KL{p}{q}$, which has good
10: statistical properties that makes it more appropriate for certain
11: inference problems. This paper develops a simple algorithm for
12: reliably minimizing the inclusive \gls{KL} using stochastic gradients with vanishing bias. % Consider a valid \gls{MCMC} method, a Markov chain whose stationary distribution is $p$. The algorithm we develop iteratively samples the chain $\latent[k]$, and then uses those samples to follow the score function of the variational approximation, $\nabla \log q(\latent[k])$ with a Robbins-Monro step-size schedule.
13: This method, which we call \gls{MSC}, converges to a local optimum of the inclusive \gls{KL}. It does not suffer from the systematic errors inherent in existing methods, such as Reweighted Wake-Sleep and Neural Adaptive Sequential Monte Carlo, which lead to bias in their final
14: estimates.
15: % In a variant that ties the variational approximation
16: %directly to the Markov chain, \gls{MSC} further provides a new
17: %algorithm that melds \gls{VI} and \gls{MCMC}.
18: We illustrate convergence on a toy
19: model and demonstrate the utility of \gls{MSC} on Bayesian probit
20: regression for classification
21: % %, deep Markov models to learn the dynamics of simulated spiking neurons,
22: as well as a stochastic volatility model for financial data.
23: \end{abstract}
24: