9e20c0975a7c0513.tex
1: \begin{abstract}
2:   Modern \gls{VI} uses stochastic gradients to avoid intractable
3:   expectations, enabling large-scale probabilistic inference in
4:   complex models.  \gls{VI} posits a family of approximating
5:   distributions $q$ and then finds the member of that family that is
6:   closest to the exact posterior $p$. Traditionally, \gls{VI}
7:   algorithms minimize the ``exclusive \gls{KL}'' $\KL{q}{p}$, often
8:   for computational convenience. Recent research, however, has also
9:   focused on the ``inclusive \gls{KL}'' $\KL{p}{q}$, which has good
10:   statistical properties that makes it more appropriate for certain
11:   inference problems.  This paper develops a simple algorithm for
12:   reliably minimizing the inclusive \gls{KL} using stochastic gradients with vanishing bias. % Consider a valid \gls{MCMC} method, a Markov chain whose stationary distribution is $p$. The algorithm we develop iteratively samples the chain $\latent[k]$, and then uses those samples to follow the score function of the variational approximation, $\nabla \log q(\latent[k])$ with a  Robbins-Monro step-size schedule.
13: This method, which we call \gls{MSC}, converges to a local optimum of the inclusive \gls{KL}. It does not suffer from the systematic errors inherent in existing methods, such as Reweighted Wake-Sleep and Neural Adaptive Sequential Monte Carlo, which lead to bias in their final
14:   estimates.
15:   % In a variant that ties the variational approximation
16:   %directly to the Markov chain, \gls{MSC} further provides a new
17:   %algorithm that melds \gls{VI} and \gls{MCMC}.  
18:   We illustrate convergence on a toy
19:   model and demonstrate the utility of \gls{MSC} on Bayesian probit
20:   regression for classification
21: %  %, deep Markov models to learn the dynamics of simulated spiking neurons, 
22:   as well as a stochastic volatility model for financial data.
23: \end{abstract}
24: