1: \documentclass[11pt]{article}
2: \usepackage{graphicx}
3: \usepackage{latexsym}
4: \usepackage{amssymb}
5:
6: %\topmargin -0.75in
7: %\textheight 9.18in
8: \topmargin -0.5in
9: \textheight 8.8in
10: \oddsidemargin -0.1in
11: \evensidemargin -0.1in
12: \textwidth 6.7in
13: \tolerance=1600 %allow some tolerance in extending after line.
14: \parskip=6pt
15: \overfullrule=0pt %no dark lines if overfull
16: \setlength{\parindent}{12pt}
17: \setlength{\partopsep}{0pt}
18: \setlength{\topsep}{0pt}
19: \renewcommand{\topfraction}{0.9}
20: \renewcommand{\textfraction}{0.1}
21: \setcounter{bottomnumber}{1}
22: \renewcommand{\bottomfraction}{0.5}
23:
24: \def\beq{\begin{eqnarray}}
25: \def\eeq{\end{eqnarray}}
26: \def\I{\mbox{\bf I}}
27: \def\mod{\mbox{mod}\ }
28: \def\geo{^{\mbox{\tiny geo}}}
29: \def\opt{^{\mbox{\tiny opt}}}
30: \def\rhat{\hat r}
31: \def\rhatrev{\underline{\hat r}}
32: \def\rhatSIS{\hat r_{\mbox{\tiny SIS}}}
33: \def\rhatbridge{\hat r_{\mbox{\tiny bridge}}}
34: \def\rhatAIS{\hat r_{\mbox{\tiny AIS}}}
35: \def\rhatLIS{\hat r_{\mbox{\tiny LIS}}}
36: \def\rhatLISave{\hat r_{\mbox{\tiny LIS-ave}}}
37: \def\rhatLISrev{\underline{\hat r}_{\,\mbox{\tiny LIS}}}
38: \def\rhatLISbridged{\hat r_{\mbox{\tiny LIS-bridged}}}
39: \def\Var{\mbox{Var}}
40: \def\Cor{\mbox{Cor}}
41:
42: \begin{document}
43:
44: \fontsize{11}{16pt}\selectfont
45:
46: \begin{center}
47:
48: {\small Technical Report No.\ 0511,
49: Department of Statistics, University of Toronto}
50:
51: \vspace*{0.45in}
52:
53: {\LARGE \bf Estimating Ratios of Normalizing Constants Using \\[6pt]
54: Linked Importance Sampling}
55:
56: \vspace*{9pt}
57:
58: {\large Radford M. Neal}\\[4pt]
59: Department of Statistics and Department of Computer Science \\
60: University of Toronto, Toronto, Ontario, Canada \\
61: \texttt{http://www.cs.utoronto.ca/$\sim$radford/} \\
62: \texttt{radford@stat.utoronto.ca}\\[6pt]
63:
64: 8 November 2005
65:
66: \end{center}
67:
68: \vspace*{8pt}
69:
70:
71: \noindent \textbf{Abstract.}\ \ Ratios of normalizing constants for
72: two distributions are needed in both Bayesian statistics, where they
73: are used to compare models, and in statistical physics, where they
74: correspond to differences in free energy. Two approaches have long
75: been used to estimate ratios of normalizing constants. The `simple
76: importance sampling' (SIS) or `free energy perturbation' method uses a
77: sample drawn from just one of the two distributions. The `bridge
78: sampling' or `acceptance ratio' estimate can be viewed as the ratio of
79: two SIS estimates involving a bridge distribution. For both methods,
80: difficult problems must be handled by introducing a sequence of
81: intermediate distributions linking the two distributions of interest,
82: with the final ratio of normalizing constants being estimated by the
83: product of estimates of ratios for adjacent distributions in this
84: sequence. Recently, work by Jarzynski, and independently by Neal, has
85: shown how one can view such a product of estimates, each based on
86: simple importance sampling using a single point, as an SIS estimate on
87: an extended state space. This `Annealed Importance Sampling' (AIS)
88: method produces an exactly unbiased estimate for the ratio of
89: normalizing constants even when the Markov transitions used do not
90: reach equilibrium. In this paper, I show how a corresponding `Linked
91: Importance Sampling' (LIS) method can be constructed in which the
92: estimates for individual ratios are similar to bridge sampling
93: estimates. As a further elaboration, bridge sampling rather than
94: simple importance sampling can be employed at the top level for both
95: AIS and LIS, which sometimes produces further improvement. I show
96: empirically that for some problems, LIS estimates are much more
97: accurate than AIS estimates found using the same computation time,
98: although for other problems the two methods have similar performance.
99: Like AIS, LIS can also produce estimates for expectations, even when
100: the distribution contains multiple isolated modes. AIS is related to
101: the `tempered transition' method for handling isolated modes, and to a
102: method for `dragging' fast variables. Linked sampling methods similar
103: to LIS can be constructed that are analogous to tempered transitions
104: and to this method for dragging fast variables, which may sometimes
105: work better than those analogous to AIS.
106:
107: \newpage
108:
109:
110: \section{\hspace*{-7pt}Introduction}\label{sec-intro}\vspace*{-10pt}
111:
112: Consider two distributions on the same space, with probability mass or
113: density functions $\pi_0(x) = p_0(x)/Z_0$ and $\pi_1(x) =
114: p_1(x)/Z_1$. Suppose that we are not able to directly compute $\pi_0$
115: and $\pi_1$, but only $p_0$ and $p_1$, since we do not know the
116: normalizing constants, $Z_0$ and $Z_1$. We wish to find a Monte Carlo
117: estimate for the ratio of these normalizing constants, $Z_1/Z_0$,
118: which we sometimes denote by $r$, using samples of values drawn (at
119: least approximately) from $\pi_0$ and from $\pi_1$. Sometimes, we may
120: know $Z_0$, in which case we can arrange for it to be one, so that
121: estimation of this ratio will give the numerical value of $Z_1$.
122: Other times, we will be able to obtain only the ratio of normalizing
123: constants, but this may be sufficient for our purposes.
124:
125: In statistical physics, $x$ represents the state of some physical
126: system, and the distributions are typically `canonical' distributions
127: having the following form (for $j=0,1$):
128: \beq
129: p_j(x) & = & \exp(-\beta_j U(x,\lambda_j))
130: \label{eq-canonical}
131: \eeq
132: where $U(x,\lambda_j)$ is an `energy' function, which may depend on the
133: parameter $\lambda_j$, and $\beta_j$ is the inverse temperature of
134: system $j$. Many interesting properties of the systems are related
135: to the `free energy', defined as $-\log(Z_j)\,/\,\beta_j$. Often, only
136: the difference in free energy between systems $0$ and $1$ is relevant,
137: and this is determined by the ratio $Z_1/Z_0$.
138:
139: In Bayesian statistics, $x$ comprises the parameters and latent
140: variables for some statistical model, $\pi_0$ is the prior
141: distribution for these quantities (for which the normalizing constant
142: is usually known), and $\pi_1$ is the posterior distribution given the
143: observed data. We can compute $p_1(x)$ as the product of the prior
144: density for $x$ and the probability of the data given $x$, but the
145: normalizing constant, $Z_1$, is difficult to compute. We can
146: interpret $Z_1$ as the `marginal likelihood' --- the probability of
147: the observed data under this model, integrating over possible values
148: of the model's parameters and latent variables. The marginal
149: likelihood for a model indicates how well it is supported
150: by the data.
151:
152: Although I will use simple distributions as illustrations in this
153: paper, in real applications, $x$ is usually high dimensional, and at
154: least one of $\pi_0$ and $\pi_1$ is usually quite complex.
155: Accordingly, sampling from these distributions generally requires use
156: of Markov chain methods, such as the venerable Metropolis algorithm
157: (Metropolis, \textit{et al} 1953). See (Neal 1993) for a review of
158: Markov chain sampling methods. Sometimes, however, $\pi_0$ will be
159: relatively simple, and independent points drawn from it can be
160: generated efficiently, as would often be the case with the prior
161: distribution for a Bayesian model, or for a physical system at
162: infinite temperature ($\beta_0=0$).
163:
164: Many methods for estimating ratios of normalizing constants from Monte
165: Carlo data have been investigated in the physics literature (for a
166: review, see (Neal 1993, Section 6.2)), and later rediscovered in the
167: statistics literature (Gelman and Meng 1998). A logical method to
168: start with is `simple importance sampling' (SIS), also called `free energy
169: perturbation', based on the following identity, which can
170: easily be proved on the assumption that no region having zero probability
171: under $\pi_0$ has non-zero probability under $\pi_1$:
172: \beq
173: {Z_1 \over Z_0} & = & E_{\pi_0}\! \left[ {p_1(X) \over p_0(X)} \right]
174: \ \ \approx \ \ {1 \over N} \sum_{i=1}^N {p_1(x^{(i)}) \over p_0(x^{(i)})}
175: \ \ =\ \ {1 \over N} \sum_{i=1}^N \rhatSIS^{(i)}
176: \ \ =\ \ \rhatSIS
177: \label{eq-simple}
178: \eeq
179: In the above equation, $E_{\pi_0}$ denotes an expectation with
180: respect to the distribution
181: $\pi_0$, which is estimated by a Monte Carlo average over points
182: $x^{(i)},\ldots,x^{(N)}$ drawn from $\pi_0$ (either independently, or using a
183: Markov chain sampler).
184: Here and later, $\hat r_{\mbox{\tiny M}}$ will denote an estimate of
185: $r=Z_1/Z_0$, found by method M. If this estimate is an average of
186: unbiased estimates based on a number of samples, these individual
187: estimates will be denoted by $\hat r_{\mbox{\tiny M}}^{(i)}$.
188:
189: The simple importance sampling estimate, $\rhatSIS$, will be poor
190: if $\pi_0$ and $\pi_1$ are not close enough --- in particular, if any
191: region with non-negligible probability under $\pi_1$ has very small
192: probability under $\pi_0$. Such a region would have an important
193: effect on the value of $r$, but very little information about it would
194: be contained in the sample from $\pi_0$. In such a situation, it may
195: be possible to obtain a good estimate by introducing intermediate
196: distributions. Parameterizing these distributions in some way using
197: $\eta$, we can define a sequence of distributions,
198: $\pi_{\eta_0},\ldots,\pi_{\eta_n}$, with $\eta_0=0$ and $\eta_n=1$ so
199: that the first and last distributions in the sequence are $\pi_0$ and
200: $\pi_1$, with the intermediate distributions interpolating between
201: them. We can then write
202: \beq
203: {Z_1 \over Z_0} & = & \prod_{j=0}^{n-1} {Z_{\eta_{j+1}} \over Z_{\eta_j}}
204: \label{eq-intermed}
205: \eeq
206: Provided that $\pi_{\eta_{j+1}}$ and $\pi_{\eta_j}$ are close enough,
207: we can estimate each of the factors
208: $Z_{\eta_{j+1}}/Z_{\eta_j}$ using simple
209: importance sampling, and from these estimates obtain an estimate for $Z_1/Z_0$.
210:
211: We can obtain good estimates in a wider range of situations, or using
212: fewer intermediate distributions (sometimes none), by applying a
213: technique introduced by Bennett (1976), who called it the `acceptance
214: ratio' method. This method was later rediscovered by Meng and Wong
215: (1996), who called it `bridge sampling'. Lu, Singh, and Kofke (2003)
216: provide a recent review and assessment. One way of viewing this
217: method is that it replaces the simple importance sampling estimate for
218: $Z_1/Z_0$ by a ratio of estimates for $Z_*/Z_0$ and $Z_*/Z_1$, where
219: $Z_*$ is the normalizing constant for a `bridge distribution',
220: $\pi_*(x) = p_*(x)/Z_*$, which is chosen so that it is overlapped by
221: both $\pi_0$ and $\pi_1$. Using simple importance sampling estimates
222: for $Z_*/Z_0$ and $Z_*/Z_1$, we can obtain the estimate
223: \beq
224: {Z_1 \over Z_0} & = &
225: E_{\pi_0}\! \left[ {p_*(X) \over p_0(X)} \right] \, \Big/\,
226: E_{\pi_1}\! \left[ {p_*(X) \over p_1(X)} \right]
227: \ \ \approx \ \
228: {1 \over N_0} \sum_{k=1}^{N_0} {p_*(x_{0,k}) \over p_0(x_{0,k})} \ \Big/\
229: {1 \over N_1} \sum_{k=1}^{N_1} {p_*(x_{1,k}) \over p_1(x_{1,k})}
230: \ \ =\ \ \rhatbridge\ \ \ \ \
231: \label{eq-bridge}
232: \eeq
233: where $x_{0,1},\ldots,x_{0,N_0}$ are drawn from $\pi_0$ and
234: $x_{1,1},\ldots,x_{1,N_1}$ are drawn from $\pi_1$.
235:
236: One simple choice for the bridge distribution is the `geometric' bridge:
237: \beq
238: p\geo_*(x) & = & \sqrt{p_0(x)p_1(x)}
239: \label{eq-geo-bridge}
240: \eeq
241: which is in a sense half-way between $\pi_0$ and $\pi_1$.
242: As discussed by Bennett (1976) and by Meng and Wong (1996), the asymptotically
243: optimal choice of bridge distribution is
244: \beq
245: p\opt_*(x) & = & { p_0(x)p_1(x) \over r (N_0/N_1) p_0(x)\, +\, p_1(x)}
246: \label{eq-opt-bridge}
247: \eeq
248: where $r=Z_1/Z_0$. Of course, we cannot use this bridge distribution
249: in practice, since we do not know $r$. We can use a preliminary guess
250: at $r$ to define an initial bridge distribution, however, which will
251: give us a bridge sampling estimate for $Z_1/Z_0$. Using this estimate
252: as the new value of $r$, we can refine our bridge distribution, iterating
253: this process as many times as desired. The result of this iteration can
254: also be viewed as a maximum likelihood estimate for $r$, as discussed by
255: Shirts, \textit{et~al} (2003), who argues on this basis that it is
256: asymptotically as good as any estimate for $r$. I have found that
257: estimates with $r$ set iteratively are often better than those found
258: with the true value of $r$ (which does not contradict optimality of the true
259: value for a fixed choice of bridge distribution).
260:
261: If $\pi_0$ and $\pi_1$ do not overlap sufficiently, no bridge
262: distribution will produce good estimates, and we will have to
263: introduce intermediate distributions as in
264: equation~(\ref{eq-intermed}). Note, however, that the bridge sampling
265: estimate with either of the above bridge distributions converges
266: to the correct ratio asymptotically as long there is some region that
267: has non-zero probability under both $\pi_0$ and $\pi_1$, a much weaker
268: requirement than that for simple importance sampling.
269:
270: This advantage of bridge sampling over SIS can be seen in a simple
271: example involving distributions that are uniform over an interval of the
272: reals. Let $p_0(x) = I_{(0,3)}(x)$ and $p_1(x)=I_{(2,4)}(x)$, so that
273: $Z_0=3$ and $Z_1=2$. The simple importance sampling estimate of
274: equation~(\ref{eq-simple}) does not work, as it converges to $1/3$
275: rather than $2/3$. However, using a bridge distribution with
276: $p_*(x)=I_{(2,3)}$, which is effectively what both $p_*\opt$ and
277: $p_*\geo$ will be in this example, the bridge sampling estimate of
278: equation~(\ref{eq-bridge}) converges to the correct value, since the
279: numerator converges to $1/3$ and the denominator to $1/2$.
280:
281: Although both simple importance sampling and bridge sampling have been
282: successfully used in many applications, they have some deficiencies.
283: One issue is that although the SIS estimate of
284: equation~(\ref{eq-simple}) is unbiased for $Z_1/Z_0$, the bridge
285: sampling estimate of equation~(\ref{eq-bridge}) is not, and the same
286: would appear to be the case for an estimate using intermediate
287: distributions (via equation~(\ref{eq-intermed})). This is of no
288: direct importance, particularly since we are often more interested in
289: $\log(Z_1/Z_0)$ than in $Z_1/Z_0$ itself. However, it does preclude
290: averaging independent replications of the bridge sampling estimate to
291: obtain a better estimate, since the bias would prevent convergence to
292: the correct value as the number of replications increases. A more
293: vexing difficulty is that, except sometimes for $\pi_0$, sampling from
294: the distributions $\pi_{\eta}$ must usually be done by Markov chain
295: methods, which approach the desired distribution only asymptotically.
296: To speed convergence, the Markov chain for sampling $\pi_{\eta_j}$ is
297: often started from the last state sampled for $\pi_{\eta_{j-1}}$, but
298: it is unclear how many iterations should then be discarded before an
299: adequate approximation to the correct distribution is reached.
300:
301: Surprisingly, these difficulties can be completely overcome when using
302: simple importance sampling with a single point. As shown by Jarzynski
303: (1997, 2001), and later independently by myself (Neal 2001), an estimate for
304: $Z_1/Z_0$ using intermediate distributions as in
305: equation~(\ref{eq-intermed}) will be exactly unbiased if each of the
306: ratios $Z_{\eta_{j+1}}/Z_{\eta_j}$ is estimated using the simple
307: importance sampling estimate of equation~(\ref{eq-simple}) with $N=1$,
308: sampling each distribution with a Markov chain update starting with the
309: point for the previous distribution.
310: Averaging the estimates obtained from $M$ independent replications of this
311: process (called `runs') produces the following estimate:
312: \beq
313: {Z_1 \over Z_0} & \approx &
314: {1 \over M}\, \sum_{i=1}^M\, \prod_{j=0}^{n-1}\,
315: {p_{\eta_{j+1}}(x^{(i)}_j) \over p_{\eta_j}(x^{(i)}_j)}
316: \ \ =\ \ {1 \over M} \sum_{i=1}^M \rhatAIS^{(i)}
317: \ \ =\ \ \rhatAIS
318: \label{eq-ais-est}
319: \eeq
320: Here, $x^{(1)}_0,\ldots,x^{(M)}_0$ are drawn independently from $\pi_0$,
321: and each $x^{(i)}_j$ for $j>0$ is generated by applying a Markov chain
322: transition that leaves $\pi_{\eta_j}$ invariant to $x^{(i)}_{j-1}$. This
323: single Markov transition (which could, however, consist of several Metropolis
324: or other updates if we so choose), will usually not be enough to reach
325: equilibrium, but the estimate $\rhatAIS$ is nevertheless exactly unbiased, and
326: will converge to the true value as $M$ increases, provided that no region
327: having zero probability under $\pi_{\eta_j}$ has non-zero probability
328: under $\pi_{\eta_{j+1}}$. This can be proved by showing how the
329: estimate above can be seen as a simple importance sampling estimate on an
330: extended state space that includes the values sampled for the intermediate
331: distributions.
332:
333: I call this method `Annealed Importance Sampling' (AIS), since the
334: sequence of distributions used often corresponds to an `annealing'
335: procedure, in which the temperature is gradually decreased. As I
336: discuss in (Neal 2001), this allows the procedure to sample different
337: isolated modes of the distribution on different runs, properly
338: weighting the points obtained from each of these runs to produce the
339: correct probability for each mode. AIS is related to an earlier
340: method for moving between isolated modes that I call `tempered
341: transitions' (Neal 1996). In a recent paper (Neal 2004), I show how
342: tempered transitions can be modified to produce a method for efficient
343: Markov chain sampling when some of the state variables are `fast' ---
344: ie, when it is possible to more quickly recompute the probability of a
345: state when only these fast variables change than when the other `slow'
346: variables change as well. In this method, the fast variables are
347: `dragged' through intermediate distributions in order to produce more
348: appropriate values to go with a proposed change to the slow variables.
349: Deciding whether to accept the final proposal involves what is in
350: effect an estimate of the ratio of normalizing constants for the
351: conditional distributions of the fast variables.
352:
353: In this paper, I show how the ideas behind Annealed Importance
354: Sampling and bridge sampling can be combined. I call the resulting
355: method `Linked Importance Sampling' (LIS), since the two samples
356: needed for bridge sampling are linked by a single state that is used
357: in both. Intermediate distributions can be used, with each
358: distribution being linked by a single state to the next distribution.
359: In contrast to bridge sampling, LIS estimates are unbiased, and as is
360: the case for AIS, they remain exactly unbiased even when intermediate
361: distributions are used, and when sampling is done using Markov chain
362: transitions that have not converged to their equilibrium
363: distributions.
364:
365: Crooks (2000) mentions a different way of combining AIS with bridge
366: sampling --- since AIS estimates are simple importance sampling
367: estimates on an extended state space, we can combine `forward' and
368: `reverse' estimates to produce a bridge sampling estimate that may be
369: superior. I will call this method `bridged AIS'. Similarly,
370: such a top-level application of bridge sampling can be combined with
371: the low-level application of bridge sampling in LIS, giving what I
372: call `bridged LIS'.
373:
374: Using tests on sequences of one-dimensional distributions, I
375: demonstrate that for some problems LIS is much more efficient than AIS
376: --- a result that should be expected, since in extreme cases, such as
377: for the uniform distributions discussed above, the simple importance
378: sampling estimates underlying AIS do not converge to the correct
379: answer even asymptotically, whereas bridge sampling estimates do. For
380: some other problems, however, AIS and LIS perform about equally well.
381: The bridged version of AIS sometimes performs much better than the
382: unbridged version, but still performs less well than LIS and its
383: bridged version on some problems. I also analyse the asymptotic
384: properties of AIS and LIS for some types of distribution, providing
385: additional insight into their behaviour.
386:
387: Variants of tempered transitions and of my method for dragging fast
388: variables can be constructed that are analogous to LIS rather than to
389: AIS. I discuss the `linked' variant of tempered transitions briefly,
390: and include a more detailed description of a linked version of
391: dragging, which may sometimes be better than the version related to
392: AIS. I conclude by discussing some possibilities for future research.
393:
394:
395: \section{\hspace*{-7pt}The Linked Importance Sampling
396: procedure}\label{sec-lis}\vspace*{-10pt}
397:
398: Assume that we can evaluate the unnormalized probability or density
399: functions $p_{\eta}(x)$, for any value of the parameter $\eta$, with
400: the normalized form of such a distribution being denoted by
401: $\pi_{\eta}$. The values $\eta=0$ and $\eta=1$ define the two
402: distributions we are interested in, for which the normalizing
403: constants are $Z_0$ and $Z_1$. A sequence of $n\!-\!1$ intermediate
404: values for $\eta$ define distributions that will assist in estimating
405: the ratio of these normalizing constants, $r=Z_1/Z_0$. We denote the
406: values of $\eta$ for the distributions used by $\eta_0,\ldots,\eta_n$,
407: with $\eta_0=0$ and $\eta_n=1$. Typically, $\eta_j<\eta_{j+1}$ for
408: all $j$.
409:
410: For problems in statistical physics, $\eta$ might be proportional to
411: the inverse temperature, $\beta$, of equation~(\ref{eq-canonical}), or
412: might map to a value for $\lambda$. For a Bayesian inference
413: problem, $\eta$ might be a power that the likelihood is raised to, so
414: that $\eta=0$ causes the data to be ignored, and $\eta=1$ gives full
415: weight to the data; the ratio $Z_1/Z_0$ will then be the marginal
416: likelihood. In both of these examples, progressing in small steps
417: from $\eta=0$ to $\eta=1$ is not only useful in estimating $Z_1/Z_0$,
418: but also often has an `annealing' effect, which helps avoid being
419: trapped in a local mode of the distribution.
420:
421: \subsection{\hspace*{-4pt}Details of the LIS procedure}\vspace*{-4pt}
422:
423: For each distribution, $\pi_{\eta}$, assume we have a pair of Markov chain
424: transition probability (or density) functions, denoted by $T_{\eta}(x,x')$
425: and $\underline{T}_{\eta}(x,x')$, satisfying $\int T_{\eta}(x,x') dx' = 1$
426: and $\int \underline{T}_{\eta}(x,x') dx' = 1$, for which the following mutual
427: reversibility relationship holds:
428: \beq
429: \pi_{\eta}(x)\,T_{\eta}(x,x') & = &
430: \pi_{\eta}(x')\,\underline{T}_{\eta}(x',x),\ \ \ \
431: \mbox{for all $x$ and $x'$}
432: \label{eq-rev}
433: \eeq
434: From this relationship, one can easily show that both $T_{\eta}$ and
435: $\underline{T}_{\eta}$ leave $\pi_{\eta}$ invariant --- ie, that
436: $\int \pi_{\eta}(x)
437: T_{\eta}(x,x') dx = \pi_{\eta}(x')$, and the same for $\underline{T}_{\eta}$.
438: If $T_{\eta}$ is reversible (ie, satisfies `detailed balance'), then
439: $\underline{T}_{\eta}$ will be the same as $T_{\eta}$. Non-reversible
440: transitions often arise when components of state are updated in some
441: predetermined order, in which case the reverse transition simply updates
442: components in the opposite order. As a special case, $T_{\eta}$ might
443: draw the next state from $\pi_{\eta}$ independently of the current state.
444: Such independent sampling may often be possible for $T_0$.
445:
446: These Markov chain transitions are used to obtain samples that are
447: approximately drawn from each of the $n\!+\!1$ distributions,
448: $\pi_{\eta_0},\ldots,\pi_{\eta_n}$. We assume that we can begin
449: sampling from $\pi_0$ by drawing a single point independently from
450: $\pi_0$. For $j>0$, we begin sampling from $\pi_{\eta_j}$ by
451: selecting a link state, $x_{j-1*j}$, from the sample associated with
452: $\pi_{\eta_{j-1}}$. For all $j$, we produce a sample of $K_j\!+\!1$
453: states from this starting point by applying a total of $K_j$ forward
454: ($T_{\eta_j}$) or reversed ($\underline T_{\eta_j}$) Markov
455: transitions. Link states are selected using bridge distributions,
456: $p_{j*j+1}$, which are defined in terms of $p_{\eta_j}$ and
457: $p_{\eta_{j+1}}$, perhaps using the form of
458: equation~(\ref{eq-geo-bridge}) or~(\ref{eq-opt-bridge}), with $p_0$
459: replaced by $p_{\eta_j}$ and $p_1$ by $p_{\eta_{j+1}}$.
460:
461: In detail, the Linked Importance Sampling procedure produces $M$ estimates,
462: $\rhatLIS^{(1)},\ldots,\rhatLIS^{(M)}$, that are averaged to produce
463: the final estimate, $\rhatLIS$. Each $\rhatLIS^{(i)}$ is
464: obtained by performing the following:\vspace*{5pt}
465:
466: \begin{center}\bf The LIS Procedure\end{center}\vspace*{-5pt}
467:
468: \begin{enumerate}
469: \item[1)] Pick an integer $\nu_0$ uniformly at random from $\{0,\ldots,K_0\}$,
470: and then set $x_{0,\nu_0}$ to a value drawn from $\pi_{\eta_0}$.
471: \item[2)] For $j\,=\,0,\ldots,n$, sample $K_j\!+\!1$ states drawn (at
472: least approximately) from $\pi_{\eta_j}$ as follows:
473: \begin{enumerate}
474: \item[a)] If $j>0$:\ \ Pick an integer $\nu_j$ uniformly at random from
475: $\{0,\ldots,K_j\}$, and then set $x_{j,\nu_j}$ to $x_{j-1*j}$.
476: \item[b)] For $k\,=\,\nu_j+1,\ldots,K_j$, draw $x_{j,k}$ according to the
477: forward Markov chain transition probabilities
478: $T_{\eta_j}(x_{j,k-1},x_{j,k})$. (If $\nu_j=K_j$, do nothing in
479: this step.)
480: \item[c)] For $k\,=\,\nu_j-1,\ldots,0$, draw $x_{j,k}$ according to the
481: reverse Markov chain transition probabilities
482: $\underline{T}_{\eta_j}(x_{j,k+1},x_{j,k})$. (If $\nu_j=0$, do
483: nothing in this step.)
484: \item[d)] If $j<n$:\ \ Pick a value for $\mu_j$ from
485: $\{0,\ldots,K_j\}$ according to the following
486: probabilities:\vspace*{-2pt}
487: \beq
488: \Pi_0(\mu_j\,|\,x_j) & = &
489: {p_{j*j+1}(x_{j,\mu_j}) \over p_{\eta_j}(x_{j,\mu_j})}
490: \ \Big/\
491: \sum_{k=0}^{K_j} {p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k})}
492: \label{eq-pmuj}
493: \eeq
494: and then set $x_{j*j+1}$ to $x_{j,\mu_j}$.
495: \end{enumerate}
496: \item[3)] Set $\mu_n$ to a value chosen uniformly at random from
497: $\{0,\ldots,K_n\}$. (This selection has no effect on
498: the estimate, but is used in the proof of correctness.)
499: \item[4)] Compute the estimate from this run as follows:
500: \beq
501: \rhatLIS^{(i)} & = & \prod_{j=0}^{n-1} \left[
502: {1 \over K_j+1}\, \sum_{k=0}^{K_j}\,
503: { p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) }
504: \ \Big/\
505: {1 \over K_{j+1}+1}\, \sum_{k=0}^{K_{j+1}}\,
506: { p_{j*j+1}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k}) }
507: \right]
508: \label{eq-lis}
509: \eeq
510: (Note that most of the factors of $1/(K_j\!+\!1)$ and
511: $1/(K_{j+1}\!+\!1)$ cancel, giving a final result of
512: $(K_n\!+\!1)\,/\,(K_0\!+\!1)$, but the redundant factors
513: are retained above for clarity of meaning.)\vspace*{-6pt}
514: \end{enumerate}
515: The result of performing steps (1) through (3) is illustrated in
516: Figure~\ref{fig-lis}. After $M$ runs of this procedure, the final
517: estimate is computed as
518: \beq
519: \rhatLIS & = & {1 \over M} \sum_{i=1}^M \rhatLIS^{(i)}
520: \eeq
521:
522: \begin{figure}[t]
523:
524: \centerline{\includegraphics[width=6.5in]{fig-lis.eps}}
525:
526: \caption[]{An illustration of Linked Importance Sampling. One
527: intermediate distribution is used, with $\eta_1=1/2$. The
528: distributions $\pi_0$, $\pi_{1/2}$, and $\pi_1$ are represented by
529: ovals enclosing the regions of high probability under each
530: distribution. Nine Markov chain transitions are performed at each
531: stage. The two link states are shown as black dots. The initial and
532: final states (indexed by $\nu_0$ and $\mu_n$) are shown as gray dots.
533: Other states generated by the forward and reverse Markov chain
534: transitions are shown as empty dots. For this run, $\nu_0\!=\!4$,
535: $\mu_0\!=\!9$, $\nu_1\!=\!1$, $\mu_1\!=\!8$, $\nu_2\!=\!3$, and
536: $\mu_2\!=\!7$.}\label{fig-lis}
537:
538: \end{figure}
539:
540: The crucial aspect of Linked Importance Sampling is that when moving
541: from distribution $\pi_{\eta_j}$ to $\pi_{\eta_{j+1}}$, a link state,
542: $x_{j*j+1}$, is randomly selected from among the sample of points
543: $x_{j,1},\ldots,x_{j,K_j+1}$ that are associated with $\pi_{\eta_j}$.
544: We can view the link state as part of the sample associated with
545: $\pi_{\eta_{j+1}}$ as well as that associated with $\pi_{\eta_j}$.
546: Accordingly, when using the `optimal' bridge of
547: equation~(\ref{eq-opt-bridge}), I will set $N_0/N_1$ to
548: $(K_j\!+\!1)/(K_{j+1}\!+\!1)$, though the proof of optimality for
549: bridge sampling does not guarantee that this is an optimal choice when
550: using this bridge distribution for LIS.
551:
552: \subsection{\hspace*{-4pt}Proof that LIS estimates are unbiased}\vspace*{-4pt}
553:
554: In order to prove that $\rhatLIS^{(i)}$ is an unbiased estimate of
555: $r=Z_1/Z_0$, we can regard steps (1) through (3) above as defining a
556: distribution,
557: $\Pi_0$, over all the quantities involved in the procedure --- namely,
558: $x_j$, $\mu_j$, and $\nu_j$, for $j=0,\ldots,n$, with $x_j$ representing
559: $x_{j,0},\ldots,x_{j,K_j}$. We then
560: consider the procedure for generating these same quantities in reverse,
561: which operates as follows:\vspace*{5pt}
562:
563: \pagebreak
564:
565: \begin{center}\bf The Reverse LIS Procedure\end{center}\vspace*{-5pt}
566:
567: \begin{enumerate}
568: \item[1)] Pick an integer $\mu_n$ uniformly at random from $\{0,\ldots,K_n\}$,
569: and then set $x_{n,\mu_n}$ to a value drawn from $\pi_{\eta_n}$.
570: \item[2)] For $j\,=\,n,\ldots,0$, sample $K_j\!+\!1$ states drawn (at
571: least approximately) from $\pi_{\eta_j}$ as follows:
572: \begin{enumerate}
573: \item[a)] If $j<n$:\ \ Pick an integer $\mu_j$ uniformly at random from
574: $\{0,\ldots,K_j\}$, and then set $x_{j,\mu_j}$ to $x_{j*j+1}$.
575: \item[b)] For $k\,=\,\mu_j+1,\ldots,K_j$, draw $x_{j,k}$ according to the
576: forward Markov chain transition probabilities
577: $T_{\eta_j}(x_{j,k-1},x_{j,k})$. (If $\mu_j=K_j$, do nothing
578: in this step.)
579: \item[c)] For $k\,=\,\mu_j-1,\ldots,0$, draw $x_{j,k}$ according to the
580: reverse Markov chain transition probabilities
581: $\underline{T}_{\eta_j}(x_{j,k+1},x_{j,k})$. (If $\mu_j=0$,
582: do nothing in this step.)
583: \item[d)] If $j>0$:\ \ Pick a value for $\nu_j$ from
584: $\{0,\ldots,K_j\}$ according to the following
585: probabilities:\vspace*{-3pt}
586: \beq
587: \Pi_1(\nu_j\,|\,x_j) & = &
588: {p_{j-1*j}(x_{j,\nu_j}) \over p_{\eta_j}(x_{j,\nu_j})}
589: \ \Big/\
590: \sum_{k=0}^{K_j} {p_{j-1*j}(x_{j,k}) \over p_{\eta_j}(x_{j,k})}
591: \label{eq-pnuj}
592: \eeq
593: and then set $x_{j-1*j}$ to $x_{j,\nu_j}$.
594: \end{enumerate}
595: \item[3)] Set $\nu_0$ to a value chosen uniformly at random from
596: $\{0,\ldots,K_0\}$.\vspace*{-6pt}
597: \end{enumerate}
598: This reverse procedure also defines a distribution over all the
599: quantities generated ($x_j$, $\mu_j$, and $\nu_j$ for $j=0,\ldots,n$),
600: which will be denoted by $\Pi_1$.
601:
602: We now define the unnormalized probability (density) functions
603: $P_0(x,\mu,\nu) = Z_0 \Pi_0(x,\mu,\nu)$ and
604: $P_1(x,\mu,\nu) = Z_1 \Pi_1(x,\mu,\nu)$. The ratio of normalizing constants
605: for these distributions is obviously $r=Z_1/Z_0$. We can estimate this
606: ratio by simple importance sampling, using the ratios
607: \beq
608: {P_1(x,\mu,\nu) \over P_0(x,\mu,\nu)} & = &
609: { Z_1\, \Pi_1(\mu_n)\, \pi_{\eta_n}(x_{n,\mu_n})\,
610: \prod\limits_{j=0}^{n-1} \Pi_1(\mu_j)\,
611: \prod\limits_{j=0}^n \Pi_1(x_j\,|\,\mu_j,x_{j,\mu_j})\,
612: \prod\limits_{j=1}^{n} \Pi_1(\nu_j\,|\,x_j)\, \Pi_1(\nu_0)
613: \over
614: Z_0\, \Pi_0(\nu_0)\, \pi_{\eta_0}(x_{0,\nu_0})\,
615: \prod\limits_{j=1}^n \Pi_0(\nu_j)\,
616: \prod\limits_{j=0}^n \Pi_0(x_j\,|\,\nu_j,x_{j,\nu_j})\,
617: \prod\limits_{j=0}^{n-1} \Pi_0(\mu_j\,|\,x_j)\, \Pi_0(\mu_n)
618: }\ \ \
619: \label{eq-ratio01}
620: \eeq
621:
622: From Steps (2b) and (2c) of the forward and reverse procedures, along
623: with the mutual reversibility relationship of equation~(\ref{eq-rev}), we see
624: that
625: \beq
626: \Pi_0(x_j\,|\,\nu_j,x_{j,\nu_j})
627: & = &
628: \prod_{k=\nu_j+1}^n\!\! T_{\eta_j}(x_{j,k-1},x_{j,k})\ \cdot\
629: \prod_{k=0}^{\nu_j-1} \underline{T}_{\eta_j}(x_{j,k+1},x_{j,k}) \\[4pt]
630: & = &
631: \prod_{k=\nu_j+1}^n\!\! T_{\eta_j}(x_{j,k-1},x_{j,k})\ \cdot\
632: \prod_{k=0}^{\nu_j-1} T_{\eta_j}(x_{j,k},x_{j,k+1})\,
633: {\pi_{\eta_j}(x_{j,k})\over\pi_{\eta_j}(x_{j,k+1})}
634: \\[4pt]
635: & = &
636: {\pi_{\eta_j}(x_{j,0})\over\pi_{\eta_j}(x_{j,\nu_j})}\
637: \prod_{k=1}^n\, T_{\eta_j}(x_{j,k-1},x_{j,k})
638: \label{eq-chain1}
639: \eeq
640: and similarly,
641: \beq
642: \Pi_1(x_j\,|\,\mu_j,x_{j,\mu_j})
643: & = &
644: {\pi_{\eta_j}(x_{j,0})\over\pi_{\eta_j}(x_{j,\mu_j})}\
645: \prod_{k=1}^n\, T_{\eta_j}(x_{j,k-1},x_{j,k})
646: \label{eq-chain2}
647: \eeq
648: From this, we see that parts of the ratio in equation~(\ref{eq-ratio01})
649: can be written as
650: \beq
651: { Z_1\,\pi_{\eta_n}(x_{n,\mu_n})\,
652: \prod\limits_{j=0}^n \, \Pi_1(x_j\,|\,\mu_j,x_{j,\mu_j})\,
653: \over
654: Z_0\,\pi_{\eta_0}(x_{0,\nu_0})\,
655: \prod\limits_{j=0}^n \, \Pi_0(x_j\,|\,\nu_j,x_{j,\nu_j})\,
656: }
657: & = &
658: {p_{\eta_n}(x_{n,\mu_n}) \over p_{\eta_0}(x_{0,\nu_0})}\,
659: \prod_{j=0}^n\, {\pi_{\eta_j}(x_{j,\nu_j}) \over \pi_{\eta_j}(x_{j,\mu_j})}
660: \ \ =\ \
661: \prod_{j=0}^{n-1}\,
662: {p_{\eta_{j+1}}(x_{j,\mu_j}) \over p_{\eta_j}(x_{j,\mu_j})}\ \ \
663: \label{eq-fact1}
664: \eeq
665: The last step uses the fact that for $j=1,\ldots,n$,
666: $x_{j,\nu_j} = x_{j-1*j} = x_{j-1,\mu_{j-1}}$.
667:
668: From Steps (1) and (2a), we
669: see that $\Pi_0(\nu_j) = 1\,/\,(K_j\!+\!1)$ and $\Pi_1(\mu_j) =
670: 1\,/\,(K_j\!+\!1)$. Using this, and again using
671: $x_{j,\nu_j} = x_{j-1,\mu_{j-1}}$, we get that
672: \beq
673: \lefteqn {{
674: \prod\limits_{j=0}^{n-1} \Pi_1(\mu_j)\,
675: \prod\limits_{j=1}^{n} \Pi_1(\nu_j\,|\,x_j)
676: \over
677: \prod\limits_{j=1}^{n} \Pi_0(\nu_j)\,
678: \prod\limits_{j=0}^{n-1} \Pi_0(\mu_j\,|\,x_j)}
679: \ \ = \ \
680: { \prod\limits_{j=0}^{n-1} \Pi_1(\nu_{j+1}\,|\,x_{j+1})\,(K_{j+1}\!+\!1)
681: \over
682: \prod\limits_{j=0}^{n-1} \Pi_0(\mu_j\,|\,x_j)\,(K_j\!+\!1)
683: }}\ \ \ \ \ \ \ \ \\[5pt]
684: & = &
685: \prod_{j=0}^{n-1}\,\
686: {\displaystyle
687: \ {p_{j*j+1}(x_{j+1,\nu_{j+1}}) \over p_{\eta_{j+1}}(x_{j+1,\nu_{j+1}})}
688: \ \Big/\ {1 \over K_{j+1}\!+\!1}
689: \sum_{k=0}^{K_{j+1}} {p_{j*j+1}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k})}\
690: \over\displaystyle
691: {p_{j*j+1}(x_{j,\mu_j}) \over p_{\eta_j}(x_{j,\mu_j})}
692: \ \Big/\ {1 \over K_j\!+\!1}
693: \sum_{k=0}^{K_j} {p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k})}
694: } \\[5pt]
695: & = &
696: \prod_{j=0}^{n-1}\,
697: {p_{\eta_j}(x_{j,\mu_j}) \over p_{\eta_{j+1}}(x_{j,\mu_j})}\
698: \prod_{j=0}^{n-1}\,
699: \left[
700: {1 \over K_j\!+\!1}
701: \sum_{k=0}^{K_j} {p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k})}
702: \ \Big/\
703: {1 \over K_{j+1}\!+\!1}
704: \sum_{k=0}^{K_{j+1}} {p_{j*j+1}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k})}
705: \right]\ \ \ \
706: \label{eq-fact2}
707: \eeq
708:
709: From Steps (1) and (3), we see that
710: $\Pi_0(\nu_0) = \Pi_1(\nu_0) = 1\,/\,(K_0\!+\!1)$ and
711: $\Pi_1(\mu_n) = \Pi_0(\mu_n) = 1\,/\,(K_n\!+\!1)$, so these factors
712: cancel in equation~(\ref{eq-ratio01}). The factors in
713: equation~(\ref{eq-fact1}) cancel with the first part of
714: equation~(\ref{eq-fact2}). The final result is that the simple importance
715: sampling estimate based on a single LIS run is as shown in
716: equation~(\ref{eq-lis}), demonstrating that $\rhatLIS$ is indeed an unbiased
717: estimate of $r=Z_1/Z_0$.
718:
719: \subsection{\hspace*{-4pt}Bridged LIS estimates}\vspace*{-4pt}
720:
721: Since the LIS estimate can be viewed as a simple importance sampling
722: estimate on an extended space, we can consider a `bridged LIS'
723: estimate in which this top-level SIS estimate is replaced by a bridge
724: sampling estimate. This will require that we actually perform the reverse
725: LIS procedure described above, from which an LIS estimate for
726: the reverse ratio, $\underline{r} = Z_0/Z_1$, can be computed:
727: \beq
728: \rhatLISrev^{(i)} & = & \prod_{j=1}^{n} \left[
729: {1 \over K_j+1}\, \sum_{k=0}^{K_j}\,
730: { p_{j-1*j}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) }
731: \ \Big/\
732: {1 \over K_{j-1}+1}\, \sum_{k=0}^{K_{j-1}}\,
733: { p_{j-1*j}(x_{j-1,k}) \over p_{\eta_{j-1}}(x_{j-1,k}) }
734: \right]
735: \label{eq-lis-rev}
736: \eeq
737: The reversed procedure requires independent sampling from $\pi_1$.
738: This will usually not be possible directly, but well-separated states
739: from a Markov chain sampler with $\pi_1$ as its invariant distribution will
740: provide a good approximation, provided that this sampler moves around the
741: whole distribution, without being trapped in an isolated mode. Indeed,
742: the entire sample of $K_n\!+\!1$ states from $\pi_1$ that is needed
743: at the start of the reverse procedure can be obtained by taking consecutive
744: states from such a Markov chain sampler.
745:
746: For the bridged form of LIS, we also need a suitable bridge
747: distribution, $P_*$, for which we must be able to evaluate the ratios
748: $P_*/P_0$ and $P_*/P_1$. (Note that this choice of a
749: `top-level' bridge distribution is separate from the choices of
750: `low-level' bridge distributions, $p_{j*j+1}$, though we might use the same
751: form for both.) With the optimal bridge of
752: equation~(\ref{eq-opt-bridge}), these ratios can be written as follows,
753: if the forward procedure is performed $M$ times and the reverse procedure
754: $\underline{M}$ times:
755: \beq
756: {P\opt_*(x,\mu,\nu) \over P_0(x,\mu,\nu)} & = &
757: \left[\,r\,(M/\underline{M})\,
758: \left({P_1(x,\mu,\nu) \over P_0(x,\mu,\nu)}\right)^{-1}
759: \!\! +\ 1\,\right]^{-1}
760: \\[6pt]
761: {P\opt_*(x,\mu,\nu) \over P_1(x,\mu,\nu)} & = &
762: \left[\, r\,(M/\underline{M})\ +\
763: \left({P_0(x,\mu,\nu) \over P_1(x,\mu,\nu)}\right)^{-1}
764: \right]^{-1}
765: \eeq
766: The geometric bridge of equation~(\ref{eq-geo-bridge}) results in
767: \beq
768: {P\geo_*(x,\mu,\nu) \over P_0(x,\mu,\nu)} & = &
769: \sqrt{P_1(x,\mu,\nu) \over P_0(x,\mu,\nu)}
770: \\[6pt]
771: {P\geo_*(x,\mu,\nu) \over P_1(x,\mu,\nu)} & = &
772: \sqrt{P_0(x,\mu,\nu) \over P_1(x,\mu,\nu)}
773: \eeq
774: These expressions allow us to express bridged LIS estimates in terms
775: of the simple LIS estimate of equation~(\ref{eq-lis}), and its reverse
776: version of equation~(\ref{eq-lis-rev}). For the optimal bridge, we get
777: \beq
778: \rhatLISbridged\opt & = &
779: {1 \over M} \sum_{i=1}^M\,
780: {1 \over r\,(M/\underline{M})\,/\,\rhatLIS^{(i)}\ +\ 1}
781: \,\ \Big/\
782: {1 \over \underline{M}} \sum_{i=1}^{\underline{M}}\,
783: {1 \over r\,(M/\underline{M})\ +\ 1/\rhatLISrev^{(i)}}
784: \label{eq-bridged-lis1}
785: \eeq
786: Similarly, for the geometric bridge, we get
787: \beq
788: \rhatLISbridged\geo
789: & = & {1 \over M} \sum_{i=1}^M\, \sqrt{\rhatLIS^{(i)}} \,\ \Big/\
790: {1 \over \underline{M}} \sum_{i=1}^{\underline{M}}\,
791: \sqrt{\rhatLISrev^{(i)}}
792: \label{eq-bridged-lis2}
793: \eeq
794:
795: \subsection{\hspace*{-4pt}LIS estimates with independent sampling with no
796: intermediate distributions}\vspace*{-4pt}
797:
798: It is interesting to look at the special case of Linked Importance
799: Sampling with $n=1$ --- ie, in which the are no intermediate
800: distributions between $\pi_0$ and $\pi_1$ --- in which the points from both
801: $\pi_0$ and $\pi_1$ are sampled independently. The LIS procedure
802: can then be simplified somewhat, and it is also possible to improve
803: the LIS estimate by averaging over the choice of link state. Such
804: averaging is not feasible when Markov chain sampling is used, since
805: choosing a different link state would require a new simulation of the
806: Markov transitions.
807:
808: Since we will sample points independently, there is no need to decide
809: how many points will be sampled by the forward transitions and how
810: many by the reverse transitions in Steps (2a) and (2b) of the LIS
811: procedure. We simply obtain a pair of samples consisting of
812: points $x_{0,0},\ldots,x_{0,K_0}$ drawn independently from $\pi_0$,
813: and points $x_{1,1},\ldots,x_{1,K_1}$ drawn independently from
814: $\pi_1$. We then randomly select a link state, indexed by $\mu$, from
815: among $x_{0,0},\ldots,x_{0,K_0}$ according to the
816: following probabilities, which depend on the choice of a single
817: bridge distribution, denoted by $p_*(x)$:
818: \beq
819: \Pi_0 (\mu \,|\, x_0) & = &
820: { p_*(x_{0,\mu}) \over p_0(x_{0,\mu})}\ \Big/\
821: \sum\limits_{k=0}^{K_0} {p_*(x_{0,k}) \over p_0(x_{0,k}) }
822: \eeq
823: The LIS estimate for $r = Z_1/Z_0$ based on this pair of samples
824: from $\pi_0$ and $\pi_1$ is
825: \beq
826: \rhatLIS^{(i)} & = &
827: {1 \over K_0\!+\!1} \sum_{k=0}^{K_0} {p_*(x_{0,k}) \over p_0(x_{0,k})}
828: \ \Big/\,
829: {1 \over K_1\!+\!1} \left[ {p_*(x_{0,\mu}) \over p_1(x_{0,\mu})}
830: \, +\, \sum_{k=1}^{K_1} {p_*(x_{1,k}) \over p_1(x_{1,k})}
831: \right]
832: \label{eq-lis-indep}
833: \eeq
834: The superscript $i$ is used here
835: to indicate that this estimate is based on the $i$'th
836: pair of samples. We can see that it is very similar to the bridge sampling
837: estimate of equation~(\ref{eq-bridge}), except that the link state is included
838: in both samples. Since these LIS estimates are unbiased, we can
839: average $M$ of them to obtain a final LIS estimate.
840:
841: We can also average the estimate of equation~(\ref{eq-lis-indep})
842: over the random choice of link state, which
843: is guaranteed to produce an estimate (also unbiased) with smaller
844: mean-squared-error (see Schervish 1995, Section 3.2). The result is
845: \beq
846: \rhatLISave^{(i)} & = &
847: \sum_{\mu=0}^{K_0} \Pi_0(\mu\,|\,x_0) \
848: {1 \over K_0\!+\!1} \sum_{k=0}^{K_0} {p_*(x_{0,k}) \over p_0(x_{0,k})}
849: \ \Big/\,
850: {1 \over K_1\!+\!1} \left[ {p_*(x_{0,\mu}) \over p_1(x_{0,\mu})}
851: \, +\, \sum_{k=1}^{K_1} {p_*(x_{1,k}) \over p_1(x_{1,k})}
852: \right] \\[5pt]
853: & = &
854: {K_1\!+\!1 \over K_0\!+\!1}\ \sum_{\mu=0}^{K_0} \
855: {p_*(x_{0,\mu}) \over p_0(x_{0,\mu})}
856: \ \Big/\,
857: \left[ {p_*(x_{0,\mu}) \over p_1(x_{0,\mu})}
858: \, +\, \sum_{k=1}^{K_1} {p_*(x_{1,k}) \over p_1(x_{1,k})}
859: \right]
860: \label{eq-lis-ave}
861: \eeq
862: Averaging these estimates over $M$ pairs of samples produces a final estimate
863: denoted by $\rhatLISave$.
864:
865: To use bridged LIS in this context, we need to find reverse estimates
866: as well, but these reverse estimates needn't be independent of the
867: forward estimates, since the asymptotic validity of the bridge
868: sampling estimate of equation~(\ref{eq-bridge}) does not depend on the
869: samples $x_0$ and $x_1$ being independent. Accordingly, we can use
870: the same samples from $\pi_0$ and $\pi_1$ for the forward and the
871: reverse operations. However, to perform reverse sampling, we need to
872: have a sample of $K_1\!+\!1$ points drawn from $\pi_1$, the first of
873: which is ignored when performing forward sampling. Conversely, the
874: first of the $K_0\!+\!1$ points drawn from $\pi_0$ is ignored when
875: performing the reverse sampling.
876:
877: We can improve the bridged LIS estimates by averaging the numerator
878: and the denominator of equation~(\ref{eq-bridged-lis1})
879: or~(\ref{eq-bridged-lis2}) with respect to the random choice of link
880: state. We can also average with respect to the omission of one of the
881: points from one of the samples --- ie, rather than omitting the first
882: of $K_1 + 1$ points in the sample from $\pi_1$ when computing a
883: forward estimate, we average with respect to a random choice of point
884: to omit, and similarly for reverse estimates. Note that the averaging
885: should be done over the sums in the numerator and denominator, not
886: with respect to the entire estimate, nor with respect to the values of
887: $\rhatLIS^{(i)}$ and $\rhatLISrev^{(i)}$ appearing inside the
888: summands. The effective sample size after this additional averaging
889: of dependent points is unclear, so it is not obvious what the ratio
890: of sample sizes in equation~(\ref{eq-opt-bridge}) should be, but
891: using $(K_0\!+\!1)/(K_1\!+\!1)$ is probably adequate.
892:
893:
894: \section{\hspace*{-7pt}Analytical comparisons of AIS and
895: LIS}\label{sec-anal}\vspace*{-10pt}
896:
897: In this section, I analyse (somewhat informally) the performance of
898: AIS and LIS asymptotically, and in other situations where analytical
899: results are possible.
900:
901:
902: \subsection{\hspace*{-4pt}Asymptotic properties of
903: AIS and LIS estimates}\label{sec-asym}\vspace*{-4pt}
904:
905: I begin by analysing the asymptotic performance of AIS and LIS when
906: the sequence of distributions is defined by an unnormalized density function
907: of the following form:
908: \beq
909: p_{\eta}(x) & = & p_0(x)\, \exp (-\eta U(x))
910: \label{eq-U-dist}
911: \eeq
912: This class includes sequences of canonical distributions defined by
913: equation~(\ref{eq-canonical}) in which the inverse temperature
914: varies, as well as
915: sequences that can be used for Bayesian analysis, in which $p_0$ defines the
916: prior and $\eta$ is a power that the likelihood (expressed as $\exp(-U(x))$) is
917: raised to, with $\eta=1$ giving the posterior distribution.
918: For these distributions, we can express $r$ using the well-known
919: `thermodynamic integration' formula as follows:
920: \beq
921: r\ \ =\ \ \log(Z_1/Z_0)\ \ =\ \ - \int_0^1 E_{\pi_{\eta}}(U)\,d\eta
922: \label{eq-therm-int}
923: \eeq
924:
925: The analysis here is asymptotic, as the number of intermediate
926: distributions used, given by $n\!-\!1$, goes to infinity. I will
927: assume the $\eta_j$ defining these distributions are chosen according to a
928: scheme in which for any
929: $a \in (0,1)$, the spacing $\eta_{j+1}-\eta_j$ when $j = \lfloor a\,n \rfloor$
930: is asymptotically proportional to $1/n$ --- in other words,
931: the relative density of intermediate distributions in the neighborhood
932: of different values of $\eta$ stays the same as the overall density increases.
933: The simplest such scheme is to let $\eta_j = j/n$, though other schemes
934: may sometimes be better.
935:
936: With the above form for $p_{\eta}$, the AIS estimate from a single run
937: (from equation~(\ref{eq-ais-est})) can be written as follows:
938: \beq
939: \log\ \rhatAIS^{(i)}
940: & = &
941: \sum_{j=0}^{n-1}\, \log \Big(p_{\eta_{j+1}}(x^{(i)}_j)
942: \,\Big/\,p_{\eta_j}(x^{(i)}_j)\Big)
943: \ \ =\ \
944: \sum_{j=0}^{n-1}\, - (\eta_{j+1}-\eta_j)\, U \Big(x^{(i)}_j\Big)
945: \label{eq-ais-reim}
946: \eeq
947: When $\eta_j=j/n$, this can be seen as a stochastic form of Riemann's Rule
948: for numerically integrating equation~(\ref{eq-therm-int}), though one
949: difference is that $\log\ \rhatAIS$ converges to the correct value as $M$ goes
950: to infinity even if $n$ stays fixed.
951:
952: Provided that there is some finite bound on the variance of $U$ under all
953: the distributions $\pi_{\eta}$, and that the Markov transitions used mix well,
954: a Central Limit Theorem will apply, allowing us to conclude that the
955: distribution of $\ell_n = \log\ \rhatAIS^{(i)}$ becomes
956: Gaussian as $n$ goes to infinity. Let the mean of $\ell_n$ be $\mu_n$,
957: and let the variance of $\ell_n$ asymptotically be $\sigma^2/n$, where $\sigma$
958: is determined by details of the spacing of intermediate distributions and
959: of the degree of autocorrelation in the Markov transitions.
960: Note that $E[Y^q]=\exp(q\mu+q^2\varsigma^2/2)$ when $Y=\exp(X)$
961: and $X$ is Gaussian with mean $\mu$ and variance $\varsigma^2$.
962: Using this, the mean of $\exp(\ell_n)$ is $\exp(\mu_n+\sigma^2/2n)$. This
963: must equal $r$, since $\rhatAIS$ is unbiased, so $\mu_n = \log(r)-\sigma^2/2n$.
964: Using this, we can see that the variance of $\rhatAIS^{(i)}=\exp(\ell_n)$ is
965: $r\,[\exp(\sigma^2/2n) - 1]$, which for large $n$ will be approximately
966: $r\sigma^2/2n$. The variance of $\rhatAIS$ will therefore be $r\sigma^2/2nM$.
967: Asymptotically, the total computational effort, which will generally be
968: proportional to $nM$, can be divided in any way between more intermediate
969: distributions ($n$) or more runs ($M$) without affecting the accuracy
970: of estimation of $r$, provided that $n$ is kept large enough that
971: these asymptotic results apply --- a fact noted by Hendrix and Jarzynski (2001).
972: We can therefore use a value of $M$ greater than one without penalty,
973: in order to obtain an error estimate from the degree of variation
974: over the $M$ runs.
975:
976: For LIS, we can write the log of the estimate from one run
977: (equation~(\ref{eq-lis})) as follows:
978: \beq
979: \log\ \rhatLIS^{(i)} & = & \sum_{j=0}^{n-1} \left[
980: \log \left({1 \over K_j+1}\, \sum_{k=0}^{K_j}\,
981: { p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) } \right)
982: \ -\
983: \log \left({1 \over K_{j+1}+1}\, \sum_{k=0}^{K_{j+1}}\,
984: { p_{j*j+1}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k}) }\right)
985: \right]\ \ \ \ \ \
986: \label{eq-logrhatLIS}
987: \eeq
988: Suppose that we let $K_j = \lceil m K_j^0 \rceil$ for all $j$ and some set of
989: $K^0_j$, and that we then let $m$ go to infinity. Assuming that the variances
990: of the ratios of probabilities are finite, and that the Markov chain transitions
991: used mix sufficiently well, a Central Limit
992: Theorem will again apply, and we can conclude that all of the $n$ terms in
993: the sum above, and therefore also the sum itself, will approach Gaussian
994: distributions, with variances proportional to $1/m$.
995:
996: To analyse the LIS estimate in more detail, we need to assume a form of
997: bridge distribution, as well as a form for $p_{\eta}$. If $p_{\eta}$
998: has the form of equation~(\ref{eq-U-dist}) and we use the geometric bridge
999: of equation~(\ref{eq-geo-bridge}), we can write
1000: \beq
1001: \log\ \rhatLIS^{(i)} & = & \sum_{j=0}^{n-1}\, \left[\
1002: \log \left( {1 \over K_j+1}\, \sum\limits_{k=0}^{K_j}\,
1003: \exp(-(\eta_{j+1}\!-\!\eta_j)\, U(x_{j,k})\, /\, 2) \right)
1004: \ -\ \right. \nonumber \\[4pt]
1005: & & \ \ \ \ \ \ \ \ \left.
1006: \log \left( {1 \over K_{j+1}+1}\,\sum\limits_{k=0}^{K_j}\,
1007: \exp(-(\eta_j\!-\!\eta_{j+1})\, U(x_{j+1,k})\, /\, 2) \right)
1008: \ \right]
1009: \eeq
1010: Since $\exp(z)\approx1+z$ and $\log(1+z)\approx z$ when $z$ is small, we can
1011: rewrite this when $n$ is large (and hence $\eta_{j+1}\!-\!\eta_j$ is small) as
1012: \beq
1013: \log\ \rhatLIS^{(i)} & \approx & \sum_{j=0}^{n-1}\, \left[\
1014: \log \left( 1 \ -\ { \eta_{j+1}\!-\!\eta_j \over 2}\,
1015: {1 \over K_j+1}\, \sum\limits_{k=0}^{K_j} U(x_{j,k}) \right)
1016: \ -\ \right. \nonumber \\[4pt]
1017: & & \ \ \ \ \ \ \ \ \left.
1018: \log \left( 1 \ +\ {\eta_{j+1}\!-\!\eta_j \over 2}\,
1019: {1 \over K_{j+1}+1}\, \sum\limits_{k=0}^{K_{j+1}} U(x_{j+1,k}) \right)
1020: \ \right] \\[6pt]
1021: & \approx & \sum_{j=0}^{n-1}\,
1022: - {\eta_{j+1}\!-\!\eta_j \over 2}\,
1023: \left[ {1 \over K_j+1}\, \sum\limits_{k=0}^{K_j} U(x_{j,k})
1024: \ +\ {1 \over K_{j+1}+1}\,
1025: \sum\limits_{k=0}^{K_{j+1}} U(x_{j+1,k}) \right] \\[6pt]
1026: & = &
1027: -\ {\eta_1\!-\!\eta_0 \over 2}\,
1028: {1 \over K_0+1}\, \sum\limits_{k=0}^{K_0} U(x_{0,k})
1029: \ -\ {\eta_n\!-\!\eta_{n-1} \over 2}\,
1030: {1 \over K_n+1}\, \sum\limits_{k=0}^{K_n} U(x_{n,k})
1031: \nonumber \\[4pt]
1032: & & -\ \sum_{j=1}^{n-1}\,
1033: {\eta_{j+1}\!-\!\eta_{j-1} \over 2}\,
1034: {1 \over K_j+1}\, \sum\limits_{k=0}^{K_j} U(x_{j,k})
1035: \eeq
1036:
1037: When $\eta_j=j/n$, this looks like a stochastic form of the
1038: Trapezoidal Rule for numerically integrating
1039: equation~(\ref{eq-therm-int}). Since the Trapezoidal Rule converges
1040: faster than Reimann's Rule, one might expect LIS to perform better
1041: than AIS asymptotically, but this is not so in this stochastic
1042: situation. Suppose for simplicity that we set all $K_j=m$. The
1043: variance of $\log\ \rhatLIS^{(i)}$ will be dominated by the variance
1044: of the last sum above, which will be proportional to $1/nm$, assuming
1045: that $m$ is large, so that the dependence between terms (from sharing
1046: link states) is negligible. Using the same argument as for AIS above,
1047: the variance of $\log \rhatLIS$ will be proportional to $1/nmM$.
1048: Considering that the computation time for an LIS run will be
1049: proportional to $nm$, versus $n$ for AIS, we see that the variances of
1050: the AIS and LIS estimates go down the same way in proportion to
1051: computation time, asymptotically as $n$ and $m$ go to infinity.
1052:
1053: Furthermore, the proportionality constant should be the same for
1054: AIS and LIS, assuming that the overhead of the two procedures is
1055: negligible compared to the time spent performing Markov transitions,
1056: so that the proportionality constants for computation time are the
1057: same for AIS (multiplying $n$) and for LIS (multiplying $nm$). The
1058: proportionality constants for variance for AIS (multiplying $1/nM$)
1059: and for LIS (multiplying $1/nmM$) depend in a complex way on the form of
1060: the density of $\eta_j$ values and on the mixing properties of the
1061: Markov transitions, but the result should be the same for AIS and
1062: LIS, provided the same scheme is used for choosing $\eta_j$ values,
1063: and the same Markov transitions are used, parameterized smoothly in
1064: terms of $\eta$. A difference that might appear significant is that
1065: for AIS only one Markov transition is done for each $\eta_j$, whereas
1066: for LIS, $m$ such transitions are done. However, as $n$ goes to
1067: infinity, nearby distributions become more similar, so transitions for
1068: $m$ consecutive distributions become similar to $m$ transitions for
1069: one of these distributions.
1070:
1071: The apparently pessimistic conclusion from this is that when both $n$
1072: and $m$ (and hence the $K_j$) are large, the performance of LIS should
1073: be about the same as that of AIS (with $n$ for AIS chosen to equalize
1074: the computation time), assuming that the distributions used have the
1075: form of equation~(\ref{eq-U-dist}), that the variance of $U$ is finite
1076: under all of the distributions $\pi_{\eta}$, and that the Markov
1077: transitions used mix well enough. Fortunately, however, there is no
1078: reason to make both $m$ and $n$ large with LIS. For good performance,
1079: $n$ must be large enough that $\pi_{\eta_j}$ and $\pi_{\eta_{j+1}}$
1080: overlap significantly, but there is no reason to make $n$ much larger
1081: than this. The accuracy of the estimates can be improved as desired
1082: by increasing $m$ and/or $M$ while keeping $n$ fixed. The results
1083: below show that LIS estimates with $n$ fixed are sometimes much better
1084: than AIS estimates.
1085:
1086: Finally, let us consider the asymptotic performance of the bridged
1087: versions of AIS and LIS, assuming that the variance of $U$ is finite,
1088: so that the distribution of the estimates from individual runs becomes
1089: Gaussian as $n$ (for AIS) or $m$ (for LIS) goes to infinity. Looking
1090: at equations~(\ref{eq-bridged-lis1}) and~(\ref{eq-bridged-lis2}),
1091: which also are applicable to bridged AIS estimates, we see that the
1092: log of $\rhatLISbridged^{(i)}$ can for both optimal and geometric
1093: bridges be expressed as the difference of the log of the numerator,
1094: which is the mean of a function of the forward estimates,
1095: $\rhatLIS^{(i)}$, and the log of the denominator, which is the mean of
1096: a function of the reverse estimates, $\rhatLISrev^{(i)}$. If these
1097: forward and reverse estimates have Gaussian distributions with small
1098: variances, $\sigma^2$ and $\underline{\sigma}^2$, then
1099: $\rhatLISbridged^{(i)}$ will also be Gaussian, with a variance that
1100: can be computed in terms of the derivatives of the summands in the
1101: numerator and the denominator, with respect to $\rhatLIS^{(i)}$ and
1102: $\rhatLISrev^{(i)}$, evaluated at the true values of $r$ and $1/r$.
1103: I will assume that $r=1$ below, as can be done without loss of generality.
1104:
1105: For the geometric bridge, these derivatives are both $1/2$, from which
1106: it follows that the variance of the numerator in
1107: equation~(\ref{eq-bridged-lis2}) is $\sigma^2/4M$ and that of the
1108: denominator is $\underline{\sigma}^2/4\underline{M}$. Since the
1109: numerator and denominator evaluate to one for $\rhatLIS^{(i)}=r=1$ and
1110: $\rhatLISrev^{(i)}=1/r=1$, the sum of the variances of the logs of the
1111: numerator and denominator is $\sigma^2/4M +
1112: \underline{\sigma}^2/4\underline{M}$. If
1113: $\sigma^2=\underline{\sigma}^2$ and $M=\underline{M}$, this reduces to
1114: $\sigma^2/2M$. The variance of an unbridged LIS estimate will be
1115: $\sigma^2/M$. However, the bridged estimate requires time
1116: proportional to $M+\underline{M}$, compared to just $M$ for the
1117: unbridged estimate. The value of $M$ for the unbridged method can
1118: therefore be twice as large as for the bridged method, with the result
1119: that bridged and unbridged estimates perform equally well
1120: asymptotically (assuming the variance of $U$ is finite).
1121:
1122: For the optimal bridge, the derivatives of the summands in the
1123: numerator and denominator are both $1/4$, when evaluated at
1124: $\rhatLIS^{(i)}=r=1$ and $\rhatLIS^{(i)}=1/r=1$, and assuming that
1125: $M=\underline{M}$. The numerator and denominator both evaluate to
1126: $1/2$, with the result that asymptotically the variance of the bridged
1127: estimate, assuming $\sigma^2=\underline{\sigma}^2$, is $\sigma^2/2M$,
1128: the same as for the geometric bridge.
1129:
1130: In conclusion, bridged AIS and LIS estimates asymptotically have the
1131: same performance as the corresponding unbridged estimates (with twice
1132: the value of $M$), for both the optimal and geometric bridges,
1133: assuming $U$ has finite variance. This conclusion applies more
1134: generally, as long as a Central Limit Theorem holds for the individual
1135: estimates, $\rhatLIS^{(i)}$ and $\rhatLISrev^{(i)}$. However, the
1136: bridged methods may be much better when the variance of $U$ is
1137: infinite, or for classes of distributions other than that of
1138: equation~(\ref{eq-U-dist}). The bridged methods may also provide
1139: improvement when the values of $n$ or $m$ are not large enough for the
1140: asymptotic results to apply.
1141:
1142:
1143: \subsection{\hspace*{-4pt}Properties of AIS and LIS when sampling from
1144: uniform distributions}\label{sec-unif}\vspace*{-4pt}
1145:
1146: In this section, I will demonstrate that when $n$ is kept suitably
1147: small, LIS can perform much better than AIS when these methods are
1148: applied to sequences of uniform distributions.
1149:
1150: As a first example, consider the class of nested uniform
1151: distributions with unnormalized densities given by\vspace*{-6pt}
1152: \beq
1153: p_{\eta}(x) & = & \left\{ \begin{array}{ll}
1154: 1 & \mbox{if $-s^{\eta} < x < s^{\eta}$} \\ 0 & \mbox{otherwise}
1155: \end{array}\right.
1156: \eeq
1157: for which the normalizing constants are $Z_{\eta} = 2s^{\eta}$, so that
1158: $r = Z_1/Z_0 = s$. The results concerning this class of distributions
1159: can easily be extended to any class of uniform distributions, in any
1160: number of dimensions, that have nested regions of support.
1161: For both AIS and LIS, I will assume that the intermediate
1162: distributions are defined by $\eta_j = j/n$. With this choice, the
1163: probability that a point, $x$, randomly sampled from $\pi_j$ will have
1164: $p_{j+1}(x)=1$ is $s^{1/n}$, for any $j$.
1165:
1166: During an AIS run, only a single point is sampled from each
1167: distribution. An AIS run will produce an estimate for $r$ of zero if
1168: any of the ratios ${p_{\eta_{j+1}}(x^{(i)}_j)\,/\,
1169: p_{\eta_j}(x^{(i)}_j)}$ in equation~(\ref{eq-ais-est}) are zero, which
1170: happens with probability $1 - (s^{1/n})^n\, =\, 1-s$, and will
1171: otherwise produce an estimate of one. Note that the distribution of
1172: estimates is independent of $n$. AIS is therefore not a useful
1173: technique for nested uniform distributions --- simple importance
1174: sampling (ie, AIS with $n\!=\!1$) would work just as well (or just as
1175: poorly, if $s$ is very small). Bridged AIS produces no improvement in
1176: this context.
1177:
1178: Suppose instead we use LIS with all $K_j=m$, and suppose that the
1179: Markov transitions, $T_j$, produce points that are almost independent
1180: of the previous point. For this problem, both the geometric and
1181: optimal forms of the bridge distribution result in $p_{j*j+1}(x) =
1182: p_{\eta_{j+1}}(x)$. If $m+1$ points are sampled independently from
1183: $\pi_{\eta_j}$, the fraction of these points for which
1184: $p_{\eta_{j+1}}(x)$ is one will have variance
1185: $s^{1/n}\,(1\!-\!s^{1/n})\,/\,(m\!+\!1)$. For sufficiently large
1186: $m$, the variance of the log of this fraction will be
1187: approximately $(s^{1/n}\,(1\!-\!s^{1/n})\,/\,(m\!+\!1))\,/\,s^{2/n}$,
1188: which simplifies to $(s^{-1/n}\!-\!1)\,/\,(m\!+\!1)$. For this
1189: approximation to be useful, the probability that none of the $m+1$
1190: points sampled from $\pi_{\eta_j}$ lie in the region where
1191: $p_{\eta_{j+1}}$ is one, equal to $(1-s^{1/n})^{m+1}$, must be negligible.
1192: This probability must be fairly small anyway, if LIS is to perform well.
1193:
1194: Suppose that the computational cost of an LIS run is proportional to
1195: the sum of the number of points sampled from $\pi_0$ and the number of
1196: Markov transitions performed. If we fix this cost, the number of
1197: intermediate distributions, $n$, and the number of transitions for
1198: each distribution, $m$, will be related by $m(n\!+\!1)\,=\,C$, for
1199: some constant $C$. Assume for the moment that both $n$ and $m$ are
1200: large. The probability of a run producing a zero estimate will then
1201: be negligible, and we can assess the accuracy of the estimate for one
1202: run by the variance of $\log \rhatLIS^{(i)}$ (modified in some way
1203: to eliminate the infinity resulting from the negligible, but non-zero,
1204: probability that $\rhatLIS^{(i)}$ is zero). Looking at
1205: equation~(\ref{eq-logrhatLIS}), we see that for these nested uniform
1206: distributions, the second log term vanishes ---
1207: $p_{j*j+1}(x_{j+1,k})\,/\,p_{\eta_{j+1}}(x_{j+1,k})$ is always one,
1208: since $p_{j*j+1}$ is the same as $p_{\eta_{j+1}}$. When $m$ is large,
1209: the dependence between terms with different values of $j$ will be
1210: negligible, so we can add the variances of the terms to get the variance
1211: of the estimate, obtaining the result that
1212: \beq
1213: \Var \Big(\log\ \rhatLIS^{(i)}\Big)
1214: & \approx & n\,(s^{-1/n}\!-\!1)\,/\,(m\!+\!1)
1215: \label{eq-varLIS-nest}
1216: \eeq
1217: When $n$ is large, $s^{-1/n}=\, \exp(\log(1/s)/n)$ is approximately
1218: $1+\log(1/s)/n$, and hence the variance above is
1219: approximately $\log(1/s)\,/\,(m\!+\!1)$.
1220: So it seems that the larger the value of $m$, the better ---
1221: until we reach a value of $m$ for which the corresponding value of $n$,
1222: equal to $C/m\,-\,1$, is small enough that this result no longer applies.
1223:
1224: Best performance will therefore come using a fairly small value of
1225: $n$, but a large value of $m$. Substituting $m=C/(n\!+\!1)$ into
1226: equation~(\ref{eq-varLIS-nest}), and assuming $m/(m\!+\!1)\approx 1$, we get
1227: \beq
1228: \Var \Big(\log\ \rhatLIS^{(i)}\Big)
1229: & \approx & n\,(s^{-1/n}\!-\!1)\,/\,(C/(n\!+\!1))
1230: \ \ =\ \ n(n\!+\!1)\,(s^{-1/n}\!-\!1)\,/\,C
1231: \eeq
1232: The value of $n$ that minimizes this depends only on $s$, not on $C$.
1233: The optimal choice of $n$ increases slowly as $s$ gets smaller:\ \
1234: $s=0.1$ gives $n=2$, $s=0.05$ gives $n=3$, $s=0.01$ gives $n=4$, and
1235: $s=0.0001$ gives $n=7$.
1236:
1237: As a second example, consider the class of non-nested uniform distributions
1238: with unnormalized densities given by\vspace*{-6pt}
1239: \beq
1240: p_{\eta}(x) & = & \left\{ \begin{array}{ll}
1241: 1 & \mbox{if $\eta t-1 < x < \eta t+1$} \\ 0 & \mbox{otherwise}
1242: \end{array}\right.
1243: \eeq
1244: For this class, $Z_{\eta} = 2$ for all $\eta$, so $r = Z_1/Z_0 = 1$.
1245: I will again assume that the intermediate
1246: distributions are defined by $\eta_j = j/n$, and that all $K_j=m$. Assuming
1247: that $n$ is greater than
1248: $t/2$, the probability that a point, $x$, randomly sampled from $\pi_{\eta_j}$
1249: will have $p_{\eta_{j+1}}(x)=1$ is $1-t/2n$, for any $j$.
1250:
1251: For this example, AIS estimates do not converge to the true value of
1252: $r$ as $M$ increases, regardless of the value of $n$. To see this,
1253: note that the ratios in equation~(\ref{eq-ais-est}) will all be either
1254: zero or one, and that the estimate from one run, $\rhatAIS^{(i)}$,
1255: will be one if all of these ratios are one, and zero otherwise. The
1256: probability of a particular ratio being one is $1-t/2n$, so the
1257: probability that all are one (assuming the $T_{\eta}$ produce points
1258: independent of the current point) is $(1-t/2n)^n$, which approaches
1259: $\exp(-t/2)$ as $n$ goes to infinity. The AIS estimate, averaging
1260: over $M$ runs, will have mean $\exp(-t/2)$, rather than the correct
1261: value of one.
1262:
1263: In contrast, bridged AIS estimates will converge to the true value as $M$
1264: increases, as long as $n$ is at least $t/2$, so that there is overlap
1265: between successive distributions in the sequence. However, when $t$
1266: is large, the overlap between the distributions over paths produced by
1267: forward and reverse AIS runs, given by $\exp(-t/2)$, will be very
1268: small, and the procedure will be very inefficient.
1269:
1270: To see how well LIS performs, recall the formula for $\log \rhatLIS$
1271: from equation~(\ref{eq-logrhatLIS}):
1272: \beq
1273: \log\ \rhatLIS^{(i)} & = & \sum_{j=0}^{n-1} \left[
1274: \log \left({1 \over K_j+1}\, \sum_{k=0}^{K_j}\,
1275: { p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) } \right)
1276: \ -\
1277: \log \left({1 \over K_{j+1}+1}\, \sum_{k=0}^{K_{j+1}}\,
1278: { p_{j*j+1}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k}) }\right)
1279: \right]\ \ \ \ \ \
1280: \eeq
1281: Due to symmetry, the two log terms above have the same distribution,
1282: for all $j$. The variance of one of these log
1283: terms (for large $m$) is
1284: $((t/2n)\,(1\!-\!t/2n)\,/\,(m\!+\!1))\,/\,(1\!-\!t/2n)^2$, which
1285: simplifies to $1\,/\,((2n/t\!-\!1)\,(m\!+\!1))$. The second log
1286: term in equation~(\ref{eq-logrhatLIS}) for one $j$ will involve the
1287: same points, $x_{j+1,k}$, as the first log term for the next $j$. The
1288: effect of this is that these terms will be negatively correlated, with
1289: correlation of $-1$ if $n\!=\!t$. However, since the
1290: two terms occur with opposite signs, the effect on the final sum is
1291: that $n\!-\!1$ pairs of terms (out of $2n$ terms total) are positively
1292: correlated. Straightforward calculations show that this correlation is
1293: $2n/t - 1$ for $t/2 < n \le t$ and $1\,/\,(2n/t - 1)$ for $n \ge t$.
1294: Using the fact that when $X$ and $Y$ have the same
1295: distribution, $\Var(X+Y) = 2\,\Var(X)\,[1+\Cor(X,Y)]$, we obtain the
1296: result that, for large $m$,
1297: \beq
1298: \Var \Big(\log\ \rhatLIS^{(i)}\Big)
1299: & \approx & {2 \over (2n/t\!-\!1)\,(m\!+\!1)}
1300: \left\{\begin{array}{ll}
1301: n\ +\ (n\!-\!1)\,(2n/t-1)
1302: & \ \ \mbox{if $t/2 < n \le t$}
1303: \\[4pt]
1304: n\ +\ (n\!-\!1)\,/\, (2n/t-1)
1305: & \ \ \mbox{if $n \ge t$}
1306: \end{array}\right\}
1307: \eeq
1308: Setting $m = C/(n\!+\!1)$, and assuming $m/(m\!+\!1)\approx 1$, gives
1309: \beq
1310: \Var \Big(\log\ \rhatLIS^{(i)}\Big)
1311: & \approx & {2(n\!+\!1) \over C(2n/t\!-\!1)}
1312: \left\{\begin{array}{ll}
1313: n\ +\ (n\!-\!1)\,(2n/t-1)
1314: & \ \ \mbox{if $t/2 < n \le t$}
1315: \\[4pt]
1316: n\ +\ (n\!-\!1)\,/\, (2n/t-1)
1317: & \ \ \mbox{if $n \ge t$}
1318: \end{array}\right\}
1319: \eeq
1320: Numerical investigation shows that the global minimum of the variance
1321: occurs where $n$ is near $(3/2)\,t$. A second local minimum where $n$
1322: is near $(3/4)\,t$ also exists. The two minima are nearly equally good
1323: when $t$ is large. There is a local maximum where $n$ is near $t$,
1324: with the variance there being about 19\% greater than at the global
1325: minimum. The variance is much larger for very large and very small values
1326: of $n$. We therefore see that for this example too, the best results
1327: are obtained by fixing $n$ to a moderate value; any desired level of
1328: accuracy can then be obtained by increasing $m$ and/or $M$.
1329:
1330:
1331: \section{\hspace*{-7pt}Empirical comparisons of AIS and
1332: LIS}\label{sec-cmp}\vspace*{-10pt}
1333:
1334: The analytical results of the previous section indicate that LIS can
1335: sometimes perform much better than AIS, but that the benefits of LIS
1336: may only be seen when the number of intermediate distributions used is
1337: kept suitably small (but not so small that they do not overlap). In
1338: this section, I investigate the performance of AIS and LIS (and their
1339: bridged versions) empirically. The programs used for these tests
1340: (written in R) are available from my web page.
1341:
1342: These tests were done using sequences of one-dimensional distributions
1343: having unnormalized density functions of the following form:
1344: \beq
1345: p_{\eta}(x) & = &
1346: \exp\Big(\!-\!\Big|(x\!-\!\eta t)\,/\,s^{\eta}\,\Big|^q\,\Big)
1347: \eeq
1348: where $s$, $t$, and $q$ are fixed constants. As $\eta$ moves from 0 to 1,
1349: the centre of this distribution shifts by $t$, and changes width by the
1350: factor $s$. The power $q$ controls how thick the tails of the distributions
1351: are. When $q=2$, the distributions are Gaussian; a larger value
1352: produces lighter tails. Note that $Z_{\eta}$ is
1353: proportional to $s^{\eta}$, and hence $r = Z_1/Z_0$ is equal to $s$.
1354:
1355: If $t=0$, the distributions can be written in the form of
1356: equation~(\ref{eq-U-dist}), after reparameterizing in terms of $\eta'
1357: = 1/s^{\eta q}$, so that $p_{\eta'}(x) = \exp(-\eta' |x|^q)$. In this
1358: case, we expect the asymptotic behaviour to be as discussed in
1359: Section~\ref{sec-asym}, but the behaviour with samples of practical
1360: size may be different. As $q$ goes to infinity, the distributions
1361: converge to uniform distributions over $(\eta t\!-\!s^{\eta},\,\eta
1362: t\!+\!s^{\eta})$, and the results of Section~\ref{sec-unif} become relevant.
1363:
1364: I did an initial set of tests using six sequences of distributions.
1365: Three of these sequences were of Gaussian distributions, with $q\!=\!2$.
1366: The first of these used $s\!=\!1$ and $t\!=\!4$, producing a shift with no
1367: change in scale as $\eta$ increases from 0 to 1. The second used
1368: $s\!=\!0.05$ and $t\!=\!0$, producing a contraction with no shift. The last
1369: used $s\!=\!0.3$ and $t\!=\!2$, combining a shift with a contraction. A
1370: second set of three sequences used the same values of $s$ and $t$, but
1371: with $q\!=\!10$, which produces more `rectangular' distributions with
1372: lighter tails. The six sequences are shown in Figure~\ref{fig-seq}.
1373: Each sequence in these plots consists of five distributions,
1374: corresponding to $\eta\, =\, 0,\, 1/4,\, 2/4,\, 3/4,\, 1$. These were
1375: the sequences used for the LIS runs (hence $n\!=\!4$ for these runs). The
1376: AIS runs used more distributions, spaced more finely with respect to
1377: $\eta$, so as to produce the same number of Markov transitions and
1378: sampling operations as in the LIS runs.
1379:
1380:
1381: \begin{figure}[t]
1382:
1383: \vspace*{-29pt}
1384:
1385: \centerline{\includegraphics{epow-plts.ps}}
1386:
1387: \caption[]{The sequences of unnormalized density functions used for the
1388: tests. The plots show the unnormalized density functions for
1389: $\eta\, =\, 0,\, 1/4,\, 2/4,\, 3/4,\, 1$, for six combinations
1390: of $s$, $t$, and $q$.}\label{fig-seq}
1391:
1392: \end{figure}
1393:
1394:
1395: These distributions (for any $\eta$) can easily be sampled from using
1396: rejection sampling. Samples from $\pi_0$ and $\pi_1$ were used to
1397: initialize forward and reverse runs of AIS and LIS. For this test, we
1398: pretend that sampling for other $\pi_{\eta}$ must be done using Markov
1399: chain methods. The transition used for $\pi_{\eta}$, $T_{\eta}$, was
1400: a random-walk Metropolis update, using a Gaussian proposal
1401: distribution with mean equal to the current point and standard
1402: deviation $s^{\eta}$. Since Metropolis updates are reversible,
1403: $\underline{T}_{\eta}$ was the same.
1404:
1405: Two sets of forward and reverse LIS runs were done with $n\!=\!4$, all
1406: $K_j\!=\!50$, and $M\!=\!20$, one set using the geometric bridge, the
1407: other using the optimal bridge with the true value of $r$. The
1408: forward estimates were computed from equation~(\ref{eq-lis}); the
1409: reverse estimates from equation~(\ref{eq-lis-rev}), which is
1410: equivalent to using the forward procedure with the reverse sequence of
1411: distributions. Bridged LIS estimates were also found using
1412: equation~(\ref{eq-bridged-lis1}), with the value of $r$ found by
1413: iteration. To make the comparison with forward and reverse estimates
1414: fair, the bridged LIS estimates used $M\!=\!10$ --- ie, only half of
1415: the forward and half of the reverse runs were used, for a total of
1416: $20$ runs.
1417:
1418: A corresponding set of forward, reverse, and bridged AIS runs were
1419: also done, with $n\!=\!250$ and $M\!=\!20$ ($M\!=\!10$ for the bridged
1420: estimates). If sampling a point from $\pi_0$ or $\pi_1$ takes about
1421: the same computation time as a Metropolis update, these AIS runs will
1422: take about the same time as the LIS runs. (This assumes that sampling
1423: and Markov transitions dominate the time, which is typically true for
1424: real problems but perhaps not for this simple test problem.)
1425:
1426: Sets of longer LIS and AIS runs were also done, which were the same as
1427: the sets above except that for LIS, $K_j\!=\!200$ for all $j$, and for
1428: AIS, $n\!=\!1000$, which again equalizes the computation time.
1429:
1430: Experience, together with the asymptotic results of
1431: Section~\ref{sec-asym}, shows that estimates produced using a small
1432: value of $M$ are better than, or at least as good as, those produced
1433: with larger $M$. I chose $M\!=\!20$ ($M\!=\!10$ for bridged estimates) since
1434: this is about the smallest value that allows reliable estimation of
1435: standard errors, which would usually be needed in practice.
1436:
1437: The standard errors for AIS and LIS estimates of $\rhat$ were
1438: estimated by the sample standard deviation of the $\rhat^{(i)}$
1439: divided by $\sqrt{M}$. When comparing the methods, I looked primarily
1440: at the mean squared error when estimating $\log(r)$ (rather than when
1441: estimating $r$). The estimate I used was $\log(\rhat)$, and the
1442: standard error for this estimate was estimated by the standard error
1443: for $\rhat$ divided by $\rhat$. For the reverse runs, $\log(r)$ was
1444: estimated by $-\log(\rhatrev)$. For bridged AIS and LIS, the standard
1445: errors for the log of the numerator and the log of the denominator of
1446: equation~(\ref{eq-bridged-lis1}) were found, and the overall standard
1447: error was computed as the square root of the sum of the squares of
1448: these two standard errors. This method of converting estimates and
1449: standard errors for $r$ to those for $\log(r)$ is valid
1450: asymptotically. It might be improved upon for finite samples, but
1451: such improvements would probably not affect the relative merits of the
1452: methods compared here.
1453:
1454: \begin{figure}[p]
1455:
1456: \centerline{\includegraphics{tst-plt1.ps}}
1457:
1458: \vspace*{-8pt}
1459:
1460: %\hspace*{0.1in}\makebox[3.2in]{Short Runs}\hfill\makebox[3.2in]{Long Runs}
1461:
1462: \caption[]{Results of short and long runs
1463: on the distribution sequence with $s\!=\!1$, $t\!=\!4$, and
1464: $q\!=\!2$.}\label{fig-r1}
1465:
1466: \end{figure}
1467:
1468: \begin{figure}[p]
1469:
1470: \centerline{\includegraphics{tst-plt2.ps}}
1471:
1472: \vspace*{-8pt}
1473:
1474: %\hspace*{0.1in}\makebox[3.2in]{Short Runs}\hfill\makebox[3.2in]{Long Runs}
1475:
1476: \caption[]{Results of short and long runs
1477: on the distribution sequence with $s\!=\!1$, $t\!=\!4$, and
1478: $q\!=\!10$.}\label{fig-r2}
1479:
1480: \end{figure}
1481:
1482:
1483: \begin{figure}[p]
1484:
1485: \centerline{\includegraphics{tst-plt3.ps}}
1486:
1487: \vspace*{-8pt}
1488:
1489: %\hspace*{0.1in}\makebox[3.2in]{Short Runs}\hfill\makebox[3.2in]{Long Runs}
1490:
1491: \caption[]{Results of short and long runs
1492: on the distribution sequence with $s\!=\!0.05$, $t\!=\!0$, and
1493: $q\!=\!2$.}\label{fig-r3}
1494:
1495: \end{figure}
1496:
1497: \begin{figure}[p]
1498:
1499: \centerline{\includegraphics{tst-plt4.ps}}
1500:
1501: \vspace*{-8pt}
1502:
1503: 5\hspace*{0.1in}\makebox[3.2in]{Short Runs}\hfill\makebox[3.2in]{Long Runs}
1504:
1505: \caption[]{Results of short and long runs
1506: on the distribution sequence with $s\!=\!0.05$, $t\!=\!0$, and
1507: $q\!=\!10$.}\label{fig-r4}
1508:
1509: \end{figure}
1510:
1511:
1512: \begin{figure}[p]
1513:
1514: \centerline{\includegraphics{tst-plt5.ps}}
1515:
1516: \vspace*{-8pt}
1517:
1518: %\hspace*{0.1in}\makebox[3.2in]{Short Runs}\hfill\makebox[3.2in]{Long Runs}
1519:
1520: \caption[]{Results of short and long runs
1521: on the distribution sequence with $s\!=\!0.3$, $t\!=\!2$, and
1522: $q\!=\!2$.}\label{fig-r5}
1523:
1524: \end{figure}
1525:
1526: \begin{figure}[p]
1527:
1528: \centerline{\includegraphics{tst-plt6.ps}}
1529:
1530: \vspace*{-8pt}
1531:
1532: %\hspace*{0.1in}\makebox[3.2in]{Short Runs}\hfill\makebox[3.2in]{Long Runs}
1533:
1534: \caption[]{Results of short and long runs
1535: on the distribution sequence with $s\!=\!0.3$, $t\!=\!2$, and
1536: $q\!=\!10$.}\label{fig-r6}
1537:
1538: \end{figure}
1539:
1540: Figures~\ref{fig-r1} through \ref{fig-r6} plot the mean squared errors
1541: of estimates for $\log(r)$ for the six sets of runs. Results are
1542: shown for AIS, for LIS using the geometric bridge, and for LIS using
1543: the optimal bridge, with the true value of $r$. Results for both the
1544: forward and reverse versions of each method are shown, together with
1545: the bridged version, using the optimal bridge, with $r$ obtained by
1546: iteration. Results for the short runs ($n\!=\!4$, $K_j\!=\!50$ for
1547: LIS, $n\!=\!250$ for AIS) are on the left, and for the long runs
1548: ($n\!=\!4$, $K_j\!=\!200$ for LIS, $n\!=\!2000$ for AIS) on the right.
1549: The mean squared error for each method was estimated by simulating
1550: each method 2000 times, and comparing the estimates with the true
1551: value of $\log(r)$. The bars in the plots are dark up to the
1552: estimated mean squared error minus twice its standard error, and are
1553: then light up to the estimated mean squared error plus twice its
1554: standard error. For bars that extend above the plot the estimated
1555: mean squared error is shown at the top of the bar.
1556:
1557: The results for translated sequences of distributions ($t\!=\!4$ and
1558: $s\!=\!1$) are shown in Figures~\ref{fig-r1} and~\ref{fig-r2}. When the
1559: distributions are Gaussian ($q\!=\!2$), no advantage is seen for LIS --- if
1560: anything, LIS performs slightly worse than AIS, particularly when the
1561: geometric bridge is used. The forward and reverse forms of AIS and
1562: LIS should have identical performance for these distribution
1563: sequences, due to symmetry; any differences seen result from random
1564: variation. The bridged forms of both AIS and LIS perform better than
1565: the unbridged forward and reverse forms. The advantage of bridging is
1566: less for the longer runs, however, as expected from the analysis at
1567: the end of Section~\ref{sec-asym}.
1568:
1569: When $q\!=\!10$, the distributions have much lighter tails than the
1570: Gaussian, more closely resembling the uniform distributions analysed
1571: in Section~\ref{sec-unif}. For these sequences of distributions, LIS
1572: performs substantially better than AIS. The unbridged version of AIS
1573: does particularly badly. The mean squared error for the bridged
1574: version of AIS is about 2.5 times greater than for the bridged version
1575: of LIS. It makes little difference whether the geometric or optimal
1576: bridge is used for LIS.
1577:
1578: Figures~\ref{fig-r3} and~\ref{fig-r4} show the results for sequences
1579: of distributions with the same mean ($t\!=\!0$) but decreasing width
1580: ($s\!=\!0.05$). For these sequences, a modest advantage of LIS over AIS
1581: is apparent for the sequence of Gaussian distributions ($q\!=\!2$), with
1582: the variance for AIS estimates being about a factor of 1.3 greater
1583: than for LIS estimates with the geometric bridge, and about a factor
1584: of 1.7 greater than for LIS estimates with the optimal bridge. The
1585: reversed AIS and LIS estimates are somewhat worse than the forward
1586: estimates for this sequence of distributions. No advantage is seen for
1587: bridged AIS or LIS estimates.
1588:
1589: The results for the sequence of distributions with $q\!=\!10$ is similar,
1590: except that the advantage of LIS over AIS is much greater --- about a
1591: factor of 6.
1592:
1593: Results for the last type of sequence, with $s\!=\!0.3$ and $t\!=\!2$, are
1594: shown in Figures~\ref{fig-r5} and~\ref{fig-r6}. This problem is a
1595: hybrid of the previous two, with both translation and change in width,
1596: producing results intermediate between those for the previous two
1597: problems. No difference in performance between AIS and LIS is
1598: apparent for the Gaussian distributions ($q\!=\!2$), but the bridged forms
1599: of both perform slightly better. For the sequence of distributions
1600: with $q\!=\!10$, a clear advantage of LIS over AIS can be seen, but this
1601: advantage is not as great as for the sequence with $t\!=\!0$ and $s\!=\!0.05$.
1602: The bridged forms of both AIS and LIS are again better, more so for
1603: the short runs than for the long runs.
1604:
1605: In addition to looking at the mean squared error of estimates found
1606: with these methods, I also looked at the fraction of times that the
1607: estimate for $\log(r)$ differed from the true value by more than twice
1608: the standard error estimated using the $M$ runs. This should be
1609: approximately 5\% if the distribution of estimates is Gaussian, and
1610: the standard errors are accurate. For the longer runs, this fraction
1611: was indeed near or only slightly above 5\% for all methods, except for
1612: the unbridged AIS runs when these performed very poorly. For the
1613: shorter runs, however, the unbridged AIS and LIS methods produced
1614: estimates more than two standard errors from the mean around 10\% of
1615: the time (sometimes much more often, when unbridged AIS performed
1616: poorly). Both the bridged AIS and the bridged LIS methods gave more
1617: reliable standard errors. However, it is possible that better
1618: standard errors for the unbridged methods might be obtained with a
1619: more sophisticated approach than I used.
1620:
1621: I performed additional runs to verify and extend some of the analytic
1622: results from Section~\ref{sec-anal}. Figures~\ref{fig-r7}
1623: and~\ref{fig-r8} show results obtained using LIS with increasing
1624: numbers of intermediate distributions, starting with the value of
1625: $n\!=\!4$ used for the tests above, and continuing to $n\!=\!9$,
1626: $n\!=\!19$, and $n\!=\!39$, while keeping the computation time
1627: constant by decreasing $m$ in proportion to $n\!+\!1$. The two
1628: distribution sequences with $s\!=\!1$ and $t\!=\!4$ and with
1629: $s\!=\!0.05$ and $t\!=\!0$ were used, in both cases with $q\!=\!10$.
1630: The sequence with $t\!=\!0$ and $s\!=\!0.05$ has the form of
1631: equation~(\ref{eq-U-dist}), so in accordance with the analysis of
1632: Section~\ref{sec-asym}, we expect that asymptotically, as $n$
1633: increases, LIS and AIS should have the same performance. This is
1634: indeed what we see in Figure~\ref{fig-r7}. We also see the same
1635: behaviour for the sequence with $t\!=\!4$ and $s\!=\!1$ in
1636: Figure~\ref{fig-r8}.
1637:
1638:
1639: \begin{figure}[p]
1640:
1641: \centerline{\includegraphics{tst-plt-2four.ps}}
1642:
1643: \vspace*{-8pt}
1644:
1645: \caption[]{Results using increasing values of $n$ for LIS, while keeping
1646: computation time constant, for the distribution sequence with
1647: $s\!=\!1$, $t\!=\!4$, and $q\!=\!10$. The same AIS procedure was
1648: used for all plots, but results vary randomly.}\label{fig-r7}
1649:
1650: \end{figure}
1651:
1652:
1653: \begin{figure}[p]
1654:
1655: \centerline{\includegraphics{tst-plt-4four.ps}}
1656:
1657: \vspace*{-8pt}
1658:
1659: \caption[]{Results using increasing values of $n$ for LIS, while keeping
1660: computation time constant, for the distribution sequence with
1661: $s\!=\!0.05$, $t\!=\!0$, and $q\!=\!10$. The same AIS procedure was
1662: used for all plots, but results vary randomly.}\label{fig-r8}
1663:
1664: \end{figure}
1665:
1666:
1667: \begin{figure}[p]
1668:
1669: \centerline{\includegraphics{tst-plt-222.ps}}
1670:
1671: \vspace*{-8pt}
1672:
1673: \caption[]{Results with increasing values of $q$, for sequences of
1674: distributions with $s\!=\!1$ and $t\!=\!4$. The AIS runs used
1675: $n\!=\!250$; the LIS runs used $n\!=\!4$ and $m\!=\!50$,
1676: requiring the same amount of computation.}\label{fig-r9}
1677:
1678: \end{figure}
1679:
1680:
1681: \begin{figure}[p]
1682:
1683: \centerline{\includegraphics{tst-plt-444.ps}}
1684:
1685: \vspace*{-8pt}
1686:
1687: \caption[]{Results with increasing values of $q$, for sequences of
1688: distributions with $s\!=\!0.05$ and $t\!=\!1$. The AIS runs used
1689: $n\!=\!250$; the LIS runs used $n\!=\!4$ and $m\!=\!50$,
1690: requiring the same amount of computation.}\label{fig-r10}
1691:
1692: \end{figure}
1693:
1694: As $q$ increases, the distributions become close to uniform, and the
1695: results of Section~\ref{sec-unif} should apply. To test this, I tried
1696: values of $q\!=\!2$, $q\!=\!10$, $q\!=\!20$, and $q\!=\!30$ for the
1697: distribution sequence with $s\!=\!1$ and $t\!=\!4$ and the sequence with
1698: $s\!=\!0.05$ and $t\!=\!0$. Results are shown in Figures~\ref{fig-r9}
1699: and~\ref{fig-r10}. (The results for $q\!=\!2$ and $q\!=\!10$ are the same as
1700: on the left in Figures~\ref{fig-r1} to~\ref{fig-r4}, though the scale
1701: differs.)
1702:
1703: For the sequences with $s\!=\!1$ and $t\!=\!4$, the limiting uniform
1704: distributions have the form of the second example in
1705: Section~\ref{sec-unif}. As noted there, AIS estimates do not converge
1706: to the correct value of $r$ for this distribution sequence; bridged AIS
1707: estimates do converge, but may be rather inefficient. We see
1708: analogous behaviour in Figure~\ref{fig-r9} when $q$ is large. The
1709: mean squared error of the AIS estimates increases approximately
1710: linearly with $q$ over the range $q\!=\!10$ to $q\!=\!30$. The
1711: bridged AIS estimates also get worse as $q$ increases, but more
1712: slowly. In contrast, the mean squared error of the LIS estimates
1713: changes hardly at all as $q$ increases.
1714:
1715: The story is similar for sequences with $s\!=\!0.05$ and $t\!=\!1$,
1716: for which the limiting uniform distributions correspond to those in
1717: the first example of Section~\ref{sec-unif}. The LIS estimates
1718: perform about equally well for all values of $q$, but the AIS
1719: estimates are dramatically worse for large values of $q$. For this
1720: sequence, reverse AIS estimates are much worse than forward AIS
1721: estimates, and bridging does not help.
1722:
1723: According to the analysis of Section~\ref{sec-asym}, the choice of
1724: choice of $n\!=\!4$ for LIS used above is not optimal for either of
1725: these distribution sequences when $q$ is large. For the sequence with
1726: $s\!=\!1$ and $t\!=\!4$, using $n\!=\!6$ should be better by a factor
1727: of 1.176. However, in LIS runs with $q=30$, the mean squared error
1728: using $n=\!=\!4$ and $m\!=\!200$ is indistinguishable from that using
1729: $n\!=\!6$ and $m\!=\!143$, given the standard errors (a factor of 1.09
1730: or more should have been detectable). Of course, $q=30$ does not give
1731: exactly uniform distributions, and these values of $m$ may not be
1732: large enough for the asymptotic results to apply, especially since the
1733: Markov transitions do not sample independently. For the sequence with
1734: $s\!=\!0.05$ and $t\!=\!0$, the results in Section~\ref{sec-asym}
1735: indicate that using $n\!=\!3$ should be better by a factor of 1.084.
1736: In this case, LIS runs with $q=30$ using $n\!=\!3$ and $m\!=\!250$ are
1737: better than runs using $n\!=\!4$ and $m\!=\!200$ by a factor of 1.16,
1738: significantly greater than one given the standard errors, but not
1739: significantly different from the expected ratio of 1.084.
1740:
1741:
1742: \section{\hspace*{-7pt}Other applications of linked
1743: sampling}\label{sec-gen}\vspace*{-10pt}
1744:
1745: So far in this paper, I have focused on how Linked Importance Sampling
1746: can be used to estimate ratios of normalizing constants. LIS can also
1747: be used to estimate expectations with respect to $\pi_1$, however, and
1748: in some applications, this may be its most important use. Linked
1749: sampling methods related to LIS can also be applied in other ways. I
1750: briefly described these other applications here, outlining the use of
1751: linked sampling for `dragging' fast variables in some detail.
1752:
1753:
1754: \subsection{\hspace*{-4pt}Estimating expectations}\vspace*{-4pt}
1755:
1756: The expectation of some function, $a(x)$, with respect to $\pi_1$
1757: can be estimated using simple importance sampling, with points drawn
1758: from $\pi_0$, as follows:
1759: \beq
1760: E_{\pi_1}\big[a(X)\big]
1761: \ \ = \ \ E_{\pi_0}\!\left[ a(X) {p_1(X) \over p_0(X)}\right] \, \Big/\
1762: {Z_1 \over Z_0}
1763: \ \ \approx\ \
1764: {1 \over N}\sum_{i=1}^N\, a(x^{(i)})\, {p_1(x^{(i)}) \over p_0(x^{(i)})}\ \Big/\
1765: {1 \over N}\sum_{i=1}^N\, {p_1(x^{(i)}) \over p_0(x^{(i)})}
1766: \label{eq-is-exp}
1767: \eeq
1768: where $x^{(i)},\ldots,x^{(N)}$ are drawn from $\pi_0$.
1769: Like equation~(\ref{eq-simple}), this estimate is valid only if
1770: no region having zero probability under $\pi_0$ has non-zero probability
1771: under $\pi_1$. The two factors of $1/N$ of course cancel, but are included
1772: to emphasize the connection with the estimate for $r=Z_1/Z_0$, which is
1773: simply the denominator of the estimate above.
1774:
1775: Since LIS can be viewed as simple importance sampling on an extended
1776: state space, with distributions $\Pi_0$ and $\Pi_1$ defined by the
1777: forward and reverse procedures of Section~\ref{sec-lis}, we can use
1778: equation~(\ref{eq-is-exp}) to estimate any quantity that can be
1779: expressed as an expectation with respect ot $\Pi_1$. Step (1) of the
1780: reverse procedure defining $\Pi_1$ sets $x_{n,\mu_n}$ to a value
1781: randomly chosen from $\pi_{\eta_n} = \pi_1$. Step (2) then sets the
1782: other $x_{n,k}$ to values obtained from $x_{n,\mu_n}$ by applying
1783: Markov chain transitions that leave $\pi_1$ invariant. It follows
1784: that under $\Pi_1$, all the points $x_{n,k}$ have marginal
1785: distribution $\pi_1$ (though they may not be independent). Accordingly,
1786: \beq
1787: E_{\pi_1}\big[a(X)\big] & = & E_{\,\Pi_1}\!\left[
1788: {1 \over K_n\!+\!1}\, \sum_{k=0}^{K_n} a(X_{n,k}) \right]
1789: \eeq
1790: Estimating the right side as in equation~(\ref{eq-is-exp}), and using
1791: the fact that the ratio of probabilities under $\Pi_1$ over those
1792: under $\Pi_0$ is given by $\rhatLIS^{(i)}$ in equation~(\ref{eq-lis}),
1793: we get the estimate
1794: \beq
1795: E_{\pi_1}\big[a(X)\big] & \approx &
1796: \sum_{i=1}^M {\rhatLIS^{(i)} \over K_n\!+\!1}
1797: \sum_{k=0}^{K_n} a(x^{(i)}_{n,k})
1798: \ \Big/\
1799: \sum_{i=1}^M \rhatLIS^{(i)}
1800: \label{eq-lis-exp}
1801: \eeq
1802:
1803: If the $M$ runs of LIS are started by sampling independently from
1804: $\pi_0$ (as will often be possible), the standard error of this
1805: estimate can be assessed in the usual fashion for importance sampling,
1806: as I have discussed for the analogous AIS estimates in (Neal 2001).
1807: This error assessment can be difficult, since when some
1808: $\rhatLIS^{(i)}$ are much larger than others, the variance of
1809: $\rhatLIS^{(i)}$ is hard to estimate. Note, however, that the degree
1810: to which the Markov chain transitions used have converged need not be
1811: assessed, a possible advantage compared with simple MCMC estimates. The
1812: estimate of equation~(\ref{eq-lis-exp}) will be asymptotically correct
1813: (as $M\rightarrow\infty$) regardless of how far these Markov chain
1814: transitions are from convergence.
1815:
1816: The primary reason one might wish to use LIS to estimate expectations
1817: is that going through the sequence of distributions parameterized by
1818: $\eta_0,\ldots,\eta_n$ may produce an `annealing' effect, which
1819: prevents the Markov chain sampler from being trapped in a local mode
1820: of the distribution. Compared with the analogous AIS procedure, LIS
1821: may perform better for some forms of distributions, for the same
1822: reasons as were discussed in Sections~\ref{sec-anal}
1823: and~\ref{sec-cmp}. One should also note that LIS estimates for
1824: expectations with respect to $\pi_{\eta_j}$ for all $j$ can easily be
1825: obtained from a single set of runs, by simply considering the results
1826: of each LIS run up to the point where the sample for $\pi_{\eta_j}$ is
1827: obtained.
1828:
1829:
1830: \subsection{\hspace*{-4pt}A linked form of tempered transitions}\vspace*{-4pt}
1831:
1832: My `tempered transition' method (Neal 1996) is another approach to
1833: sampling from distributions with isolated modes, between which
1834: movement is difficult for Markov chain transitions such as simple
1835: Metropolis updates. In this approach, such simple Markov chain
1836: transitions are supplemented by occasional complex `tempered
1837: transitions', composed of many simple Markov chain transitions. A
1838: tempered transition consists of several stages, which proceed through
1839: a sequence of distributions, from the distribution being sampled, to a
1840: `higher temperature' distribution in which movement between modes is
1841: easier, and then back down to the distribution being sampled. At each
1842: stage of a tempered transition, we generate a single new state by
1843: applying a Markov chain transition to the current state, after which
1844: we switch to the next distribution in the sequence. The second half of
1845: a tempered transition is similar to an Annealed Importance Sampling
1846: run, while the first half is similar to an AIS run with the reversed
1847: sequence of distributions.
1848:
1849: A similar `linked' procedure can be defined, in which at each stage we
1850: generate a chain of states by applying a Markov chain transition.
1851: We then select a `link state' from this sequence (using a suitable
1852: bridge distribution) which serves as the starting point for the chain
1853: of states generated in the next stage. In the final stage, a chain of
1854: states is produced using a Markov chain transition that leaves the
1855: distribution being sampled invariant, and a candidate state is
1856: selected uniformly at random from this chain. The appropriate
1857: probability for accepting this candidate state is computed using
1858: ratios similar to those going into the LIS estimate of
1859: equation~(\ref{eq-lis}).
1860:
1861: As discussed in Section~\ref{sec-cmp}, for AIS to work well, all
1862: distributions in the sequence must assign reasonably high probability
1863: to regions of the space that have non-negligible probability under the
1864: next distribution in the sequence. One would expect tempered
1865: transitions to work well only when this holds for both the sequence
1866: and its reversal. In contrast, one would expect the `linked' version
1867: of tempered transitions to work well as long as the sequence satisfies
1868: the weaker condition that there be some `overlap' between adjacent
1869: distributions (assuming a suitable bridge distribution is used).
1870:
1871:
1872: \subsection{\hspace*{-4pt}Dragging fast variables using linked
1873: chains}\vspace*{-4pt}
1874:
1875: A slight modification of the tempered transition method can be applied
1876: to problems in which the state is composed of both `fast' and `slow'
1877: variables. We will write the distribution of interest for such a problem
1878: as
1879: \beq
1880: \pi(x,y) & = & (1/Z)\, \exp(-U(x,y))
1881: \eeq
1882: where $x$ denotes the `fast' variables and $y$ the `slow' variables.
1883: We assume that the computation is dominated by the time required to
1884: evaluate $U(x,y)$, but that once $U(x,y)$ has been evaluated, with
1885: relevant intermediate quantities saved,
1886: evaluating $U(x',y)$ for any new $x'$ is much faster than evaluating
1887: $U(x',y')$ for some $y'$ not previously encountered. One example of
1888: such a problem is inference for Gaussian process classification models
1889: (Neal 1999), in which $y$ consists of the hyperparameters defining the
1890: covariance function used, and $x$ consists of the latent variables
1891: associated with the $n$ observations. After a change to $y$, we must
1892: recompute the Cholesky decomposition of an $n \times n$ covariance matrix,
1893: which takes time proportional to $n^3$, whereas after a change to $x$
1894: only, $U(x,y)$ can be re-computed in time proportional to $n^2$,
1895: assuming the Cholesky decomposition for this value of $y$ has been
1896: saved.
1897:
1898: In my method for `dragging' fast variables (Neal 2004), the ability
1899: to quickly re-evaluate $U(x,y)$ when only $x$ changes is exploited to
1900: allow larger changes to be made to $y$ than would be possible if $x$
1901: were kept fixed, or were given a new value from some simple proposal
1902: distribution. From the state $(x_0,y_0)$, a dragging
1903: update proposes a new value $y_1$, drawn from some symmetrical proposal
1904: distribution, in conjunction with a new value $x_1$ that is found by
1905: applying a succession of Markov chain updates that leave
1906: invariant distributions in the series, $\pi_{\eta_j}(x)$, for
1907: $j=1,\ldots,n\!-\!1$, with $0<\eta_j<\eta_{j+1}<1$. The proposed state,
1908: $(x_1,y_1)$, is then accepted or rejected in a fashion analogous to tempered
1909: transitions.
1910:
1911: The distributions in the sequence used are defined by the following
1912: unnormalized probability or density function, which depends on the
1913: current and proposed values for $y$:
1914: \beq
1915: p_{\eta}(x) & = &
1916: \exp\,(\,-\,((1\!-\!\eta)\, U(x,y_0)\ +\ \eta\, U(x,y_1)))
1917: \label{eq-drag-p}
1918: \eeq
1919: The corresponding normalized probability or density function will be
1920: written as $\pi_{\eta}$. Note that $\pi_0(x) = \pi(x|y_0)$ and
1921: $\pi_1(x)=\pi(x|y_1)$. Crucially,
1922: after $U(x,y_0)$ and $U(x,y_1)$ have been evaluated once (for any~$x$),
1923: we can evaluate $p_{\eta}(x)$ for any $\eta$ and any $x$
1924: without any further `slow' computations.
1925: Indeed, since $U(x_0,y_0)$ will usually have already been evaluated as part of
1926: the previous Markov chain transition, only one slow computation will be required
1927: to evaluate $p_{\eta}(x)$ for any number of values of $\eta$ and $x$.
1928:
1929: A `linked' dragging update can be defined as follows. Given
1930: the sequence of distributions defined by $\eta_0,\ldots,\eta_n$, with
1931: $\eta_0=0$ and $\eta_n=1$, the numbers of transitions ($T$ or $\underline{T}$)
1932: to perform for each distribution over $x$, denoted by $K_0,\ldots,K_n$, and a
1933: set of bridge distributions, denoted by $p_{j*j+1}$, for $j=0,\ldots,n\!-\!1$,
1934: an update from the current state $(x_0,y_0)$ is done as follows:\vspace*{5pt}
1935:
1936: \begin{center}\bf The Linked Dragging Procedure\end{center}\vspace*{-5pt}
1937:
1938: \begin{enumerate}
1939: \item[1)] Propose a new value, $y_1$, from some proposal distribution
1940: $S(y_1|y_0)$, which satisfies the symmetry condition that $S(y_1|y_0)
1941: =S(y_0|y_1)$.
1942: \item[2)] Pick an integer $\nu_0$ uniformly at random from $\{0,\ldots,K_0\}$,
1943: and then set $x_{0,\nu_0}$ to the current values of the fast
1944: variables, $x_0$.
1945: \item[3)] For $j\,=\,0,\ldots,n$, create a chain of values for $x$ associated
1946: with $\pi_{\eta_j}$ as follows:
1947: \begin{enumerate}
1948: \item[a)] If $j>0$:\ \ Pick an integer $\nu_j$ uniformly at random from
1949: $\{0,\ldots,K_j\}$, and then set $x_{j,\nu_j}$ to $x_{j-1*j}$.
1950: \item[b)] For $k\,=\,\nu_j+1,\ldots,K_j$, draw $x_{j,k}$ according to the
1951: forward Markov chain transition probabilities
1952: $T_{\eta_j}(x_{j,k-1},x_{j,k})$. (If $\nu_j=K_j$, do nothing in
1953: this step.)
1954: \item[c)] For $k\,=\,\nu_j-1,\ldots,0$, draw $x_{j,k}$ according to the
1955: reverse Markov chain transition probabilities
1956: $\underline{T}_{\eta_j}(x_{j,k+1},x_{j,k})$. (If $\nu_j=0$, do
1957: nothing in this step.)
1958: \item[d)] If $j<n$:\ \ Pick a value for $\mu_j$ from
1959: $\{0,\ldots,K_j\}$ according to the following probabilities
1960: \beq
1961: \Pi_0(\mu_j\,|\,x_j) & = &
1962: {p_{j*j+1}(x_{j,\mu_j}) \over p_{\eta_j}(x_{j,\mu_j})}
1963: \ \Big/\
1964: \sum_{k=0}^{K_j} {p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k})}
1965: \eeq
1966: and then set $x_{j*j+1}$ to $x_{j,\mu_j}$.
1967: \end{enumerate}
1968: \item[3)] Set $\mu_n$ to a value chosen uniformly at random from
1969: $\{0,\ldots,K_n\}$, and let the proposed new values for the fast
1970: variables, $x_1$, be equal to $x_{n,\mu_n}$.
1971: \item[4)] Accept $(x_1,y_1)$ as the new state with probability
1972: \beq
1973: \min \left\{\, 1,\ \
1974: \prod_{j=0}^{n-1} \left[
1975: {1 \over K_j+1}\, \sum_{k=0}^{K_j}\,
1976: { p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) }
1977: \ \Big/\
1978: {1 \over K_{j+1}+1}\, \sum_{k=0}^{K_{j+1}}\,
1979: { p_{j*j+1}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k}) }
1980: \right]
1981: \,\right\}
1982: \eeq
1983: If $(x_1,y_1)$ is not accepted, the new state is the same as
1984: the old state, $(x_0,y_0)$.\vspace*{-6pt}
1985: \end{enumerate}
1986: One can show that this update leaves $\pi(x,y)$ invariant by showing
1987: that it satisfies detailed balance, which in turns follows from the
1988: stronger property that the probability of starting at $(x_0,y_0)$,
1989: assuming this start state comes from $\pi(x,y)$, then generating the various
1990: quantities produced by the above procedure, and finally accepting $(x_1,y_1)$
1991: as the new state, is the same as the probability of starting this procedure
1992: at $(x_1,y_1)$, generating the same quantities in reverse, and finally accepting
1993: $(x_0,y_0)$. The proof of this is analogous to the derivation of LIS in
1994: Section~\ref{sec-lis}.
1995:
1996: To use the linked dragging procedure, we need to select suitable
1997: bridge distributions. Since the characteristics of $\pi_{\eta}(x)$
1998: will depend on $y_0$ and $y_1$, and of course $\eta$, we may not know
1999: enough to select good estimates for the values of $r$ needed to use
2000: the optimal bridge of equation~(\ref{eq-opt-bridge}), though we might
2001: try just setting $r$ to one. This is not a problem for the geometric bridge of
2002: equation~(\ref{eq-geo-bridge}), for which the acceptance probability
2003: above can be written as\vspace*{2pt}
2004: \beq
2005: \min \left\{\, 1,\ \
2006: \prod_{j=0}^{n-1} \left[
2007: {1 \over K_j+1}\, \sum_{k=0}^{K_j}\,
2008: \sqrt{{ p_{\eta_{j+1}}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) }}
2009: \ \Big/\
2010: {1 \over K_{j+1}+1}\, \sum_{k=0}^{K_{j+1}}\,
2011: \sqrt{{ p_{\eta_j}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k}) }}\,
2012: \right]
2013: \,\right\}\\[-10pt]\nonumber
2014: \eeq
2015: From equation~(\ref{eq-drag-p}), we see that
2016: \beq
2017: { p_{\eta_{j+1}}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) }
2018: & = & \exp\,(\,-\,(\eta_{j+1}\!-\!\eta_j)\,
2019: (U(x_{j,k},y_1)\!-\!U(x_{j,k},y_0))) \\[6pt]
2020: { p_{\eta_j}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k}) }
2021: & = & \exp\,(\,-\,(\eta_{j+1}\!-\!\eta_j)\,
2022: (U(x_{j+1,k},y_0)\!-\!U(x_{j+1,k},y_1)))
2023: \eeq
2024: For the simplest case with no intermediate distributions (ie, with $n\!=\!1$),
2025: the acceptance probability simplifies to
2026: \beq
2027: \min \left\{\, 1,\ \
2028: { \displaystyle {1 \over K_0+1}\, \sum_{k=0}^{K_0}\,
2029: \exp\,(\,-\,(U(x_{j,k},y_1)\!-\!U(x_{j,k},y_0))\,/\,2)
2030: \over
2031: \displaystyle {1 \over K_1+1}\, \sum_{k=0}^{K_1}\,
2032: \exp\,(\,-\,(U(x_{j,k},y_0)\!-\!U(x_{j,k},y_1))\,/\,2)
2033: } \right\}
2034: \eeq
2035:
2036:
2037: \section{\hspace*{-7pt}Conclusions and Future work}\vspace*{-10pt}
2038:
2039: In this paper, I have demonstrated that in some situations Linked
2040: Importance Sampling is substantially more efficient than Annealed
2041: Importance Sampling, provided a suitable number of intermediate
2042: distributions are used. However, in other situations, where the tails
2043: of the distributions involved are sufficiently heavy, the two methods
2044: are about equally efficient. More research is therefore needed to
2045: determine for which problems of practical interest LIS, and related
2046: linked sampling methods, will be useful.
2047:
2048: In tests on multivariate Gaussian distributions, I have not seen an
2049: advantage for LIS over AIS. Both perform about equally well on a
2050: sequence of 100-dimensional spherical Gaussian distributions with
2051: variances changing by a factor of two, so that $\log(r) = -100$. This
2052: is in accord with the results in Section~\ref{sec-cmp}, where LIS had
2053: little or no advantage over AIS when the distributions were Gaussian.
2054: LIS is more likely to be useful for problems involving continuous
2055: distributions with lighter tails.
2056:
2057: One problem that may benefit from LIS is that of computing the
2058: probability of a very rare event, which can be cast as computing the
2059: normalizing constant for a distribution with the constraint that the
2060: state be in the set corresponding to this event. Intermediate
2061: distributions might use looser forms of this constraint. If, in all
2062: these distributions, states violating the constraints have zero
2063: probability, AIS will tend to have the same bad behaviour seen with
2064: uniform distributions in Section~\ref{sec-unif}, while LIS may work
2065: much better.
2066:
2067: Another context where LIS may outperform AIS is when only a fixed
2068: number of intermediate distributions are available --- ie, only a
2069: finite number of values are allowed for $\eta$. This is the situation
2070: for the `sequential importance sampler' of MacEachern, Clyde, and Liu
2071: (1999), which can be seen as an instance of AIS (Neal 2001). Here,
2072: the intermediate distributions use only a fraction of the $n$ items in
2073: the data set; such a fraction can only have the form $j/n$ with $j$ an
2074: integer. The distance between successive distributions for this
2075: problem may sometimes be too great for AIS to work well, but their
2076: overlap might nevertheless be sufficient for LIS.
2077:
2078: It may be possible to improve LIS by reducing the variance in how well
2079: it samples at each stage. Instead of performing a predetermined
2080: number, $K_j$, of Markov transitions at stage $j$, we might instead
2081: perform as many transitions as are necessary to obtain a good sample.
2082: Define a `tour' to be a sequence of transitions that moves from a high
2083: value of some key quantity (eg, $U(x)$ for the canonical distributions
2084: of equation~(\ref{eq-canonical})) to a low value of this quantity, or
2085: vice versa. Good sampling might be ensured by performing some
2086: predetermined number of tours, with the number of these tours that
2087: occur before and after the link state being chosen at random.
2088: Suitable `high' and `low' values would probably need to be found using
2089: preliminary runs.
2090:
2091: More speculatively, it seems as if there should be some method that
2092: has the advantages of LIS over AIS, but that like AIS uses many
2093: intermediate distributions, performing only a single Markov transition
2094: for each. Intuitively, it seems that such a `smooth' method that does
2095: not abruptly change $\eta$ should be more efficient. One can use LIS
2096: with all $K_j$ set to one, but this will produce good results only if
2097: $n$ is large, which we saw in the analysis of Section~\ref{sec-asym}
2098: does not lead to an advantage over AIS. Perhaps some way could be
2099: found of using states associated with all values of $\eta$ when
2100: estimating each of the ratios $Z_{\eta_{j+1}}/Z_{\eta_j}$, while still
2101: producing an estimate that is exactly unbiased even when the Markov transitions
2102: do not reach equilibrium.
2103:
2104:
2105: \section*{Acknowledgements}\vspace{-10pt}
2106:
2107: This research was supported by the Natural Sciences and Engineering
2108: Research Council of Canada. I hold a Canada Research Chair in
2109: Statistics and Machine Learning.
2110:
2111:
2112: \section*{References}\vspace{-10pt}
2113:
2114: \leftmargini 0.2in
2115: \labelsep 0in
2116:
2117: \begin{description}
2118: \itemsep 2pt
2119:
2120: \item
2121: Bennett, C.~H.\ (1976) ``Efficient estimation of free energy differences
2122: from Monte Carlo data'', {\em Journal of Computational Physics}, vol.~22,
2123: pp.~245-268.
2124:
2125: \item
2126: Crooks, G.~E.\ (2000) ``Path-ensemble averages in systems driven far
2127: from equilibrium'', \textit{Physical Review E}, vol.~61, pp.~2361-2366.
2128:
2129: %\item
2130: % Crooks, G.~E.\ (1999) \textit{Excursions in Statistical Dynamics},
2131: % PhD thesis, Chemistry, University of California at
2132: % Berkeley, available from \texttt{http://threeplusone.com/pubs/GECthesis.html}
2133:
2134: \item
2135: Gelman, A.\ and Meng, X.-L.\ (1998) ``Simulating normalizing constants:
2136: From importance sampling to bridge sampling to path sampling'',
2137: \textit{Statistical Science}, vol.~13, pp.~163-185.
2138:
2139: \item
2140: Hendrix, D.~A.\ and Jarzynski, C.\ (2001) ``A ``fast growth'' method of
2141: computing free energy differences'', \textit{Journal of Chemical Physics},
2142: vol.~114, pp.~5974-5981.
2143:
2144: \item
2145: Jarzynski, C.\ (1997) ``Nonequilibrium equality for free energy differences'',
2146: \textit{Physical Review Letters}, vol.~78, pp.~2690-2693.
2147:
2148: \item
2149: Jarzynski, C.\ (2001) ``A ``fast growth'' method of computing free energy
2150: differences'', \textit{Journal of Chemical Physics}, vol.~114, pp.~5974-5981.
2151:
2152: %\item
2153: % Liu, J.~S.\ (2001) \textit{Monte Carlo Strategies in Scientific Computing},
2154: % Springer-Verlag.
2155:
2156: \item
2157: Lu, N., Singh, J.~K., and Kofke, D.~A.\ (2003) ``Appropriate methods
2158: to combine forward and reverse free-energy perturbation averages'',
2159: \textit{Journal of Chemical Physics}, vol.~118, pp.~2977-2984.
2160:
2161: \item
2162: MacEachern, S.~N., Clyde, M., and Liu, J.~S. (1999) ``Sequential
2163: importance sampling for nonparametric Bayes models:\ The next generation'',
2164: \textit{Canadian Journal of Statistics}, vol.~27, pp.~251-267.
2165:
2166: \item
2167: Meng, X.-L.\ and Wong, H.~W.\ (1996) ``Simulating ratios of normalizing
2168: constants via a simple identity: A theoretical exploration'',
2169: \textit{Statistica Sinica}, vol.~6, pp.~831-860.
2170:
2171: \item
2172: Metropolis, N., Rosenbluth, A.~W., Rosenbluth, M.~N., Teller, A.~H.,
2173: and Teller, E.\ (1953) ``Equation of state calculations by fast computing
2174: machines'', {\em Journal of Chemical Physics}, vol.~21, pp.~1087-1092.
2175:
2176: \item
2177: Neal, R.~M.\ (1993) {\em Probabilistic Inference Using Markov Chain
2178: Monte Carlo Methods}, Technical Report CRG-TR-93-1, Dept.\
2179: of Computer Science, University of Toronto, 140 pages.
2180: Obtainable from \texttt{http://www.cs.utoronto.ca/$\sim$radford/}.
2181:
2182: \item
2183: Neal, R.~M.\ (1996) ``Sampling from multimodal distributions using tempered
2184: transitions'', \textit{Statistics and Computing}, vol.~6, pp.~353-366.
2185:
2186: \item
2187: Neal, R.~M.\ (1999) ``Regression and classification using Gaussian process
2188: priors'' (with discussion), in J.~M.~Bernardo, {\em et al}
2189: (editors) {\em Bayesian Statistics 6}, Oxford University Press,
2190: pp.~475-501.
2191:
2192: \item
2193: Neal, R.~M.\ (2001) ``Annealed importance sampling'',
2194: \textit{Statistics and Computing}, vol.~11, pp.~125-139.
2195:
2196: %\item
2197: % Neal, R.~M.\ (2003) ``Slice sampling'' (with discussion),
2198: % {\em Annals of Statistics}, vol.~1, pp.~705-767.
2199:
2200: \item
2201: Neal, R.~M.\ (2004) ``Taking bigger Metropolis steps by dragging fast
2202: variables'', Technical Report No.~0411, Dept. of Statistics, University of
2203: Toronto, 9 pages.
2204:
2205: \item
2206: Schervish, M.~J.\ (1995) \textit{Theory of Statistics}, Springer.
2207:
2208: \item
2209: Shirts, M.~R., Bair, E., Hooker, G., and Pande, V.~S.` (2003)
2210: ``Equilibrium free energies from nonequilibrium measurements using
2211: maximum-likelihood methods'', \textit{Physical Review Letters},
2212: vol.~91, p.~140601.
2213:
2214: \end{description}
2215:
2216: \end{document}
2217: