0511:math0511216/tr.tex

1: \documentclass[11pt]{article}

2: \usepackage{graphicx}

3: \usepackage{latexsym}

4: \usepackage{amssymb}

5:

6: %\topmargin -0.75in

7: %\textheight 9.18in

8: \topmargin -0.5in

9: \textheight 8.8in

10: \oddsidemargin -0.1in

11: \evensidemargin -0.1in

12: \textwidth 6.7in

13: \tolerance=1600 %allow some tolerance in extending after line.

14: \parskip=6pt

15: \overfullrule=0pt %no dark lines if overfull

16: \setlength{\parindent}{12pt}

17: \setlength{\partopsep}{0pt}

18: \setlength{\topsep}{0pt}

19: \renewcommand{\topfraction}{0.9}

20: \renewcommand{\textfraction}{0.1}

21: \setcounter{bottomnumber}{1}

22: \renewcommand{\bottomfraction}{0.5}

23:

24: \def\beq{\begin{eqnarray}}

25: \def\eeq{\end{eqnarray}}

26: \def\I{\mbox{\bf I}}

27: \def\mod{\mbox{mod}\ }

28: \def\geo{^{\mbox{\tiny geo}}}

29: \def\opt{^{\mbox{\tiny opt}}}

30: \def\rhat{\hat r}

31: \def\rhatrev{\underline{\hat r}}

32: \def\rhatSIS{\hat r_{\mbox{\tiny SIS}}}

33: \def\rhatbridge{\hat r_{\mbox{\tiny bridge}}}

34: \def\rhatAIS{\hat r_{\mbox{\tiny AIS}}}

35: \def\rhatLIS{\hat r_{\mbox{\tiny LIS}}}

36: \def\rhatLISave{\hat r_{\mbox{\tiny LIS-ave}}}

37: \def\rhatLISrev{\underline{\hat r}_{\,\mbox{\tiny LIS}}}

38: \def\rhatLISbridged{\hat r_{\mbox{\tiny LIS-bridged}}}

39: \def\Var{\mbox{Var}}

40: \def\Cor{\mbox{Cor}}

41:

42: \begin{document}

43:

44: \fontsize{11}{16pt}\selectfont

45:

46: \begin{center}

47:

48: {\small Technical Report No.\ 0511,

49:  Department of Statistics, University of Toronto}

50:

51: \vspace*{0.45in}

52:

53: {\LARGE \bf Estimating Ratios of Normalizing Constants Using \\[6pt]

54:             Linked Importance Sampling}

55:

56: \vspace*{9pt}

57:

58: {\large Radford M. Neal}\\[4pt]

59:  Department of Statistics and Department of Computer Science \\

60:  University of Toronto, Toronto, Ontario, Canada \\

61:  \texttt{http://www.cs.utoronto.ca/$\sim$radford/} \\

62:  \texttt{radford@stat.utoronto.ca}\\[6pt]

63:

64:  8 November 2005

65:

66: \end{center}

67:

68: \vspace*{8pt}

69:

70:

71: \noindent \textbf{Abstract.}\ \ Ratios of normalizing constants for

72: two distributions are needed in both Bayesian statistics, where they

73: are used to compare models, and in statistical physics, where they

74: correspond to differences in free energy.  Two approaches have long

75: been used to estimate ratios of normalizing constants.  The `simple

76: importance sampling' (SIS) or `free energy perturbation' method uses a

77: sample drawn from just one of the two distributions.  The `bridge

78: sampling' or `acceptance ratio' estimate can be viewed as the ratio of

79: two SIS estimates involving a bridge distribution.  For both methods,

80: difficult problems must be handled by introducing a sequence of

81: intermediate distributions linking the two distributions of interest,

82: with the final ratio of normalizing constants being estimated by the

83: product of estimates of ratios for adjacent distributions in this

84: sequence.  Recently, work by Jarzynski, and independently by Neal, has

85: shown how one can view such a product of estimates, each based on

86: simple importance sampling using a single point, as an SIS estimate on

87: an extended state space.  This `Annealed Importance Sampling' (AIS)

88: method produces an exactly unbiased estimate for the ratio of

89: normalizing constants even when the Markov transitions used do not

90: reach equilibrium.  In this paper, I show how a corresponding `Linked

91: Importance Sampling' (LIS) method can be constructed in which the

92: estimates for individual ratios are similar to bridge sampling

93: estimates.  As a further elaboration, bridge sampling rather than

94: simple importance sampling can be employed at the top level for both

95: AIS and LIS, which sometimes produces further improvement.  I show

96: empirically that for some problems, LIS estimates are much more

97: accurate than AIS estimates found using the same computation time,

98: although for other problems the two methods have similar performance.

99: Like AIS, LIS can also produce estimates for expectations, even when

100: the distribution contains multiple isolated modes.  AIS is related to

101: the `tempered transition' method for handling isolated modes, and to a

102: method for `dragging' fast variables.  Linked sampling methods similar

103: to LIS can be constructed that are analogous to tempered transitions

104: and to this method for dragging fast variables, which may sometimes

105: work better than those analogous to AIS.

106:

107: \newpage

108:

109:

110: \section{\hspace*{-7pt}Introduction}\label{sec-intro}\vspace*{-10pt}

111:

112: Consider two distributions on the same space, with probability mass or

113: density functions $\pi_0(x) = p_0(x)/Z_0$ and $\pi_1(x) =

114: p_1(x)/Z_1$.  Suppose that we are not able to directly compute $\pi_0$

115: and $\pi_1$, but only $p_0$ and $p_1$, since we do not know the

116: normalizing constants, $Z_0$ and $Z_1$.  We wish to find a Monte Carlo

117: estimate for the ratio of these normalizing constants, $Z_1/Z_0$,

118: which we sometimes denote by $r$, using samples of values drawn (at

119: least approximately) from $\pi_0$ and from $\pi_1$.  Sometimes, we may

120: know $Z_0$, in which case we can arrange for it to be one, so that

121: estimation of this ratio will give the numerical value of $Z_1$.

122: Other times, we will be able to obtain only the ratio of normalizing

123: constants, but this may be sufficient for our purposes.

124:

125: In statistical physics, $x$ represents the state of some physical

126: system, and the distributions are typically `canonical' distributions

127: having the following form (for $j=0,1$):

128: \beq

129:   p_j(x) & = & \exp(-\beta_j U(x,\lambda_j))

130: \label{eq-canonical}

131: \eeq

132: where $U(x,\lambda_j)$ is an `energy' function, which may depend on the

133: parameter $\lambda_j$, and $\beta_j$ is the inverse temperature of

134: system $j$.  Many interesting properties of the systems are related

135: to the `free energy', defined as $-\log(Z_j)\,/\,\beta_j$.  Often, only

136: the difference in free energy between systems $0$ and $1$ is relevant,

137: and this is determined by the ratio $Z_1/Z_0$.

138:

139: In Bayesian statistics, $x$ comprises the parameters and latent

140: variables for some statistical model, $\pi_0$ is the prior

141: distribution for these quantities (for which the normalizing constant

142: is usually known), and $\pi_1$ is the posterior distribution given the

143: observed data.  We can compute $p_1(x)$ as the product of the prior

144: density for $x$ and the probability of the data given $x$, but the

145: normalizing constant, $Z_1$, is difficult to compute.  We can

146: interpret $Z_1$ as the `marginal likelihood' --- the probability of

147: the observed data under this model, integrating over possible values

148: of the model's parameters and latent variables.  The marginal

149: likelihood for a model indicates how well it is supported

150: by the data.

151:

152: Although I will use simple distributions as illustrations in this

153: paper, in real applications, $x$ is usually high dimensional, and at

154: least one of $\pi_0$ and $\pi_1$ is usually quite complex.

155: Accordingly, sampling from these distributions generally requires use

156: of Markov chain methods, such as the venerable Metropolis algorithm

157: (Metropolis, \textit{et al} 1953).  See (Neal 1993) for a review of

158: Markov chain sampling methods.  Sometimes, however, $\pi_0$ will be

159: relatively simple, and independent points drawn from it can be

160: generated efficiently, as would often be the case with the prior

161: distribution for a Bayesian model, or for a physical system at

162: infinite temperature ($\beta_0=0$).

163:

164: Many methods for estimating ratios of normalizing constants from Monte

165: Carlo data have been investigated in the physics literature (for a

166: review, see (Neal 1993, Section 6.2)), and later rediscovered in the

167: statistics literature (Gelman and Meng 1998).  A logical method to

168: start with is `simple importance sampling' (SIS), also called `free energy

169: perturbation', based on the following identity, which can

170: easily be proved on the assumption that no region having zero probability

171: under $\pi_0$ has non-zero probability under $\pi_1$:

172: \beq

173:   {Z_1 \over Z_0} & = & E_{\pi_0}\! \left[ {p_1(X) \over p_0(X)} \right]

174:    \ \ \approx \ \ {1 \over N} \sum_{i=1}^N {p_1(x^{(i)}) \over p_0(x^{(i)})}

175:    \ \ =\ \  {1 \over N} \sum_{i=1}^N \rhatSIS^{(i)}

176:    \ \ =\ \ \rhatSIS

177: \label{eq-simple}

178: \eeq

179: In the above equation, $E_{\pi_0}$ denotes an expectation with

180: respect to the distribution

181: $\pi_0$, which is estimated by a Monte Carlo average over points

182: $x^{(i)},\ldots,x^{(N)}$ drawn from $\pi_0$ (either independently, or using a

183: Markov chain sampler).

184: Here and later, $\hat r_{\mbox{\tiny M}}$ will denote an estimate of

185: $r=Z_1/Z_0$, found by method M.  If this estimate is an average of

186: unbiased estimates based on a number of samples, these individual

187: estimates will be denoted by $\hat r_{\mbox{\tiny M}}^{(i)}$.

188:

189: The simple importance sampling estimate, $\rhatSIS$, will be poor

190: if $\pi_0$ and $\pi_1$ are not close enough --- in particular, if any

191: region with non-negligible probability under $\pi_1$ has very small

192: probability under $\pi_0$.  Such a region would have an important

193: effect on the value of $r$, but very little information about it would

194: be contained in the sample from $\pi_0$.  In such a situation, it may

195: be possible to obtain a good estimate by introducing intermediate

196: distributions.  Parameterizing these distributions in some way using

197: $\eta$, we can define a sequence of distributions,

198: $\pi_{\eta_0},\ldots,\pi_{\eta_n}$, with $\eta_0=0$ and $\eta_n=1$ so

199: that the first and last distributions in the sequence are $\pi_0$ and

200: $\pi_1$, with the intermediate distributions interpolating between

201: them.  We can then write

202: \beq

203:   {Z_1 \over Z_0} & = & \prod_{j=0}^{n-1} {Z_{\eta_{j+1}} \over Z_{\eta_j}}

204: \label{eq-intermed}

205: \eeq

206: Provided that $\pi_{\eta_{j+1}}$ and $\pi_{\eta_j}$ are close enough,

207: we can estimate each of the factors

208: $Z_{\eta_{j+1}}/Z_{\eta_j}$ using simple

209: importance sampling, and from these estimates obtain an estimate for $Z_1/Z_0$.

210:

211: We can obtain good estimates in a wider range of situations, or using

212: fewer intermediate distributions (sometimes none), by applying a

213: technique introduced by Bennett (1976), who called it the `acceptance

214: ratio' method.  This method was later rediscovered by Meng and Wong

215: (1996), who called it `bridge sampling'.  Lu, Singh, and Kofke (2003)

216: provide a recent review and assessment.  One way of viewing this

217: method is that it replaces the simple importance sampling estimate for

218: $Z_1/Z_0$ by a ratio of estimates for $Z_*/Z_0$ and $Z_*/Z_1$, where

219: $Z_*$ is the normalizing constant for a `bridge distribution',

220: $\pi_*(x) = p_*(x)/Z_*$, which is chosen so that it is overlapped by

221: both $\pi_0$ and $\pi_1$.  Using simple importance sampling estimates

222: for $Z_*/Z_0$ and $Z_*/Z_1$, we can obtain the estimate

223: \beq

224:   {Z_1 \over Z_0} & = &

225:      E_{\pi_0}\! \left[ {p_*(X) \over p_0(X)} \right] \, \Big/\,

226:      E_{\pi_1}\! \left[ {p_*(X) \over p_1(X)} \right]

227:   \ \ \approx \ \

228:  {1 \over N_0} \sum_{k=1}^{N_0} {p_*(x_{0,k}) \over p_0(x_{0,k})} \ \Big/\

229:  {1 \over N_1} \sum_{k=1}^{N_1} {p_*(x_{1,k}) \over p_1(x_{1,k})}

230:    \ \ =\ \ \rhatbridge\ \ \ \ \

231: \label{eq-bridge}

232: \eeq

233: where $x_{0,1},\ldots,x_{0,N_0}$ are drawn from $\pi_0$ and

234: $x_{1,1},\ldots,x_{1,N_1}$ are drawn from $\pi_1$.

235:

236: One simple choice for the bridge distribution is the `geometric' bridge:

237: \beq

238:   p\geo_*(x) & = & \sqrt{p_0(x)p_1(x)}

239: \label{eq-geo-bridge}

240: \eeq

241: which is in a sense half-way between $\pi_0$ and $\pi_1$.

242: As discussed by Bennett (1976) and by Meng and Wong (1996), the asymptotically

243: optimal choice of bridge distribution is

244: \beq

245:   p\opt_*(x) & = & { p_0(x)p_1(x) \over r (N_0/N_1) p_0(x)\, +\, p_1(x)}

246: \label{eq-opt-bridge}

247: \eeq

248: where $r=Z_1/Z_0$.  Of course, we cannot use this bridge distribution

249: in practice, since we do not know $r$.  We can use a preliminary guess

250: at $r$ to define an initial bridge distribution, however, which will

251: give us a bridge sampling estimate for $Z_1/Z_0$.  Using this estimate

252: as the new value of $r$, we can refine our bridge distribution, iterating

253: this process as many times as desired.  The result of this iteration can

254: also be viewed as a maximum likelihood estimate for $r$, as discussed by

255: Shirts, \textit{et~al} (2003), who argues on this basis that it is

256: asymptotically as good as any estimate for $r$.   I have found that

257: estimates with $r$ set iteratively are often better than those found

258: with the true value of $r$ (which does not contradict optimality of the true

259: value for a fixed choice of bridge distribution).

260:

261: If $\pi_0$ and $\pi_1$ do not overlap sufficiently, no bridge

262: distribution will produce good estimates, and we will have to

263: introduce intermediate distributions as in

264: equation~(\ref{eq-intermed}).  Note, however, that the bridge sampling

265: estimate with either of the above bridge distributions converges

266: to the correct ratio asymptotically as long there is some region that

267: has non-zero probability under both $\pi_0$ and $\pi_1$, a much weaker

268: requirement than that for simple importance sampling.

269:

270: This advantage of bridge sampling over SIS can be seen in a simple

271: example involving distributions that are uniform over an interval of the

272: reals.  Let $p_0(x) = I_{(0,3)}(x)$ and $p_1(x)=I_{(2,4)}(x)$, so that

273: $Z_0=3$ and $Z_1=2$.  The simple importance sampling estimate of

274: equation~(\ref{eq-simple}) does not work, as it converges to $1/3$

275: rather than $2/3$.  However, using a bridge distribution with

276: $p_*(x)=I_{(2,3)}$, which is effectively what both $p_*\opt$ and

277: $p_*\geo$ will be in this example, the bridge sampling estimate of

278: equation~(\ref{eq-bridge}) converges to the correct value, since the

279: numerator converges to $1/3$ and the denominator to $1/2$.

280:

281: Although both simple importance sampling and bridge sampling have been

282: successfully used in many applications, they have some deficiencies.

283: One issue is that although the SIS estimate of

284: equation~(\ref{eq-simple}) is unbiased for $Z_1/Z_0$, the bridge

285: sampling estimate of equation~(\ref{eq-bridge}) is not, and the same

286: would appear to be the case for an estimate using intermediate

287: distributions (via equation~(\ref{eq-intermed})).  This is of no

288: direct importance, particularly since we are often more interested in

289: $\log(Z_1/Z_0)$ than in $Z_1/Z_0$ itself.  However, it does preclude

290: averaging independent replications of the bridge sampling estimate to

291: obtain a better estimate, since the bias would prevent convergence to

292: the correct value as the number of replications increases.  A more

293: vexing difficulty is that, except sometimes for $\pi_0$, sampling from

294: the distributions $\pi_{\eta}$ must usually be done by Markov chain

295: methods, which approach the desired distribution only asymptotically.

296: To speed convergence, the Markov chain for sampling $\pi_{\eta_j}$ is

297: often started from the last state sampled for $\pi_{\eta_{j-1}}$, but

298: it is unclear how many iterations should then be discarded before an

299: adequate approximation to the correct distribution is reached.

300:

301: Surprisingly, these difficulties can be completely overcome when using

302: simple importance sampling with a single point.  As shown by Jarzynski

303: (1997, 2001), and later independently by myself (Neal 2001), an estimate for

304: $Z_1/Z_0$ using intermediate distributions as in

305: equation~(\ref{eq-intermed}) will be exactly unbiased if each of the

306: ratios $Z_{\eta_{j+1}}/Z_{\eta_j}$ is estimated using the simple

307: importance sampling estimate of equation~(\ref{eq-simple}) with $N=1$,

308: sampling each distribution with a Markov chain update starting with the

309: point for the previous distribution.

310: Averaging the estimates obtained from $M$ independent replications of this

311: process (called `runs') produces the following estimate:

312: \beq

313:   {Z_1 \over Z_0} & \approx &

314:     {1 \over M}\, \sum_{i=1}^M\, \prod_{j=0}^{n-1}\,

315:     {p_{\eta_{j+1}}(x^{(i)}_j) \over p_{\eta_j}(x^{(i)}_j)}

316:     \ \ =\ \ {1 \over M} \sum_{i=1}^M \rhatAIS^{(i)}

317:     \ \ =\ \ \rhatAIS

318: \label{eq-ais-est}

319: \eeq

320: Here, $x^{(1)}_0,\ldots,x^{(M)}_0$ are drawn independently from $\pi_0$,

321: and each $x^{(i)}_j$ for $j>0$ is generated by applying a Markov chain

322: transition that leaves $\pi_{\eta_j}$ invariant to $x^{(i)}_{j-1}$.  This

323: single Markov transition (which could, however, consist of several Metropolis

324: or other updates if we so choose), will usually not be enough to reach

325: equilibrium, but the estimate $\rhatAIS$ is nevertheless exactly unbiased, and

326: will converge to the true value as $M$ increases, provided that no region

327: having zero probability under $\pi_{\eta_j}$ has non-zero probability

328: under $\pi_{\eta_{j+1}}$.  This can be proved by showing how the

329: estimate above can be seen as a simple importance sampling estimate on an

330: extended state space that includes the values sampled for the intermediate

331: distributions.

332:

333: I call this method `Annealed Importance Sampling' (AIS), since the

334: sequence of distributions used often corresponds to an `annealing'

335: procedure, in which the temperature is gradually decreased.  As I

336: discuss in (Neal 2001), this allows the procedure to sample different

337: isolated modes of the distribution on different runs, properly

338: weighting the points obtained from each of these runs to produce the

339: correct probability for each mode.  AIS is related to an earlier

340: method for moving between isolated modes that I call `tempered

341: transitions' (Neal 1996).  In a recent paper (Neal 2004), I show how

342: tempered transitions can be modified to produce a method for efficient

343: Markov chain sampling when some of the state variables are `fast' ---

344: ie, when it is possible to more quickly recompute the probability of a

345: state when only these fast variables change than when the other `slow'

346: variables change as well.  In this method, the fast variables are

347: `dragged' through intermediate distributions in order to produce more

348: appropriate values to go with a proposed change to the slow variables.

349: Deciding whether to accept the final proposal involves what is in

350: effect an estimate of the ratio of normalizing constants for the

351: conditional distributions of the fast variables.

352:

353: In this paper, I show how the ideas behind Annealed Importance

354: Sampling and bridge sampling can be combined.  I call the resulting

355: method `Linked Importance Sampling' (LIS), since the two samples

356: needed for bridge sampling are linked by a single state that is used

357: in both.  Intermediate distributions can be used, with each

358: distribution being linked by a single state to the next distribution.

359: In contrast to bridge sampling, LIS estimates are unbiased, and as is

360: the case for AIS, they remain exactly unbiased even when intermediate

361: distributions are used, and when sampling is done using Markov chain

362: transitions that have not converged to their equilibrium

363: distributions.

364:

365: Crooks (2000) mentions a different way of combining AIS with bridge

366: sampling --- since AIS estimates are simple importance sampling

367: estimates on an extended state space, we can combine `forward' and

368: `reverse' estimates to produce a bridge sampling estimate that may be

369: superior.  I will call this method `bridged AIS'.  Similarly,

370: such a top-level application of bridge sampling can be combined with

371: the low-level application of bridge sampling in LIS, giving what I

372: call `bridged LIS'.

373:

374: Using tests on sequences of one-dimensional distributions, I

375: demonstrate that for some problems LIS is much more efficient than AIS

376: --- a result that should be expected, since in extreme cases, such as

377: for the uniform distributions discussed above, the simple importance

378: sampling estimates underlying AIS do not converge to the correct

379: answer even asymptotically, whereas bridge sampling estimates do.  For

380: some other problems, however, AIS and LIS perform about equally well.

381: The bridged version of AIS sometimes performs much better than the

382: unbridged version, but still performs less well than LIS and its

383: bridged version on some problems.  I also analyse the asymptotic

384: properties of AIS and LIS for some types of distribution, providing

385: additional insight into their behaviour.

386:

387: Variants of tempered transitions and of my method for dragging fast

388: variables can be constructed that are analogous to LIS rather than to

389: AIS.  I discuss the `linked' variant of tempered transitions briefly,

390: and include a more detailed description of a linked version of

391: dragging, which may sometimes be better than the version related to

392: AIS.  I conclude by discussing some possibilities for future research.

393:

394:

395: \section{\hspace*{-7pt}The Linked Importance Sampling

396:          procedure}\label{sec-lis}\vspace*{-10pt}

397:

398: Assume that we can evaluate the unnormalized probability or density

399: functions $p_{\eta}(x)$, for any value of the parameter $\eta$, with

400: the normalized form of such a distribution being denoted by

401: $\pi_{\eta}$.  The values $\eta=0$ and $\eta=1$ define the two

402: distributions we are interested in, for which the normalizing

403: constants are $Z_0$ and $Z_1$.  A sequence of $n\!-\!1$ intermediate

404: values for $\eta$ define distributions that will assist in estimating

405: the ratio of these normalizing constants, $r=Z_1/Z_0$.  We denote the

406: values of $\eta$ for the distributions used by $\eta_0,\ldots,\eta_n$,

407: with $\eta_0=0$ and $\eta_n=1$.  Typically, $\eta_j<\eta_{j+1}$ for

408: all $j$.

409:

410: For problems in statistical physics, $\eta$ might be proportional to

411: the inverse temperature, $\beta$, of equation~(\ref{eq-canonical}), or

412: might map to a value for $\lambda$.  For a Bayesian inference

413: problem, $\eta$ might be a power that the likelihood is raised to, so

414: that $\eta=0$ causes the data to be ignored, and $\eta=1$ gives full

415: weight to the data; the ratio $Z_1/Z_0$ will then be the marginal

416: likelihood.  In both of these examples, progressing in small steps

417: from $\eta=0$ to $\eta=1$ is not only useful in estimating $Z_1/Z_0$,

418: but also often has an `annealing' effect, which helps avoid being

419: trapped in a local mode of the distribution.

420:

421: \subsection{\hspace*{-4pt}Details of the LIS procedure}\vspace*{-4pt}

422:

423: For each distribution, $\pi_{\eta}$, assume we have a pair of Markov chain

424: transition probability (or density) functions, denoted by $T_{\eta}(x,x')$

425: and $\underline{T}_{\eta}(x,x')$, satisfying $\int T_{\eta}(x,x') dx' = 1$

426: and $\int \underline{T}_{\eta}(x,x') dx' = 1$, for which the following mutual

427: reversibility relationship holds:

428: \beq

429:   \pi_{\eta}(x)\,T_{\eta}(x,x') & = &

430:      \pi_{\eta}(x')\,\underline{T}_{\eta}(x',x),\ \ \ \

431:   \mbox{for all $x$ and $x'$}

432: \label{eq-rev}

433: \eeq

434: From this relationship, one can easily show that both $T_{\eta}$ and

435: $\underline{T}_{\eta}$ leave $\pi_{\eta}$ invariant --- ie, that

436: $\int \pi_{\eta}(x)

437: T_{\eta}(x,x') dx = \pi_{\eta}(x')$, and the same for $\underline{T}_{\eta}$.

438: If $T_{\eta}$ is reversible (ie, satisfies `detailed balance'), then

439: $\underline{T}_{\eta}$ will be the same as $T_{\eta}$.  Non-reversible

440: transitions often arise when components of state are updated in some

441: predetermined order, in which case the reverse transition simply updates

442: components in the opposite order.  As a special case, $T_{\eta}$ might

443: draw the next state from $\pi_{\eta}$ independently of the current state.

444: Such independent sampling may often be possible for $T_0$.

445:

446: These Markov chain transitions are used to obtain samples that are

447: approximately drawn from each of the $n\!+\!1$ distributions,

448: $\pi_{\eta_0},\ldots,\pi_{\eta_n}$.  We assume that we can begin

449: sampling from $\pi_0$ by drawing a single point independently from

450: $\pi_0$.  For $j>0$, we begin sampling from $\pi_{\eta_j}$ by

451: selecting a link state, $x_{j-1*j}$, from the sample associated with

452: $\pi_{\eta_{j-1}}$.  For all $j$, we produce a sample of $K_j\!+\!1$

453: states from this starting point by applying a total of $K_j$ forward

454: ($T_{\eta_j}$) or reversed ($\underline T_{\eta_j}$) Markov

455: transitions.  Link states are selected using bridge distributions,

456: $p_{j*j+1}$, which are defined in terms of $p_{\eta_j}$ and

457: $p_{\eta_{j+1}}$, perhaps using the form of

458: equation~(\ref{eq-geo-bridge}) or~(\ref{eq-opt-bridge}), with $p_0$

459: replaced by $p_{\eta_j}$ and $p_1$ by $p_{\eta_{j+1}}$.

460:

461: In detail, the Linked Importance Sampling procedure produces $M$ estimates,

462: $\rhatLIS^{(1)},\ldots,\rhatLIS^{(M)}$, that are averaged to produce

463: the final estimate, $\rhatLIS$.  Each $\rhatLIS^{(i)}$ is

464: obtained by performing the following:\vspace*{5pt}

465:

466: \begin{center}\bf The LIS Procedure\end{center}\vspace*{-5pt}

467:

468: \begin{enumerate}

469: \item[1)] Pick an integer $\nu_0$ uniformly at random from $\{0,\ldots,K_0\}$,

470:           and then set $x_{0,\nu_0}$ to a value drawn from $\pi_{\eta_0}$.

471: \item[2)] For $j\,=\,0,\ldots,n$, sample $K_j\!+\!1$ states drawn (at

472:   least approximately) from $\pi_{\eta_j}$  as follows:

473:   \begin{enumerate}

474:   \item[a)] If $j>0$:\ \ Pick an integer $\nu_j$ uniformly at random from

475:             $\{0,\ldots,K_j\}$, and then set $x_{j,\nu_j}$ to $x_{j-1*j}$.

476:   \item[b)] For $k\,=\,\nu_j+1,\ldots,K_j$, draw $x_{j,k}$ according to the

477:             forward Markov chain transition probabilities

478:             $T_{\eta_j}(x_{j,k-1},x_{j,k})$.  (If $\nu_j=K_j$, do nothing in

479:             this step.)

480:   \item[c)] For $k\,=\,\nu_j-1,\ldots,0$, draw $x_{j,k}$ according to the

481:             reverse Markov chain transition probabilities

482:             $\underline{T}_{\eta_j}(x_{j,k+1},x_{j,k})$. (If $\nu_j=0$, do

483:             nothing in this step.)

484:   \item[d)] If $j<n$:\ \ Pick a value for $\mu_j$ from

485:             $\{0,\ldots,K_j\}$ according to the following

486:             probabilities:\vspace*{-2pt}

487:             \beq

488:               \Pi_0(\mu_j\,|\,x_j) & = &

489:                  {p_{j*j+1}(x_{j,\mu_j}) \over p_{\eta_j}(x_{j,\mu_j})}

490:                  \ \Big/\

491:                  \sum_{k=0}^{K_j} {p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k})}

492:             \label{eq-pmuj}

493:             \eeq

494:             and then set $x_{j*j+1}$ to $x_{j,\mu_j}$.

495:   \end{enumerate}

496: \item[3)] Set $\mu_n$ to a value chosen uniformly at random from

497:           $\{0,\ldots,K_n\}$.  (This selection has no effect on

498:           the estimate, but is used in the proof of correctness.)

499: \item[4)] Compute the estimate from this run as follows:

500: \beq

501:    \rhatLIS^{(i)} & = & \prod_{j=0}^{n-1} \left[

502:      {1 \over K_j+1}\, \sum_{k=0}^{K_j}\,

503:           { p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) }

504:      \ \Big/\

505:      {1 \over K_{j+1}+1}\, \sum_{k=0}^{K_{j+1}}\,

506:           { p_{j*j+1}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k}) }

507:      \right]

508: \label{eq-lis}

509: \eeq

510: (Note that most of the factors of $1/(K_j\!+\!1)$ and

511: $1/(K_{j+1}\!+\!1)$ cancel, giving a final result of

512: $(K_n\!+\!1)\,/\,(K_0\!+\!1)$, but the redundant factors

513: are retained above for clarity of meaning.)\vspace*{-6pt}

514: \end{enumerate}

515: The result of performing steps (1) through (3) is illustrated in

516: Figure~\ref{fig-lis}.  After $M$ runs of this procedure, the final

517: estimate is computed as

518: \beq

519:   \rhatLIS & = & {1 \over M} \sum_{i=1}^M \rhatLIS^{(i)}

520: \eeq

521:

522: \begin{figure}[t]

523:

524: \centerline{\includegraphics[width=6.5in]{fig-lis.eps}}

525:

526: \caption[]{An illustration of Linked Importance Sampling.  One

527: intermediate distribution is used, with $\eta_1=1/2$.  The

528: distributions $\pi_0$, $\pi_{1/2}$, and $\pi_1$ are represented by

529: ovals enclosing the regions of high probability under each

530: distribution.  Nine Markov chain transitions are performed at each

531: stage.  The two link states are shown as black dots.  The initial and

532: final states (indexed by $\nu_0$ and $\mu_n$) are shown as gray dots.

533: Other states generated by the forward and reverse Markov chain

534: transitions are shown as empty dots.  For this run, $\nu_0\!=\!4$,

535: $\mu_0\!=\!9$, $\nu_1\!=\!1$, $\mu_1\!=\!8$, $\nu_2\!=\!3$, and

536: $\mu_2\!=\!7$.}\label{fig-lis}

537:

538: \end{figure}

539:

540: The crucial aspect of Linked Importance Sampling is that when moving

541: from distribution $\pi_{\eta_j}$ to $\pi_{\eta_{j+1}}$, a link state,

542: $x_{j*j+1}$, is randomly selected from among the sample of points

543: $x_{j,1},\ldots,x_{j,K_j+1}$ that are associated with $\pi_{\eta_j}$.

544: We can view the link state as part of the sample associated with

545: $\pi_{\eta_{j+1}}$ as well as that associated with $\pi_{\eta_j}$.

546: Accordingly, when using the `optimal' bridge of

547: equation~(\ref{eq-opt-bridge}), I will set $N_0/N_1$ to

548: $(K_j\!+\!1)/(K_{j+1}\!+\!1)$, though the proof of optimality for

549: bridge sampling does not guarantee that this is an optimal choice when

550: using this bridge distribution for LIS.

551:

552: \subsection{\hspace*{-4pt}Proof that LIS estimates are unbiased}\vspace*{-4pt}

553:

554: In order to prove that $\rhatLIS^{(i)}$ is an unbiased estimate of

555: $r=Z_1/Z_0$, we can regard steps (1) through (3) above as defining a

556: distribution,

557: $\Pi_0$, over all the quantities involved in the procedure --- namely,

558: $x_j$, $\mu_j$, and $\nu_j$, for $j=0,\ldots,n$, with $x_j$ representing

559: $x_{j,0},\ldots,x_{j,K_j}$.  We then

560: consider the procedure for generating these same quantities in reverse,

561: which operates as follows:\vspace*{5pt}

562:

563: \pagebreak

564:

565: \begin{center}\bf The Reverse LIS Procedure\end{center}\vspace*{-5pt}

566:

567: \begin{enumerate}

568: \item[1)] Pick an integer $\mu_n$ uniformly at random from $\{0,\ldots,K_n\}$,

569:           and then set $x_{n,\mu_n}$ to a value drawn from $\pi_{\eta_n}$.

570: \item[2)] For $j\,=\,n,\ldots,0$, sample $K_j\!+\!1$ states drawn (at

571:   least approximately) from $\pi_{\eta_j}$  as follows:

572:   \begin{enumerate}

573:   \item[a)] If $j<n$:\ \ Pick an integer $\mu_j$ uniformly at random from

574:             $\{0,\ldots,K_j\}$, and then set $x_{j,\mu_j}$ to $x_{j*j+1}$.

575:   \item[b)] For $k\,=\,\mu_j+1,\ldots,K_j$, draw $x_{j,k}$ according to the

576:             forward Markov chain transition probabilities

577:             $T_{\eta_j}(x_{j,k-1},x_{j,k})$.  (If $\mu_j=K_j$, do nothing

578:             in this step.)

579:   \item[c)] For $k\,=\,\mu_j-1,\ldots,0$, draw $x_{j,k}$ according to the

580:             reverse Markov chain transition probabilities

581:             $\underline{T}_{\eta_j}(x_{j,k+1},x_{j,k})$. (If $\mu_j=0$,

582:             do nothing in this step.)

583:   \item[d)] If $j>0$:\ \ Pick a value for $\nu_j$ from

584:             $\{0,\ldots,K_j\}$ according to the following

585:             probabilities:\vspace*{-3pt}

586:             \beq

587:               \Pi_1(\nu_j\,|\,x_j) & = &

588:                  {p_{j-1*j}(x_{j,\nu_j}) \over p_{\eta_j}(x_{j,\nu_j})}

589:                  \ \Big/\

590:                  \sum_{k=0}^{K_j} {p_{j-1*j}(x_{j,k}) \over p_{\eta_j}(x_{j,k})}

591:             \label{eq-pnuj}

592:             \eeq

593:             and then set $x_{j-1*j}$ to $x_{j,\nu_j}$.

594:   \end{enumerate}

595: \item[3)] Set $\nu_0$ to a value chosen uniformly at random from

596:           $\{0,\ldots,K_0\}$.\vspace*{-6pt}

597: \end{enumerate}

598: This reverse procedure also defines a distribution over all the

599: quantities generated ($x_j$, $\mu_j$, and $\nu_j$ for $j=0,\ldots,n$),

600: which will be denoted by $\Pi_1$.

601:

602: We now define the unnormalized probability (density) functions

603: $P_0(x,\mu,\nu) = Z_0 \Pi_0(x,\mu,\nu)$ and

604: $P_1(x,\mu,\nu) = Z_1 \Pi_1(x,\mu,\nu)$.  The ratio of normalizing constants

605: for these distributions is obviously $r=Z_1/Z_0$.  We can estimate this

606: ratio by simple importance sampling, using the ratios

607: \beq

608:    {P_1(x,\mu,\nu) \over P_0(x,\mu,\nu)} & = &

609:    { Z_1\, \Pi_1(\mu_n)\, \pi_{\eta_n}(x_{n,\mu_n})\,

610:      \prod\limits_{j=0}^{n-1} \Pi_1(\mu_j)\,

611:      \prod\limits_{j=0}^n \Pi_1(x_j\,|\,\mu_j,x_{j,\mu_j})\,

612:      \prod\limits_{j=1}^{n} \Pi_1(\nu_j\,|\,x_j)\, \Pi_1(\nu_0)

613:      \over

614:      Z_0\, \Pi_0(\nu_0)\, \pi_{\eta_0}(x_{0,\nu_0})\,

615:      \prod\limits_{j=1}^n \Pi_0(\nu_j)\,

616:      \prod\limits_{j=0}^n \Pi_0(x_j\,|\,\nu_j,x_{j,\nu_j})\,

617:      \prod\limits_{j=0}^{n-1} \Pi_0(\mu_j\,|\,x_j)\, \Pi_0(\mu_n)

618:    }\ \ \

619: \label{eq-ratio01}

620: \eeq

621:

622: From Steps (2b) and (2c) of the forward and reverse procedures, along

623: with the mutual reversibility relationship of equation~(\ref{eq-rev}), we see

624: that

625: \beq

626:   \Pi_0(x_j\,|\,\nu_j,x_{j,\nu_j})

627:   & = &

628:   \prod_{k=\nu_j+1}^n\!\! T_{\eta_j}(x_{j,k-1},x_{j,k})\ \cdot\

629:   \prod_{k=0}^{\nu_j-1} \underline{T}_{\eta_j}(x_{j,k+1},x_{j,k}) \\[4pt]

630:   & = &

631:   \prod_{k=\nu_j+1}^n\!\! T_{\eta_j}(x_{j,k-1},x_{j,k})\ \cdot\

632:   \prod_{k=0}^{\nu_j-1} T_{\eta_j}(x_{j,k},x_{j,k+1})\,

633:                         {\pi_{\eta_j}(x_{j,k})\over\pi_{\eta_j}(x_{j,k+1})}

634:   \\[4pt]

635:   & = &

636:   {\pi_{\eta_j}(x_{j,0})\over\pi_{\eta_j}(x_{j,\nu_j})}\

637:   \prod_{k=1}^n\, T_{\eta_j}(x_{j,k-1},x_{j,k})

638: \label{eq-chain1}

639: \eeq

640: and similarly,

641: \beq

642:   \Pi_1(x_j\,|\,\mu_j,x_{j,\mu_j})

643:   & = &

644:   {\pi_{\eta_j}(x_{j,0})\over\pi_{\eta_j}(x_{j,\mu_j})}\

645:   \prod_{k=1}^n\, T_{\eta_j}(x_{j,k-1},x_{j,k})

646: \label{eq-chain2}

647: \eeq

648: From this, we see that parts of the ratio in equation~(\ref{eq-ratio01})

649: can be written as

650: \beq

651:    { Z_1\,\pi_{\eta_n}(x_{n,\mu_n})\,

652:      \prod\limits_{j=0}^n \, \Pi_1(x_j\,|\,\mu_j,x_{j,\mu_j})\,

653:      \over

654:      Z_0\,\pi_{\eta_0}(x_{0,\nu_0})\,

655:      \prod\limits_{j=0}^n \, \Pi_0(x_j\,|\,\nu_j,x_{j,\nu_j})\,

656:    }

657:    & = &

658:    {p_{\eta_n}(x_{n,\mu_n}) \over p_{\eta_0}(x_{0,\nu_0})}\,

659:    \prod_{j=0}^n\, {\pi_{\eta_j}(x_{j,\nu_j}) \over \pi_{\eta_j}(x_{j,\mu_j})}

660:    \ \ =\ \

661:    \prod_{j=0}^{n-1}\,

662:       {p_{\eta_{j+1}}(x_{j,\mu_j}) \over p_{\eta_j}(x_{j,\mu_j})}\ \ \

663: \label{eq-fact1}

664: \eeq

665: The last step uses the fact that for $j=1,\ldots,n$,

666: $x_{j,\nu_j} = x_{j-1*j} = x_{j-1,\mu_{j-1}}$.

667:

668: From Steps (1) and (2a), we

669: see that $\Pi_0(\nu_j) = 1\,/\,(K_j\!+\!1)$ and $\Pi_1(\mu_j) =

670: 1\,/\,(K_j\!+\!1)$.   Using this, and again using

671: $x_{j,\nu_j} = x_{j-1,\mu_{j-1}}$, we get that

672: \beq

673:    \lefteqn {{

674:      \prod\limits_{j=0}^{n-1} \Pi_1(\mu_j)\,

675:      \prod\limits_{j=1}^{n} \Pi_1(\nu_j\,|\,x_j)

676:      \over

677:      \prod\limits_{j=1}^{n} \Pi_0(\nu_j)\,

678:      \prod\limits_{j=0}^{n-1} \Pi_0(\mu_j\,|\,x_j)}

679:    \ \ = \ \

680:    { \prod\limits_{j=0}^{n-1} \Pi_1(\nu_{j+1}\,|\,x_{j+1})\,(K_{j+1}\!+\!1)

681:      \over

682:      \prod\limits_{j=0}^{n-1} \Pi_0(\mu_j\,|\,x_j)\,(K_j\!+\!1)

683:    }}\ \ \ \ \ \ \ \ \\[5pt]

684:    & = &

685:    \prod_{j=0}^{n-1}\,\

686:    {\displaystyle

687:     \ {p_{j*j+1}(x_{j+1,\nu_{j+1}}) \over p_{\eta_{j+1}}(x_{j+1,\nu_{j+1}})}

688:     \ \Big/\ {1 \over K_{j+1}\!+\!1}

689:     \sum_{k=0}^{K_{j+1}} {p_{j*j+1}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k})}\

690:    \over\displaystyle

691:     {p_{j*j+1}(x_{j,\mu_j}) \over p_{\eta_j}(x_{j,\mu_j})}

692:     \ \Big/\ {1 \over K_j\!+\!1}

693:     \sum_{k=0}^{K_j} {p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k})}

694:    } \\[5pt]

695:   & = &

696:    \prod_{j=0}^{n-1}\,

697:     {p_{\eta_j}(x_{j,\mu_j}) \over p_{\eta_{j+1}}(x_{j,\mu_j})}\

698:    \prod_{j=0}^{n-1}\,

699:    \left[

700:     {1 \over K_j\!+\!1}

701:     \sum_{k=0}^{K_j} {p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k})}

702:     \ \Big/\

703:     {1 \over K_{j+1}\!+\!1}

704:     \sum_{k=0}^{K_{j+1}} {p_{j*j+1}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k})}

705:    \right]\ \ \ \

706: \label{eq-fact2}

707: \eeq

708:

709: From Steps (1) and (3), we see that

710: $\Pi_0(\nu_0) = \Pi_1(\nu_0) = 1\,/\,(K_0\!+\!1)$ and

711: $\Pi_1(\mu_n) = \Pi_0(\mu_n) = 1\,/\,(K_n\!+\!1)$, so these factors

712: cancel in equation~(\ref{eq-ratio01}).  The factors in

713: equation~(\ref{eq-fact1}) cancel with the first part of

714: equation~(\ref{eq-fact2}).  The final result is that the simple importance

715: sampling estimate based on a single LIS run is as shown in

716: equation~(\ref{eq-lis}), demonstrating that $\rhatLIS$ is indeed an unbiased

717: estimate of $r=Z_1/Z_0$.

718:

719: \subsection{\hspace*{-4pt}Bridged LIS estimates}\vspace*{-4pt}

720:

721: Since the LIS estimate can be viewed as a simple importance sampling

722: estimate on an extended space, we can consider a `bridged LIS'

723: estimate in which this top-level SIS estimate is replaced by a bridge

724: sampling estimate.  This will require that we actually perform the reverse

725: LIS procedure described above, from which an  LIS estimate for

726: the reverse ratio, $\underline{r} = Z_0/Z_1$, can be computed:

727: \beq

728:    \rhatLISrev^{(i)} & = & \prod_{j=1}^{n} \left[

729:      {1 \over K_j+1}\, \sum_{k=0}^{K_j}\,

730:           { p_{j-1*j}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) }

731:      \ \Big/\

732:      {1 \over K_{j-1}+1}\, \sum_{k=0}^{K_{j-1}}\,

733:           { p_{j-1*j}(x_{j-1,k}) \over p_{\eta_{j-1}}(x_{j-1,k}) }

734:      \right]

735: \label{eq-lis-rev}

736: \eeq

737: The reversed procedure requires independent sampling from $\pi_1$.

738: This will usually not be possible directly, but well-separated states

739: from a Markov chain sampler with $\pi_1$ as its invariant distribution will

740: provide a good approximation, provided that this sampler moves around the

741: whole distribution, without being trapped in an isolated mode.  Indeed,

742: the entire sample of $K_n\!+\!1$ states from $\pi_1$ that is needed

743: at the start of the reverse procedure can be obtained by taking consecutive

744: states from such a Markov chain sampler.

745:

746: For the bridged form of LIS, we also need a suitable bridge

747: distribution, $P_*$, for which we must be able to evaluate the ratios

748: $P_*/P_0$ and $P_*/P_1$.  (Note that this choice of a

749: `top-level' bridge distribution is separate from the choices of

750: `low-level' bridge distributions, $p_{j*j+1}$, though we might use the same

751: form for both.)  With the optimal bridge of

752: equation~(\ref{eq-opt-bridge}), these ratios can be written as follows,

753: if the forward procedure is performed $M$ times and the reverse procedure

754: $\underline{M}$ times:

755: \beq

756:  {P\opt_*(x,\mu,\nu) \over P_0(x,\mu,\nu)} & = &

757:  \left[\,r\,(M/\underline{M})\,

758:        \left({P_1(x,\mu,\nu) \over P_0(x,\mu,\nu)}\right)^{-1}

759:        \!\! +\ 1\,\right]^{-1}

760:  \\[6pt]

761:  {P\opt_*(x,\mu,\nu) \over P_1(x,\mu,\nu)} & = &

762:  \left[\, r\,(M/\underline{M})\ +\

763:        \left({P_0(x,\mu,\nu) \over P_1(x,\mu,\nu)}\right)^{-1}

764:        \right]^{-1}

765: \eeq

766: The geometric bridge of equation~(\ref{eq-geo-bridge}) results in

767: \beq

768:  {P\geo_*(x,\mu,\nu) \over P_0(x,\mu,\nu)} & = &

769:  \sqrt{P_1(x,\mu,\nu) \over P_0(x,\mu,\nu)}

770:  \\[6pt]

771:  {P\geo_*(x,\mu,\nu) \over P_1(x,\mu,\nu)} & = &

772:  \sqrt{P_0(x,\mu,\nu) \over P_1(x,\mu,\nu)}

773: \eeq

774: These expressions allow us to express bridged LIS estimates in terms

775: of the simple LIS estimate of equation~(\ref{eq-lis}), and its reverse

776: version of equation~(\ref{eq-lis-rev}).  For the optimal bridge, we get

777: \beq

778:  \rhatLISbridged\opt & = &

779:    {1 \over M} \sum_{i=1}^M\,

780:       {1 \over r\,(M/\underline{M})\,/\,\rhatLIS^{(i)}\ +\ 1}

781:    \,\ \Big/\

782:    {1 \over \underline{M}} \sum_{i=1}^{\underline{M}}\,

783:       {1 \over r\,(M/\underline{M})\ +\ 1/\rhatLISrev^{(i)}}

784: \label{eq-bridged-lis1}

785: \eeq

786: Similarly, for the geometric bridge, we get

787: \beq

788:  \rhatLISbridged\geo

789:    & = & {1 \over M} \sum_{i=1}^M\, \sqrt{\rhatLIS^{(i)}} \,\ \Big/\

790:          {1 \over \underline{M}} \sum_{i=1}^{\underline{M}}\,

791:            \sqrt{\rhatLISrev^{(i)}}

792: \label{eq-bridged-lis2}

793: \eeq

794:

795: \subsection{\hspace*{-4pt}LIS estimates with independent sampling with no

796:                           intermediate distributions}\vspace*{-4pt}

797:

798: It is interesting to look at the special case of Linked Importance

799: Sampling with $n=1$ --- ie, in which the are no intermediate

800: distributions between $\pi_0$ and $\pi_1$ --- in which the points from both

801: $\pi_0$ and $\pi_1$ are sampled independently.  The LIS procedure

802: can then be simplified somewhat, and it is also possible to improve

803: the LIS estimate by averaging over the choice of link state.  Such

804: averaging is not feasible when Markov chain sampling is used, since

805: choosing a different link state would require a new simulation of the

806: Markov transitions.

807:

808: Since we will sample points independently, there is no need to decide

809: how many points will be sampled by the forward transitions and how

810: many by the reverse transitions in Steps (2a) and (2b) of the LIS

811: procedure.  We simply obtain a pair of samples consisting of

812: points $x_{0,0},\ldots,x_{0,K_0}$ drawn independently from $\pi_0$,

813: and points $x_{1,1},\ldots,x_{1,K_1}$ drawn independently from

814: $\pi_1$.  We then randomly select a link state, indexed by $\mu$, from

815: among $x_{0,0},\ldots,x_{0,K_0}$ according to the

816: following probabilities, which depend on the choice of a single

817: bridge distribution, denoted by $p_*(x)$:

818: \beq

819:   \Pi_0 (\mu \,|\, x_0) & = &

820:   { p_*(x_{0,\mu}) \over p_0(x_{0,\mu})}\ \Big/\

821:                \sum\limits_{k=0}^{K_0} {p_*(x_{0,k}) \over p_0(x_{0,k}) }

822: \eeq

823: The LIS estimate for $r = Z_1/Z_0$ based on this pair of samples

824: from $\pi_0$ and $\pi_1$ is

825: \beq

826:  \rhatLIS^{(i)} & = &

827:  {1 \over K_0\!+\!1} \sum_{k=0}^{K_0} {p_*(x_{0,k}) \over p_0(x_{0,k})}

828:  \ \Big/\,

829:  {1 \over K_1\!+\!1} \left[ {p_*(x_{0,\mu}) \over p_1(x_{0,\mu})}

830:          \, +\, \sum_{k=1}^{K_1} {p_*(x_{1,k}) \over p_1(x_{1,k})}

831:  \right]

832: \label{eq-lis-indep}

833: \eeq

834: The superscript $i$ is used here

835: to indicate that this estimate is based on the $i$'th

836: pair of samples.  We can see that it is very similar to the bridge sampling

837: estimate of equation~(\ref{eq-bridge}), except that the link state is included

838: in both samples.  Since these LIS estimates are unbiased, we can

839: average $M$ of them to obtain a final LIS estimate.

840:

841: We can also average the estimate of equation~(\ref{eq-lis-indep})

842: over the random choice of link state, which

843: is guaranteed to produce an estimate (also unbiased) with smaller

844: mean-squared-error (see Schervish 1995, Section 3.2).  The result is

845: \beq

846:  \rhatLISave^{(i)} & = &

847:  \sum_{\mu=0}^{K_0} \Pi_0(\mu\,|\,x_0) \

848:  {1 \over K_0\!+\!1} \sum_{k=0}^{K_0} {p_*(x_{0,k}) \over p_0(x_{0,k})}

849:  \ \Big/\,

850:  {1 \over K_1\!+\!1} \left[ {p_*(x_{0,\mu}) \over p_1(x_{0,\mu})}

851:          \, +\, \sum_{k=1}^{K_1} {p_*(x_{1,k}) \over p_1(x_{1,k})}

852:  \right] \\[5pt]

853: & = &

854:  {K_1\!+\!1 \over K_0\!+\!1}\ \sum_{\mu=0}^{K_0} \

855:  {p_*(x_{0,\mu}) \over p_0(x_{0,\mu})}

856:  \ \Big/\,

857:   \left[ {p_*(x_{0,\mu}) \over p_1(x_{0,\mu})}

858:          \, +\, \sum_{k=1}^{K_1} {p_*(x_{1,k}) \over p_1(x_{1,k})}

859:  \right]

860: \label{eq-lis-ave}

861: \eeq

862: Averaging these estimates over $M$ pairs of samples produces a final estimate

863: denoted by $\rhatLISave$.

864:

865: To use bridged LIS in this context, we need to find reverse estimates

866: as well, but these reverse estimates needn't be independent of the

867: forward estimates, since the asymptotic validity of the bridge

868: sampling estimate of equation~(\ref{eq-bridge}) does not depend on the

869: samples $x_0$ and $x_1$ being independent.  Accordingly, we can use

870: the same samples from $\pi_0$ and $\pi_1$ for the forward and the

871: reverse operations.  However, to perform reverse sampling, we need to

872: have a sample of $K_1\!+\!1$ points drawn from $\pi_1$, the first of

873: which is ignored when performing forward sampling.  Conversely, the

874: first of the $K_0\!+\!1$ points drawn from $\pi_0$ is ignored when

875: performing the reverse sampling.

876:

877: We can improve the bridged LIS estimates by averaging the numerator

878: and the denominator of equation~(\ref{eq-bridged-lis1})

879: or~(\ref{eq-bridged-lis2}) with respect to the random choice of link

880: state.  We can also average with respect to the omission of one of the

881: points from one of the samples --- ie, rather than omitting the first

882: of $K_1 + 1$ points in the sample from $\pi_1$ when computing a

883: forward estimate, we average with respect to a random choice of point

884: to omit, and similarly for reverse estimates.  Note that the averaging

885: should be done over the sums in the numerator and denominator, not

886: with respect to the entire estimate, nor with respect to the values of

887: $\rhatLIS^{(i)}$ and $\rhatLISrev^{(i)}$ appearing inside the

888: summands.  The effective sample size after this additional averaging

889: of dependent points is unclear, so it is not obvious what the ratio

890: of sample sizes in equation~(\ref{eq-opt-bridge}) should be, but

891: using $(K_0\!+\!1)/(K_1\!+\!1)$ is probably adequate.

892:

893:

894: \section{\hspace*{-7pt}Analytical comparisons of AIS and

895:                        LIS}\label{sec-anal}\vspace*{-10pt}

896:

897: In this section, I analyse (somewhat informally) the performance of

898: AIS and LIS asymptotically, and in other situations where analytical

899: results are possible.

900:

901:

902: \subsection{\hspace*{-4pt}Asymptotic properties of

903:                        AIS and LIS estimates}\label{sec-asym}\vspace*{-4pt}

904:

905: I begin by analysing the asymptotic performance of AIS and LIS when

906: the sequence of distributions is defined by an unnormalized density function

907: of the following form:

908: \beq

909:    p_{\eta}(x) & = & p_0(x)\, \exp (-\eta U(x))

910: \label{eq-U-dist}

911: \eeq

912: This class includes sequences of canonical distributions defined by

913: equation~(\ref{eq-canonical}) in which the inverse temperature

914: varies, as well as

915: sequences that can be used for Bayesian analysis, in which $p_0$ defines the

916: prior and $\eta$ is a power that the likelihood (expressed as $\exp(-U(x))$) is

917: raised to, with $\eta=1$ giving the posterior distribution.

918: For these distributions, we can express $r$ using the well-known

919: `thermodynamic integration' formula as follows:

920: \beq

921:   r\ \ =\ \ \log(Z_1/Z_0)\ \ =\ \ - \int_0^1 E_{\pi_{\eta}}(U)\,d\eta

922: \label{eq-therm-int}

923: \eeq

924:

925: The analysis here is asymptotic, as the number of intermediate

926: distributions used, given by $n\!-\!1$, goes to infinity.  I will

927: assume the $\eta_j$ defining these distributions are chosen according to a

928: scheme in which for any

929: $a \in (0,1)$, the spacing $\eta_{j+1}-\eta_j$ when $j = \lfloor a\,n \rfloor$

930: is asymptotically proportional to $1/n$ --- in other words,

931: the relative density of intermediate distributions in the neighborhood

932: of different values of $\eta$ stays the same as the overall density increases.

933: The simplest such scheme is to let $\eta_j = j/n$, though other schemes

934: may sometimes be better.

935:

936: With the above form for $p_{\eta}$, the AIS estimate from a single run

937: (from equation~(\ref{eq-ais-est})) can be written as follows:

938: \beq

939:   \log\ \rhatAIS^{(i)}

940:   & = &

941:    \sum_{j=0}^{n-1}\, \log \Big(p_{\eta_{j+1}}(x^{(i)}_j)

942:                                 \,\Big/\,p_{\eta_j}(x^{(i)}_j)\Big)

943:    \ \ =\ \

944:    \sum_{j=0}^{n-1}\, - (\eta_{j+1}-\eta_j)\, U \Big(x^{(i)}_j\Big)

945: \label{eq-ais-reim}

946: \eeq

947: When $\eta_j=j/n$, this can be seen as a stochastic form of Riemann's Rule

948: for numerically integrating equation~(\ref{eq-therm-int}), though one

949: difference is that $\log\ \rhatAIS$ converges to the correct value as $M$ goes

950: to infinity even if $n$ stays fixed.

951:

952: Provided that there is some finite bound on the variance of $U$ under all

953: the distributions $\pi_{\eta}$, and that the Markov transitions used mix well,

954: a Central Limit Theorem will apply, allowing us to conclude that the

955: distribution of $\ell_n = \log\ \rhatAIS^{(i)}$ becomes

956: Gaussian as $n$ goes to infinity.  Let the mean of $\ell_n$ be $\mu_n$,

957: and let the variance of $\ell_n$ asymptotically be $\sigma^2/n$, where $\sigma$

958: is determined by details of the spacing of intermediate distributions and

959: of the degree of autocorrelation in the Markov transitions.

960: Note that $E[Y^q]=\exp(q\mu+q^2\varsigma^2/2)$ when $Y=\exp(X)$

961: and $X$ is Gaussian with mean $\mu$ and variance $\varsigma^2$.

962: Using this, the mean of $\exp(\ell_n)$ is $\exp(\mu_n+\sigma^2/2n)$.  This

963: must equal $r$, since $\rhatAIS$ is unbiased, so $\mu_n = \log(r)-\sigma^2/2n$.

964: Using this, we can see that the variance of $\rhatAIS^{(i)}=\exp(\ell_n)$ is

965: $r\,[\exp(\sigma^2/2n) - 1]$, which for large $n$ will be approximately

966: $r\sigma^2/2n$.  The variance of $\rhatAIS$ will therefore be $r\sigma^2/2nM$.

967: Asymptotically, the total computational effort, which will generally be

968: proportional to $nM$, can be divided in any way between more intermediate

969: distributions ($n$) or more runs ($M$) without affecting the accuracy

970: of estimation of $r$, provided that $n$ is kept large enough that

971: these asymptotic results apply --- a fact noted by Hendrix and Jarzynski (2001).

972: We can therefore use a value of $M$ greater than one without penalty,

973: in order to obtain an error estimate from the degree of variation

974: over the $M$ runs.

975:

976: For LIS, we can write the log of the estimate from one run

977: (equation~(\ref{eq-lis})) as follows:

978: \beq

979:    \log\ \rhatLIS^{(i)} & = & \sum_{j=0}^{n-1} \left[

980:      \log \left({1 \over K_j+1}\, \sum_{k=0}^{K_j}\,

981:           { p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) } \right)

982:      \ -\

983:      \log \left({1 \over K_{j+1}+1}\, \sum_{k=0}^{K_{j+1}}\,

984:           { p_{j*j+1}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k}) }\right)

985:      \right]\ \ \ \ \ \

986: \label{eq-logrhatLIS}

987: \eeq

988: Suppose that we let $K_j = \lceil m K_j^0 \rceil$ for all $j$ and some set of

989: $K^0_j$, and that we then let $m$ go to infinity.  Assuming that the variances

990: of the ratios of probabilities are finite, and that the Markov chain transitions

991: used mix sufficiently well, a Central Limit

992: Theorem will again apply, and we can conclude that all of the $n$ terms in

993: the sum above, and therefore also the sum itself, will approach Gaussian

994: distributions, with variances proportional to $1/m$.

995:

996: To analyse the LIS estimate in more detail, we need to assume a form of

997: bridge distribution, as well as a form for $p_{\eta}$.  If $p_{\eta}$

998: has the form of equation~(\ref{eq-U-dist}) and we use the geometric bridge

999: of equation~(\ref{eq-geo-bridge}), we can write

1000: \beq

1001:    \log\ \rhatLIS^{(i)} & = & \sum_{j=0}^{n-1}\, \left[\

1002:      \log \left( {1 \over K_j+1}\, \sum\limits_{k=0}^{K_j}\,

1003:           \exp(-(\eta_{j+1}\!-\!\eta_j)\, U(x_{j,k})\, /\, 2)  \right)

1004:    \ -\ \right. \nonumber \\[4pt]

1005:    & & \ \ \ \ \ \ \ \ \left.

1006:      \log \left( {1 \over K_{j+1}+1}\,\sum\limits_{k=0}^{K_j}\,

1007:           \exp(-(\eta_j\!-\!\eta_{j+1})\, U(x_{j+1,k})\, /\, 2) \right)

1008:      \ \right]

1009: \eeq

1010: Since $\exp(z)\approx1+z$ and $\log(1+z)\approx z$ when $z$ is small, we can

1011: rewrite this when $n$ is large (and hence $\eta_{j+1}\!-\!\eta_j$ is small) as

1012: \beq

1013:    \log\ \rhatLIS^{(i)} & \approx & \sum_{j=0}^{n-1}\, \left[\

1014:      \log \left( 1 \ -\ { \eta_{j+1}\!-\!\eta_j \over 2}\,

1015:          {1 \over K_j+1}\, \sum\limits_{k=0}^{K_j} U(x_{j,k}) \right)

1016:    \ -\ \right. \nonumber \\[4pt]

1017:    & & \ \ \ \ \ \ \ \ \left.

1018:      \log \left( 1 \ +\ {\eta_{j+1}\!-\!\eta_j \over 2}\,

1019:          {1 \over K_{j+1}+1}\, \sum\limits_{k=0}^{K_{j+1}} U(x_{j+1,k}) \right)

1020:      \ \right] \\[6pt]

1021:    & \approx & \sum_{j=0}^{n-1}\,

1022:          - {\eta_{j+1}\!-\!\eta_j \over 2}\,

1023:          \left[ {1 \over K_j+1}\, \sum\limits_{k=0}^{K_j} U(x_{j,k})

1024:          \ +\   {1 \over K_{j+1}+1}\,

1025:                 \sum\limits_{k=0}^{K_{j+1}} U(x_{j+1,k}) \right]  \\[6pt]

1026:    & = &

1027:          -\ {\eta_1\!-\!\eta_0 \over 2}\,

1028:            {1 \over K_0+1}\, \sum\limits_{k=0}^{K_0} U(x_{0,k})

1029:          \ -\ {\eta_n\!-\!\eta_{n-1} \over 2}\,

1030:               {1 \over K_n+1}\, \sum\limits_{k=0}^{K_n} U(x_{n,k})

1031:          \nonumber \\[4pt]

1032:    &   &  -\ \sum_{j=1}^{n-1}\,

1033:            {\eta_{j+1}\!-\!\eta_{j-1} \over 2}\,

1034:            {1 \over K_j+1}\, \sum\limits_{k=0}^{K_j} U(x_{j,k})

1035: \eeq

1036:

1037: When $\eta_j=j/n$, this looks like a stochastic form of the

1038: Trapezoidal Rule for numerically integrating

1039: equation~(\ref{eq-therm-int}).  Since the Trapezoidal Rule converges

1040: faster than Reimann's Rule, one might expect LIS to perform better

1041: than AIS asymptotically, but this is not so in this stochastic

1042: situation.  Suppose for simplicity that we set all $K_j=m$.  The

1043: variance of $\log\ \rhatLIS^{(i)}$ will be dominated by the variance

1044: of the last sum above, which will be proportional to $1/nm$, assuming

1045: that $m$ is large, so that the dependence between terms (from sharing

1046: link states) is negligible.  Using the same argument as for AIS above,

1047: the variance of $\log \rhatLIS$ will be proportional to $1/nmM$.

1048: Considering that the computation time for an LIS run will be

1049: proportional to $nm$, versus $n$ for AIS, we see that the variances of

1050: the AIS and LIS estimates go down the same way in proportion to

1051: computation time, asymptotically as $n$ and $m$ go to infinity.

1052:

1053: Furthermore, the proportionality constant should be the same for

1054: AIS and LIS, assuming that the overhead of the two procedures is

1055: negligible compared to the time spent performing Markov transitions,

1056: so that the proportionality constants for computation time are the

1057: same for AIS (multiplying $n$) and for LIS (multiplying $nm$).  The

1058: proportionality constants for variance for AIS (multiplying $1/nM$)

1059: and for LIS (multiplying $1/nmM$) depend in a complex way on the form of

1060: the density of $\eta_j$ values and on the mixing properties of the

1061: Markov transitions, but the result should be the same for AIS and

1062: LIS, provided the same scheme is used for choosing $\eta_j$ values,

1063: and the same Markov transitions are used, parameterized smoothly in

1064: terms of $\eta$.  A difference that might appear significant is that

1065: for AIS only one Markov transition is done for each $\eta_j$, whereas

1066: for LIS, $m$ such transitions are done.  However, as $n$ goes to

1067: infinity, nearby distributions become more similar, so transitions for

1068: $m$ consecutive distributions become similar to $m$ transitions for

1069: one of these distributions.

1070:

1071: The apparently pessimistic conclusion from this is that when both $n$

1072: and $m$ (and hence the $K_j$) are large, the performance of LIS should

1073: be about the same as that of AIS (with $n$ for AIS chosen to equalize

1074: the computation time), assuming that the distributions used have the

1075: form of equation~(\ref{eq-U-dist}), that the variance of $U$ is finite

1076: under all of the distributions $\pi_{\eta}$, and that the Markov

1077: transitions used mix well enough.  Fortunately, however, there is no

1078: reason to make both $m$ and $n$ large with LIS.  For good performance,

1079: $n$ must be large enough that $\pi_{\eta_j}$ and $\pi_{\eta_{j+1}}$

1080: overlap significantly, but there is no reason to make $n$ much larger

1081: than this.  The accuracy of the estimates can be improved as desired

1082: by increasing $m$ and/or $M$ while keeping $n$ fixed.  The results

1083: below show that LIS estimates with $n$ fixed are sometimes much better

1084: than AIS estimates.

1085:

1086: Finally, let us consider the asymptotic performance of the bridged

1087: versions of AIS and LIS, assuming that the variance of $U$ is finite,

1088: so that the distribution of the estimates from individual runs becomes

1089: Gaussian as $n$ (for AIS) or $m$ (for LIS) goes to infinity.  Looking

1090: at equations~(\ref{eq-bridged-lis1}) and~(\ref{eq-bridged-lis2}),

1091: which also are applicable to bridged AIS estimates, we see that the

1092: log of $\rhatLISbridged^{(i)}$ can for both optimal and geometric

1093: bridges be expressed as the difference of the log of the numerator,

1094: which is the mean of a function of the forward estimates,

1095: $\rhatLIS^{(i)}$, and the log of the denominator, which is the mean of

1096: a function of the reverse estimates, $\rhatLISrev^{(i)}$.  If these

1097: forward and reverse estimates have Gaussian distributions with small

1098: variances, $\sigma^2$ and $\underline{\sigma}^2$, then

1099: $\rhatLISbridged^{(i)}$ will also be Gaussian, with a variance that

1100: can be computed in terms of the derivatives of the summands in the

1101: numerator and the denominator, with respect to $\rhatLIS^{(i)}$ and

1102: $\rhatLISrev^{(i)}$, evaluated at the true values of $r$ and $1/r$.

1103: I will assume that $r=1$ below, as can be done without loss of generality.

1104:

1105: For the geometric bridge, these derivatives are both $1/2$, from which

1106: it follows that the variance of the numerator in

1107: equation~(\ref{eq-bridged-lis2}) is $\sigma^2/4M$ and that of the

1108: denominator is $\underline{\sigma}^2/4\underline{M}$.  Since the

1109: numerator and denominator evaluate to one for $\rhatLIS^{(i)}=r=1$ and

1110: $\rhatLISrev^{(i)}=1/r=1$, the sum of the variances of the logs of the

1111: numerator and denominator is $\sigma^2/4M +

1112: \underline{\sigma}^2/4\underline{M}$. If

1113: $\sigma^2=\underline{\sigma}^2$ and $M=\underline{M}$, this reduces to

1114: $\sigma^2/2M$.  The variance of an unbridged LIS estimate will be

1115: $\sigma^2/M$.  However, the bridged estimate requires time

1116: proportional to $M+\underline{M}$, compared to just $M$ for the

1117: unbridged estimate.  The value of $M$ for the unbridged method can

1118: therefore be twice as large as for the bridged method, with the result

1119: that bridged and unbridged estimates perform equally well

1120: asymptotically (assuming the variance of $U$ is finite).

1121:

1122: For the optimal bridge, the derivatives of the summands in the

1123: numerator and denominator are both $1/4$, when evaluated at

1124: $\rhatLIS^{(i)}=r=1$ and $\rhatLIS^{(i)}=1/r=1$, and assuming that

1125: $M=\underline{M}$.  The numerator and denominator both evaluate to

1126: $1/2$, with the result that asymptotically the variance of the bridged

1127: estimate, assuming $\sigma^2=\underline{\sigma}^2$, is $\sigma^2/2M$,

1128: the same as for the geometric bridge.

1129:

1130: In conclusion, bridged AIS and LIS estimates asymptotically have the

1131: same performance as the corresponding unbridged estimates (with twice

1132: the value of $M$), for both the optimal and geometric bridges,

1133: assuming $U$ has finite variance.  This conclusion applies more

1134: generally, as long as a Central Limit Theorem holds for the individual

1135: estimates, $\rhatLIS^{(i)}$ and $\rhatLISrev^{(i)}$.  However, the

1136: bridged methods may be much better when the variance of $U$ is

1137: infinite, or for classes of distributions other than that of

1138: equation~(\ref{eq-U-dist}).  The bridged methods may also provide

1139: improvement when the values of $n$ or $m$ are not large enough for the

1140: asymptotic results to apply.

1141:

1142:

1143: \subsection{\hspace*{-4pt}Properties of AIS and LIS when sampling from

1144:             uniform distributions}\label{sec-unif}\vspace*{-4pt}

1145:

1146: In this section, I will demonstrate that when $n$ is kept suitably

1147: small, LIS can perform much better than AIS when these methods are

1148: applied to sequences of uniform distributions.

1149:

1150: As a first example, consider the class of nested uniform

1151: distributions with unnormalized densities given by\vspace*{-6pt}

1152: \beq

1153:   p_{\eta}(x) & = & \left\{ \begin{array}{ll}

1154:       1 & \mbox{if $-s^{\eta} < x < s^{\eta}$} \\ 0 & \mbox{otherwise}

1155:   \end{array}\right.

1156: \eeq

1157: for which the normalizing constants are $Z_{\eta} = 2s^{\eta}$, so that

1158: $r = Z_1/Z_0 = s$.  The results concerning this class of distributions

1159: can easily be extended to any class of uniform distributions, in any

1160: number of dimensions, that have nested regions of support.

1161: For both AIS and LIS, I will assume that the intermediate

1162: distributions are defined by $\eta_j = j/n$.  With this choice, the

1163: probability that a point, $x$, randomly sampled from $\pi_j$ will have

1164: $p_{j+1}(x)=1$ is $s^{1/n}$, for any $j$.

1165:

1166: During an AIS run, only a single point is sampled from each

1167: distribution.  An AIS run will produce an estimate for $r$ of zero if

1168: any of the ratios ${p_{\eta_{j+1}}(x^{(i)}_j)\,/\,

1169: p_{\eta_j}(x^{(i)}_j)}$ in equation~(\ref{eq-ais-est}) are zero, which

1170: happens with probability $1 - (s^{1/n})^n\, =\, 1-s$, and will

1171: otherwise produce an estimate of one.  Note that the distribution of

1172: estimates is independent of $n$.  AIS is therefore not a useful

1173: technique for nested uniform distributions --- simple importance

1174: sampling (ie, AIS with $n\!=\!1$) would work just as well (or just as

1175: poorly, if $s$ is very small).  Bridged AIS produces no improvement in

1176: this context.

1177:

1178: Suppose instead we use LIS with all $K_j=m$, and suppose that the

1179: Markov transitions, $T_j$, produce points that are almost independent

1180: of the previous point.  For this problem, both the geometric and

1181: optimal forms of the bridge distribution result in $p_{j*j+1}(x) =

1182: p_{\eta_{j+1}}(x)$.  If $m+1$ points are sampled independently from

1183: $\pi_{\eta_j}$, the fraction of these points for which

1184: $p_{\eta_{j+1}}(x)$ is one will have variance

1185: $s^{1/n}\,(1\!-\!s^{1/n})\,/\,(m\!+\!1)$.  For sufficiently large

1186: $m$, the variance of the log of this fraction will be

1187: approximately $(s^{1/n}\,(1\!-\!s^{1/n})\,/\,(m\!+\!1))\,/\,s^{2/n}$,

1188: which simplifies to $(s^{-1/n}\!-\!1)\,/\,(m\!+\!1)$.  For this

1189: approximation to be useful, the probability that none of the $m+1$

1190: points sampled from $\pi_{\eta_j}$ lie in the region where

1191: $p_{\eta_{j+1}}$ is one, equal to $(1-s^{1/n})^{m+1}$, must be negligible.

1192: This probability must be fairly small anyway, if LIS is to perform well.

1193:

1194: Suppose that the computational cost of an LIS run is proportional to

1195: the sum of the number of points sampled from $\pi_0$ and the number of

1196: Markov transitions performed.  If we fix this cost, the number of

1197: intermediate distributions, $n$, and the number of transitions for

1198: each distribution, $m$, will be related by $m(n\!+\!1)\,=\,C$, for

1199: some constant $C$.  Assume for the moment that both $n$ and $m$ are

1200: large.  The probability of a run producing a zero estimate will then

1201: be negligible, and we can assess the accuracy of the estimate for one

1202: run by the variance of $\log \rhatLIS^{(i)}$ (modified in some way

1203: to eliminate the infinity resulting from the negligible, but non-zero,

1204: probability that $\rhatLIS^{(i)}$ is zero).  Looking at

1205: equation~(\ref{eq-logrhatLIS}), we see that for these nested uniform

1206: distributions, the second log term vanishes ---

1207: $p_{j*j+1}(x_{j+1,k})\,/\,p_{\eta_{j+1}}(x_{j+1,k})$ is always one,

1208: since $p_{j*j+1}$ is the same as $p_{\eta_{j+1}}$.  When $m$ is large,

1209: the dependence between terms with different values of $j$ will be

1210: negligible, so we can add the variances of the terms to get the variance

1211: of the estimate, obtaining the result that

1212: \beq

1213:   \Var \Big(\log\ \rhatLIS^{(i)}\Big)

1214:     & \approx & n\,(s^{-1/n}\!-\!1)\,/\,(m\!+\!1)

1215: \label{eq-varLIS-nest}

1216: \eeq

1217: When $n$ is large, $s^{-1/n}=\, \exp(\log(1/s)/n)$ is approximately

1218: $1+\log(1/s)/n$, and hence the variance above is

1219: approximately $\log(1/s)\,/\,(m\!+\!1)$.

1220: So it seems that the larger the value of $m$, the better ---

1221: until we reach a value of $m$ for which the corresponding value of $n$,

1222: equal to $C/m\,-\,1$, is small enough that this result no longer applies.

1223:

1224: Best performance will therefore come using a fairly small value of

1225: $n$, but a large value of $m$.  Substituting $m=C/(n\!+\!1)$ into

1226: equation~(\ref{eq-varLIS-nest}), and assuming $m/(m\!+\!1)\approx 1$, we get

1227: \beq

1228:   \Var \Big(\log\ \rhatLIS^{(i)}\Big)

1229:     & \approx & n\,(s^{-1/n}\!-\!1)\,/\,(C/(n\!+\!1))

1230:     \ \ =\ \ n(n\!+\!1)\,(s^{-1/n}\!-\!1)\,/\,C

1231: \eeq

1232: The value of $n$ that minimizes this depends only on $s$, not on $C$.

1233: The optimal choice of $n$ increases slowly as $s$ gets smaller:\ \

1234: $s=0.1$ gives $n=2$, $s=0.05$ gives $n=3$, $s=0.01$ gives $n=4$, and

1235: $s=0.0001$ gives $n=7$.

1236:

1237: As a second example, consider the class of non-nested uniform distributions

1238: with unnormalized densities given by\vspace*{-6pt}

1239: \beq

1240:   p_{\eta}(x) & = & \left\{ \begin{array}{ll}

1241:       1 & \mbox{if $\eta t-1 < x < \eta t+1$} \\ 0 & \mbox{otherwise}

1242:   \end{array}\right.

1243: \eeq

1244: For this class, $Z_{\eta} = 2$ for all $\eta$, so $r = Z_1/Z_0 = 1$.

1245: I will again assume that the intermediate

1246: distributions are defined by $\eta_j = j/n$, and that all $K_j=m$.  Assuming

1247: that $n$ is greater than

1248: $t/2$, the probability that a point, $x$, randomly sampled from $\pi_{\eta_j}$

1249: will have $p_{\eta_{j+1}}(x)=1$ is $1-t/2n$, for any $j$.

1250:

1251: For this example, AIS estimates do not converge to the true value of

1252: $r$ as $M$ increases, regardless of the value of $n$.  To see this,

1253: note that the ratios in equation~(\ref{eq-ais-est}) will all be either

1254: zero or one, and that the estimate from one run, $\rhatAIS^{(i)}$,

1255: will be one if all of these ratios are one, and zero otherwise.  The

1256: probability of a particular ratio being one is $1-t/2n$, so the

1257: probability that all are one (assuming the $T_{\eta}$ produce points

1258: independent of the current point) is $(1-t/2n)^n$, which approaches

1259: $\exp(-t/2)$ as $n$ goes to infinity.  The AIS estimate, averaging

1260: over $M$ runs, will have mean $\exp(-t/2)$, rather than the correct

1261: value of one.

1262:

1263: In contrast, bridged AIS estimates will converge to the true value as $M$

1264: increases, as long as $n$ is at least $t/2$, so that there is overlap

1265: between successive distributions in the sequence.  However, when $t$

1266: is large, the overlap between the distributions over paths produced by

1267: forward and reverse AIS runs, given by $\exp(-t/2)$, will be very

1268: small, and the procedure will be very inefficient.

1269:

1270: To see how well LIS performs, recall the formula for $\log \rhatLIS$

1271: from equation~(\ref{eq-logrhatLIS}):

1272: \beq

1273:    \log\ \rhatLIS^{(i)} & = & \sum_{j=0}^{n-1} \left[

1274:      \log \left({1 \over K_j+1}\, \sum_{k=0}^{K_j}\,

1275:           { p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) } \right)

1276:      \ -\

1277:      \log \left({1 \over K_{j+1}+1}\, \sum_{k=0}^{K_{j+1}}\,

1278:           { p_{j*j+1}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k}) }\right)

1279:      \right]\ \ \ \ \ \

1280: \eeq

1281: Due to symmetry, the two log terms above have the same distribution,

1282: for all $j$.  The variance of one of these log

1283: terms (for large $m$) is

1284: $((t/2n)\,(1\!-\!t/2n)\,/\,(m\!+\!1))\,/\,(1\!-\!t/2n)^2$, which

1285: simplifies to $1\,/\,((2n/t\!-\!1)\,(m\!+\!1))$.  The second log

1286: term in equation~(\ref{eq-logrhatLIS}) for one $j$ will involve the

1287: same points, $x_{j+1,k}$, as the first log term for the next $j$.  The

1288: effect of this is that these terms will be negatively correlated, with

1289: correlation of $-1$ if $n\!=\!t$.  However, since the

1290: two terms occur with opposite signs, the effect on the final sum is

1291: that $n\!-\!1$ pairs of terms (out of $2n$ terms total) are positively

1292: correlated.  Straightforward calculations show that this correlation is

1293: $2n/t - 1$ for $t/2 < n \le t$ and $1\,/\,(2n/t - 1)$ for $n \ge t$.

1294: Using the fact that when $X$ and $Y$ have the same

1295: distribution, $\Var(X+Y) = 2\,\Var(X)\,[1+\Cor(X,Y)]$, we obtain the

1296: result that, for large $m$,

1297: \beq

1298:   \Var \Big(\log\ \rhatLIS^{(i)}\Big)

1299:     & \approx & {2 \over (2n/t\!-\!1)\,(m\!+\!1)}

1300:                 \left\{\begin{array}{ll}

1301:                    n\ +\ (n\!-\!1)\,(2n/t-1)

1302:                      & \ \ \mbox{if $t/2 < n \le t$}

1303:                 \\[4pt]

1304:                    n\ +\ (n\!-\!1)\,/\, (2n/t-1)

1305:                      & \ \ \mbox{if $n \ge t$}

1306:                 \end{array}\right\}

1307: \eeq

1308: Setting $m = C/(n\!+\!1)$, and assuming $m/(m\!+\!1)\approx 1$, gives

1309: \beq

1310:   \Var \Big(\log\ \rhatLIS^{(i)}\Big)

1311:     & \approx & {2(n\!+\!1) \over C(2n/t\!-\!1)}

1312:                 \left\{\begin{array}{ll}

1313:                    n\ +\ (n\!-\!1)\,(2n/t-1)

1314:                      & \ \ \mbox{if $t/2 < n \le t$}

1315:                 \\[4pt]

1316:                    n\ +\ (n\!-\!1)\,/\, (2n/t-1)

1317:                      & \ \ \mbox{if $n \ge t$}

1318:                 \end{array}\right\}

1319: \eeq

1320: Numerical investigation shows that the global minimum of the variance

1321: occurs where $n$ is near $(3/2)\,t$.  A second local minimum where $n$

1322: is near $(3/4)\,t$ also exists.  The two minima are nearly equally good

1323: when $t$ is large.  There is a local maximum where $n$ is near $t$,

1324: with the variance there being about 19\% greater than at the global

1325: minimum.  The variance is much larger for very large and very small values

1326: of $n$.  We therefore see that for this example too, the best results

1327: are obtained by fixing $n$ to a moderate value; any desired level of

1328: accuracy can then be obtained by increasing $m$ and/or $M$.

1329:

1330:

1331: \section{\hspace*{-7pt}Empirical comparisons of AIS and

1332:                        LIS}\label{sec-cmp}\vspace*{-10pt}

1333:

1334: The analytical results of the previous section indicate that LIS can

1335: sometimes perform much better than AIS, but that the benefits of LIS

1336: may only be seen when the number of intermediate distributions used is

1337: kept suitably small (but not so small that they do not overlap).  In

1338: this section, I investigate the performance of AIS and LIS (and their

1339: bridged versions) empirically.  The programs used for these tests

1340: (written in R) are available from my web page.

1341:

1342: These tests were done using sequences of one-dimensional distributions

1343: having unnormalized density functions of the following form:

1344: \beq

1345:   p_{\eta}(x) & = &

1346:     \exp\Big(\!-\!\Big|(x\!-\!\eta t)\,/\,s^{\eta}\,\Big|^q\,\Big)

1347: \eeq

1348: where $s$, $t$, and $q$ are fixed constants.  As $\eta$ moves from 0 to 1,

1349: the centre of this distribution shifts by $t$, and changes width by the

1350: factor $s$.  The power $q$ controls how thick the tails of the distributions

1351: are.  When $q=2$, the distributions are Gaussian; a larger value

1352: produces lighter tails.  Note that $Z_{\eta}$ is

1353: proportional to $s^{\eta}$, and hence $r = Z_1/Z_0$ is equal to $s$.

1354:

1355: If $t=0$, the distributions can be written in the form of

1356: equation~(\ref{eq-U-dist}), after reparameterizing in terms of $\eta'

1357: = 1/s^{\eta q}$, so that $p_{\eta'}(x) = \exp(-\eta' |x|^q)$.  In this

1358: case, we expect the asymptotic behaviour to be as discussed in

1359: Section~\ref{sec-asym}, but the behaviour with samples of practical

1360: size may be different.  As $q$ goes to infinity, the distributions

1361: converge to uniform distributions over $(\eta t\!-\!s^{\eta},\,\eta

1362: t\!+\!s^{\eta})$, and the results of Section~\ref{sec-unif} become relevant.

1363:

1364: I did an initial set of tests using six sequences of distributions.

1365: Three of these sequences were of Gaussian distributions, with $q\!=\!2$.

1366: The first of these used $s\!=\!1$ and $t\!=\!4$, producing a shift with no

1367: change in scale as $\eta$ increases from 0 to 1.  The second used

1368: $s\!=\!0.05$ and $t\!=\!0$, producing a contraction with no shift.  The last

1369: used $s\!=\!0.3$ and $t\!=\!2$, combining a shift with a contraction.  A

1370: second set of three sequences used the same values of $s$ and $t$, but

1371: with $q\!=\!10$, which produces more `rectangular' distributions with

1372: lighter tails.  The six sequences are shown in Figure~\ref{fig-seq}.

1373: Each sequence in these plots consists of five distributions,

1374: corresponding to $\eta\, =\, 0,\, 1/4,\, 2/4,\, 3/4,\, 1$.  These were

1375: the sequences used for the LIS runs (hence $n\!=\!4$ for these runs).  The

1376: AIS runs used more distributions, spaced more finely with respect to

1377: $\eta$, so as to produce the same number of Markov transitions and

1378: sampling operations as in the LIS runs.

1379:

1380:

1381: \begin{figure}[t]

1382:

1383: \vspace*{-29pt}

1384:

1385: \centerline{\includegraphics{epow-plts.ps}}

1386:

1387: \caption[]{The sequences of unnormalized density functions used for the

1388:            tests.  The plots show the unnormalized density functions for

1389:            $\eta\, =\, 0,\, 1/4,\, 2/4,\, 3/4,\, 1$, for six combinations

1390:            of $s$, $t$, and $q$.}\label{fig-seq}

1391:

1392: \end{figure}

1393:

1394:

1395: These distributions (for any $\eta$) can easily be sampled from using

1396: rejection sampling.  Samples from $\pi_0$ and $\pi_1$ were used to

1397: initialize forward and reverse runs of AIS and LIS.  For this test, we

1398: pretend that sampling for other $\pi_{\eta}$ must be done using Markov

1399: chain methods.  The transition used for $\pi_{\eta}$, $T_{\eta}$, was

1400: a random-walk Metropolis update, using a Gaussian proposal

1401: distribution with mean equal to the current point and standard

1402: deviation $s^{\eta}$.  Since Metropolis updates are reversible,

1403: $\underline{T}_{\eta}$ was the same.

1404:

1405: Two sets of forward and reverse LIS runs were done with $n\!=\!4$, all

1406: $K_j\!=\!50$, and $M\!=\!20$, one set using the geometric bridge, the

1407: other using the optimal bridge with the true value of $r$.  The

1408: forward estimates were computed from equation~(\ref{eq-lis}); the

1409: reverse estimates from equation~(\ref{eq-lis-rev}), which is

1410: equivalent to using the forward procedure with the reverse sequence of

1411: distributions.  Bridged LIS estimates were also found using

1412: equation~(\ref{eq-bridged-lis1}), with the value of $r$ found by

1413: iteration.  To make the comparison with forward and reverse estimates

1414: fair, the bridged LIS estimates used $M\!=\!10$ --- ie, only half of

1415: the forward and half of the reverse runs were used, for a total of

1416: $20$ runs.

1417:

1418: A corresponding set of forward, reverse, and bridged AIS runs were

1419: also done, with $n\!=\!250$ and $M\!=\!20$ ($M\!=\!10$ for the bridged

1420: estimates).  If sampling a point from $\pi_0$ or $\pi_1$ takes about

1421: the same computation time as a Metropolis update, these AIS runs will

1422: take about the same time as the LIS runs.  (This assumes that sampling

1423: and Markov transitions dominate the time, which is typically true for

1424: real problems but perhaps not for this simple test problem.)

1425:

1426: Sets of longer LIS and AIS runs were also done, which were the same as

1427: the sets above except that for LIS, $K_j\!=\!200$ for all $j$, and for

1428: AIS, $n\!=\!1000$, which again equalizes the computation time.

1429:

1430: Experience, together with the asymptotic results of

1431: Section~\ref{sec-asym}, shows that estimates produced using a small

1432: value of $M$ are better than, or at least as good as, those produced

1433: with larger $M$.  I chose $M\!=\!20$ ($M\!=\!10$ for bridged estimates) since

1434: this is about the smallest value that allows reliable estimation of

1435: standard errors, which would usually be needed in practice.

1436:

1437: The standard errors for AIS and LIS estimates of $\rhat$ were

1438: estimated by the sample standard deviation of the $\rhat^{(i)}$

1439: divided by $\sqrt{M}$.  When comparing the methods, I looked primarily

1440: at the mean squared error when estimating $\log(r)$ (rather than when

1441: estimating $r$).  The estimate I used was $\log(\rhat)$, and the

1442: standard error for this estimate was estimated by the standard error

1443: for $\rhat$ divided by $\rhat$.  For the reverse runs, $\log(r)$ was

1444: estimated by $-\log(\rhatrev)$.  For bridged AIS and LIS, the standard

1445: errors for the log of the numerator and the log of the denominator of

1446: equation~(\ref{eq-bridged-lis1}) were found, and the overall standard

1447: error was computed as the square root of the sum of the squares of

1448: these two standard errors.  This method of converting estimates and

1449: standard errors for $r$ to those for $\log(r)$ is valid

1450: asymptotically.  It might be improved upon for finite samples, but

1451: such improvements would probably not affect the relative merits of the

1452: methods compared here.

1453:

1454: \begin{figure}[p]

1455:

1456: \centerline{\includegraphics{tst-plt1.ps}}

1457:

1458: \vspace*{-8pt}

1459:

1460: %\hspace*{0.1in}\makebox[3.2in]{Short Runs}\hfill\makebox[3.2in]{Long Runs}

1461:

1462: \caption[]{Results of short and long runs

1463:            on the distribution sequence with $s\!=\!1$, $t\!=\!4$, and

1464:            $q\!=\!2$.}\label{fig-r1}

1465:

1466: \end{figure}

1467:

1468: \begin{figure}[p]

1469:

1470: \centerline{\includegraphics{tst-plt2.ps}}

1471:

1472: \vspace*{-8pt}

1473:

1474: %\hspace*{0.1in}\makebox[3.2in]{Short Runs}\hfill\makebox[3.2in]{Long Runs}

1475:

1476: \caption[]{Results of short and long runs

1477:            on the distribution sequence with $s\!=\!1$, $t\!=\!4$, and

1478:            $q\!=\!10$.}\label{fig-r2}

1479:

1480: \end{figure}

1481:

1482:

1483: \begin{figure}[p]

1484:

1485: \centerline{\includegraphics{tst-plt3.ps}}

1486:

1487: \vspace*{-8pt}

1488:

1489: %\hspace*{0.1in}\makebox[3.2in]{Short Runs}\hfill\makebox[3.2in]{Long Runs}

1490:

1491: \caption[]{Results of short and long runs

1492:            on the distribution sequence with $s\!=\!0.05$, $t\!=\!0$, and

1493:            $q\!=\!2$.}\label{fig-r3}

1494:

1495: \end{figure}

1496:

1497: \begin{figure}[p]

1498:

1499: \centerline{\includegraphics{tst-plt4.ps}}

1500:

1501: \vspace*{-8pt}

1502:

1503: 5\hspace*{0.1in}\makebox[3.2in]{Short Runs}\hfill\makebox[3.2in]{Long Runs}

1504:

1505: \caption[]{Results of short and long runs

1506:            on the distribution sequence with $s\!=\!0.05$, $t\!=\!0$, and

1507:            $q\!=\!10$.}\label{fig-r4}

1508:

1509: \end{figure}

1510:

1511:

1512: \begin{figure}[p]

1513:

1514: \centerline{\includegraphics{tst-plt5.ps}}

1515:

1516: \vspace*{-8pt}

1517:

1518: %\hspace*{0.1in}\makebox[3.2in]{Short Runs}\hfill\makebox[3.2in]{Long Runs}

1519:

1520: \caption[]{Results of short and long runs

1521:            on the distribution sequence with $s\!=\!0.3$, $t\!=\!2$, and

1522:            $q\!=\!2$.}\label{fig-r5}

1523:

1524: \end{figure}

1525:

1526: \begin{figure}[p]

1527:

1528: \centerline{\includegraphics{tst-plt6.ps}}

1529:

1530: \vspace*{-8pt}

1531:

1532: %\hspace*{0.1in}\makebox[3.2in]{Short Runs}\hfill\makebox[3.2in]{Long Runs}

1533:

1534: \caption[]{Results of short and long runs

1535:            on the distribution sequence with $s\!=\!0.3$, $t\!=\!2$, and

1536:            $q\!=\!10$.}\label{fig-r6}

1537:

1538: \end{figure}

1539:

1540: Figures~\ref{fig-r1} through \ref{fig-r6} plot the mean squared errors

1541: of estimates for $\log(r)$ for the six sets of runs.  Results are

1542: shown for AIS, for LIS using the geometric bridge, and for LIS using

1543: the optimal bridge, with the true value of $r$.  Results for both the

1544: forward and reverse versions of each method are shown, together with

1545: the bridged version, using the optimal bridge, with $r$ obtained by

1546: iteration.  Results for the short runs ($n\!=\!4$, $K_j\!=\!50$ for

1547: LIS, $n\!=\!250$ for AIS) are on the left, and for the long runs

1548: ($n\!=\!4$, $K_j\!=\!200$ for LIS, $n\!=\!2000$ for AIS) on the right.

1549: The mean squared error for each method was estimated by simulating

1550: each method 2000 times, and comparing the estimates with the true

1551: value of $\log(r)$.  The bars in the plots are dark up to the

1552: estimated mean squared error minus twice its standard error, and are

1553: then light up to the estimated mean squared error plus twice its

1554: standard error.  For bars that extend above the plot the estimated

1555: mean squared error is shown at the top of the bar.

1556:

1557: The results for translated sequences of distributions ($t\!=\!4$ and

1558: $s\!=\!1$) are shown in Figures~\ref{fig-r1} and~\ref{fig-r2}.  When the

1559: distributions are Gaussian ($q\!=\!2$), no advantage is seen for LIS --- if

1560: anything, LIS performs slightly worse than AIS, particularly when the

1561: geometric bridge is used.  The forward and reverse forms of AIS and

1562: LIS should have identical performance for these distribution

1563: sequences, due to symmetry; any differences seen result from random

1564: variation.  The bridged forms of both AIS and LIS perform better than

1565: the unbridged forward and reverse forms.  The advantage of bridging is

1566: less for the longer runs, however, as expected from the analysis at

1567: the end of Section~\ref{sec-asym}.

1568:

1569: When $q\!=\!10$, the distributions have much lighter tails than the

1570: Gaussian, more closely resembling the uniform distributions analysed

1571: in Section~\ref{sec-unif}.  For these sequences of distributions, LIS

1572: performs substantially better than AIS.  The unbridged version of AIS

1573: does particularly badly.  The mean squared error for the bridged

1574: version of AIS is about 2.5 times greater than for the bridged version

1575: of LIS.  It makes little difference whether the geometric or optimal

1576: bridge is used for LIS.

1577:

1578: Figures~\ref{fig-r3} and~\ref{fig-r4} show the results for sequences

1579: of distributions with the same mean ($t\!=\!0$) but decreasing width

1580: ($s\!=\!0.05$).  For these sequences, a modest advantage of LIS over AIS

1581: is apparent for the sequence of Gaussian distributions ($q\!=\!2$), with

1582: the variance for AIS estimates being about a factor of 1.3 greater

1583: than for LIS estimates with the geometric bridge, and about a factor

1584: of 1.7 greater than for LIS estimates with the optimal bridge.  The

1585: reversed AIS and LIS estimates are somewhat worse than the forward

1586: estimates for this sequence of distributions.  No advantage is seen for

1587: bridged AIS or LIS estimates.

1588:

1589: The results for the sequence of distributions with $q\!=\!10$ is similar,

1590: except that the advantage of LIS over AIS is much greater --- about a

1591: factor of 6.

1592:

1593: Results for the last type of sequence, with $s\!=\!0.3$ and $t\!=\!2$, are

1594: shown in Figures~\ref{fig-r5} and~\ref{fig-r6}.  This problem is a

1595: hybrid of the previous two, with both translation and change in width,

1596: producing results intermediate between those for the previous two

1597: problems.  No difference in performance between AIS and LIS is

1598: apparent for the Gaussian distributions ($q\!=\!2$), but the bridged forms

1599: of both perform slightly better.  For the sequence of distributions

1600: with $q\!=\!10$, a clear advantage of LIS over AIS can be seen, but this

1601: advantage is not as great as for the sequence with $t\!=\!0$ and $s\!=\!0.05$.

1602: The bridged forms of both AIS and LIS are again better, more so for

1603: the short runs than for the long runs.

1604:

1605: In addition to looking at the mean squared error of estimates found

1606: with these methods, I also looked at the fraction of times that the

1607: estimate for $\log(r)$ differed from the true value by more than twice

1608: the standard error estimated using the $M$ runs.  This should be

1609: approximately 5\% if the distribution of estimates is Gaussian, and

1610: the standard errors are accurate.  For the longer runs, this fraction

1611: was indeed near or only slightly above 5\% for all methods, except for

1612: the unbridged AIS runs when these performed very poorly.  For the

1613: shorter runs, however, the unbridged AIS and LIS methods produced

1614: estimates more than two standard errors from the mean around 10\% of

1615: the time (sometimes much more often, when unbridged AIS performed

1616: poorly).  Both the bridged AIS and the bridged LIS methods gave more

1617: reliable standard errors.  However, it is possible that better

1618: standard errors for the unbridged methods might be obtained with a

1619: more sophisticated approach than I used.

1620:

1621: I performed additional runs to verify and extend some of the analytic

1622: results from Section~\ref{sec-anal}.  Figures~\ref{fig-r7}

1623: and~\ref{fig-r8} show results obtained using LIS with increasing

1624: numbers of intermediate distributions, starting with the value of

1625: $n\!=\!4$ used for the tests above, and continuing to $n\!=\!9$,

1626: $n\!=\!19$, and $n\!=\!39$, while keeping the computation time

1627: constant by decreasing $m$ in proportion to $n\!+\!1$.  The two

1628: distribution sequences with $s\!=\!1$ and $t\!=\!4$ and with

1629: $s\!=\!0.05$ and $t\!=\!0$ were used, in both cases with $q\!=\!10$.

1630: The sequence with $t\!=\!0$ and $s\!=\!0.05$ has the form of

1631: equation~(\ref{eq-U-dist}), so in accordance with the analysis of

1632: Section~\ref{sec-asym}, we expect that asymptotically, as $n$

1633: increases, LIS and AIS should have the same performance.  This is

1634: indeed what we see in Figure~\ref{fig-r7}.  We also see the same

1635: behaviour for the sequence with $t\!=\!4$ and $s\!=\!1$ in

1636: Figure~\ref{fig-r8}.

1637:

1638:

1639: \begin{figure}[p]

1640:

1641: \centerline{\includegraphics{tst-plt-2four.ps}}

1642:

1643: \vspace*{-8pt}

1644:

1645: \caption[]{Results using increasing values of $n$ for LIS, while keeping

1646:            computation time constant, for the distribution sequence with

1647:            $s\!=\!1$, $t\!=\!4$, and $q\!=\!10$.  The same AIS procedure was

1648:            used for all plots, but results vary randomly.}\label{fig-r7}

1649:

1650: \end{figure}

1651:

1652:

1653: \begin{figure}[p]

1654:

1655: \centerline{\includegraphics{tst-plt-4four.ps}}

1656:

1657: \vspace*{-8pt}

1658:

1659: \caption[]{Results using increasing values of $n$ for LIS, while keeping

1660:            computation time constant, for the distribution sequence with

1661:            $s\!=\!0.05$, $t\!=\!0$, and $q\!=\!10$.  The same AIS procedure was

1662:            used for all plots, but results vary randomly.}\label{fig-r8}

1663:

1664: \end{figure}

1665:

1666:

1667: \begin{figure}[p]

1668:

1669: \centerline{\includegraphics{tst-plt-222.ps}}

1670:

1671: \vspace*{-8pt}

1672:

1673: \caption[]{Results with increasing values of $q$, for sequences of

1674:            distributions with $s\!=\!1$ and $t\!=\!4$.  The AIS runs used

1675:            $n\!=\!250$; the LIS runs used $n\!=\!4$ and $m\!=\!50$,

1676:            requiring the same amount of computation.}\label{fig-r9}

1677:

1678: \end{figure}

1679:

1680:

1681: \begin{figure}[p]

1682:

1683: \centerline{\includegraphics{tst-plt-444.ps}}

1684:

1685: \vspace*{-8pt}

1686:

1687: \caption[]{Results with increasing values of $q$, for sequences of

1688:            distributions with $s\!=\!0.05$ and $t\!=\!1$.  The AIS runs used

1689:            $n\!=\!250$; the LIS runs used $n\!=\!4$ and $m\!=\!50$,

1690:            requiring the same amount of computation.}\label{fig-r10}

1691:

1692: \end{figure}

1693:

1694: As $q$ increases, the distributions become close to uniform, and the

1695: results of Section~\ref{sec-unif} should apply.  To test this, I tried

1696: values of $q\!=\!2$, $q\!=\!10$, $q\!=\!20$, and $q\!=\!30$ for the

1697: distribution sequence with $s\!=\!1$ and $t\!=\!4$ and the sequence with

1698: $s\!=\!0.05$ and $t\!=\!0$.  Results are shown in Figures~\ref{fig-r9}

1699: and~\ref{fig-r10}.  (The results for $q\!=\!2$ and $q\!=\!10$ are the same as

1700: on the left in Figures~\ref{fig-r1} to~\ref{fig-r4}, though the scale

1701: differs.)

1702:

1703: For the sequences with $s\!=\!1$ and $t\!=\!4$, the limiting uniform

1704: distributions have the form of the second example in

1705: Section~\ref{sec-unif}.  As noted there, AIS estimates do not converge

1706: to the correct value of $r$ for this distribution sequence; bridged AIS

1707: estimates do converge, but may be rather inefficient.  We see

1708: analogous behaviour in Figure~\ref{fig-r9} when $q$ is large.  The

1709: mean squared error of the AIS estimates increases approximately

1710: linearly with $q$ over the range $q\!=\!10$ to $q\!=\!30$.  The

1711: bridged AIS estimates also get worse as $q$ increases, but more

1712: slowly.  In contrast, the mean squared error of the LIS estimates

1713: changes hardly at all as $q$ increases.

1714:

1715: The story is similar for sequences with $s\!=\!0.05$ and $t\!=\!1$,

1716: for which the limiting uniform distributions correspond to those in

1717: the first example of Section~\ref{sec-unif}.  The LIS estimates

1718: perform about equally well for all values of $q$, but the AIS

1719: estimates are dramatically worse for large values of $q$.  For this

1720: sequence, reverse AIS estimates are much worse than forward AIS

1721: estimates, and bridging does not help.

1722:

1723: According to the analysis of Section~\ref{sec-asym}, the choice of

1724: choice of $n\!=\!4$ for LIS used above is not optimal for either of

1725: these distribution sequences when $q$ is large.  For the sequence with

1726: $s\!=\!1$ and $t\!=\!4$, using $n\!=\!6$ should be better by a factor

1727: of 1.176.  However, in LIS runs with $q=30$, the mean squared error

1728: using $n=\!=\!4$ and $m\!=\!200$ is indistinguishable from that using

1729: $n\!=\!6$ and $m\!=\!143$, given the standard errors (a factor of 1.09

1730: or more should have been detectable).  Of course, $q=30$ does not give

1731: exactly uniform distributions, and these values of $m$ may not be

1732: large enough for the asymptotic results to apply, especially since the

1733: Markov transitions do not sample independently.  For the sequence with

1734: $s\!=\!0.05$ and $t\!=\!0$, the results in Section~\ref{sec-asym}

1735: indicate that using $n\!=\!3$ should be better by a factor of 1.084.

1736: In this case, LIS runs with $q=30$ using $n\!=\!3$ and $m\!=\!250$ are

1737: better than runs using $n\!=\!4$ and $m\!=\!200$ by a factor of 1.16,

1738: significantly greater than one given the standard errors, but not

1739: significantly different from the expected ratio of 1.084.

1740:

1741:

1742: \section{\hspace*{-7pt}Other applications of linked

1743:                        sampling}\label{sec-gen}\vspace*{-10pt}

1744:

1745: So far in this paper, I have focused on how Linked Importance Sampling

1746: can be used to estimate ratios of normalizing constants.  LIS can also

1747: be used to estimate expectations with respect to $\pi_1$, however, and

1748: in some applications, this may be its most important use.  Linked

1749: sampling methods related to LIS can also be applied in other ways.  I

1750: briefly described these other applications here, outlining the use of

1751: linked sampling for `dragging' fast variables in some detail.

1752:

1753:

1754: \subsection{\hspace*{-4pt}Estimating expectations}\vspace*{-4pt}

1755:

1756: The expectation of some function, $a(x)$, with respect to $\pi_1$

1757: can be estimated using simple importance sampling, with points drawn

1758: from $\pi_0$, as follows:

1759: \beq

1760:    E_{\pi_1}\big[a(X)\big]

1761:    \ \ = \ \ E_{\pi_0}\!\left[ a(X) {p_1(X) \over p_0(X)}\right] \, \Big/\

1762:              {Z_1 \over Z_0}

1763:    \ \ \approx\ \

1764: {1 \over N}\sum_{i=1}^N\, a(x^{(i)})\, {p_1(x^{(i)}) \over p_0(x^{(i)})}\ \Big/\

1765: {1 \over N}\sum_{i=1}^N\, {p_1(x^{(i)}) \over p_0(x^{(i)})}

1766: \label{eq-is-exp}

1767: \eeq

1768: where $x^{(i)},\ldots,x^{(N)}$ are drawn from $\pi_0$.

1769: Like equation~(\ref{eq-simple}), this estimate is valid only if

1770: no region having zero probability under $\pi_0$ has non-zero probability

1771: under $\pi_1$.  The two factors of $1/N$ of course cancel, but are included

1772: to emphasize the connection with the estimate for $r=Z_1/Z_0$, which is

1773: simply the denominator of the estimate above.

1774:

1775: Since LIS can be viewed as simple importance sampling on an extended

1776: state space, with distributions $\Pi_0$ and $\Pi_1$ defined by the

1777: forward and reverse procedures of Section~\ref{sec-lis}, we can use

1778: equation~(\ref{eq-is-exp}) to estimate any quantity that can be

1779: expressed as an expectation with respect ot $\Pi_1$.  Step (1) of the

1780: reverse procedure defining $\Pi_1$ sets $x_{n,\mu_n}$ to a value

1781: randomly chosen from $\pi_{\eta_n} = \pi_1$.  Step (2) then sets the

1782: other $x_{n,k}$ to values obtained from $x_{n,\mu_n}$ by applying

1783: Markov chain transitions that leave $\pi_1$ invariant.  It follows

1784: that under $\Pi_1$, all the points $x_{n,k}$ have marginal

1785: distribution $\pi_1$ (though they may not be independent).  Accordingly,

1786: \beq

1787:   E_{\pi_1}\big[a(X)\big] & = & E_{\,\Pi_1}\!\left[

1788:     {1 \over K_n\!+\!1}\, \sum_{k=0}^{K_n} a(X_{n,k}) \right]

1789: \eeq

1790: Estimating the right side as in equation~(\ref{eq-is-exp}), and using

1791: the fact that the ratio of probabilities under $\Pi_1$ over those

1792: under $\Pi_0$ is given by $\rhatLIS^{(i)}$ in equation~(\ref{eq-lis}),

1793: we get the estimate

1794: \beq

1795:   E_{\pi_1}\big[a(X)\big] & \approx &

1796:   \sum_{i=1}^M {\rhatLIS^{(i)} \over K_n\!+\!1}

1797:      \sum_{k=0}^{K_n} a(x^{(i)}_{n,k})

1798:   \ \Big/\

1799:   \sum_{i=1}^M \rhatLIS^{(i)}

1800: \label{eq-lis-exp}

1801: \eeq

1802:

1803: If the $M$ runs of LIS are started by sampling independently from

1804: $\pi_0$ (as will often be possible), the standard error of this

1805: estimate can be assessed in the usual fashion for importance sampling,

1806: as I have discussed for the analogous AIS estimates in (Neal 2001).

1807: This error assessment can be difficult, since when some

1808: $\rhatLIS^{(i)}$ are much larger than others, the variance of

1809: $\rhatLIS^{(i)}$ is hard to estimate.  Note, however, that the degree

1810: to which the Markov chain transitions used have converged need not be

1811: assessed, a possible advantage compared with simple MCMC estimates.  The

1812: estimate of equation~(\ref{eq-lis-exp}) will be asymptotically correct

1813: (as $M\rightarrow\infty$) regardless of how far these Markov chain

1814: transitions are from convergence.

1815:

1816: The primary reason one might wish to use LIS to estimate expectations

1817: is that going through the sequence of distributions parameterized by

1818: $\eta_0,\ldots,\eta_n$ may produce an `annealing' effect, which

1819: prevents the Markov chain sampler from being trapped in a local mode

1820: of the distribution.  Compared with the analogous AIS procedure, LIS

1821: may perform better for some forms of distributions, for the same

1822: reasons as were discussed in Sections~\ref{sec-anal}

1823: and~\ref{sec-cmp}.  One should also note that LIS estimates for

1824: expectations with respect to $\pi_{\eta_j}$ for all $j$ can easily be

1825: obtained from a single set of runs, by simply considering the results

1826: of each LIS run up to the point where the sample for $\pi_{\eta_j}$ is

1827: obtained.

1828:

1829:

1830: \subsection{\hspace*{-4pt}A linked form of tempered transitions}\vspace*{-4pt}

1831:

1832: My `tempered transition' method (Neal 1996) is another approach to

1833: sampling from distributions with isolated modes, between which

1834: movement is difficult for Markov chain transitions such as simple

1835: Metropolis updates.  In this approach, such simple Markov chain

1836: transitions are supplemented by occasional complex `tempered

1837: transitions', composed of many simple Markov chain transitions.  A

1838: tempered transition consists of several stages, which proceed through

1839: a sequence of distributions, from the distribution being sampled, to a

1840: `higher temperature' distribution in which movement between modes is

1841: easier, and then back down to the distribution being sampled.  At each

1842: stage of a tempered transition, we generate a single new state by

1843: applying a Markov chain transition to the current state, after which

1844: we switch to the next distribution in the sequence. The second half of

1845: a tempered transition is similar to an Annealed Importance Sampling

1846: run, while the first half is similar to an AIS run with the reversed

1847: sequence of distributions.

1848:

1849: A similar `linked' procedure can be defined, in which at each stage we

1850: generate a chain of states by applying a Markov chain transition.

1851: We then select a `link state' from this sequence (using a suitable

1852: bridge distribution) which serves as the starting point for the chain

1853: of states generated in the next stage.  In the final stage, a chain of

1854: states is produced using a Markov chain transition that leaves the

1855: distribution being sampled invariant, and a candidate state is

1856: selected uniformly at random from this chain.  The appropriate

1857: probability for accepting this candidate state is computed using

1858: ratios similar to those going into the LIS estimate of

1859: equation~(\ref{eq-lis}).

1860:

1861: As discussed in Section~\ref{sec-cmp}, for AIS to work well, all

1862: distributions in the sequence must assign reasonably high probability

1863: to regions of the space that have non-negligible probability under the

1864: next distribution in the sequence.  One would expect tempered

1865: transitions to work well only when this holds for both the sequence

1866: and its reversal.  In contrast, one would expect the `linked' version

1867: of tempered transitions to work well as long as the sequence satisfies

1868: the weaker condition that there be some `overlap' between adjacent

1869: distributions (assuming a suitable bridge distribution is used).

1870:

1871:

1872: \subsection{\hspace*{-4pt}Dragging fast variables using linked

1873:                           chains}\vspace*{-4pt}

1874:

1875: A slight modification of the tempered transition method can be applied

1876: to problems in which the state is composed of both `fast' and `slow'

1877: variables.  We will write the distribution of interest for such a problem

1878: as

1879: \beq

1880:    \pi(x,y) & = & (1/Z)\, \exp(-U(x,y))

1881: \eeq

1882: where $x$ denotes the `fast' variables and $y$ the `slow' variables.

1883: We assume that the computation is dominated by the time required to

1884: evaluate $U(x,y)$, but that once $U(x,y)$ has been evaluated, with

1885: relevant intermediate quantities saved,

1886: evaluating $U(x',y)$ for any new $x'$ is much faster than evaluating

1887: $U(x',y')$ for some $y'$ not previously encountered.  One example of

1888: such a problem is inference for Gaussian process classification models

1889: (Neal 1999), in which $y$ consists of the hyperparameters defining the

1890: covariance function used, and $x$ consists of the latent variables

1891: associated with the $n$ observations.  After a change to $y$, we must

1892: recompute the Cholesky decomposition of an $n \times n$ covariance matrix,

1893: which takes time proportional to $n^3$, whereas after a change to $x$

1894: only, $U(x,y)$ can be re-computed in time proportional to $n^2$,

1895: assuming the Cholesky decomposition for this value of $y$ has been

1896: saved.

1897:

1898: In my method for `dragging' fast variables (Neal 2004), the ability

1899: to quickly re-evaluate $U(x,y)$ when only $x$ changes is exploited to

1900: allow larger changes to be made to $y$ than would be possible if $x$

1901: were kept fixed, or were given a new value from some simple proposal

1902: distribution.  From the state $(x_0,y_0)$, a dragging

1903: update proposes a new value $y_1$, drawn from some symmetrical proposal

1904: distribution, in conjunction with a new value $x_1$ that is found by

1905: applying a succession of Markov chain updates that leave

1906: invariant distributions in the series, $\pi_{\eta_j}(x)$, for

1907: $j=1,\ldots,n\!-\!1$, with $0<\eta_j<\eta_{j+1}<1$.  The proposed state,

1908: $(x_1,y_1)$, is then accepted or rejected in a fashion analogous to tempered

1909: transitions.

1910:

1911: The distributions in the sequence used are defined by the following

1912: unnormalized probability or density function, which depends on the

1913: current and proposed values for $y$:

1914: \beq

1915:   p_{\eta}(x) & = &

1916:     \exp\,(\,-\,((1\!-\!\eta)\, U(x,y_0)\ +\ \eta\, U(x,y_1)))

1917: \label{eq-drag-p}

1918: \eeq

1919: The corresponding normalized probability or density function will be

1920: written as $\pi_{\eta}$.  Note that $\pi_0(x) = \pi(x|y_0)$ and

1921: $\pi_1(x)=\pi(x|y_1)$.  Crucially,

1922: after $U(x,y_0)$ and $U(x,y_1)$ have been evaluated once (for any~$x$),

1923: we can evaluate $p_{\eta}(x)$ for any $\eta$ and any $x$

1924: without any further `slow' computations.

1925: Indeed, since $U(x_0,y_0)$ will usually have already been evaluated as part of

1926: the previous Markov chain transition, only one slow computation will be required

1927: to evaluate $p_{\eta}(x)$ for any number of values of $\eta$ and $x$.

1928:

1929: A `linked' dragging update can be defined as follows.  Given

1930: the sequence of distributions defined by $\eta_0,\ldots,\eta_n$, with

1931: $\eta_0=0$ and $\eta_n=1$, the numbers of transitions ($T$ or $\underline{T}$)

1932: to perform for each distribution over $x$, denoted by $K_0,\ldots,K_n$, and a

1933: set of bridge distributions, denoted by $p_{j*j+1}$, for $j=0,\ldots,n\!-\!1$,

1934: an update from the current state $(x_0,y_0)$ is done as follows:\vspace*{5pt}

1935:

1936: \begin{center}\bf The Linked Dragging Procedure\end{center}\vspace*{-5pt}

1937:

1938: \begin{enumerate}

1939: \item[1)] Propose a new value, $y_1$, from some proposal distribution

1940:           $S(y_1|y_0)$, which satisfies the symmetry condition that $S(y_1|y_0)

1941:           =S(y_0|y_1)$.

1942: \item[2)] Pick an integer $\nu_0$ uniformly at random from $\{0,\ldots,K_0\}$,

1943:           and then set $x_{0,\nu_0}$ to the current values of the fast

1944:           variables, $x_0$.

1945: \item[3)] For $j\,=\,0,\ldots,n$, create a chain of values for $x$ associated

1946:           with $\pi_{\eta_j}$ as follows:

1947: \begin{enumerate}

1948:   \item[a)] If $j>0$:\ \ Pick an integer $\nu_j$ uniformly at random from

1949:             $\{0,\ldots,K_j\}$, and then set $x_{j,\nu_j}$ to $x_{j-1*j}$.

1950:   \item[b)] For $k\,=\,\nu_j+1,\ldots,K_j$, draw $x_{j,k}$ according to the

1951:             forward Markov chain transition probabilities

1952:             $T_{\eta_j}(x_{j,k-1},x_{j,k})$.  (If $\nu_j=K_j$, do nothing in

1953:             this step.)

1954:   \item[c)] For $k\,=\,\nu_j-1,\ldots,0$, draw $x_{j,k}$ according to the

1955:             reverse Markov chain transition probabilities

1956:             $\underline{T}_{\eta_j}(x_{j,k+1},x_{j,k})$. (If $\nu_j=0$, do

1957:             nothing in this step.)

1958:   \item[d)] If $j<n$:\ \ Pick a value for $\mu_j$ from

1959:             $\{0,\ldots,K_j\}$ according to the following probabilities

1960:             \beq

1961:               \Pi_0(\mu_j\,|\,x_j) & = &

1962:                  {p_{j*j+1}(x_{j,\mu_j}) \over p_{\eta_j}(x_{j,\mu_j})}

1963:                  \ \Big/\

1964:                  \sum_{k=0}^{K_j} {p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k})}

1965:             \eeq

1966:             and then set $x_{j*j+1}$ to $x_{j,\mu_j}$.

1967: \end{enumerate}

1968: \item[3)] Set $\mu_n$ to a value chosen uniformly at random from

1969:           $\{0,\ldots,K_n\}$, and let the proposed new values for the fast

1970:           variables, $x_1$, be equal to $x_{n,\mu_n}$.

1971: \item[4)] Accept $(x_1,y_1)$ as the new state with probability

1972: \beq

1973:    \min \left\{\, 1,\ \

1974:    \prod_{j=0}^{n-1} \left[

1975:      {1 \over K_j+1}\, \sum_{k=0}^{K_j}\,

1976:           { p_{j*j+1}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) }

1977:      \ \Big/\

1978:      {1 \over K_{j+1}+1}\, \sum_{k=0}^{K_{j+1}}\,

1979:           { p_{j*j+1}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k}) }

1980:      \right]

1981:    \,\right\}

1982: \eeq

1983:           If $(x_1,y_1)$ is not accepted, the new state is the same as

1984:           the old state, $(x_0,y_0)$.\vspace*{-6pt}

1985: \end{enumerate}

1986: One can show that this update leaves $\pi(x,y)$ invariant by showing

1987: that it satisfies detailed balance, which in turns follows from the

1988: stronger property that the probability of starting at $(x_0,y_0)$,

1989: assuming this start state comes from $\pi(x,y)$, then generating the various

1990: quantities produced by the above procedure, and finally accepting $(x_1,y_1)$

1991: as the new state, is the same as the probability of starting this procedure

1992: at $(x_1,y_1)$, generating the same quantities in reverse, and finally accepting

1993: $(x_0,y_0)$. The proof of this is analogous to the derivation of LIS in

1994: Section~\ref{sec-lis}.

1995:

1996: To use the linked dragging procedure, we need to select suitable

1997: bridge distributions.  Since the characteristics of $\pi_{\eta}(x)$

1998: will depend on $y_0$ and $y_1$, and of course $\eta$, we may not know

1999: enough to select good estimates for the values of $r$ needed to use

2000: the optimal bridge of equation~(\ref{eq-opt-bridge}), though we might

2001: try just setting $r$ to one.  This is not a problem for the geometric bridge of

2002: equation~(\ref{eq-geo-bridge}), for which the acceptance probability

2003: above can be written as\vspace*{2pt}

2004: \beq

2005:    \min \left\{\, 1,\ \

2006:    \prod_{j=0}^{n-1} \left[

2007:      {1 \over K_j+1}\, \sum_{k=0}^{K_j}\,

2008:           \sqrt{{ p_{\eta_{j+1}}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) }}

2009:      \ \Big/\

2010:      {1 \over K_{j+1}+1}\, \sum_{k=0}^{K_{j+1}}\,

2011:           \sqrt{{ p_{\eta_j}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k}) }}\,

2012:      \right]

2013:    \,\right\}\\[-10pt]\nonumber

2014: \eeq

2015: From equation~(\ref{eq-drag-p}), we see that

2016: \beq

2017:   { p_{\eta_{j+1}}(x_{j,k}) \over p_{\eta_j}(x_{j,k}) }

2018:   & = & \exp\,(\,-\,(\eta_{j+1}\!-\!\eta_j)\,

2019:                     (U(x_{j,k},y_1)\!-\!U(x_{j,k},y_0))) \\[6pt]

2020:   { p_{\eta_j}(x_{j+1,k}) \over p_{\eta_{j+1}}(x_{j+1,k}) }

2021:   & = & \exp\,(\,-\,(\eta_{j+1}\!-\!\eta_j)\,

2022:                     (U(x_{j+1,k},y_0)\!-\!U(x_{j+1,k},y_1)))

2023: \eeq

2024: For the simplest case with no intermediate distributions (ie, with $n\!=\!1$),

2025: the acceptance probability simplifies to

2026: \beq

2027:    \min \left\{\, 1,\ \

2028:    { \displaystyle {1 \over K_0+1}\, \sum_{k=0}^{K_0}\,

2029:        \exp\,(\,-\,(U(x_{j,k},y_1)\!-\!U(x_{j,k},y_0))\,/\,2)

2030:      \over

2031:      \displaystyle {1 \over K_1+1}\, \sum_{k=0}^{K_1}\,

2032:        \exp\,(\,-\,(U(x_{j,k},y_0)\!-\!U(x_{j,k},y_1))\,/\,2)

2033:    } \right\}

2034: \eeq

2035:

2036:

2037: \section{\hspace*{-7pt}Conclusions and Future work}\vspace*{-10pt}

2038:

2039: In this paper, I have demonstrated that in some situations Linked

2040: Importance Sampling is substantially more efficient than Annealed

2041: Importance Sampling, provided a suitable number of intermediate

2042: distributions are used.  However, in other situations, where the tails

2043: of the distributions involved are sufficiently heavy, the two methods

2044: are about equally efficient.  More research is therefore needed to

2045: determine for which problems of practical interest LIS, and related

2046: linked sampling methods, will be useful.

2047:

2048: In tests on multivariate Gaussian distributions, I have not seen an

2049: advantage for LIS over AIS.  Both perform about equally well on a

2050: sequence of 100-dimensional spherical Gaussian distributions with

2051: variances changing by a factor of two, so that $\log(r) = -100$.  This

2052: is in accord with the results in Section~\ref{sec-cmp}, where LIS had

2053: little or no advantage over AIS when the distributions were Gaussian.

2054: LIS is more likely to be useful for problems involving continuous

2055: distributions with lighter tails.

2056:

2057: One problem that may benefit from LIS is that of computing the

2058: probability of a very rare event, which can be cast as computing the

2059: normalizing constant for a distribution with the constraint that the

2060: state be in the set corresponding to this event.  Intermediate

2061: distributions might use looser forms of this constraint.  If, in all

2062: these distributions, states violating the constraints have zero

2063: probability, AIS will tend to have the same bad behaviour seen with

2064: uniform distributions in Section~\ref{sec-unif}, while LIS may work

2065: much better.

2066:

2067: Another context where LIS may outperform AIS is when only a fixed

2068: number of intermediate distributions are available --- ie, only a

2069: finite number of values are allowed for $\eta$.  This is the situation

2070: for the `sequential importance sampler' of MacEachern, Clyde, and Liu

2071: (1999), which can be seen as an instance of AIS (Neal 2001).  Here,

2072: the intermediate distributions use only a fraction of the $n$ items in

2073: the data set; such a fraction can only have the form $j/n$ with $j$ an

2074: integer.  The distance between successive distributions for this

2075: problem may sometimes be too great for AIS to work well, but their

2076: overlap might nevertheless be sufficient for LIS.

2077:

2078: It may be possible to improve LIS by reducing the variance in how well

2079: it samples at each stage.  Instead of performing a predetermined

2080: number, $K_j$, of Markov transitions at stage $j$, we might instead

2081: perform as many transitions as are necessary to obtain a good sample.

2082: Define a `tour' to be a sequence of transitions that moves from a high

2083: value of some key quantity (eg, $U(x)$ for the canonical distributions

2084: of equation~(\ref{eq-canonical})) to a low value of this quantity, or

2085: vice versa.  Good sampling might be ensured by performing some

2086: predetermined number of tours, with the number of these tours that

2087: occur before and after the link state being chosen at random.

2088: Suitable `high' and `low' values would probably need to be found using

2089: preliminary runs.

2090:

2091: More speculatively, it seems as if there should be some method that

2092: has the advantages of LIS over AIS, but that like AIS uses many

2093: intermediate distributions, performing only a single Markov transition

2094: for each.  Intuitively, it seems that such a `smooth' method that does

2095: not abruptly change $\eta$ should be more efficient.  One can use LIS

2096: with all $K_j$ set to one, but this will produce good results only if

2097: $n$ is large, which we saw in the analysis of Section~\ref{sec-asym}

2098: does not lead to an advantage over AIS.  Perhaps some way could be

2099: found of using states associated with all values of $\eta$ when

2100: estimating each of the ratios $Z_{\eta_{j+1}}/Z_{\eta_j}$, while still

2101: producing an estimate that is exactly unbiased even when the Markov transitions

2102: do not reach equilibrium.

2103:

2104:

2105: \section*{Acknowledgements}\vspace{-10pt}

2106:

2107: This research was supported by the Natural Sciences and Engineering

2108: Research Council of Canada.  I hold a Canada Research Chair in

2109: Statistics and Machine Learning.

2110:

2111:

2112: \section*{References}\vspace{-10pt}

2113:

2114: \leftmargini 0.2in

2115: \labelsep 0in

2116:

2117: \begin{description}

2118: \itemsep 2pt

2119:

2120: \item

2121:   Bennett, C.~H.\ (1976) ``Efficient estimation of free energy differences

2122:   from Monte Carlo data'', {\em Journal of Computational Physics}, vol.~22,

2123:   pp.~245-268.

2124:

2125: \item

2126:   Crooks, G.~E.\ (2000) ``Path-ensemble averages in systems driven far

2127:   from equilibrium'', \textit{Physical Review E}, vol.~61, pp.~2361-2366.

2128:

2129: %\item

2130: %  Crooks, G.~E.\ (1999) \textit{Excursions in Statistical Dynamics},

2131: %  PhD thesis, Chemistry, University of California at

2132: %  Berkeley, available from \texttt{http://threeplusone.com/pubs/GECthesis.html}

2133:

2134: \item

2135:   Gelman, A.\ and Meng, X.-L.\ (1998) ``Simulating normalizing constants:

2136:   From importance sampling to bridge sampling to path sampling'',

2137:   \textit{Statistical Science}, vol.~13, pp.~163-185.

2138:

2139: \item

2140:   Hendrix, D.~A.\ and Jarzynski, C.\ (2001) ``A ``fast growth'' method of

2141:   computing free energy differences'', \textit{Journal of Chemical Physics},

2142:   vol.~114, pp.~5974-5981.

2143:

2144: \item

2145:   Jarzynski, C.\ (1997) ``Nonequilibrium equality for free energy differences'',

2146:   \textit{Physical Review Letters}, vol.~78, pp.~2690-2693.

2147:

2148: \item

2149:   Jarzynski, C.\ (2001) ``A ``fast growth'' method of computing free energy

2150:   differences'', \textit{Journal of Chemical Physics}, vol.~114, pp.~5974-5981.

2151:

2152: %\item

2153: %  Liu, J.~S.\ (2001) \textit{Monte Carlo Strategies in Scientific Computing},

2154: %  Springer-Verlag.

2155:

2156: \item

2157:   Lu, N., Singh, J.~K., and Kofke, D.~A.\ (2003) ``Appropriate methods

2158:   to combine forward and reverse free-energy perturbation averages'',

2159:   \textit{Journal of Chemical Physics}, vol.~118, pp.~2977-2984.

2160:

2161: \item

2162:   MacEachern, S.~N., Clyde, M., and Liu, J.~S. (1999) ``Sequential

2163:   importance sampling for nonparametric Bayes models:\ The next generation'',

2164:   \textit{Canadian Journal of Statistics}, vol.~27, pp.~251-267.

2165:

2166: \item

2167:   Meng, X.-L.\ and Wong, H.~W.\ (1996) ``Simulating ratios of normalizing

2168:   constants via a simple identity: A theoretical exploration'',

2169:   \textit{Statistica Sinica}, vol.~6, pp.~831-860.

2170:

2171: \item

2172:   Metropolis, N., Rosenbluth, A.~W., Rosenbluth, M.~N., Teller, A.~H.,

2173:   and Teller, E.\ (1953) ``Equation of state calculations by fast computing

2174:   machines'', {\em Journal of Chemical Physics}, vol.~21, pp.~1087-1092.

2175:

2176: \item

2177:   Neal, R.~M.\ (1993) {\em Probabilistic Inference Using Markov Chain

2178:   Monte Carlo Methods}, Technical Report CRG-TR-93-1, Dept.\

2179:   of Computer Science, University of Toronto, 140 pages.

2180:   Obtainable from \texttt{http://www.cs.utoronto.ca/$\sim$radford/}.

2181:

2182: \item

2183:   Neal, R.~M.\ (1996) ``Sampling from multimodal distributions using tempered

2184:   transitions'', \textit{Statistics and Computing}, vol.~6, pp.~353-366.

2185:

2186: \item

2187:   Neal, R.~M.\ (1999) ``Regression and classification using Gaussian process

2188:       priors'' (with discussion), in J.~M.~Bernardo, {\em et al}

2189:       (editors) {\em Bayesian Statistics 6}, Oxford University Press,

2190:       pp.~475-501.

2191:

2192: \item

2193:   Neal, R.~M.\ (2001) ``Annealed importance sampling'',

2194:   \textit{Statistics and Computing}, vol.~11, pp.~125-139.

2195:

2196: %\item

2197: %  Neal, R.~M.\ (2003) ``Slice sampling'' (with discussion),

2198: %  {\em Annals of Statistics}, vol.~1, pp.~705-767.

2199:

2200: \item

2201:   Neal, R.~M.\ (2004) ``Taking bigger Metropolis steps by dragging fast

2202:   variables'', Technical Report No.~0411, Dept. of Statistics, University of

2203:   Toronto, 9 pages.

2204:

2205: \item

2206:   Schervish, M.~J.\ (1995) \textit{Theory of Statistics}, Springer.

2207:

2208: \item

2209:   Shirts, M.~R., Bair, E., Hooker, G., and Pande, V.~S.` (2003)

2210:   ``Equilibrium free energies from nonequilibrium measurements using

2211:   maximum-likelihood methods'', \textit{Physical Review Letters},

2212:   vol.~91, p.~140601.

2213:

2214: \end{description}

2215:

2216: \end{document}

2217: