0506:physics0506055/pca.tex

1: \documentclass{article}

2:

3: \usepackage{graphicx}

4: \usepackage{psfig}

5: \usepackage{epsfig}

6: \usepackage[round]{natbib}

7:

8: \setlength{\hoffset}{-1in}\setlength{\oddsidemargin}{2.5cm}

9: \setlength{\textwidth}{16cm} \setlength{\voffset}{-1in}

10: %\setlength{\topmargin}{1cm} \setlength{\textheight}{11cm}

11: \setlength{\topmargin}{1cm} \setlength{\textheight}{25cm}

12: \setlength{\unitlength}{1cm}

13:

14: \setlength{\parindent}{0cm}

15:

16: \newcommand{\bx}[1]{\fbox{\begin{minipage}{15.8cm}#1\end{minipage}}}

17:

18: \bibliographystyle{plainnat}

19:

20: \title{

21: Improving on the empirical covariance matrix using truncated PCA with white noise residuals

22:  }

23:

24: \author{Stephen Jewson}

25: \begin{document}

26:

27: \author{Stephen Jewson\footnote{\emph{Correspondence address}: Email: \texttt{x@stephenjewson.com}}\\}

28:

29: \maketitle

30:

31: \begin{abstract}

32: The empirical covariance matrix is not necessarily the best estimator

33: for the population covariance matrix:

34: we describe a simple method which gives better estimates in two examples.

35: The method models the covariance matrix using truncated PCA with white noise residuals.

36: Jack-knife cross-validation is used to find the truncation that maximises the

37: out-of-sample likelihood score.

38: \end{abstract}

39:

40: \section{Introduction}

41:

42: There are many applications in which it is necessary to estimate

43: population covariance matrices from sample data.

44: Our own particular interest is in the statistical modelling of weather data

45: for the valuation of weather-related insurance contracts~\citep{jewsonbz05},

46: but there are other uses in fields as diverse as ecology and pattern recognition.

47: A simple and commonly used estimator for the population covariance matrix is the empirical covariance matrix.

48: However, there seems to be no reason why this should be the best estimator, and

49: we present a recipe that we show generates better estimates in two examples.

50: The recipe is based on PCA. We apply PCA to the sample

51: data, truncate the series of singular vectors and model the residuals using white noise.

52: The truncation is then varied and the optimal truncation is chosen as that which

53: maximises the out-of-sample likelihood in a jack-knife test.

54: The resulting estimate of the population covariance matrix

55: is a better estimate than the empirical covariance matrix in the

56: sense that it gives higher out-of-sample likelihood scores for the sample data.

57:

58: %Principal component analysis (PCA) is a linear multivariate statistical

59: %Our own current interest in PCA is in statistical modelling, by which we mean taking a

60: %multivariate data set

61: %and fitting a statistical model that can be used to generate a long

62: %series of surrogate data that has similar statistical properties to the

63: %original data set.

64: %One of the questions that arises when PCA

65: %is used for this purpose

66: %is the \emph{truncation} that should be used i.e. how many of the singular vectors

67: %to keep (we will explain what this means in more detail below).

68: %There is also a large literature on this question of how to choose the truncation: see, for instance,

69: %the review of a number of articles by CITE. However, the methods

70: %that have been proposed are strikingly ad-hoc and subjective. For instance,

71: %many depend on statistical testing at an arbitrary confidence interval.

72: %Others are simply rules of thumb, with only rather vague justification.

73: %For our own particular applications we would rather use a method that is

74: %more objective, and to that end we describe what we \emph{think} is a new

75: %method for determining the truncation, based on a simple jack-knife cross-validation

76: %scheme. Given a cost function, this method is completely objective.

77: %The most natural cost function seems to be the likelihood, although in particular

78: %situations other cost functions may be appropriate.

79:

80: In section~\ref{pca} we briefly review PCA,

81: in section~\ref{method} we describe our method for determining the optimal truncation,

82: in section~\ref{example} we give two examples and

83: in section~\ref{summary} we summarise.

84:

85: \section{Principal Component Analysis}

86: \label{pca}

87:

88: Consider a matrix of data $X$ with dimensions $s$ by $t$ and rank $r$.

89: We will think of $s$ and $t$ as representing space and time, but many other

90: interpretations are possible.

91: Mathematically speaking, we know that $r \le \mbox{min}(s,t)$. Practically speaking, for any

92: genuine observed data, we can usually assume that $r=\mbox{min}(s,t)$. This is because

93: it is infinitely unlikely that there is a linear relation between the columns or the rows

94: in $X$ (unless one of the columns or rows has deliberately been produced as a linear combination of the others).

95: Such is the typical nature of real measured data.

96:

97: The mathematical theory of singular value decomposition states that all matrices can be decomposed

98: in a certain unique way. Applying this theory to our matrix $X$ gives:

99:

100: \begin{equation}\label{X=}

101:  X=E \Lambda P^T

102: \end{equation}

103:

104: where $E$ has the dimensions $s$ by $r$, $\Lambda$ has dimensions $r$ by $r$ and $P$ has

105: dimensions $t$ by $r$. By the singular value decomposition theorem these matrices have the

106: following properties (\emph{inter alia}):

107: \begin{itemize}

108:     \item $E^T E=I$

109:     \item $P^T P=I$

110:     \item $\Lambda$ is diagonal

111: \end{itemize}

112:

113: PCA is very closely related to eigenvalue decomposition: $E$ contains the eigenvectors of the

114: covariance matrix $XX^T$, $P$ contains the eigenvectors

115: of the covariance matrix $X^TX$ and the two covariance matrices have the same eigenvalues,

116: which are the diagonal terms of  $\Lambda^2$ (we discuss the relations between PCA

117: and eigenvalue decomposition in a little more detail in~\citet{jewson03x}).

118:

119: We can write equation~\ref{X=} in terms of the elements of the matrices as:

120: \begin{equation}\label{x=}

121:  x_{ij}=\sum_{k=1}^r e_{ik} \lambda_k p_{jk}

122: \end{equation}

123:

124: In this form we can see more clearly that we are writing the original data in terms of a sum of $r$

125: rank 1 matrices, each of which is formed as the product of two vectors and a scalar.

126: Since we are thinking of the two dimensions as space and time

127: we can think of the two vectors that make up the $k$'th rank 1 matrix

128: as being a set of weights in space (a spatial pattern $e_{ik}$) and

129: a set of weights in time (a time series $p_{jk}$).

130: The ordering of the rank 1 matrices is arbitrary, but by convention is always taken

131: with the highest values of $\lambda$ first. This has the consequence that the first of the $r$ matrices

132: contains the most variance, the second contains the next-most, and so on.

133: One of the properties of PCA is that the variance accounted for by the first rank 1 matrix

134: is actually the largest possible

135: (among all rank 1 matrices, subject to the orthonormality constraints),

136: and the variance accounted for by the second is the largest possible from the remaining variance.

137:

138: There are various adaptions of this basic version of PCA.

139: For instance, the matrix $X$ may be centred and/or standardized prior to deriving the

140: patterns.

141:

142: Given equation~\ref{x=} we can consider approximating the data by truncating

143: the sum to fewer than $r$ of the rank 1 matrices. If we let $r'$ be the number

144: of matrices retained this gives:

145:

146: \begin{equation}\label{x-hat=}

147:  \hat{x}_{ij}=\sum_{k=1}^{r'} e_{ik} \lambda_k p_{jk}

148: \end{equation}

149:

150: This truncation may make sense for two reasons. Firstly, the retained patterns

151: together may account for a large fraction of the total variance, but in only a small

152: number of patterns. PCA can thus act as an efficient way to represent a large fraction of the information in $X$.

153: Secondly, the retained patterns are presumably the more accurately estimated patterns,

154: in a statistical sense. This is useful if the PCA is to be used for simulation or

155: extrapolation of any kind.

156:

157: We will now make the restrictive assumption that the data in $X$ is independent in time, dependent in space and

158: distributed with a multivariate normal distribution.

159: In this case the spatial patterns show structure while the time series are uncorrelated.

160: We wish to generate surrogate data that has the same

161: correlation structure in space as $X$,

162: and this can be done by replacing the time series

163: in expression~\ref{x=} with simulated values:

164:

165: \begin{equation}\label{x-sim=}

166:  x^{sim}_{ij}=\sum_{k=1}^r e_{ik} \lambda_k p^{sim}_{jk}

167: \end{equation}

168:

169: It is easy to show that $x^{sim}$ has the same spatial covariance matrix as the original $x_{ij}$.

170: However, the rank 1 matrices for high values of $k$ are likely to be very poorly estimated,

171: and this may be bad for our simulations.

172: This motivates the idea that we should perhaps truncate the sum and use only the

173: well estimated patterns in the simulation, up to the $r'$'th.

174: There are two problems with this, however:

175: first, that the variance

176: of the resulting simulated data would be lower than the variance of the observations,

177: and second that the rank of

178: the simulated data could be too low (the dimension of the space spanned by the simulated data

179: could be smaller than the dimension of the space spanned by the sample data).

180: This might result in simulations which could never explore the space of possible observations

181: fully, and we find this to be undesirable.

182: These problems can both be corrected by adding appropriate amounts of white noise as `padding'.

183:

184: This gives:

185: \begin{equation}\label{x-sim2=}

186:  x^{sim}_{ij}=\sum_{k=1}^{r'} p_{ik} \lambda_k q^{sim}_{jk}+\sigma_i \epsilon_{ij}

187: \end{equation}

188:

189: where $\epsilon$ is white noise and the $\sigma_i$ are chosen so that the simulations

190: have the correct variance. The lower $r'$, the greater the $\sigma_i$ have to be to make

191: up the full variance.

192:

193: Within this setup the question we wish to ask is: how should the truncation $r'$ be chosen?

194:

195:

196:

197: \section{Choosing the truncation}

198: \label{method}

199:

200: The method we propose for choosing the truncation works as follows.

201: As the truncation $r'$ is increased, more information about the correlation structure of $X$

202: is included in the simulations.

203: But more spurious information is also included because the higher order patterns are less well estimated.

204: Because of these competing effects the benefit of increasing

205: $r'$ presumably disappears at some point: we wish to find exactly the value of $r'$ at which

206: this occurs. To do so we use a jack-knife cross-validation technique: we

207: test the extent to which a certain truncation is able to represent

208: data that is outside the sample of data on which the PCA is estimated. This test

209: allows us to compare different truncations in a fair and honest way, and find

210: which performs the best.

211:

212: What cost function should we use for our test? A particular truncation along with the white noise padding

213: is effectively an estimate of the multivariate distribution of $X$.

214: This motivates us to use the standard cost function used for the fitting of distributions in classical statistics,

215: which is the log-likelihood. Given a particular truncation, and the

216: amplitudes of the supplementary white noise, we can calculate the covariance matrix

217: of the multivariate distribution.

218: From this we can calculate the log-likelihood using the standard expression for the density

219: for the multivariate normal with dimension $p$:

220: \begin{equation}

221:  f=\frac{1}{(2\pi)^{\frac{p}{2}} D^\frac{1}{2}} \mbox{exp}\left(-\frac{1}{2}(z-\mu)^T\Sigma^{-1}(z-\mu)\right)

222: \end{equation}

223: where

224: $\Sigma$ is the covariance matrix (size $p$ by $p$),

225: $D$ is the determinant of the covariance matrix (a single number),

226: $z$ is a vector length $p$ and

227: $\mu$ is a vector length $p$.

228:

229: The log-density is then:

230: \begin{equation}\label{logf}

231:  \mbox{log}f=-\frac{1}{2}p\mbox{log}(2\pi)

232:                -\frac{1}{2}\mbox{log}D

233:                -\frac{1}{2}(z-\mu)^T\Sigma^{-1}(z-\mu)

234: \end{equation}

235:

236: We will refer to the 2nd and 3rd terms of this equation as the `dispersion term' $(-\frac{1}{2}\mbox{log}D)$

237: and the `standardisation term' $(-\frac{1}{2}(z-\mu)^T\Sigma^{-1}(z-\mu))$.

238: $D$ is a measure of the dispersion in the multivariate distribution:

239: for instance, when $p=1$ we have $D=\sigma$. The dispersion term (which has a negative coefficient)

240: penalizes distributions with a large dispersion.

241: $(z-\mu)^T\Sigma^{-1}(z-\mu)$ is the `z value' or standardised value of the spatial pattern $z-\mu$,

242: in the multivariate normal distribution described by $\Sigma$. If $z-\mu$ is very unlikely

243: in this distribution then this term will be very large. The standardisation term penalizes

244: the distribution if there are many points with large standardised values.

245: The distribution which maximises the log-likelihood is a trade-off between these two effects:

246: the dispersion has to be small, but not so small that the standardised values of the out-of-sample

247: data is too large.

248:

249: One aspect of using log-likelihood as a cost function is that it rejects a distribution and covariance matrix completely

250: if there is even a single observation that could not have come from the distribution. For instance,

251: if we use truncated PCA without the white noise padding then many of the out-of-sample observations would be

252: impossible, simply because they come from a higher dimensional space. We consider

253: this strict rejection of distributions that do not span the space of the observed data to be desirable.

254:

255:

256: We now summarise our method. For each truncation we run over the data,

257: missing out each time point in turn, applying PCA to the remaining data,

258: truncating at the given level, estimating the amplitude of the supplementary white noise,

259: calculating the covariance matrix for the combination of truncated singular vector series

260: and white noise,

261: and calculating the log-likelihood for the missed data. We combine

262: all the log-likelihoods for a particular truncation to give a single score for that

263: truncation. We then compare these log-likelihood scores across the different truncations to find

264: which truncation is the best at predicting the distribution of the out-of-sample data.

265:

266: \section{Examples}

267: \label{example}

268:

269: We now give two simple examples of the method described above.

270: They are both motivated by our interest in simulating the risk in weather derivative

271: portfolios, for which we wish to create many thousands of years of surrogate weather data

272: (see chapter 7 in~\cite{jewsonbz05}).

273:

274: In both examples we standardise the data in time before we apply PCA.

275: For the first example $s<t$, while for the second $s>t$.

276: This alters the nature of the problem significantly, as we will see below.

277:

278: \subsection{Example 1: UK temperatures}

279:

280: In our first example we take a matrix $X$ of data consisting of winter average

281: daily average temperatures for 5 UK locations. There are 44 winters of data and

282: so $s=5$ and $t=44$. The rank of the data is 5, and is unaffected by the standardisation, which

283: is only applied in the time dimension.

284: The space of possible spatial

285: patterns, which has dimension 5, can be spanned by the 5 spatial singular vectors

286: if there is no truncation. If there is truncation then this is no longer the case,

287: and a general spatial pattern could not be represented as a linear combination of

288: the remaining spatial singular vectors.

289: The `padding' with white noise solves this problem, as described above.

290:

291: Figure~\ref{f01} shows (minus one times) the log-likelihood versus the truncation for this example.

292: We see that there is a big decrease in the cost function as we move from a purely independent

293: model to one that uses the first singular vector only:

294: we conclude that this data is definitely correlated in space.

295:  There is a much smaller further decrease

296: when the second singular vector is added, and adding further singular vectors beyond the second

297: actually increases the cost function.

298: A truncation to two singular vectors is therefore optimal

299: in this case.

300: Truncations of two, three and four all perform better than using the empirical covariance matrix

301: (which is a truncation of five).

302: The covariance matrix based on all five singular vectors, and the change in the covariance

303: matrix caused by truncation to the first two, are shown below. We see that the changes in the

304: individual covariances are fairly small (perhaps between 1\% and 4\%).

305:

306: \begin{center}

307: \begin{tabular}{|c|c|c|c|c|}

308:   \hline

309:    46.00 &    42.40 &    37.25 &    41.17 &    40.49 \\

310:    42.40 &    46.00 &    38.24 &    42.69 &    41.17 \\

311:    37.25 &    38.24 &    46.00 &    44.04 &    44.46 \\

312:    41.17 &    42.69 &    44.04 &    46.00 &    44.92 \\

313:    40.49 &    41.17 &    44.46 &    44.92 &    46.00 \\

314:   \hline

315: \end{tabular}

316: \end{center}

317:

318: \begin{center}

319: \begin{tabular}{|c|c|c|c|c|}

320:   \hline

321:     0.00 &     1.72 &    -0.35 &     0.39 &    -0.14 \\

322:     1.72 &     0.00 &     0.16 &    -0.22 &     0.27 \\

323:    -0.35 &     0.16 &     0.00 &     0.26 &     0.36 \\

324:     0.39 &    -0.22 &     0.26 &     0.00 &     0.27 \\

325:    -0.14 &     0.27 &     0.36 &     0.27 &     0.00 \\

326:   \hline

327: \end{tabular}

328: \end{center}

329:

330: Going further, we can test whether a truncation of two is \emph{significantly}

331: better than a truncation of one. We will do this using the method we used

332: in~\citet{hallj05b} in which we consider each individual time point of the data and count

333: the number of times each of the two methods beats the other. The resulting

334: test statistic is distributed as a binomial distribution under the null hypothesis

335: that there is no significant difference between the two truncations.

336:

337: The results of this year by year comparison are shown in figures~\ref{f02} and~\ref{f03}.

338: We see that, for every comparison of adjacent truncations,

339: one or the other wins \emph{in every year}. We conclude that the ordering of the

340: results in figure~\ref{f01} is extremely highly significant.

341:

342: We can also try and understand the variations in the log-likelihood score curve shown in figure~\ref{f01}

343: by breaking the curve down into the determinant and standardization terms in equation~\ref{logf}.

344: This breakdown is shown in figure~\ref{f04}. We see that, in this case, the shape of the log-likelihood

345: score curve is fixed by the determinant term. Had we known this in advance we

346: could have found the optimum truncation by simply calculating the determinant as a function of

347: truncation. This is a simple in-sample calculation, and much less complex than the full

348: cross-validation calculation. We suspect that it may always be the case

349: that the determinant term dominates when $s<t$, and this possibility seems to merit further investigation.

350: We also suspect that the dominance of the determinant term explains why the breakdown by year

351: gives such clear results.

352:

353: With some trepidation we now attempt to explain the behaviour of the determinant and

354: standardisation curves. The standardisation curve seems to be the easier of the two

355: to understand. For all 6 truncations this term is very small: this means that all of the out-of-sample

356: spatial patterns are quite consistent with the fitted distribution. This is presumably because

357: the out-of-sample patterns live in a 5 dimensional space, and the fitted distributions

358: have significant variance in all of these dimensions.

359: The determinant curve is a little harder to understand. As the truncation increases

360: it shows a decrease and then an increase.

361: The decrease seems to be because as the truncation is increased the degree of specialisation

362: of the model increases. The subsequent increase is presumably because of sampling error

363: on the higher singular vectors.

364:

365: \subsection{Example 2: US temperatures}

366:

367: In our second example we take a matrix $X$ of data consisting of winter average

368: daily average temperatures for 308 US locations. There are 54 winters of data and

369: so $s=308$ and $t=54$. The rank of the data is 53 because of the temporal standardisation.

370: Because $s>t$ we are now

371: in a situation where the space of possible spatial patterns, which has dimension

372: 308, cannot be spanned by the spatial singular vectors, of which there are only 53.

373: Truncation and the white noise padding are therefore essential: this is a case

374: where it seems that we are \emph{guaranteed} to find a better estimate of the covariance

375: matrix than that given by the empirical covariance matrix, because the empirical

376: covariance matrix will immediately fail. In fact, the simple example of a purely independent model

377: (a full-rank diagonal covariance matrix) will always beat the empirical covariance matrix.

378:

379: The likelihood score versus truncation is shown in figure~\ref{f05}.

380: We can only evaluate the likelihood score up to a truncation of 52. This is because

381: the rank of the data is 53, and so the truncation of 53, which has no white noise

382: padding, gives a correlation matrix that cannot be inverted.

383:

384: We see that the log-likelihood gradually reduces as the truncation is increased, up to

385: a truncation of 47. It then rapidly increases to very large values between 47 and 52.

386: 47 is thus the optimum truncation.

387:

388: In figure~\ref{f06} we decompose the log-likelihood curve into determinant

389: and standardization terms. In this case we see that it is the interplay of these two

390: terms that fixes the minimum, and it would not be possible to determine the minimum

391: using the determinant curve alone (which is monotonic).

392:

393: Again, with some trepidation, we attempt to explain the shapes of these two curves.

394: The determinant curve decreases as the truncation increases: we think this is

395: because adding more singular vectors, at the expense of white noise variance,

396: makes the multivariate distribution more specific i.e. it concentrates the

397: variance into fewer dimensions. Ultimately, for a truncation of 53, there is only

398: non-zero variance in 53 of the 308 dimensions (and the correlation matrix

399: is no longer invertible). The standardisation term gradually increases

400: as a result of this specialisation. Then, as the truncation approaches

401: 53, the variance in the other dimensions becomes very small, and the probability

402: of some of the out of sample patterns, which come from a 308 dimensional space,

403: becomes very low. At this point the standardisation term becomes very large.

404: We think that this tradeoff between the determinant term and the standardisation

405: term is likely to occur whenever $s>t$.

406:

407:

408: \section{Summary}

409: \label{summary}

410:

411: We have investigated a simple approach for making a better estimate of the population covariance

412: matrix than that given by the empirical covariance matrix.

413: The method is based on truncated PCA with white noise residuals.

414: The question of how to truncate PCA has been addressed before, but we introduce

415: a simple new method based on a very straightforward reasoning: we want to choose the truncation

416: so that we maximise the likelihood of out-of-sample data. Finding the best truncation

417: under this definition of optimum is relatively easy. We give two examples, and in both

418: cases we find better estimates of the population covariance matrix than that given by

419: the empirical covariance matrix (where \emph{better} is defined as giving higher

420: out-of-sample likelihood scores).

421:

422: Based on the results from our examples we conclude that using

423: the empirical covariance matrix for statistical modelling

424: may not be a very good

425: idea since the higher order singular vectors tend to be poorly estimated and thus decrease the

426: out-of-sample likelihood.

427: In the $s>t$ case there is the additional problem that the empirical covariance matrix does not

428: describe a space large enough to contain the observations.

429: Optimal truncation with white noise `padding' solves both these problems,

430: and thus may give better modelling results.

431:

432: In some cases, such as the two examples we have used in this study, one of the dimensions of the

433: sample data is a genuine spatial dimension. In this case it may be possible to do even better by

434: modelling the residuals using `red' noise, rather than just white noise. Testing this idea is next.

435: It would also be interesting to compare our method with other possible methods for improving

436: the estimate of the covariance matrix, such as linear combinations of the empirical covariance

437: matrix with an independent model.

438:

439:

440: \section{Acknowledgements}

441:

442: The author would like to think Dag Lohmann, Sergio Pezzuli and

443: Christine Ziehmann for interesting discussions on this topic.

444:

445: \section{Legal statement}

446:

447: SJ was employed by RMS at the time that this article was written.

448:

449: However, neither the research behind this article nor the writing

450: of this article were in the course of his employment, (where 'in

451: the course of their employment' is within the meaning of the

452: Copyright, Designs and Patents Act 1988, Section 11), nor were

453: they in the course of his normal duties, or in the course of

454: duties falling outside his normal duties but specifically assigned

455: to him (where 'in the course of his normal duties' and 'in the

456: course of duties falling outside his normal duties' are within the

457: meanings of the Patents Act 1977, Section 39). Furthermore the

458: article does not contain any proprietary information or trade

459: secrets of RMS. As a result, the author is the owner of all the

460: intellectual property rights (including, but not limited to,

461: copyright, moral rights, design rights and rights to inventions)

462: associated with and arising from this article. The author reserves

463: all these rights. No-one may reproduce, store or transmit, in any

464: form or by any means, any part of this article without the

465: author's prior written permission. The moral rights of the author

466: have been asserted.

467:

468: The contents of this article reflect the author's personal

469: opinions at the point in time at which this article was submitted

470: for publication. However, by the very nature of ongoing research,

471: they do not necessarily reflect the author's current opinions. In

472: addition, they do not necessarily reflect the opinions of the

473: author's employers.

474:

475: \bibliography{pca}

476:

477: \newpage

478: \begin{figure}[!htb]

479:   \begin{center}

480:     \scalebox{0.8}{\includegraphics{fig1}}

481: %    \scalebox{0.8}{\includegraphics{figs/likelihood}}

482:   \end{center}

483:   \caption{

484: The log-likelihood versus truncation for example 1 described in the text.

485:      }

486:   \label{f01}

487: \end{figure}

488:

489: \newpage

490: \begin{figure}[!htb]

491:   \begin{center}

492:     \scalebox{0.8}{\includegraphics{fig2}}

493: %    \scalebox{0.8}{\includegraphics{figs/likelihood}}

494:   \end{center}

495:   \caption{

496: The log-likelihood on a yearly basis for the six truncations used in example 1.

497:      }

498:   \label{f02}

499: \end{figure}

500:

501: \newpage

502: \begin{figure}[!htb]

503:   \begin{center}

504:     \scalebox{0.8}{\includegraphics{fig3}}

505: %    \scalebox{0.8}{\includegraphics{figs/likelihood}}

506:   \end{center}

507:   \caption{

508: Same as figure~\ref{f02} but with a different scale to clarify the differences between

509: the curves.

510: }

511:   \label{f03}

512: \end{figure}

513:

514: \newpage

515: \begin{figure}[!htb]

516:   \begin{center}

517:     \scalebox{0.8}{\includegraphics{fig4}}

518: %    \scalebox{0.8}{\includegraphics{figs/likelihood}}

519:   \end{center}

520:   \caption{

521: Decomposition of the log-likelihood curve in figure~\ref{f01} into the

522: determinant and standardization terms. We see that the curve in figure~\ref{f01}

523: is completely dominated by the determinant term.

524:     }

525:   \label{f04}

526: \end{figure}

527:

528: \newpage

529: \begin{figure}[!htb]

530:   \begin{center}

531:     \scalebox{0.8}{\includegraphics{fig5}}

532: %    \scalebox{0.8}{\includegraphics{figs/likelihood}}

533:   \end{center}

534:   \caption{

535: The log-likelihood versus truncation for example 2 described in the text,

536: with two different vertical and horizontal scales.

537:     }

538:   \label{f05}

539: \end{figure}

540:

541: \newpage

542: \begin{figure}[!htb]

543:   \begin{center}

544:     \scalebox{0.8}{\includegraphics{fig6}}

545: %    \scalebox{0.8}{\includegraphics{figs/likelihood}}

546:   \end{center}

547:   \caption{

548: Decomposition of the log-likelihood curve in figure~\ref{f05} into the

549: determinant and standardization terms. In this case the curve in figure~\ref{f05}

550: is not dominated by either term, and the minimum in the curve in figure~\ref{f05}

551: arises from interplay between these two terms.

552:     }

553:   \label{f06}

554: \end{figure}

555:

556:

557: \end{document}

558: