1: \documentclass{article}
2:
3: \usepackage{graphicx}
4: \usepackage{psfig}
5: \usepackage{epsfig}
6: \usepackage[round]{natbib}
7:
8: \setlength{\hoffset}{-1in}\setlength{\oddsidemargin}{2.5cm}
9: \setlength{\textwidth}{16cm} \setlength{\voffset}{-1in}
10: %\setlength{\topmargin}{1cm} \setlength{\textheight}{11cm}
11: \setlength{\topmargin}{1cm} \setlength{\textheight}{25cm}
12: \setlength{\unitlength}{1cm}
13:
14: \setlength{\parindent}{0cm}
15:
16: \newcommand{\bx}[1]{\fbox{\begin{minipage}{15.8cm}#1\end{minipage}}}
17:
18: \bibliographystyle{plainnat}
19:
20: \title{
21: Improving on the empirical covariance matrix using truncated PCA with white noise residuals
22: }
23:
24: \author{Stephen Jewson}
25: \begin{document}
26:
27: \author{Stephen Jewson\footnote{\emph{Correspondence address}: Email: \texttt{x@stephenjewson.com}}\\}
28:
29: \maketitle
30:
31: \begin{abstract}
32: The empirical covariance matrix is not necessarily the best estimator
33: for the population covariance matrix:
34: we describe a simple method which gives better estimates in two examples.
35: The method models the covariance matrix using truncated PCA with white noise residuals.
36: Jack-knife cross-validation is used to find the truncation that maximises the
37: out-of-sample likelihood score.
38: \end{abstract}
39:
40: \section{Introduction}
41:
42: There are many applications in which it is necessary to estimate
43: population covariance matrices from sample data.
44: Our own particular interest is in the statistical modelling of weather data
45: for the valuation of weather-related insurance contracts~\citep{jewsonbz05},
46: but there are other uses in fields as diverse as ecology and pattern recognition.
47: A simple and commonly used estimator for the population covariance matrix is the empirical covariance matrix.
48: However, there seems to be no reason why this should be the best estimator, and
49: we present a recipe that we show generates better estimates in two examples.
50: The recipe is based on PCA. We apply PCA to the sample
51: data, truncate the series of singular vectors and model the residuals using white noise.
52: The truncation is then varied and the optimal truncation is chosen as that which
53: maximises the out-of-sample likelihood in a jack-knife test.
54: The resulting estimate of the population covariance matrix
55: is a better estimate than the empirical covariance matrix in the
56: sense that it gives higher out-of-sample likelihood scores for the sample data.
57:
58: %Principal component analysis (PCA) is a linear multivariate statistical
59: %Our own current interest in PCA is in statistical modelling, by which we mean taking a
60: %multivariate data set
61: %and fitting a statistical model that can be used to generate a long
62: %series of surrogate data that has similar statistical properties to the
63: %original data set.
64: %One of the questions that arises when PCA
65: %is used for this purpose
66: %is the \emph{truncation} that should be used i.e. how many of the singular vectors
67: %to keep (we will explain what this means in more detail below).
68: %There is also a large literature on this question of how to choose the truncation: see, for instance,
69: %the review of a number of articles by CITE. However, the methods
70: %that have been proposed are strikingly ad-hoc and subjective. For instance,
71: %many depend on statistical testing at an arbitrary confidence interval.
72: %Others are simply rules of thumb, with only rather vague justification.
73: %For our own particular applications we would rather use a method that is
74: %more objective, and to that end we describe what we \emph{think} is a new
75: %method for determining the truncation, based on a simple jack-knife cross-validation
76: %scheme. Given a cost function, this method is completely objective.
77: %The most natural cost function seems to be the likelihood, although in particular
78: %situations other cost functions may be appropriate.
79:
80: In section~\ref{pca} we briefly review PCA,
81: in section~\ref{method} we describe our method for determining the optimal truncation,
82: in section~\ref{example} we give two examples and
83: in section~\ref{summary} we summarise.
84:
85: \section{Principal Component Analysis}
86: \label{pca}
87:
88: Consider a matrix of data $X$ with dimensions $s$ by $t$ and rank $r$.
89: We will think of $s$ and $t$ as representing space and time, but many other
90: interpretations are possible.
91: Mathematically speaking, we know that $r \le \mbox{min}(s,t)$. Practically speaking, for any
92: genuine observed data, we can usually assume that $r=\mbox{min}(s,t)$. This is because
93: it is infinitely unlikely that there is a linear relation between the columns or the rows
94: in $X$ (unless one of the columns or rows has deliberately been produced as a linear combination of the others).
95: Such is the typical nature of real measured data.
96:
97: The mathematical theory of singular value decomposition states that all matrices can be decomposed
98: in a certain unique way. Applying this theory to our matrix $X$ gives:
99:
100: \begin{equation}\label{X=}
101: X=E \Lambda P^T
102: \end{equation}
103:
104: where $E$ has the dimensions $s$ by $r$, $\Lambda$ has dimensions $r$ by $r$ and $P$ has
105: dimensions $t$ by $r$. By the singular value decomposition theorem these matrices have the
106: following properties (\emph{inter alia}):
107: \begin{itemize}
108: \item $E^T E=I$
109: \item $P^T P=I$
110: \item $\Lambda$ is diagonal
111: \end{itemize}
112:
113: PCA is very closely related to eigenvalue decomposition: $E$ contains the eigenvectors of the
114: covariance matrix $XX^T$, $P$ contains the eigenvectors
115: of the covariance matrix $X^TX$ and the two covariance matrices have the same eigenvalues,
116: which are the diagonal terms of $\Lambda^2$ (we discuss the relations between PCA
117: and eigenvalue decomposition in a little more detail in~\citet{jewson03x}).
118:
119: We can write equation~\ref{X=} in terms of the elements of the matrices as:
120: \begin{equation}\label{x=}
121: x_{ij}=\sum_{k=1}^r e_{ik} \lambda_k p_{jk}
122: \end{equation}
123:
124: In this form we can see more clearly that we are writing the original data in terms of a sum of $r$
125: rank 1 matrices, each of which is formed as the product of two vectors and a scalar.
126: Since we are thinking of the two dimensions as space and time
127: we can think of the two vectors that make up the $k$'th rank 1 matrix
128: as being a set of weights in space (a spatial pattern $e_{ik}$) and
129: a set of weights in time (a time series $p_{jk}$).
130: The ordering of the rank 1 matrices is arbitrary, but by convention is always taken
131: with the highest values of $\lambda$ first. This has the consequence that the first of the $r$ matrices
132: contains the most variance, the second contains the next-most, and so on.
133: One of the properties of PCA is that the variance accounted for by the first rank 1 matrix
134: is actually the largest possible
135: (among all rank 1 matrices, subject to the orthonormality constraints),
136: and the variance accounted for by the second is the largest possible from the remaining variance.
137:
138: There are various adaptions of this basic version of PCA.
139: For instance, the matrix $X$ may be centred and/or standardized prior to deriving the
140: patterns.
141:
142: Given equation~\ref{x=} we can consider approximating the data by truncating
143: the sum to fewer than $r$ of the rank 1 matrices. If we let $r'$ be the number
144: of matrices retained this gives:
145:
146: \begin{equation}\label{x-hat=}
147: \hat{x}_{ij}=\sum_{k=1}^{r'} e_{ik} \lambda_k p_{jk}
148: \end{equation}
149:
150: This truncation may make sense for two reasons. Firstly, the retained patterns
151: together may account for a large fraction of the total variance, but in only a small
152: number of patterns. PCA can thus act as an efficient way to represent a large fraction of the information in $X$.
153: Secondly, the retained patterns are presumably the more accurately estimated patterns,
154: in a statistical sense. This is useful if the PCA is to be used for simulation or
155: extrapolation of any kind.
156:
157: We will now make the restrictive assumption that the data in $X$ is independent in time, dependent in space and
158: distributed with a multivariate normal distribution.
159: In this case the spatial patterns show structure while the time series are uncorrelated.
160: We wish to generate surrogate data that has the same
161: correlation structure in space as $X$,
162: and this can be done by replacing the time series
163: in expression~\ref{x=} with simulated values:
164:
165: \begin{equation}\label{x-sim=}
166: x^{sim}_{ij}=\sum_{k=1}^r e_{ik} \lambda_k p^{sim}_{jk}
167: \end{equation}
168:
169: It is easy to show that $x^{sim}$ has the same spatial covariance matrix as the original $x_{ij}$.
170: However, the rank 1 matrices for high values of $k$ are likely to be very poorly estimated,
171: and this may be bad for our simulations.
172: This motivates the idea that we should perhaps truncate the sum and use only the
173: well estimated patterns in the simulation, up to the $r'$'th.
174: There are two problems with this, however:
175: first, that the variance
176: of the resulting simulated data would be lower than the variance of the observations,
177: and second that the rank of
178: the simulated data could be too low (the dimension of the space spanned by the simulated data
179: could be smaller than the dimension of the space spanned by the sample data).
180: This might result in simulations which could never explore the space of possible observations
181: fully, and we find this to be undesirable.
182: These problems can both be corrected by adding appropriate amounts of white noise as `padding'.
183:
184: This gives:
185: \begin{equation}\label{x-sim2=}
186: x^{sim}_{ij}=\sum_{k=1}^{r'} p_{ik} \lambda_k q^{sim}_{jk}+\sigma_i \epsilon_{ij}
187: \end{equation}
188:
189: where $\epsilon$ is white noise and the $\sigma_i$ are chosen so that the simulations
190: have the correct variance. The lower $r'$, the greater the $\sigma_i$ have to be to make
191: up the full variance.
192:
193: Within this setup the question we wish to ask is: how should the truncation $r'$ be chosen?
194:
195:
196:
197: \section{Choosing the truncation}
198: \label{method}
199:
200: The method we propose for choosing the truncation works as follows.
201: As the truncation $r'$ is increased, more information about the correlation structure of $X$
202: is included in the simulations.
203: But more spurious information is also included because the higher order patterns are less well estimated.
204: Because of these competing effects the benefit of increasing
205: $r'$ presumably disappears at some point: we wish to find exactly the value of $r'$ at which
206: this occurs. To do so we use a jack-knife cross-validation technique: we
207: test the extent to which a certain truncation is able to represent
208: data that is outside the sample of data on which the PCA is estimated. This test
209: allows us to compare different truncations in a fair and honest way, and find
210: which performs the best.
211:
212: What cost function should we use for our test? A particular truncation along with the white noise padding
213: is effectively an estimate of the multivariate distribution of $X$.
214: This motivates us to use the standard cost function used for the fitting of distributions in classical statistics,
215: which is the log-likelihood. Given a particular truncation, and the
216: amplitudes of the supplementary white noise, we can calculate the covariance matrix
217: of the multivariate distribution.
218: From this we can calculate the log-likelihood using the standard expression for the density
219: for the multivariate normal with dimension $p$:
220: \begin{equation}
221: f=\frac{1}{(2\pi)^{\frac{p}{2}} D^\frac{1}{2}} \mbox{exp}\left(-\frac{1}{2}(z-\mu)^T\Sigma^{-1}(z-\mu)\right)
222: \end{equation}
223: where
224: $\Sigma$ is the covariance matrix (size $p$ by $p$),
225: $D$ is the determinant of the covariance matrix (a single number),
226: $z$ is a vector length $p$ and
227: $\mu$ is a vector length $p$.
228:
229: The log-density is then:
230: \begin{equation}\label{logf}
231: \mbox{log}f=-\frac{1}{2}p\mbox{log}(2\pi)
232: -\frac{1}{2}\mbox{log}D
233: -\frac{1}{2}(z-\mu)^T\Sigma^{-1}(z-\mu)
234: \end{equation}
235:
236: We will refer to the 2nd and 3rd terms of this equation as the `dispersion term' $(-\frac{1}{2}\mbox{log}D)$
237: and the `standardisation term' $(-\frac{1}{2}(z-\mu)^T\Sigma^{-1}(z-\mu))$.
238: $D$ is a measure of the dispersion in the multivariate distribution:
239: for instance, when $p=1$ we have $D=\sigma$. The dispersion term (which has a negative coefficient)
240: penalizes distributions with a large dispersion.
241: $(z-\mu)^T\Sigma^{-1}(z-\mu)$ is the `z value' or standardised value of the spatial pattern $z-\mu$,
242: in the multivariate normal distribution described by $\Sigma$. If $z-\mu$ is very unlikely
243: in this distribution then this term will be very large. The standardisation term penalizes
244: the distribution if there are many points with large standardised values.
245: The distribution which maximises the log-likelihood is a trade-off between these two effects:
246: the dispersion has to be small, but not so small that the standardised values of the out-of-sample
247: data is too large.
248:
249: One aspect of using log-likelihood as a cost function is that it rejects a distribution and covariance matrix completely
250: if there is even a single observation that could not have come from the distribution. For instance,
251: if we use truncated PCA without the white noise padding then many of the out-of-sample observations would be
252: impossible, simply because they come from a higher dimensional space. We consider
253: this strict rejection of distributions that do not span the space of the observed data to be desirable.
254:
255:
256: We now summarise our method. For each truncation we run over the data,
257: missing out each time point in turn, applying PCA to the remaining data,
258: truncating at the given level, estimating the amplitude of the supplementary white noise,
259: calculating the covariance matrix for the combination of truncated singular vector series
260: and white noise,
261: and calculating the log-likelihood for the missed data. We combine
262: all the log-likelihoods for a particular truncation to give a single score for that
263: truncation. We then compare these log-likelihood scores across the different truncations to find
264: which truncation is the best at predicting the distribution of the out-of-sample data.
265:
266: \section{Examples}
267: \label{example}
268:
269: We now give two simple examples of the method described above.
270: They are both motivated by our interest in simulating the risk in weather derivative
271: portfolios, for which we wish to create many thousands of years of surrogate weather data
272: (see chapter 7 in~\cite{jewsonbz05}).
273:
274: In both examples we standardise the data in time before we apply PCA.
275: For the first example $s<t$, while for the second $s>t$.
276: This alters the nature of the problem significantly, as we will see below.
277:
278: \subsection{Example 1: UK temperatures}
279:
280: In our first example we take a matrix $X$ of data consisting of winter average
281: daily average temperatures for 5 UK locations. There are 44 winters of data and
282: so $s=5$ and $t=44$. The rank of the data is 5, and is unaffected by the standardisation, which
283: is only applied in the time dimension.
284: The space of possible spatial
285: patterns, which has dimension 5, can be spanned by the 5 spatial singular vectors
286: if there is no truncation. If there is truncation then this is no longer the case,
287: and a general spatial pattern could not be represented as a linear combination of
288: the remaining spatial singular vectors.
289: The `padding' with white noise solves this problem, as described above.
290:
291: Figure~\ref{f01} shows (minus one times) the log-likelihood versus the truncation for this example.
292: We see that there is a big decrease in the cost function as we move from a purely independent
293: model to one that uses the first singular vector only:
294: we conclude that this data is definitely correlated in space.
295: There is a much smaller further decrease
296: when the second singular vector is added, and adding further singular vectors beyond the second
297: actually increases the cost function.
298: A truncation to two singular vectors is therefore optimal
299: in this case.
300: Truncations of two, three and four all perform better than using the empirical covariance matrix
301: (which is a truncation of five).
302: The covariance matrix based on all five singular vectors, and the change in the covariance
303: matrix caused by truncation to the first two, are shown below. We see that the changes in the
304: individual covariances are fairly small (perhaps between 1\% and 4\%).
305:
306: \begin{center}
307: \begin{tabular}{|c|c|c|c|c|}
308: \hline
309: 46.00 & 42.40 & 37.25 & 41.17 & 40.49 \\
310: 42.40 & 46.00 & 38.24 & 42.69 & 41.17 \\
311: 37.25 & 38.24 & 46.00 & 44.04 & 44.46 \\
312: 41.17 & 42.69 & 44.04 & 46.00 & 44.92 \\
313: 40.49 & 41.17 & 44.46 & 44.92 & 46.00 \\
314: \hline
315: \end{tabular}
316: \end{center}
317:
318: \begin{center}
319: \begin{tabular}{|c|c|c|c|c|}
320: \hline
321: 0.00 & 1.72 & -0.35 & 0.39 & -0.14 \\
322: 1.72 & 0.00 & 0.16 & -0.22 & 0.27 \\
323: -0.35 & 0.16 & 0.00 & 0.26 & 0.36 \\
324: 0.39 & -0.22 & 0.26 & 0.00 & 0.27 \\
325: -0.14 & 0.27 & 0.36 & 0.27 & 0.00 \\
326: \hline
327: \end{tabular}
328: \end{center}
329:
330: Going further, we can test whether a truncation of two is \emph{significantly}
331: better than a truncation of one. We will do this using the method we used
332: in~\citet{hallj05b} in which we consider each individual time point of the data and count
333: the number of times each of the two methods beats the other. The resulting
334: test statistic is distributed as a binomial distribution under the null hypothesis
335: that there is no significant difference between the two truncations.
336:
337: The results of this year by year comparison are shown in figures~\ref{f02} and~\ref{f03}.
338: We see that, for every comparison of adjacent truncations,
339: one or the other wins \emph{in every year}. We conclude that the ordering of the
340: results in figure~\ref{f01} is extremely highly significant.
341:
342: We can also try and understand the variations in the log-likelihood score curve shown in figure~\ref{f01}
343: by breaking the curve down into the determinant and standardization terms in equation~\ref{logf}.
344: This breakdown is shown in figure~\ref{f04}. We see that, in this case, the shape of the log-likelihood
345: score curve is fixed by the determinant term. Had we known this in advance we
346: could have found the optimum truncation by simply calculating the determinant as a function of
347: truncation. This is a simple in-sample calculation, and much less complex than the full
348: cross-validation calculation. We suspect that it may always be the case
349: that the determinant term dominates when $s<t$, and this possibility seems to merit further investigation.
350: We also suspect that the dominance of the determinant term explains why the breakdown by year
351: gives such clear results.
352:
353: With some trepidation we now attempt to explain the behaviour of the determinant and
354: standardisation curves. The standardisation curve seems to be the easier of the two
355: to understand. For all 6 truncations this term is very small: this means that all of the out-of-sample
356: spatial patterns are quite consistent with the fitted distribution. This is presumably because
357: the out-of-sample patterns live in a 5 dimensional space, and the fitted distributions
358: have significant variance in all of these dimensions.
359: The determinant curve is a little harder to understand. As the truncation increases
360: it shows a decrease and then an increase.
361: The decrease seems to be because as the truncation is increased the degree of specialisation
362: of the model increases. The subsequent increase is presumably because of sampling error
363: on the higher singular vectors.
364:
365: \subsection{Example 2: US temperatures}
366:
367: In our second example we take a matrix $X$ of data consisting of winter average
368: daily average temperatures for 308 US locations. There are 54 winters of data and
369: so $s=308$ and $t=54$. The rank of the data is 53 because of the temporal standardisation.
370: Because $s>t$ we are now
371: in a situation where the space of possible spatial patterns, which has dimension
372: 308, cannot be spanned by the spatial singular vectors, of which there are only 53.
373: Truncation and the white noise padding are therefore essential: this is a case
374: where it seems that we are \emph{guaranteed} to find a better estimate of the covariance
375: matrix than that given by the empirical covariance matrix, because the empirical
376: covariance matrix will immediately fail. In fact, the simple example of a purely independent model
377: (a full-rank diagonal covariance matrix) will always beat the empirical covariance matrix.
378:
379: The likelihood score versus truncation is shown in figure~\ref{f05}.
380: We can only evaluate the likelihood score up to a truncation of 52. This is because
381: the rank of the data is 53, and so the truncation of 53, which has no white noise
382: padding, gives a correlation matrix that cannot be inverted.
383:
384: We see that the log-likelihood gradually reduces as the truncation is increased, up to
385: a truncation of 47. It then rapidly increases to very large values between 47 and 52.
386: 47 is thus the optimum truncation.
387:
388: In figure~\ref{f06} we decompose the log-likelihood curve into determinant
389: and standardization terms. In this case we see that it is the interplay of these two
390: terms that fixes the minimum, and it would not be possible to determine the minimum
391: using the determinant curve alone (which is monotonic).
392:
393: Again, with some trepidation, we attempt to explain the shapes of these two curves.
394: The determinant curve decreases as the truncation increases: we think this is
395: because adding more singular vectors, at the expense of white noise variance,
396: makes the multivariate distribution more specific i.e. it concentrates the
397: variance into fewer dimensions. Ultimately, for a truncation of 53, there is only
398: non-zero variance in 53 of the 308 dimensions (and the correlation matrix
399: is no longer invertible). The standardisation term gradually increases
400: as a result of this specialisation. Then, as the truncation approaches
401: 53, the variance in the other dimensions becomes very small, and the probability
402: of some of the out of sample patterns, which come from a 308 dimensional space,
403: becomes very low. At this point the standardisation term becomes very large.
404: We think that this tradeoff between the determinant term and the standardisation
405: term is likely to occur whenever $s>t$.
406:
407:
408: \section{Summary}
409: \label{summary}
410:
411: We have investigated a simple approach for making a better estimate of the population covariance
412: matrix than that given by the empirical covariance matrix.
413: The method is based on truncated PCA with white noise residuals.
414: The question of how to truncate PCA has been addressed before, but we introduce
415: a simple new method based on a very straightforward reasoning: we want to choose the truncation
416: so that we maximise the likelihood of out-of-sample data. Finding the best truncation
417: under this definition of optimum is relatively easy. We give two examples, and in both
418: cases we find better estimates of the population covariance matrix than that given by
419: the empirical covariance matrix (where \emph{better} is defined as giving higher
420: out-of-sample likelihood scores).
421:
422: Based on the results from our examples we conclude that using
423: the empirical covariance matrix for statistical modelling
424: may not be a very good
425: idea since the higher order singular vectors tend to be poorly estimated and thus decrease the
426: out-of-sample likelihood.
427: In the $s>t$ case there is the additional problem that the empirical covariance matrix does not
428: describe a space large enough to contain the observations.
429: Optimal truncation with white noise `padding' solves both these problems,
430: and thus may give better modelling results.
431:
432: In some cases, such as the two examples we have used in this study, one of the dimensions of the
433: sample data is a genuine spatial dimension. In this case it may be possible to do even better by
434: modelling the residuals using `red' noise, rather than just white noise. Testing this idea is next.
435: It would also be interesting to compare our method with other possible methods for improving
436: the estimate of the covariance matrix, such as linear combinations of the empirical covariance
437: matrix with an independent model.
438:
439:
440: \section{Acknowledgements}
441:
442: The author would like to think Dag Lohmann, Sergio Pezzuli and
443: Christine Ziehmann for interesting discussions on this topic.
444:
445: \section{Legal statement}
446:
447: SJ was employed by RMS at the time that this article was written.
448:
449: However, neither the research behind this article nor the writing
450: of this article were in the course of his employment, (where 'in
451: the course of their employment' is within the meaning of the
452: Copyright, Designs and Patents Act 1988, Section 11), nor were
453: they in the course of his normal duties, or in the course of
454: duties falling outside his normal duties but specifically assigned
455: to him (where 'in the course of his normal duties' and 'in the
456: course of duties falling outside his normal duties' are within the
457: meanings of the Patents Act 1977, Section 39). Furthermore the
458: article does not contain any proprietary information or trade
459: secrets of RMS. As a result, the author is the owner of all the
460: intellectual property rights (including, but not limited to,
461: copyright, moral rights, design rights and rights to inventions)
462: associated with and arising from this article. The author reserves
463: all these rights. No-one may reproduce, store or transmit, in any
464: form or by any means, any part of this article without the
465: author's prior written permission. The moral rights of the author
466: have been asserted.
467:
468: The contents of this article reflect the author's personal
469: opinions at the point in time at which this article was submitted
470: for publication. However, by the very nature of ongoing research,
471: they do not necessarily reflect the author's current opinions. In
472: addition, they do not necessarily reflect the opinions of the
473: author's employers.
474:
475: \bibliography{pca}
476:
477: \newpage
478: \begin{figure}[!htb]
479: \begin{center}
480: \scalebox{0.8}{\includegraphics{fig1}}
481: % \scalebox{0.8}{\includegraphics{figs/likelihood}}
482: \end{center}
483: \caption{
484: The log-likelihood versus truncation for example 1 described in the text.
485: }
486: \label{f01}
487: \end{figure}
488:
489: \newpage
490: \begin{figure}[!htb]
491: \begin{center}
492: \scalebox{0.8}{\includegraphics{fig2}}
493: % \scalebox{0.8}{\includegraphics{figs/likelihood}}
494: \end{center}
495: \caption{
496: The log-likelihood on a yearly basis for the six truncations used in example 1.
497: }
498: \label{f02}
499: \end{figure}
500:
501: \newpage
502: \begin{figure}[!htb]
503: \begin{center}
504: \scalebox{0.8}{\includegraphics{fig3}}
505: % \scalebox{0.8}{\includegraphics{figs/likelihood}}
506: \end{center}
507: \caption{
508: Same as figure~\ref{f02} but with a different scale to clarify the differences between
509: the curves.
510: }
511: \label{f03}
512: \end{figure}
513:
514: \newpage
515: \begin{figure}[!htb]
516: \begin{center}
517: \scalebox{0.8}{\includegraphics{fig4}}
518: % \scalebox{0.8}{\includegraphics{figs/likelihood}}
519: \end{center}
520: \caption{
521: Decomposition of the log-likelihood curve in figure~\ref{f01} into the
522: determinant and standardization terms. We see that the curve in figure~\ref{f01}
523: is completely dominated by the determinant term.
524: }
525: \label{f04}
526: \end{figure}
527:
528: \newpage
529: \begin{figure}[!htb]
530: \begin{center}
531: \scalebox{0.8}{\includegraphics{fig5}}
532: % \scalebox{0.8}{\includegraphics{figs/likelihood}}
533: \end{center}
534: \caption{
535: The log-likelihood versus truncation for example 2 described in the text,
536: with two different vertical and horizontal scales.
537: }
538: \label{f05}
539: \end{figure}
540:
541: \newpage
542: \begin{figure}[!htb]
543: \begin{center}
544: \scalebox{0.8}{\includegraphics{fig6}}
545: % \scalebox{0.8}{\includegraphics{figs/likelihood}}
546: \end{center}
547: \caption{
548: Decomposition of the log-likelihood curve in figure~\ref{f05} into the
549: determinant and standardization terms. In this case the curve in figure~\ref{f05}
550: is not dominated by either term, and the minimum in the curve in figure~\ref{f05}
551: arises from interplay between these two terms.
552: }
553: \label{f06}
554: \end{figure}
555:
556:
557: \end{document}
558: