cs0611011/TCJ.TEX
1: % Last changed by Volodya, 17 Oct 2006
2: % Spell checked (UK): 17 Oct 2006
3: % Main message: you can control the number of mistakes.
4: % 1957 lines, 79 KB
5: 
6: \newif\ifJOURNAL
7: \JOURNALfalse
8: \newif\ifWP
9: \WPfalse
10: \newif\ifarXiv
11: \arXivfalse
12: 
13: %\JOURNALtrue			% choose JOURNAL, WP, or arXiv
14: %\WPtrue
15: \arXivtrue
16: 
17: \newif\ifnotJOURNAL	% derivative conditional
18: \notJOURNALtrue
19: \ifJOURNAL\notJOURNALfalse\fi
20: 
21: \newif\ifLATIN		% LATIN means that the Cyrillic references should be set in Latin
22: 
23: \ifJOURNAL
24:   \documentclass{cja4}
25: 
26:   %%the optional argument is used to get times font instead of CMR
27:   %\documentclass[mathtime]{cja4}
28: 
29:   \copyrightyear{2006}
30:   \vol{00}
31:   \issue{0}
32:   \DOI{000}
33:   \usepackage{amsmath,amsfonts,latexsym,graphicx}
34:   \LATINfalse
35: \fi
36: 
37: \ifWP
38:   \documentclass[toc]{kpnsarticle}
39:   \usepackage{amsmath,amsfonts,latexsym,graphicx,epsfig}
40:   \LATINfalse
41: \fi
42: 
43: \ifarXiv
44:   \documentclass{article}
45:   \usepackage{amsmath,amsfonts,latexsym,graphicx}
46:   \LATINtrue
47: \fi
48: 
49: \newif\ifnotLATIN	% derivative conditional
50: \notLATINtrue
51: \ifLATIN\notLATINfalse\fi
52: 
53: \emergencystretch=5mm
54: \tolerance=400
55: \allowdisplaybreaks[3]
56: %\input{hyphenation.txt}
57: 
58: \ifnotLATIN
59:   \usepackage{CJK}
60:   \input{OT2enc.def}
61:   \newenvironment{cyr}
62:   {\fontencoding{OT2}\fontfamily{wncyr}\fontseries{m}\fontshape{n}\selectfont}
63:   {\fontencoding{OT1}\fontfamily{tir}\selectfont}
64: \fi
65: 
66: \newcommand{\bbbr}{{\mathbb{R}}}
67: \newcommand{\bbbn}{{\mathbb{N}}}
68: \newcommand{\st}{:}
69: \newcommand{\given}{\mathbin{|}}
70: 
71: \newlength{\picturewidth}
72: \ifJOURNAL
73:   \setlength{\picturewidth}{0.98\columnwidth}
74: \fi
75: \ifnotJOURNAL
76:   \setlength{\picturewidth}{0.72\columnwidth}
77: \fi
78: 
79: \newcommand{\E}{{\bf E}}
80: 
81: \newcommand{\bbbe}{{\mathbb{E}}}		% expected value
82: \newcommand{\Expect}{\mathop{\bbbe}\nolimits}
83: 
84: \newcommand{\Err}{\mathop{{\rm Err}}\nolimits}
85: \newcommand{\err}{\mathop{{\rm err}}\nolimits}
86: 
87: \newcommand{\Mult}{\mathop{{\rm Mult}}\nolimits}
88: \newcommand{\mult}{\mathop{{\rm mult}}\nolimits}
89: 
90: \newcommand{\Emp}{\mathop{{\rm Emp}}\nolimits}
91: \newcommand{\emp}{\mathop{{\rm emp}}\nolimits}
92: 
93: \ifnotJOURNAL
94:   \newtheorem{lemma}{Lemma}
95:   \newtheorem{proposition}{Proposition}
96:   \newtheorem{corollary}{Corollary}
97:   \newtheorem{theorem}{Theorem}
98:   \newenvironment{proof}
99:     {\trivlist\item[\hskip\labelsep\textbf{Proof}]}
100:     {\endtrivlist}
101: \fi
102: 
103: \newenvironment{remark*}
104:   {\trivlist\item[\hskip\labelsep{\bfseries Remark}]\relax}
105:   {\endtrivlist}
106: \newenvironment{definition*}
107:   {\trivlist\item[\hskip\labelsep{\bfseries Definition}]\relax}
108:   {\endtrivlist}
109: 
110: \ifWP
111:   \title{Hedging Predictions in Machine Learning}
112:   \author{Alexander Gammerman and Vladimir Vovk}
113:   \newcommand{\No}{2}
114:   %For the two dates option: uncomment the next 2 lines
115:   %\twodatestrue
116:   %\newcommand{\firstposted}{November 2, 2006}
117: \fi
118: 
119: \ifarXiv
120:   \title{Hedging Predictions in Machine Learning}
121:   \author{Alexander Gammerman and Vladimir Vovk\\
122:       Computer Learning Research Centre\\
123:       Department of Computer Science\\
124:       Royal Holloway, University of London\\
125:       Egham, Surrey TW20 0EX, UK\\
126:       \texttt{\{alex,vovk\}@cs.rhul.ac.uk}}
127: \fi
128: 
129: \begin{document}
130: \ifJOURNAL
131:   \title[Hedging Predictions]{Hedging Predictions\\in Machine Learning}
132:   % {\large preliminary draft, 28 April 2006}}
133:   \author{Alexander Gammerman}
134:   \author{Vladimir Vovk}
135:   \affiliation{Computer Learning Research Centre,
136:     Royal Holloway, University of London\\
137:     Egham, Surrey TW20 0EX}
138:   \email{\{alex,vovk\}@cs.rhul.ac.uk}
139: 
140:   \shortauthors{A.~Gammerman and V.~Vovk}
141: 
142:   \received{00 Month 2006}
143:   \revised{00 Month 2006}
144: \fi
145: 
146: \ifnotJOURNAL
147:   \maketitle
148: \fi
149: 
150: \begin{abstract}
151:   Recent advances in machine learning make it possible
152:   to design efficient prediction algorithms for data sets with huge numbers of parameters.
153:   This paper describes a new technique for ``hedging'' the predictions
154:   output by many such algorithms,
155:   including support vector machines, kernel ridge regression, kernel nearest neighbours,
156:   and by many other state-of-the-art methods.
157:   The hedged predictions for the labels of new objects
158:   include quantitative measures of their own accuracy and reliability.
159:   These measures are provably valid under the assumption of randomness,
160:   traditional in machine learning:
161:   the objects and their labels are assumed to be generated independently
162:   from the same probability distribution.
163:   In particular, it becomes possible to control (up to statistical fluctuations)
164:   the number of erroneous predictions by selecting a suitable confidence level.
165:   Validity being achieved automatically,
166:   the remaining goal of hedged prediction is efficiency:
167:   taking full account of the new objects' features
168:   and other available information to produce as accurate predictions as possible.
169:   This can be done successfully using the powerful machinery of modern machine learning.
170: \end{abstract}
171: 
172: \ifJOURNAL
173:   \keywords{Classification, confidence, induction, learning, prediction, randomness, regression, transduction}
174: 
175:   \maketitle
176: \fi
177: 
178: \section{Introduction}
179: \label{sec:introduction}
180: 
181: % 1. Successes of machine learning:
182: %    prediction under only one assumption (randomness)
183: %    kernel methods: high-dimensional data
184: % 2. Weak point: no confidence, or loose bounds, or strong assumptions (Bayesian)
185: % 3. Advantages of conformal prediction
186: % 4. Contents of this paper
187: 
188: The two main varieties of the problem of prediction,
189: classification and regression,
190: % I talk about classification and regression
191: % since prediction is often associated with the Kalman filter,
192: % which is not covered in this paper
193: % (because it works outside the randomness assumption)
194: are standard subjects in statistics and machine learning.
195: The classical classification and regression techniques
196: can deal successfully with conventional small-scale, low-dimensional data sets;
197: however, attempts to apply these techniques to modern high-dimensional and high-throughput data sets
198: encounter serious conceptual and computational difficulties.
199: Several new techniques,
200: first of all support vector machines \cite{vapnik:1995,vapnik:1998}
201: and other kernel methods,
202: have been developed in machine learning recently
203: with the explicit goal of dealing with high-dimensional data sets
204: % kernel methods: we do not need to process many attributes explicitly
205: with large numbers of objects.
206: % at some point we can discard all elements that are not support vectors
207: 
208: A typical drawback of the new techniques is the lack of useful measures of confidence
209: in their predictions.
210: For example, some of the tightest upper bounds of the popular PAC theory
211: on the probability of error exceed~1 even for relatively clean data sets
212: (\cite{vovk/etal:2005}, p.~249).
213: This paper describes an efficient way to ``hedge'' the predictions
214: produced by the new and traditional machine-learning methods,
215: i.e., to complement them with measures of their accuracy and reliability.
216: Appropriately chosen,
217: not only are these measures valid and informative,
218: but they also take full account of the special features
219: of the object to be predicted.
220: 
221: We call our algorithms for producing hedged predictions ``conformal predictors'';
222: they are formally introduced in Section \ref{sec:conformal}.
223: Their most important property is the automatic validity under the randomness assumption
224: (to be discussed shortly).
225: Informally, validity means that conformal predictors never overrate
226: the accuracy and reliability of their predictions.
227: This property, stated in Sections \ref{sec:conformal} and \ref{sec:on-line},
228: is formalized in terms of finite data sequences,
229: without any recourse to asymptotics.
230: 
231: The claim of validity of conformal predictors
232: depends on an assumption that is shared by many other algorithms in machine learning,
233: which we call the assumption of randomness:
234: the objects and their labels are assumed to be generated independently
235: from the same probability distribution.
236: Admittedly, this is a strong assumption,
237: and areas of machine learning are emerging
238: that rely on other assumptions
239: (such as the Markovian assumption of reinforcement learning;
240: see, e.g., \cite{sutton/barto:1998})
241: or dispense with any stochastic assumptions altogether
242: (competitive on-line learning;
243: see, e.g., \cite{cesabianchi/lugosi:2006,vovk:2001}).
244: It is, however, much weaker than assuming a parametric statistical model,
245: sometimes complemented with a prior distribution on the parameter space,
246: which is customary in the statistical theory of prediction.
247: And taking into account the strength of the guarantees that can be proved
248: under this assumption,
249: it does not appear overly restrictive.
250: 
251: So we know that conformal predictors tell the truth.
252: Clearly, this is not enough:
253: truth can be uninformative and so useless.
254: We will refer to various measures of informativeness of conformal predictors
255: as their ``efficiency''.
256: As conformal predictors are provably valid,
257: efficiency is the only thing we need to worry about
258: when designing conformal predictors
259: for solving specific problems.
260: Virtually any classification or regression algorithm
261: can be transformed into a conformal predictor,
262: and so most of the arsenal of methods of modern machine learning
263: can be brought to bear on the design of efficient conformal predictors.
264: 
265: We start the main part of the paper, in Section \ref{sec:ideal},
266: with the description of an idealized predictor
267: based on Kolmogorov's algorithmic theory of randomness.
268: This ``universal predictor'' produces the best possible hedged predictions
269: but, unfortunately, is noncomputable.
270: We can, however, set ourselves the task of approximating the universal predictor
271: as well as possible.
272: 
273: In Section \ref{sec:conformal} we formally introduce the notion of conformal predictors
274: and state a simple result about their validity.
275: In that section we also briefly describe results of computer experiments
276: demonstrating the methodology of conformal prediction.
277: 
278: In Section \ref{sec:Bayesian} we consider an example demonstrating
279: how conformal predictors react to the violation of our model
280: of the stochastic mechanism generating the data
281: (within the framework of the randomness assumption).
282: If the model coincides with the actual stochastic mechanism,
283: we can construct an optimal conformal predictor,
284: which turns out to be almost as good as the Bayes-optimal confidence predictor
285: (the formal definitions will be given later).
286: When the stochastic mechanism significantly deviates from the model,
287: conformal predictions remain valid but their efficiency inevitably suffers.
288: The Bayes-optimal predictor starts producing very misleading results
289: which superficially look as good as when the model is correct.
290: 
291: In Section \ref{sec:on-line} we describe the ``on-line'' setting
292: of the problem of prediction,
293: and in Section \ref{sec:slow} contrast it with the more standard ``batch'' setting.
294: The notion of validity introduced in Section \ref{sec:conformal}
295: is applicable to both settings,
296: but in the on-line setting it can be strengthened:
297: we can now prove that the percentage of the erroneous predictions
298: will be close, with high probability,
299: to a chosen confidence level.
300: For the batch setting,
301: the stronger property of validity for conformal predictors
302: remains an empirical fact.
303: In Section \ref{sec:slow} we also discuss limitations of the on-line setting
304: and introduce new settings intermediate between on-line and batch.
305: To a large degree,
306: conformal predictors still enjoy the stronger property of validity
307: for the intermediate settings.
308: 
309: Section \ref{sec:induction-transduction} is devoted
310: to the discussion of the difference between two kinds of inference from empirical data,
311: induction and transduction
312: (emphasized by Vladimir Vapnik \cite{vapnik:1995,vapnik:1998}).
313: Conformal predictors belong to transduction,
314: but combining them with elements of induction
315: can lead to a significant improvement in their computational efficiency
316: (Section \ref{sec:ICP}).
317: 
318: We show how some popular methods of machine learning
319: can be used as underlying algorithms for hedged prediction.
320: We do not give the full description of these methods
321: and refer the reader to the existing readily accessible descriptions.
322: This paper is, however, self-contained in the sense
323: that we explain all features of the underlying algorithms
324: that are used in hedging their predictions.
325: We hope that the information we provide will enable the reader
326: to apply our hedging techniques
327: to their favourite machine-learning methods.
328: 
329: \section{Ideal hedged predictions}
330: \label{sec:ideal}
331: 
332: % Algorithmic randomness and idealized conformal predictors
333: % (interesting objects for math research)
334: 
335: The most basic problem of machine learning is perhaps the following.
336: We are given a \emph{training set} of \emph{examples}
337: \begin{equation}\label{eq:training-set}
338:   (x_1,y_1),\ldots,(x_l,y_l),
339: \end{equation}
340: each example $(x_i,y_i)$, $i=1,\ldots,l$, consisting of an \emph{object} $x_i$
341: (typically, a vector of attributes)
342: and its label $y_i$;
343: the problem is to predict the label $y_{l+1}$
344: of a new object $x_{l+1}$.
345: Two important special cases are where the labels are known \emph{a priori}
346: to belong to a relatively small finite set
347: (the problem of \emph{classification})
348: and where the labels are allowed to be any real numbers
349: (the problem of \emph{regression}).
350: 
351: The usual goal of classification is to produce a prediction $\hat y_{l+1}$
352: that is likely to coincide with the true label $y_{l+1}$,
353: and the usual goal of regression is to produce a prediction $\hat y_{l+1}$
354: that is likely to be close to the true label $y_{l+1}$.
355: In the case of classification,
356: our goal will be to complement the prediction $\hat y_{l+1}$
357: with some measure of its reliability.
358: In the case of regression,
359: we would like to have some measure of accuracy and reliability of our prediction.
360: There is a clear trade-off between accuracy and reliability:
361: we can improve the former by relaxing the latter
362: and vice versa.
363: We are looking for algorithms that achieve the best possible trade-off
364: and for a measure that would quantify the achieved trade-off.
365: 
366: Let us start from the case of classification.
367: The idea is to try every possible label $Y$ as a candidate for $x_{l+1}$'s label
368: and see how well the resulting sequence
369: \begin{equation}\label{eq:completion}
370:   (x_1,y_1),\dots,(x_l,y_l),(x_{l+1},Y)
371: \end{equation}
372: conforms to the randomness assumption
373: (if it does conform to this assumption, we will say that it is ``random'';
374: this will be formalized later in this section).
375: The ideal case is where all $Y$s but one lead to sequences (\ref{eq:completion})
376: that are not random;
377: we can then use the remaining $Y$ as a confident prediction for $y_{l+1}$.
378: 
379: In the case of regression,
380: we can output the set of all $Y$s that lead to random (\ref{eq:completion})
381: as our ``prediction set''.
382: An obvious obstacle is that the set of all possible $Y$s is infinite
383: and so we cannot go through all the $Y$s explicitly,
384: but we will see in the next section that there are ways to overcome this difficulty.
385: 
386: We can see that the problem of hedged prediction
387: is intimately connected with the problem of testing randomness.
388: Different versions of the ``universal'' notion of randomness
389: were defined by Kolmogorov, Martin-L\"of and Levin (see, e.g., \cite{li/vitanyi:1997})
390: based on the existence of universal Turing machines.
391: Adapted to our current setting,
392: Martin-L\"of's definition is as follows.
393: Let $\mathbf{Z}$ be the set of all possible examples;
394: as each example consists of an object and a label,
395: $\mathbf{Z}=\mathbf{X}\times\mathbf{Y}$,
396: where $\mathbf{X}$ is the set of all possible objects
397: and $\mathbf{Y}$, $\left|\mathbf{Y}\right|>1$, is the set of all possible labels.
398: We will use $\mathbf{Z}^*$ as the notation for all finite sequences of examples.
399: A function $t:\mathbf{Z}^*\to[0,1]$
400: is a \emph{randomness test} if
401: \begin{enumerate}
402: \item
403:   for all $\epsilon\in(0,1)$, all $n\in\{1,2,\dots\}$
404:   and all probability distributions $P$ on $\mathbf{Z}$,
405:   \begin{equation}\label{eq:test-validity}
406:     P^n
407:     \left\{
408:       z\in\mathbf{Z}^n
409:       \st
410:       t(z)\le\epsilon
411:     \right\}
412:     \le
413:     \epsilon;
414:   \end{equation}
415: \item
416:   $t$ is upper semicomputable.
417: \end{enumerate}
418: The first condition means that the randomness test is required to be valid:
419: if, for example, we observe $t(z)\le1\%$ for our data set $z$,
420: then either the data set was not generated independently from the same probability distribution $P$
421: or a rare (of probability at most 1\%, under any $P$) event has occurred.
422: The second condition means that
423: we should be able to compute the test, in a weak sense
424: (we cannot require computability in the usual sense,
425: since the universal test can only be upper semicomputable:
426: it can work forever to discover \emph{all} patterns in the data sequence
427: that make it non-random).
428: Martin-L\"of (developing Kolmogorov's earlier ideas) proved
429: that there exists a smallest, to within a constant factor,
430: randomness test.
431: 
432: Let us fix a smallest randomness test,
433: call it the \emph{universal test},
434: and call the value it takes on a data sequence
435: the \emph{randomness level} of this sequence.
436: A random sequence is one whose randomness level is not small;
437: this is rather informal,
438: but it is clear that for finite data sequences we cannot have a clear-cut division
439: of all sequences into random and non-random
440: (like the one defined by Martin-L\"of \cite{martin-lof:1966} for infinite sequences).
441: If $t$ is a randomness test, not necessarily universal,
442: the value that it takes on a data sequence will be called
443: the \emph{randomness level detected by} $t$.
444: 
445: \begin{remark*}
446:   The word ``random'' is used in (at least) two different senses in the existing literature.
447:   In this paper we need both but, luckily,
448:   the difference does not matter within our current framework.
449:   First, randomness can refer to the assumption that the examples
450:   are generated independently from the same distribution;
451:   this is the origin of our ``assumption of randomness''.
452:   Second, a data sequence is said to be random with respect to a statistical model
453:   if the universal test (a generalization of the notion of universal test as defined above)
454:   does not detect any lack of conformity between the two.
455:   Since the only statistical model we are interested in this paper
456:   is the one embodying the assumption of randomness,
457:   we have a perfect agreement between the two senses.
458: \end{remark*}
459: 
460: \subsection*{Prediction with Confidence and Credibility}
461: 
462: Once we have a randomness test $t$, universal or not,
463: we can use it for hedged prediction.
464: There are two natural ways to package the results
465: of such predictions:
466: in this subsection we will describe the way that can only be used
467: in classification problems.
468: If the randomness test is not computable,
469: we can imagine an oracle answering questions about its values.
470: 
471: Given the training set (\ref{eq:training-set}) and the test object $x_{l+1}$,
472: we can act as follows:
473: \begin{itemize}
474: \item
475:   consider all possible values $Y\in\mathbf{Y}$
476:   for the label $y_{l+1}$;
477: \item
478:   find the randomness level detected by $t$ for every possible completion (\ref{eq:completion});
479: \item
480:   predict the label $Y$ corresponding to a completion
481:   with the largest randomness level detected by $t$;
482: \item
483:   output as the \emph{confidence} in this prediction
484:   one minus the second largest randomness level detected by $t$;
485: \item
486:   output as the \emph{credibility} of this prediction
487:   the randomness level detected by $t$
488:   of the output prediction $Y$
489:   (i.e., the largest randomness level detected by $t$ over all possible labels).
490: \end{itemize}
491: To understand the intuition behind confidence,
492: let us tentatively choose a conventional ``significance level'', such as $1\%$.
493: (In the terminology of this paper, this corresponds to a ``confidence level'' of $99\%$,
494: i.e.,
495: $100\%$ minus $1\%$.)
496: If the confidence in our prediction is $99\%$ or more
497: and the prediction is wrong,
498: the actual data sequence belongs to an \emph{a priori} chosen
499: set of probability at most $1\%$
500: (the set of all data sequences with randomness level detected by $t$
501: not exceeding $1\%$).
502: 
503: Intuitively, low credibility means that
504: either the training set is non-random
505: or the test object is not representative of the training set
506: (say, in the training set we have images of digits
507: and the test object is that of a letter).
508: 
509: \subsection*{Confidence Predictors}
510: 
511: In regression problems,
512: confidence, as defined in the previous subsection,
513: is not a useful quantity:
514: it will typically be equal to 0.
515: A better approach is to choose a range of confidence levels $1-\epsilon$,
516: and for each of them specify a \emph{prediction set}
517: $\Gamma^{\epsilon}\subseteq\mathbf{Y}$,
518: the set of labels deemed possible at the confidence level $1-\epsilon$.
519: We will always consider nested prediction sets:
520: $\Gamma^{\epsilon_1}\subseteq\Gamma^{\epsilon_2}$ when $\epsilon_1\ge\epsilon_2$.
521: A \emph{confidence predictor} is a function
522: that maps each training set, each new object, and each confidence level $1-\epsilon$
523: (formally, we allow $\epsilon$ to take any value in $(0,1)$)
524: to the corresponding prediction set $\Gamma^{\epsilon}$.
525: For the confidence predictor to be \emph{valid} the probability that the true label
526: will fall outside the prediction set $\Gamma^{\epsilon}$ should not exceed $\epsilon$,
527: for each $\epsilon$.
528: 
529: We might, for example, choose the confidence levels 99\%, 95\% and 80\%,
530: and refer to the 99\% prediction set $\Gamma^{1\%}$ as the highly confident prediction,
531: to the 95\% prediction set $\Gamma^{5\%}$ as the confident prediction,
532: and to the 80\% prediction set $\Gamma^{20\%}$ as the casual prediction.
533: Figure \ref{fig:predset} shows how such a family of prediction sets might look
534: in the case of a rectangular label space $\mathbf{Y}$.
535: The casual prediction pinpoints the target quite well,
536: but we know that this kind of prediction can be wrong with probability 20\%.
537: The confident prediction is much bigger.
538: If we want to be highly confident
539: (make a mistake only with probability 1\%),
540: we must accept an even lower accuracy;
541: there is even a completely different location that we cannot rule out
542: at this level of confidence.
543: % In principle, a confidence predictor outputs prediction sets
544: % for all confidence levels, and these sets are nested,
545: % as in the figure above.
546: 
547: \begin{figure}
548:   \centering
549:   \makebox{\includegraphics[width=\picturewidth,clip=true]{predset.eps}}
550:   \caption{\label{fig:predset}An example of a nested family of prediction sets
551:     (casual prediction in black,
552:     confident prediction in dark grey,
553:     and highly confident prediction in light grey).}
554: \end{figure}
555: 
556: Given a randomness test, again universal or not,
557: we can define the corresponding confidence predictor as follows:
558: for any confidence level $1-\epsilon$,
559: the corresponding prediction set consists of the $Y$s
560: such that the randomness level of the completion (\ref{eq:completion})
561: detected by the test is greater than $\epsilon$.
562: The condition (\ref{eq:test-validity}) of validity for statistical tests
563: implies that a confidence predictor defined in this way
564: is always valid.
565: 
566: The confidence predictor based on the universal test
567: (the \emph{universal confidence predictor})
568: is an interesting object for mathematical investigation
569: (see, e.g., \cite{vovk/etal:1999}, Section 4),
570: but it is not computable and so cannot be used in practice.
571: Our goal in the following sections will be
572: to find computable approximations to it.
573: 
574: \section{Conformal Prediction}
575: \label{sec:conformal}
576: 
577: % Practical approximation: conformal prediction (universal for invariant predictors)
578: 
579: In the previous section we explained how randomness tests
580: can be used for prediction.
581: The connection between testing and prediction is, of course, well understood
582: and have been discussed at length by philosophers \cite{popper:1934}
583: and statisticians
584: (see, e.g., the textbook \cite{cox/hinkley:1974}, Section 7.5).
585: % In fact, this connection is two-way,
586: % so we do not lose anything basing our predictions on testing.
587: In this section we will see how some popular prediction algorithms
588: can be transformed into randomness tests
589: and, therefore, be used for producing hedged predictions.
590: 
591: Let us start with the most successful recent development in machine learning,
592: support vector machines
593: (\cite{vapnik:1995,vapnik:1998},
594: with a key idea going back
595: to the generalized portrait method \cite{vapnik/chervonenkis:1974}).
596: Suppose the label space is $\mathbf{Y}=\{-1,1\}$
597: (we are dealing with the binary classification problem).
598: With each set of examples
599: \begin{equation}\label{eq:set}
600:   (x_1,y_1),
601:   \ldots,
602:   (x_n,y_n)
603: \end{equation}
604: one associates an optimization problem
605: whose solution produces nonnegative numbers $\alpha_1,\ldots,\alpha_n$
606: (``Lagrange multipliers'').
607: These numbers determine the prediction rule used by the support vector machine
608: (see \cite{vapnik:1998}, Chapter 10, for details),
609: but they also are interesting objects in their own right.
610: Each $\alpha_i$, $i=1,\ldots,n$, tells us
611: how ``strange'' an element of the set (\ref{eq:set})
612: the corresponding example $(x_i,y_i)$ is.
613: If $\alpha_i=0$, $(x_i,y_i)$ fits (\ref{eq:set}) very well
614: (in fact so well that such examples are uninformative,
615: and the support vector machine ignores them when making predictions).
616: The elements with $\alpha_i>0$ are called \emph{support vectors},
617: and the large value of $\alpha_i$ indicates
618: that the corresponding $(x_i,y_i)$ is an outlier.
619: % It is customary to impose an upper bound $C$ on the values of $\alpha_i$,
620: % one reason being to prevent the outliers affecting too much the prediction
621: % (the other to delimit the search space).
622: 
623: Taking the completion (\ref{eq:completion}) as (\ref{eq:set})
624: (so that $n=l+1$),
625: we can find the corresponding $\alpha_1,\ldots,\alpha_{l+1}$.
626: If $Y$ is different from the actual label $y_{l+1}$,
627: we expect $(x_{l+1},Y)$ to be an outlier in (\ref{eq:completion})
628: and so $\alpha_{l+1}$ be large as compared with $\alpha_1,\ldots,\alpha_l$.
629: A natural way to compare $\alpha_{l+1}$ to the other $\alpha$s
630: is to look at the ratio
631: \begin{equation}\label{eq:p}
632:   p_Y
633:   :=
634:   \frac
635:   {
636:     \left|
637:       \{i=1,\ldots,l+1 \st \alpha_i\ge\alpha_{l+1}\}
638:     \right|
639:   }
640:   {l+1},
641: \end{equation}
642: which we call the \emph{p-value} associated with the possible label $Y$ for $x_{l+1}$.
643: In words, the p-value is the proportion of the $\alpha$s
644: which are at least as large as the last $\alpha$.
645: 
646: The methodology of support vector machines
647: (as described in \cite{vapnik:1995,vapnik:1998})
648: is directly applicable
649: only to the binary classification problems,
650: but the general case can be reduced to the binary case
651: by the standard ``one-against-one'' or ``one-against-the-rest'' procedures.
652: This allows us to define the strangeness values $\alpha_1,\ldots,\alpha_{l+1}$
653: for general classification problems
654: (see \cite{vovk/etal:2005}, p.~59, for details),
655: which in turn determine the p-values (\ref{eq:p}).
656: 
657: The function that assigns to each sequence (\ref{eq:completion})
658: the corresponding p-value, defined by (\ref{eq:p}),
659: is a randomness test
660: (this will follow from Theorem \ref{thm:on-line}
661: stated in Section \ref{sec:on-line} below).
662: Therefore, the p-values,
663: which are our approximations to the corresponding randomness levels,
664: can be used for hedged prediction
665: as described in the previous section.
666: For example, if the p-value $p_{-1}$ is small while $p_1$ is not small,
667: we can predict $1$ with confidence $1-p_{-1}$ and credibility $p_1$.
668: Typical credibility will be 1:
669: for most data sets the percentage of support vectors is small
670: (\cite{vapnik:1998}, Chapter 12),
671: and so we can expect $\alpha_{l+1}=0$ when $Y=y_{l+1}$.
672: 
673: \begin{remark*}
674:   When the order of examples is irrelevant,
675:   we refer to the data set (\ref{eq:set}) as a set,
676:   although as a mathematical object it is a multiset rather than a set
677:   since it can contain several copies of the same example.
678:   We will continue to use this informal terminology
679:   (to be completely accurate,
680:   we would have to say ``data multiset'' instead of ``data set''!)
681: \end{remark*}
682: 
683: % [This in fact demonstrate the $\mathbf{X}$ is large,
684: % not that it is high-dimensional]
685: % Already this data set can be used to illustrate the high-dimensional character
686: % of many modern data sets.
687: % Each object (handwritten digit) is a $16\times16$ grey-scale matrix,
688: % with 31 shades of grey,
689: % so there are $31^{16 \times 16}$ (approximately $10^{381}$)
690: % possible objects.
691: % This greatly exceeds the number of objects in the USPS data set, which is 9298.
692: 
693: % Several kernels are used.
694: % The results show that the method works well in predicting classifications;
695: % in addition, of course,
696: % the method also provides valid and practically useful confidence information,
697: % in sharp contrast with typical PAC error bounds
698: % (valid but not useful)
699: % and Bayesian methods
700: % (usually not valid).
701: 
702: \ifJOURNAL
703: \begin{table*}
704: \processtable{Selected test examples from the USPS data set:
705:   the p-values of digits (0--9), true and predicted labels,
706:   and confidence and credibility values.\label{tab:examples}}
707: %\begingroup\tiny
708: {\footnotesize\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
709: \hline 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 &
710:   \vbox{\hbox{\strut true}\hbox{\strut label}} &
711:   \vbox{\hbox{\strut pre-}\hbox{\strut diction}} &
712:   \vbox{\hbox{\strut confi-}\hbox{\strut dence}} &
713:   \vbox{\hbox{\strut credi-}\hbox{\strut bility}}\\
714: \hline 0.01\% & 0.11\% & 0.01\% & 0.01\% & 0.07\% & 0.01\% & 100\% & 0.01\% & 0.01\% & 0.01\%
715:    & 6 & 6 & 99.89\% & 100\%\\
716: \hline 0.32\% & 0.38\% & 1.07\% & 0.67\% & 1.43\% & 0.67\% & 0.38\% & 0.33\% & 0.73\% & 0.78\%
717:    & 6 & 4 & 98.93\% & 1.43\%\\
718: \hline 0.01\% & 0.27\% & 0.03\% & 0.04\% & 0.18\% & 0.01\% & 0.04\% & 0.01\% & 0.12\% & 100\%
719:  & 9 & 9 & 99.73\% & 100\%\\
720: %\hline 100\% & 0.03\% & 0.01\% & 0.01\% & 0.04\% & 0.01\% & 0.01\% & 0.01\% & 0.01\% & 0.01\%
721: % & 0 & 0 & 99.96\% & 100\%\\
722: %\hline 0.04\% & 0.30\% & 0.05\% & 0.38\% & 0.29\% & 0.01\% & 0.08\% & 0.07\% & 0.40\% & 0.22\%
723: % & 2 & 8 & 99.62\% & 0.40\%\\
724: %\hline 0.01\% & 0.22\% & 0.03\% & 0.55\% & 0.16\% & 0.04\% & 0.03\% & 0.01\% & 0.04\% & 0.05\%
725: % & 3 & 3 & 99.78\% & 0.55\%\\
726: %\hline 0.04\% & 0.32\% & 0.10\% & 2.06\% & 0.29\% & 2.98\% & 0.04\% & 0.07\% & 0.37\% & 0.34\%
727: % & 3 & 5 & 97.94\% & 2.98\%\\
728: %\hline 0.30\% & 0.49\% & 0.43\% & 0.36\% & 1.28\% & 0.51\% & 0.29\% & 0.21\% & 0.38\% & 1.19\%
729: % & 4 & 4 & 98.81\% & 1.28\%\\
730: %\hline 0.01\% & 0.04\% & 0.01\% & 0.01\% & 0.03\% & 0.01\% & 0.01\% & 0.01\% & 0.01\% & 100\%
731: % & 9 & 9 & 99.96\% & 100\%\\
732: %\hline 0.01\% & 0.32\% & 0.04\% & 0.01\% & 0.26\% & 100\% & 0.01\% & 0.05\% & 0.11\% & 0.18\%
733: % & 5 & 5 & 99.68\% & 100\%\\
734: %\hline 0.41\% & 0.44\% & 0.27\% & 2.07\% & 0.70\% & 1.87\% & 0.23\% & 0.29\% & 0.44\% & 0.80\%
735: % & 5 & 3 & 98.13\% & 2.07\%\\
736: \hline
737: \end{tabular}}{}
738: %\endgroup
739: \end{table*}
740: \fi
741: 
742: \ifnotJOURNAL
743: \begin{table*}
744: \caption{Selected test examples from the USPS data set:
745:   the p-values of digits (0--9), true and predicted labels,
746:   and confidence and credibility values.\label{tab:examples}}
747: 
748: \medskip
749: 
750: {\tiny\hspace{-12mm}\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
751: \hline 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 &
752:   \vbox{\hbox{\strut true}\hbox{\strut label}} &
753:   \vbox{\hbox{\strut pre-}\hbox{\strut diction}} &
754:   \vbox{\hbox{\strut confi-}\hbox{\strut dence}} &
755:   \vbox{\hbox{\strut credi-}\hbox{\strut bility}}\\
756: \hline 0.01\% & 0.11\% & 0.01\% & 0.01\% & 0.07\% & 0.01\% & 100\% & 0.01\% & 0.01\% & 0.01\%
757:    & 6 & 6 & 99.89\% & 100\%\\
758: \hline 0.32\% & 0.38\% & 1.07\% & 0.67\% & 1.43\% & 0.67\% & 0.38\% & 0.33\% & 0.73\% & 0.78\%
759:    & 6 & 4 & 98.93\% & 1.43\%\\
760: \hline 0.01\% & 0.27\% & 0.03\% & 0.04\% & 0.18\% & 0.01\% & 0.04\% & 0.01\% & 0.12\% & 100\%
761:  & 9 & 9 & 99.73\% & 100\%\\
762: %\hline 100\% & 0.03\% & 0.01\% & 0.01\% & 0.04\% & 0.01\% & 0.01\% & 0.01\% & 0.01\% & 0.01\%
763: % & 0 & 0 & 99.96\% & 100\%\\
764: %\hline 0.04\% & 0.30\% & 0.05\% & 0.38\% & 0.29\% & 0.01\% & 0.08\% & 0.07\% & 0.40\% & 0.22\%
765: % & 2 & 8 & 99.62\% & 0.40\%\\
766: %\hline 0.01\% & 0.22\% & 0.03\% & 0.55\% & 0.16\% & 0.04\% & 0.03\% & 0.01\% & 0.04\% & 0.05\%
767: % & 3 & 3 & 99.78\% & 0.55\%\\
768: %\hline 0.04\% & 0.32\% & 0.10\% & 2.06\% & 0.29\% & 2.98\% & 0.04\% & 0.07\% & 0.37\% & 0.34\%
769: % & 3 & 5 & 97.94\% & 2.98\%\\
770: %\hline 0.30\% & 0.49\% & 0.43\% & 0.36\% & 1.28\% & 0.51\% & 0.29\% & 0.21\% & 0.38\% & 1.19\%
771: % & 4 & 4 & 98.81\% & 1.28\%\\
772: %\hline 0.01\% & 0.04\% & 0.01\% & 0.01\% & 0.03\% & 0.01\% & 0.01\% & 0.01\% & 0.01\% & 100\%
773: % & 9 & 9 & 99.96\% & 100\%\\
774: %\hline 0.01\% & 0.32\% & 0.04\% & 0.01\% & 0.26\% & 100\% & 0.01\% & 0.05\% & 0.11\% & 0.18\%
775: % & 5 & 5 & 99.68\% & 100\%\\
776: %\hline 0.41\% & 0.44\% & 0.27\% & 2.07\% & 0.70\% & 1.87\% & 0.23\% & 0.29\% & 0.44\% & 0.80\%
777: % & 5 & 3 & 98.13\% & 2.07\%\\
778: \hline
779: \end{tabular}}{}
780: %\endgroup
781: \end{table*}
782: \fi
783: 
784: Table~\ref{tab:examples} illustrates the results of hedged prediction
785: for a popular data set of hand-written digits
786: called the USPS data set \cite{lecun/etal:1990}.
787: The data set contains 9298 digits represented as a $16\times16$ matrix of pixels;
788: it is divided into a training set of size 7291 and a test set of size 2007.
789: For several test examples the table shows
790: the p-values for each possible label, the actual label,
791: the predicted label, confidence, and credibility,
792: computed using the support vector method with the polynomial kernel of degree 5.
793: To interpret the numbers in this table,
794: remember that high (i.e., close to 100\%) confidence
795: means that all labels except the predicted one are unlikely.
796: If, say, the first example were predicted wrongly,
797: this would mean that a rare event (of probability less than 1\%) had occurred;
798: therefore, we expect the prediction to be correct (which it is).
799: In the case of the second example,
800: confidence is also quite high (more than 95\%),
801: but we can see that the credibility is low (less than 5\%).
802: From the confidence we can conclude that the labels other than 4
803: are excluded at level 5\%,
804: but the label 4 itself is also excluded at the level 5\%.
805: This shows that the prediction algorithm
806: was unable to extract from the training set enough information
807: to allow us to confidently classify this example:
808: the strangeness of the labels different from 4 may be due
809: to the fact that the object itself is strange;
810: perhaps the test example is very different from all examples in the training set.
811: Unsurprisingly, the prediction for the second example is wrong.
812: 
813: In general,
814: high confidence shows that all alternatives
815: to the predicted label are unlikely.
816: Low credibility means that the whole situation is suspect;
817: as we have already mentioned, we will obtain a very low credibility
818: if the new example is a letter (whereas all training examples are digits).
819: Credibility will also be low if the new example is a digit
820: written in an unusual way.
821: Notice that typically credibility will not be low
822: provided the data set was generated independently from the same distribution:
823: the probability that credibility
824: will not exceed some threshold $\epsilon$ (such as 1\%)
825: is at most $\epsilon$.
826: In summary,
827: we can trust a prediction if
828: (1) the confidence is close to 100\% and
829: (2) the credibility is not low (say, is not less than 5\%).
830: % Table~\ref{tab:examples} gives credibility values typical
831: % when using support vector machines
832: % for computing p-values:
833: % credibility is exactly 100\% on a few occasions.
834: % This happens because most of the $\alpha$'s computed
835: % by support vector machines are zero.
836: % For many  other learning methods typical values of credibility
837: % are in the range 5\%--95\%.
838: 
839: Many other prediction algorithms can be used as underlying algorithms
840: for hedged prediction.
841: For example, we can use the nearest neighbours technique to associate
842: \begin{equation}\label{eq:NN}
843:   \alpha_i
844:   :=
845:   \frac
846:   {\sum_{j=1}^k d_{ij}^+}
847:   {\sum_{j=1}^k d_{ij}^-},
848:   \quad
849:   i=1,\ldots,n,
850: \end{equation}
851: with the elements $(x_i,y_i)$ of the set (\ref{eq:set}),
852: where $d_{ij}^+$ is the $j$th shortest distance from $x_i$
853: to other objects labelled in the same way as $x_i$,
854: and $d_{ij}^-$ is the $j$th shortest distance
855: from $x_i$ to the objects labelled differently from $x_i$;
856: the parameter $k\in\{1,2,\dots\}$ in~(\ref{eq:NN})
857: is the number of nearest neighbours taken into account.
858: The distances can be computed in a feature space
859: (that is, the distance between $x\in\mathbf{X}$ and $x'\in\mathbf{X}$
860: can be understood as $\left\|F(x)-F(x')\right\|$,
861: $F$ mapping the object space $\mathbf{X}$ into a feature, typically Hilbert, space),
862: and so (\ref{eq:NN}) can also be used with the kernel nearest neighbours.
863: 
864: The intuition behind (\ref{eq:NN}) is as follows:
865: a typical object $x_i$ labelled by, say, $y$
866: will tend to be surrounded by other objects labelled by $y$;
867: and if this is the case, the corresponding $\alpha_i$ will be small.
868: In the untypical case that there are objects whose labels are different from $y$
869: nearer than objects labelled $y$,
870: $\alpha_i$ will become larger.
871: Therefore, the $\alpha$s reflect the strangeness of examples.
872: 
873: The p-values computed by (\ref{eq:NN})
874: can again be used for hedged prediction.
875: % as described in Section \ref{sec:ideal}.
876: It is a general empirical fact that
877: the accuracy and reliability of the hedged predictions
878: are in line with the error rate of the underlying algorithm.
879: For example, in the case of the USPS data set,
880: the 1-nearest neighbour algorithm
881: (i.e., the one with $k=1$)
882: achieves the error rate of 2.2\%,
883: and the hedged predictions based on (\ref{eq:NN}) are highly confident
884: (achieve confidence of at least $99\%$)
885: for more than 95\% of the test examples.
886: 
887: \subsection*{General Definition}
888: 
889: The general notion of conformal predictor can be defined as follows.
890: A \emph{nonconformity measure} is a function that assigns
891: to every data sequence (\ref{eq:set}) a sequence of numbers
892: $\alpha_1,\ldots,\alpha_n$,
893: called \emph{nonconformity scores},
894: in such a way that interchanging any two examples $(x_i,y_i)$ and $(x_j,y_j)$
895: leads to the interchange of the corresponding nonconformity scores $\alpha_i$ and $\alpha_j$
896: (with all the other nonconformity scores unaffected).
897: The corresponding \emph{conformal predictor} maps each data set (\ref{eq:training-set}),
898: $l=0,1,\ldots$,
899: each new object $x_{l+1}$,
900: and each confidence level $1-\epsilon\in(0,1)$,
901: to the prediction set
902: \begin{equation}\label{eq:Gamma}
903:   \Gamma^{\epsilon}
904:   \left(
905:     x_1,y_1,\ldots,x_{l},y_{l},x_{l+1}
906:   \right)
907:   :=
908:   \left\{
909:     Y\in\mathbf{Y}
910:     \st
911:     p_Y
912:     >
913:     \epsilon
914:   \right\},
915: \end{equation}
916: where $p_Y$ are defined by (\ref{eq:p})
917: with $\alpha_1,\ldots,\alpha_{l+1}$ being the nonconformity scores
918: corresponding to the data sequence (\ref{eq:completion}).
919: 
920: We have already remarked that associating with each completion (\ref{eq:completion})
921: the p-value (\ref{eq:p}) gives a randomness test;
922: this is true in general.
923: This implies that for each $l$ the probability of the event
924: \begin{equation*}
925:   y_{l+1}
926:   \in
927:   \Gamma^{\epsilon}
928:   \left(
929:     x_1,y_1,\ldots,x_{l},y_{l},x_{l+1}
930:   \right)
931: \end{equation*}
932: is at least $1-\epsilon$.
933: 
934: This definition works for both classification and regression,
935: but in the case of classification we can summarize (\ref{eq:Gamma})
936: by two numbers:
937: the confidence
938: \begin{equation}\label{eq:conf}
939:   \sup
940:   \left\{
941:     1-\epsilon
942:     \st
943:     \left|
944:       \Gamma^{\epsilon}
945:     \right|
946:     \le
947:     1
948:   \right\}
949: \end{equation}
950: and the credibility
951: \begin{equation}\label{eq:cred}
952:   \inf
953:   \left\{
954:     \epsilon
955:     \st
956:     \left|
957:       \Gamma^{\epsilon}
958:     \right|
959:     =
960:     0
961:   \right\}.
962: \end{equation}
963: 
964: \subsection*{Computationally Efficient Regression}
965: 
966: As we have already mentioned,
967: the algorithms described so far
968: cannot be applied directly in the case of regression,
969: even if the randomness test is efficiently computable:
970: now we cannot consider all possible values $Y$ for $y_{l+1}$
971: since there are infinitely many of them.
972: However, there might still be computationally efficient
973: % (in the sense of required computational resources)
974: ways to find the prediction sets $\Gamma^{\epsilon}$.
975: The idea is that if $\alpha_i$ are defined as the residuals
976: \begin{equation}\label{eq:residual}
977:   \alpha_i
978:   :=
979:   \left|
980:     y_i - f_Y(x_i)
981:   \right|
982: \end{equation}
983: where $f_Y:\mathbf{X}\to\bbbr$ is a regression function
984: fitted to the completed data set~(\ref{eq:completion}),
985: then $\alpha_i$ may have a simple expression in terms of $Y$,
986: leading to an efficient way of computing the prediction sets
987: (via (\ref{eq:p}) and (\ref{eq:Gamma})).
988: This idea was implemented in \cite{nouretdinov/etal:2001rr}
989: in the case where $f_Y$ is found from the ridge regression,
990: or kernel ridge regression, procedure,
991: with the resulting algorithm of hedged prediction
992: called the \emph{ridge regression confidence machine}.
993: For a much fuller description of the ridge regression confidence machine
994: (and its modifications in the case where (\ref{eq:residual})
995: are replaced by the fancier ``deleted'' or ``studentized'' residuals)
996: see \cite{vovk/etal:2005}, Section 2.3.
997: 
998: \section{Bayesian Approach to Conformal Prediction}
999: \label{sec:Bayesian}
1000: 
1001: Bayesian methods have become very popular in both machine learning and statistics
1002: thanks to their power and versatility,
1003: and in this section we will see
1004: how Bayesian ideas can be used for designing efficient conformal predictors.
1005: We will only describe results of computer experiments
1006: (following \cite{melluish/etal:2001})
1007: with artificial data sets,
1008: since for real-world data sets there is no way
1009: to make sure that the Bayesian assumption is satisfied.
1010: 
1011: Suppose $\mathbf{X}=\bbbr^p$
1012: (each object is a vector of $p$ real-valued attributes)
1013: and our model of the data-generating mechanism is
1014: \begin{equation}\label{eq:model}
1015:   y_i
1016:   =
1017:   w\cdot x_i
1018:   +
1019:   \xi_i,
1020:   \quad
1021:   i=1,2,\ldots,
1022: \end{equation}
1023: where $\xi_i$ are independent standard Gaussian random variables
1024: % (we use the notation $N(\mu,\sigma^2)$ for the Gaussian distribution
1025: % with mean $\mu$ and variance $\sigma^2$)
1026: and the weight vector $w\in\bbbr^p$ is distributed as $N(0,(1/a)I_p)$
1027: (we use the notation $I_p$ for the unit $p\times p$ matrix
1028: and $N(0,A)$ for the $p$-dimensional Gaussian distribution
1029: with covariance matrix $A$);
1030: $a$ is a positive constant.
1031: % which we believe to be $1$.
1032: The actual data-generating mechanism used in our experiments
1033: will correspond to this model with $a$ set to 1.
1034: 
1035: Under the model (\ref{eq:model}) the best (in the mean-square sense) fit
1036: to a data set (\ref{eq:set})
1037: is provided by the ridge regression procedure with parameter $a$
1038: (for details, see, e.g., \cite{vovk/etal:2005}, Section 10.3).
1039: Using the residuals (\ref{eq:residual}) with $f_Y$
1040: found by ridge regression with parameter $a$
1041: leads to an efficient conformal predictor
1042: which will be referred to as the ridge regression confidence machine with parameter $a$.
1043: Each prediction set output by the ridge regression confidence machine
1044: will be replaced by its convex hull,
1045: the corresponding \emph{prediction interval}.
1046: 
1047: To test the validity and efficiency of the ridge regression confidence machine
1048: the following procedure was used.
1049: Ten times a vector $w\in\bbbr^5$ was independently generated from the distribution $N(0,I_5)$.
1050: For each of the 10 values of $w$,
1051: 100 training objects and 100 test objects
1052: were independently generated from the uniform distribution on $[-10,10]^5$
1053: and for each object $x$ its label $y$ was generated as $w\cdot x+\xi$,
1054: with all the $\xi$ standard Gaussian and independent.
1055: For each of the 1000 test objects and each confidence level $1-\epsilon$
1056: the prediction set $\Gamma^{\epsilon}$ for its label
1057: was found from the corresponding training set
1058: using the ridge regression confidence machine with parameter $a=1$.
1059: The solid line in Figure~\ref{fig:rrcm-errors} shows the confidence level
1060: against the percentage of test examples whose labels
1061: were not covered by the corresponding prediction intervals at that confidence level.
1062: Since conformal predictors are always valid,
1063: the percentage outside the prediction interval
1064: should never exceed 100 minus the confidence level,
1065: up to statistical fluctuations,
1066: and this is confirmed by the picture.
1067: 
1068: \begin{figure}
1069:   \centering
1070:   \makebox{\includegraphics[width=\picturewidth,clip=true]{rrcm_errors.eps}}
1071:   \caption{\label{fig:rrcm-errors}Validity for the ridge regression confidence machine.}
1072: \end{figure}
1073: 
1074: A natural measure of efficiency of confidence predictors
1075: is the mean width of their prediction intervals,
1076: at different confidence levels:
1077: the algorithm is the more efficient the narrower prediction intervals it produces.
1078: The solid line in Figure~\ref{fig:rrcm-widths} shows
1079: the confidence level against the mean
1080: (over all test examples)
1081: width of the prediction intervals at that confidence level.
1082: 
1083: \begin{figure}
1084:   \centering
1085:   \makebox{\includegraphics[width=\picturewidth,clip=true]{rrcm_widths.eps}}
1086:   \caption{\label{fig:rrcm-widths}Efficiency for the ridge regression confidence machine.}
1087: \end{figure}
1088: 
1089: Since we know the data-generating mechanism,
1090: the approach via conformal prediction appears somewhat roundabout:
1091: for each test object we could instead find
1092: the conditional probability distribution of its label,
1093: which is Gaussian,
1094: and output as the prediction set $\Gamma^{\epsilon}$
1095: the shortest 
1096: (i.e., centred at the mean of the conditional distribution)
1097: interval of conditional probability $1-\epsilon$.
1098: Figures \ref{fig:Bayes-errors} and \ref{fig:Bayes-widths}
1099: are the analogues of Figures \ref{fig:rrcm-errors} and \ref{fig:rrcm-widths}
1100: for this \emph{Bayes-optimal confidence predictor}.
1101: The solid line in Figure \ref{fig:Bayes-errors}
1102: demonstrates the validity of the Bayes-optimal confidence predictor.
1103: 
1104: \begin{figure}
1105:   \centering
1106:   \makebox{\includegraphics[width=\picturewidth,clip=true]{bayes_errors.eps}}
1107:   \caption{\label{fig:Bayes-errors}Validity for the Bayes-optimal confidence predictor.}
1108: \end{figure}
1109: 
1110: \begin{figure}
1111:   \centering
1112:   \makebox{\includegraphics[width=\picturewidth,clip=true]{bayes_widths.eps}}
1113:   \caption{\label{fig:Bayes-widths}Efficiency for the Bayes-optimal confidence predictor.}
1114: \end{figure}
1115: 
1116: What is interesting is that the solid lines
1117: in Figures~\ref{fig:Bayes-widths} and \ref{fig:rrcm-widths}
1118: look exactly the same,
1119: taking account of the different scales of the vertical axes.
1120: The ridge regression confidence machine
1121: appears as good as the Bayes-optimal predictor.
1122: (This is a general phenomenon;
1123: it is also illustrated, in the case of classification,
1124: by the construction in Section 3.3 of \cite{vovk/etal:2005}
1125: of a conformal predictor that is asymptotically
1126: as good as the Bayes-optimal confidence predictor.)
1127: 
1128: The similarity between the two algorithms disappears
1129: when they are given wrong values for $a$.
1130: For example,
1131: let us see what happens if we tell the algorithms
1132: that the expected value of $\|w\|$ is just $1\%$ of what it really is
1133: (this corresponds to taking $a=10000$).
1134: The ridge regression confidence machine stays valid
1135: (see the dashed line in Figure \ref{fig:rrcm-errors}),
1136: but its efficiency deteriorates
1137: (the dashed line in Figure \ref{fig:rrcm-widths}).
1138: The efficiency of the Bayes-optimal confidence predictor
1139: (the dashed line in Figure \ref{fig:Bayes-widths})
1140: is hardly affected,
1141: but its predictions become invalid
1142: (the dashed line in Figure \ref{fig:Bayes-errors}
1143: deviates significantly from the diagonal,
1144: especially for the most important large confidence levels:
1145: e.g., only about 15\% of labels fall within the 90\% prediction sets).
1146: The worst that can happen to the ridge regression confidence machine
1147: is that its predictions will become useless
1148: (but at least harmless),
1149: whereas the Bayes-optimal predictions can become misleading.
1150: 
1151: Figures \ref{fig:rrcm-errors}--\ref{fig:Bayes-widths} also show the graphs
1152: for the intermediate value $a=1000$.
1153: Similar results but for different data sets
1154: are also given in \cite{vovk/etal:2005}, Section 10.3.
1155: A general scheme of Bayes-type conformal prediction
1156: is described in \cite{vovk/etal:2005}, pp.~102--103.
1157: 
1158: \iffalse
1159: \begin{figure}
1160: \centering
1161:   \makebox{\includegraphics[width=0.25\picturewidth,clip=true]{autompg_errors.eps}}
1162:   \makebox{\includegraphics[width=0.25\picturewidth,clip=true]{autompg_widths.eps}}
1163:   \makebox{\includegraphics[width=0.25\picturewidth,clip=true]{boston_errors.eps}}
1164:   \makebox{\includegraphics[width=0.25\picturewidth,clip=true]{boston_widths.eps}}
1165: \caption{\label{fig:benchmarks}Bayesian RR and RRCM applied to Auto mpg and Boston housing benchmarks.}
1166: \end{figure}
1167: 
1168: Figure~\ref{fig:benchmarks} extends these results
1169: to two benchmark data sets taken from the UCI machine learning repository,
1170: the auto-mpg data set and the Boston housing data set.
1171: For the benchmark data sets,
1172: the training and test examples were randomly drawn from the set of all data points.
1173: The ridge coefficient $a$ in Figure \ref{fig:benchmarks}
1174: is chosen so that a reasonable mean square error is obtained.
1175: The top graphs in the figure show
1176: that Bayesian Ridge Regression is overconfident on the auto-mpg dataset,
1177: predicting tolerance regions that are too narrow.
1178: The RRCM predicts valid tolerance regions,
1179: and the top right graph shows that to do so
1180: it gives wider tolerance regions than Bayesian Ridge Regression.
1181: On the Boston housing data set,
1182: Bayesian Ridge Regression is too conservative.
1183: The bottom left graph shows that its predicted tolerance regions are always valid;
1184: however, it also shows that they are much wider than those given by the RRCM.
1185: As the RRCM's tolerance regions are also valid,
1186: we prefer the more accurate RRCM's predictions.
1187: 
1188: \textbf{These results probably do not make much sense
1189: since \cite{melluish/etal:2001} assumes the standard deviation $\sigma$ of $\xi_i$ known:
1190: $\sigma=1$.
1191: This assumption alone might lead to the gross inadequacies of the Bayesian method
1192: that show in Figure \ref{fig:benchmarks}.}
1193: \fi
1194: 
1195: \section{On-line prediction}
1196: \label{sec:on-line}
1197: 
1198: % Properties in the on-line framework
1199: 
1200: We know from Section \ref{sec:conformal}
1201: that conformal predictors are valid in the sense that the probability of error
1202: \begin{equation}\label{eq:error}
1203:   y_{l+1}
1204:   \notin
1205:   \Gamma^{\epsilon}
1206:   \left(
1207:     x_1,y_1,
1208:     \ldots
1209:     x_l,y_l,
1210:     x_{l+1}
1211:   \right)
1212: \end{equation}
1213: at confidence level $1-\epsilon$
1214: never exceeds $\epsilon$.
1215: The word ``probability'' means ``unconditional probability'' here:
1216: the frequentist meaning of the statement that the probability of (\ref{eq:error})
1217: does not exceed $\epsilon$
1218: is that,
1219: if we repeatedly generate many sequences
1220: \begin{equation*}
1221:   x_1,y_1,\ldots,x_l,y_l,x_{l+1},y_{l+1},
1222: \end{equation*}
1223: the fraction of them satisfying (\ref{eq:error})
1224: will be at most $\epsilon$,
1225: to within statistical fluctuations.
1226: To say that we are controlling the number of errors
1227: would be an exaggeration
1228: because of the artificial character of this scheme
1229: of repeatedly generating a new training set and a new test example.
1230: Can we say that the confidence level $1-\epsilon$
1231: translates into a bound on the number of mistakes
1232: for a natural learning protocol?
1233: In this section we show that the answer is ``yes''
1234: for the popular on-line learning protocol,
1235: and in the next section we will see to what degree
1236: this carries over to other protocols.
1237: 
1238: In on-line learning the examples are presented one by one.
1239: Each time, we observe the object and predict its label.
1240: Then we observe the label and go on to the next example.
1241: We start by observing the first object $x_1$ and predicting its label $y_1$.
1242: Then we observe $y_1$ and the second object $x_2$, and predict its label $y_2$.
1243: And so on.
1244: At the $n$th step,
1245: we have observed the previous examples
1246: $ %\begin{equation*}
1247:   (x_1,y_1),\dots,(x_{n-1},y_{n-1})
1248: $ %\end{equation*}
1249: and the new object $x_n$, and our task is to predict $y_n$.
1250: The quality of our predictions should improve
1251: as we accumulate more and more old examples.
1252: This is the sense in which we are learning.
1253: 
1254: Our prediction for $y_n$ is a nested family of prediction sets
1255: $\Gamma_n^{\epsilon}\subseteq\mathbf{Y}$,
1256: $\epsilon\in(0,1)$.
1257: The process of prediction can be summarized by the following protocol:
1258: 
1259: \medskip
1260: 
1261: \noindent\textsc{On-line prediction protocol}
1262: \ifJOURNAL
1263:   \newcommand{\Indent}{\quad}
1264: \fi
1265: \ifnotJOURNAL
1266:   \newcommand{\Indent}{\quad\enspace}
1267: 
1268:   \smallskip
1269: 
1270: \fi
1271: 
1272: \noindent
1273: \Indent$\Err_0:=0$;
1274: 
1275: \noindent
1276: \Indent$\Mult_0:=0$;
1277: 
1278: \noindent
1279: \Indent$\Emp_0:=0$;
1280: 
1281: \noindent
1282: \Indent FOR $n=1,2,\ldots$:
1283: 
1284: \noindent
1285: \Indent\Indent Reality outputs $x_n\in\mathbf{X}$;
1286: 
1287: \noindent
1288: \Indent\Indent Predictor outputs $\Gamma_n^{\epsilon}\subseteq\mathbf{Y}$ for all $\epsilon\in(0,1)$;
1289: 
1290: \noindent
1291: \Indent\Indent Reality outputs $y_n\in\mathbf{Y}$;
1292: 
1293: \noindent
1294: \Indent\Indent$\err_n^{\epsilon}
1295:   :=
1296:   \left\{
1297:     \begin{array}{ll}
1298:       1 & \text{if $y_n \notin \Gamma_n^{\epsilon}$}\\
1299:       0 & \text{otherwise},
1300:     \end{array}
1301:   \right.
1302:   \quad
1303:   \epsilon\in(0,1)$;
1304: 
1305: \noindent
1306: \Indent\Indent\strut$\Err_n^{\epsilon}:=\Err^{\epsilon}_{n-1}+\err_n^{\epsilon},
1307:   \quad
1308:   \epsilon\in(0,1)$;
1309: 
1310: \noindent
1311: \Indent\Indent$\mult_n^{\epsilon}
1312:   :=
1313:   \left\{
1314:     \begin{array}{ll}
1315:       1 & \text{if $\left|\Gamma_n^{\epsilon}\right|>1$}\\
1316:       0 & \text{otherwise},
1317:     \end{array}
1318:   \right.
1319:   \quad
1320:   \epsilon\in(0,1)$;
1321: 
1322: \noindent
1323: \Indent\Indent\strut$\Mult_n^{\epsilon}:=\Mult_{n-1}^{\epsilon}+\mult_n^{\epsilon},
1324:   \quad
1325:   \epsilon\in(0,1)$;
1326: 
1327: \noindent
1328: \Indent\Indent$\emp_n^{\epsilon}
1329:   :=
1330:   \left\{
1331:     \begin{array}{ll}
1332:       1 & \text{if $\left|\Gamma_n^{\epsilon}\right|=0$}\\
1333:       0 & \text{otherwise},
1334:     \end{array}
1335:   \right.
1336:   \quad
1337:   \epsilon\in(0,1)$;
1338: 
1339: \noindent
1340: \Indent\Indent\strut$\Emp_n^{\epsilon}:=\Emp_{n-1}^{\epsilon}+\Emp_n^{\epsilon},
1341:   \quad
1342:   \epsilon\in(0,1)$
1343: 
1344: \noindent
1345: \Indent END FOR.
1346: 
1347: \medskip
1348: 
1349: \noindent
1350: As we said, the family $\Gamma_n^{\epsilon}$
1351: is assumed nested:
1352: $\Gamma_n^{\epsilon_1}\subseteq\Gamma_n^{\epsilon_2}$ when $\epsilon_1\ge\epsilon_2$.
1353: In this protocol we also record the cumulative numbers
1354: $\Err_n^{\epsilon}$ of erroneous prediction sets,
1355: $\Mult_n^{\epsilon}$ of \emph{multiple} prediction sets
1356: (i.e., prediction sets containing more than one label)
1357: and $\Emp_n^{\epsilon}$ of empty prediction sets
1358: at each confidence level $1-\epsilon$.
1359: We will discuss the significance of each of these numbers in turn.
1360: 
1361: The number of erroneous predictions is a measure of validity of our confidence predictors:
1362: we would like to have $\Err_n^{\epsilon}\le\epsilon n$,
1363: up to statistical fluctuations.
1364: In Figure~\ref{fig:CP0err} we can see the lines $n\mapsto\Err_n^{\epsilon}$
1365: for one particular conformal predictor
1366: and for three confidence levels $1-\epsilon$:
1367: the solid line for 99\%, the dash-dot line for 95\%, and the dotted line for 80\%.
1368: The number of errors made grows linearly,
1369: and the slope is approximately
1370: 20\% for the confidence level 80\%,
1371: 5\% for the confidence level 95\%,
1372: and 1\% for the confidence level 99\%.
1373: We will see below that this is not accidental.
1374: 
1375: \begin{figure}
1376:   \centering
1377:   \makebox{\includegraphics[width=\picturewidth]{CP0err.eps}}
1378:   \caption{\label{fig:CP0err}Cumulative numbers of errors for a conformal predictor
1379:     (the 1-nearest neighbour conformal predictor)
1380:     run in the on-line mode on the USPS data set
1381:     (9298 hand-written digits, randomly permuted)
1382:     at the confidence levels 80\%, 95\% and 99\%.}
1383: \end{figure}
1384: 
1385: The number of multiple predictions $\Mult_n$
1386: is a useful measure of efficiency in the case of classification:
1387: we would like as many as possible of our predictions to be singletons.
1388: Figure \ref{fig:TCM975} shows the cumulative numbers of errors
1389: $n\mapsto\Err_n^{2.5\%}$ (solid line)
1390: and multiple predictions
1391: $n\mapsto\Mult_n^{2.5\%}$ (dotted line)
1392: at the fixed confidence level 97.5\%.
1393: We can see that out of approximately 10,000 predictions
1394: about 250 (approximately 2.5\%) were errors
1395: and about 300 (approximately 3\%) were multiple predictions.
1396: 
1397: \begin{figure}
1398:   \centering
1399:   \makebox{\includegraphics[width=\picturewidth]{TCM0_975bwF.eps}}
1400:   \caption{\label{fig:TCM975}The on-line performance of the 1-nearest neighbour conformal predictor
1401:     at the confidence level 97.5\% on the USPS data set (randomly permuted).}
1402: \end{figure}
1403: 
1404: We can see that by choosing $\epsilon$ we are able to control the number of errors.
1405: For small $\epsilon$
1406: (relative to the difficulty of the data set)
1407: this might lead to the need sometimes to give
1408: multiple predictions.
1409: On the other hand,
1410: for larger $\epsilon$ this might lead to empty predictions at some steps,
1411: as can be seen from the bottom right corner of Figure \ref{fig:TCM975}:
1412: when the predictor ceases to make multiple predictions
1413: it starts making occasional empty predictions
1414: (the dash-dot line).
1415: An empty prediction is a warning that the object to be predicted is unusual
1416: (the credibility, as defined in Section \ref{sec:ideal}, is $\epsilon$ or less).
1417: 
1418: It would be a mistake to concentrate exclusively on one confidence level $1-\epsilon$.
1419: If the prediction $\Gamma_n^{\epsilon}$ is empty,
1420: this does not mean that we cannot make any prediction at all:
1421: we should just shift our attention to other confidence levels
1422: (perhaps look at the range of $\epsilon$ for which $\Gamma_n^{\epsilon}$ is a singleton).
1423: Likewise, $\Gamma_n^{\epsilon}$ being multiple
1424: does not mean that all labels in $\Gamma_n^{\epsilon}$ are equally likely:
1425: slightly increasing $\epsilon$ might lead to the removal of some labels.
1426: Of course,
1427: taking in the continuum of predictions sets, for all $\epsilon\in(0,1)$,
1428: might be too difficult or tiresome for a human mind,
1429: and concentrating on a few conventional levels,
1430: as in Figure \ref{fig:predset},
1431: might be a reasonable compromise.
1432: 
1433: \ifJOURNAL
1434: \begin{table*}
1435: \processtable{A selected test example from a data set of hospital records of patients
1436:   who suffered acute abdominal pain \cite{gammerman/thatcher:1992}:
1437:   the p-values for the nine possible diagnostic groups
1438:   (appendicitis APP, diverticulitis DIV, perforated peptic ulcer PPU,
1439:   non-specific abdominal pain NAP, cholecystitis CHO, intestinal obstruction INO,
1440:   pancreatitis PAN, renal colic RCO, dyspepsia DYS)
1441:   and the true label.\label{tab:abdominal}}
1442: %\begingroup\tiny
1443: {\footnotesize\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
1444: \hline APP & DIV & PPU & NAP & CHO & INO & PAN & RCO & DYS & true label\\
1445: \hline 1.23\% & 0.36\% & 0.16\% & 2.83\% & 5.72\% & 0.89\% & 1.37\% & 0.48\% & 80.56\% & DYS\\
1446: \hline
1447: \end{tabular}}{}
1448: %\endgroup
1449: \end{table*}
1450: \fi
1451: 
1452: \ifnotJOURNAL
1453: \begin{table*}
1454: \caption{A selected test example from a data set of hospital records of patients
1455:   who suffered acute abdominal pain \cite{gammerman/thatcher:1992}:
1456:   the p-values for the nine possible diagnostic groups
1457:   (appendicitis APP, diverticulitis DIV, perforated peptic ulcer PPU,
1458:   non-specific abdominal pain NAP, cholecystitis CHO, intestinal obstruction INO,
1459:   pancreatitis PAN, renal colic RCO, dyspepsia DYS)
1460:   and the true label.\label{tab:abdominal}}
1461: 
1462: \medskip
1463: 
1464: {\footnotesize\hspace{-2mm}\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
1465: \hline APP & DIV & PPU & NAP & CHO & INO & PAN & RCO & DYS & true label\\
1466: \hline 1.23\% & 0.36\% & 0.16\% & 2.83\% & 5.72\% & 0.89\% & 1.37\% & 0.48\% & 80.56\% & DYS\\
1467: \hline
1468: \end{tabular}}{}
1469: %\endgroup
1470: \end{table*}
1471: \fi
1472: 
1473: % Typical output: example 5 (correctly predicted)
1474: %
1475: % Real class = Dyspepsia (8) [starting from 0 rather than 1, as in the paper]
1476: % Predicted class = Dyspepsia (8)
1477: %
1478: % p-values for each class:
1479: % Class 0: Appendicitis = 0.012306289881494986
1480: % Class 1: Diverticulitis = 0.0036463081130355514
1481: % Class 2: Perforated peptic ulcer = 0.0015952597994530538
1482: % Class 3: Non-specific abdominal pain = 0.028258887876025523
1483: % Class 4: Cholecystitis = 0.057201458523245215
1484: % Class 5: Intestinal obstruction = 0.008887876025524157
1485: % Class 6: Pancreatitis = 0.013673655423883319
1486: % Class 7: Renal colic = 0.004785779398359161
1487: % Class 8: Dyspepsia = 0.8056061987237921
1488: 
1489: For example, Table \ref{tab:abdominal}
1490: gives the p-values for different kinds of abdominal pain
1491: obtained for a specific patient based on his symptoms.
1492: % check his sex with Sasha!
1493: We can see that at the confidence level 95\% the prediction set
1494: is multiple,
1495: $\{$cholecystitis, dyspepsia$\}$.
1496: When we relax the confidence level to 90\%,
1497: the prediction set narrows down to $\{$dyspepsia$\}$
1498: (the singleton containing only the true label);
1499: on the other hand,
1500: at the confidence level 99\% the prediction set widens to
1501: $\{$appendicitis, non-specific abdominal pain, cholecystitis, pancreatitis, dyspepsia$\}$.
1502: Such detailed confidence information,
1503: in combination with the property of validity,
1504: is especially valuable in medicine
1505: (and some of the first applications of conformal predictors
1506: have been to the fields of medicine and bioinformatics:
1507: see, e.g., \cite{bellotti/etal:2005,shahmuradov/etal:2005}).
1508: 
1509: In the case of regression,
1510: we will usually have $\Mult_n^{\epsilon}=n$ and $\Emp_n^{\epsilon}=0$,
1511: and so these are not useful measures of efficiency.
1512: Better measures,
1513: such as the ones used in the previous section,
1514: would, e.g., take into account the widths of the prediction intervals.
1515: 
1516: \subsection*{Theoretical Analysis}
1517: 
1518: Looking at Figures \ref{fig:CP0err} and \ref{fig:TCM975}
1519: we might be tempted to guess that the probability of error
1520: at each step of the on-line protocol
1521: is $\epsilon$
1522: and that errors are made independently at different steps.
1523: This is not literally true,
1524: as a closer examination of the bottom left corner of Figure \ref{fig:TCM975} reveals.
1525: It, however, becomes true
1526: (as noticed in \cite{vovk:2002})
1527: if the p-values (\ref{eq:p}) are redefined as
1528: \begin{equation}\label{eq:p-smoothed}
1529:   p_Y
1530:   :=
1531:   \frac
1532:   {
1533:     \left|
1534:       \{i \st \alpha_i>\alpha_{l+1}\}
1535:     \right|
1536:     +
1537:     \eta
1538:     \left|
1539:       \{i \st \alpha_i=\alpha_{l+1}\}
1540:     \right|
1541:   }
1542:   {l+1},
1543: \end{equation}
1544: where $i$ ranges over $\{1,\ldots,l+1\}$
1545: and $\eta\in[0,1]$ is generated randomly from the uniform distribution on $[0,1]$
1546: (the $\eta$s should be independent between themselves and of everything else;
1547: in practice they are produced by pseudo-random number generators).
1548: The only difference between (\ref{eq:p}) and (\ref{eq:p-smoothed})
1549: is that the expression (\ref{eq:p-smoothed}) takes more care in breaking the ties
1550: $\alpha_i=\alpha_{l+1}$.
1551: Replacing (\ref{eq:p}) by (\ref{eq:p-smoothed})
1552: in the definition of conformal predictor
1553: we obtain the notion of \emph{smoothed conformal predictor}.
1554: 
1555: The validity property for smoothed conformal predictors can now be stated as follows.
1556: \begin{theorem}\label{thm:on-line}
1557:   Suppose the examples
1558:   \begin{equation*}
1559:     (x_1,y_1),(x_2,y_2),\ldots
1560:   \end{equation*}
1561:   are generated independently
1562:   from the same distribution.
1563:   For any smoothed conformal predictor working in the on-line prediction protocol
1564:   and any confidence level $1-\epsilon$,
1565:   the random variables $\err_1^{\epsilon},\err_2^{\epsilon},\ldots$
1566:   are independent and take value 1 with probability $\epsilon$.
1567: \end{theorem}
1568: 
1569: Combining Theorem \ref{thm:on-line}
1570: with the strong law of large numbers
1571: we can see that
1572: \begin{equation*}
1573:   \lim_{n\to\infty}
1574:   \frac{\Err_n^{\epsilon}}{n}
1575:   =
1576:   \epsilon
1577: \end{equation*}
1578: holds with probability one for smoothed conformal predictors.
1579: (They are ``well calibrated''.)
1580: Since the number of mistakes made by a conformal predictor
1581: never exceeds the number of mistakes
1582: made by the corresponding smoothed conformal predictor,
1583: \begin{equation*}
1584:   \limsup_{n\to\infty}
1585:   \frac{\Err_n^{\epsilon}}{n}
1586:   \le
1587:   \epsilon
1588: \end{equation*}
1589: holds with probability one for conformal predictors.
1590: (They are ``conservatively well calibrated''.)
1591: 
1592: \section{Slow teachers, lazy teachers, and the batch setting}
1593: \label{sec:slow}
1594: 
1595: % Lazy and slow teachers; batch and mixtures on-line/batch
1596: 
1597: In the pure on-line setting, considered in the previous section,
1598: we get an immediate feedback (the true label) for every example that we predict.
1599: This makes practical applications of this scenario questionable.
1600: Imagine, for example, a mail sorting centre
1601: using an on-line prediction algorithm
1602: for zip code recognition;
1603: suppose the feedback about the ``true'' label comes from a human ``teacher''.
1604: If the feedback is given for every object $x_i$,
1605: there is no point in having the prediction algorithm:
1606: we can just as well use the label provided by the teacher.
1607: It would help if the prediction algorithm could still work well,
1608: in particular be valid,
1609: if only every, say, tenth object were classified by a human teacher
1610: (the scenario of ``lazy'' teachers).
1611: Alternatively,
1612: even if the prediction algorithm requires the knowledge of all labels,
1613: it might still be useful if the labels were allowed to be given not immediately
1614: but with a delay (``slow'' teachers).
1615: In our mail sorting example,
1616: such a delay might make sure that we hear
1617: from local post offices about any mistakes made
1618: before giving a feedback to the algorithm.
1619: 
1620: In the pure on-line protocol we had validity in the strongest possible sense:
1621: at each confidence level $1-\epsilon$ each smoothed conformal predictor
1622: made errors independently with probability $\epsilon$.
1623: In the case of weaker teachers
1624: (as usual, we are using the word ``teacher'' in the general sense of the entity
1625: providing the feedback,
1626: called Reality in the previous section),
1627: we have to accept a weaker notion of validity.
1628: Suppose the predictor receives a feedback from the teacher
1629: at the end of steps $n_1,n_2,\ldots$,
1630: $n_1<n_2<\cdots$;
1631: the feedback is the label of one of the objects that the predictor
1632: has already seen (and predicted).
1633: This scheme \cite{ryabko/etal:2003} covers both slow and lazy teachers
1634: (as well as teachers who are both slow and lazy).
1635: It was proved in \cite{nouretdinov/vovk:2003}
1636: (see also \cite{vovk/etal:2005}, Theorem 4.2)
1637: that the smoothed conformal predictors
1638: (using only the examples with known labels)
1639: remain valid in the sense
1640: \begin{equation*}
1641:   \forall\epsilon\in(0,1):
1642:   \Err_n^{\epsilon}/n\to\epsilon
1643:   \text{ in probability}
1644: \end{equation*}
1645: if and only if $n_k/n_{k-1}\to1$ as $k\to\infty$.
1646: In other words,
1647: the validity in the sense of convergence in probability holds
1648: if and only if the growth rate of $n_k$ is subexponential.
1649: (This condition is amply satisfied for our example
1650: of a teacher giving feedback for every tenth object.)
1651: 
1652: \iffalse
1653: Below are two examples of ``weak'' (slow and lazy) teachers at 99\%
1654: confidence using well-known NIST data set.
1655: 
1656: \begin{figure}
1657:   \centering
1658:   \makebox{\includegraphics[width=\picturewidth]{tcmSlow10.eps}}
1659:   \caption{\label{fig:slow teachers}An example of a Slow Teacher Predictor
1660:     with a delay of 10 examples on the NIST data set.}
1661: \end{figure}
1662: 
1663: \begin{figure}
1664:   \centering
1665:   \makebox{\includegraphics[width=\picturewidth]{tcmLazyAP10.eps}}
1666:   \caption{\label{fig:lazy}An example of a Lazy Teacher Predictor
1667:     with delays follow the arithmetic progression with coefficient 10 on the NIST data set.}
1668: \end{figure}
1669: \fi
1670: 
1671: The most standard \emph{batch} setting of the problem of prediction
1672: is in one respect even more demanding than our scenarios of weak teachers.
1673: In this setting we are given a training set (\ref{eq:training-set})
1674: and our goal is to predict the labels
1675: given the objects in the test set
1676: \begin{equation}\label{eq:test-set}
1677:   (x_{l+1},y_{l+1}),\ldots,(x_{l+k},y_{l+k}).
1678: \end{equation}
1679: This can be interpreted as a finite-horizon version
1680: of the lazy-teacher setting:
1681: no labels are returned after step $l$.
1682: Computer experiments (see, e.g., Figure \ref{fig:batch-errors})
1683: show that approximate validity still holds;
1684: for related theoretical results,
1685: see \cite{vovk/etal:2005}, Section 4.4.
1686: 
1687: \begin{figure}
1688:   \centering
1689:   \makebox{\includegraphics[width=\picturewidth]{TCM_test_errors_bw.eps}}
1690:   \caption{\label{fig:batch-errors}Cumulative numbers of errors made on the test set
1691:     by the 1-nearest neighbour conformal predictor
1692:     used in the batch mode on the USPS data set
1693:     (randomly permuted and split into a training set of size 7291 and a test set of size 2007)
1694:     at the confidence levels 80\%, 95\% and 99\%.}
1695: \end{figure}
1696: 
1697: \section{Induction and transduction}
1698: \label{sec:induction-transduction}
1699: 
1700: % Transductive vs. inductive inference
1701: 
1702: Vapnik's \cite{vapnik:1995,vapnik:1998}
1703: distinction between induction and transduction,
1704: as applied to the problem of prediction,
1705: is depicted in Figure \ref{fig:trans}.
1706: In \emph{inductive prediction}
1707: we first move from examples in hand to some more or less general rule,
1708: which we might call a prediction or decision rule,
1709: a model, or a theory;
1710: this is the \emph{inductive step}.
1711: When presented with a new object,
1712: we derive a prediction from the general rule;
1713: this is the \emph{deductive step}.
1714: In \emph{transductive prediction},
1715: we take a shortcut,
1716: moving from the old examples directly
1717: to the prediction about the new object.
1718: 
1719: \begin{figure}
1720:   \centering
1721:   \input{trans.pic}
1722:   \caption{\label{fig:trans}Inductive and transductive prediction.}
1723: \end{figure}
1724: 
1725: Typical examples of the inductive step
1726: are estimating parameters in statistics
1727: and finding an approximating function
1728: in statistical learning theory.
1729: Examples of transductive prediction
1730: are estimation of future observations in statistics
1731: (\cite{cox/hinkley:1974}, Section 7.5, \cite{takeuchi:1975})
1732: and nearest neighbours algorithms
1733: in machine learning.
1734: 
1735: In the case of simple (i.e., traditional, not hedged) predictions
1736: the distinction between induction and transduction
1737: is less than crisp.
1738: A method for doing transduction,
1739: in the simplest setting of predicting one label,
1740: is a method for predicting $y_{l+1}$
1741: from (\ref{eq:training-set}) and $x_{l+1}$.
1742: Such a method gives a prediction for any object
1743: that might be presented as $x_{l+1}$, and so it defines,
1744: at least implicitly, a rule,
1745: which might be extracted from the training set (\ref{eq:training-set}) (induction),
1746: stored, and then subsequently applied to $x_{l+1}$ to predict $y_{l+1}$ (deduction).
1747: So any real distinction is really at a practical and computational level:
1748: do we extract and store the general rule or not?
1749: 
1750: For hedged predictions the difference between induction and transduction goes deeper.
1751: We will typically want different notions of hedged prediction
1752: in the two frameworks.
1753: Mathematical results about induction usually involve two parameters,
1754: often denoted $\epsilon$ (the desired accuracy of the prediction rule)
1755: and $\delta$ (the probability of achieving the accuracy of $\epsilon$),
1756: whereas results about transduction involve only one parameter,
1757: which we denote $\epsilon$ in this paper
1758: (the probability of error we are willing to tolerate);
1759: see Figure \ref{fig:trans}.
1760: For a review of inductive prediction
1761: from this point of view, see \cite{vovk/etal:2005}, Section 10.1.
1762: 
1763: \section{Inductive conformal predictors}
1764: \label{sec:ICP}
1765: 
1766: % Computational issues: inductive conformal predictors
1767: 
1768: Our approach to prediction is thoroughly transductive,
1769: and this is what makes valid and efficient hedged prediction possible.
1770: In this section we will see, however,
1771: that there is also room for an element of induction
1772: in conformal prediction.
1773: 
1774: Let us take a closer look at the process of conformal prediction,
1775: as described in Section \ref{sec:conformal}.
1776: Suppose we are given a training set (\ref{eq:training-set})
1777: and the objects in a test set (\ref{eq:test-set}),
1778: and our goal is to predict the label of each test object.
1779: If we want to use the conformal predictor based on the support vector method,
1780: as described in Section \ref{sec:conformal},
1781: we will have to find the set of the Lagrange multipliers
1782: for each test object and for each potential label $Y$ that can be assigned to it.
1783: This would involve solving
1784: $k\left|\mathbf{Y}\right|$ essentially independent optimization problems.
1785: Using the nearest neighbours approach
1786: is typically more computationally efficient,
1787: but even it is much slower than the following procedure,
1788: suggested in \cite{papadopoulos/etal:2002a,papadopoulos/etal:2002b}.
1789: 
1790: Suppose we have an inductive algorithm which,
1791: given a training set (\ref{eq:training-set}) and a new object $x$
1792: outputs a prediction $\hat y$ for $x$'s label $y$.
1793: Fix some measure $\Delta(y,\hat y)$ of difference between $y$ and $\hat y$.
1794: The procedure is:
1795: \begin{enumerate}
1796: \item
1797:   Divide the original training set (\ref{eq:training-set})
1798:   into two subsets:
1799:   the \emph{proper training set}
1800:   $(x_1,y_1),\ldots,(x_m,y_m)$
1801:   and the \emph{calibration set}
1802:   $(x_{m+1},y_{m+1}),\ldots,(x_l,y_l)$.
1803: \item
1804:   Construct a prediction rule $F$ from the proper training set.
1805: \item
1806:   Compute the nonconformity score
1807:   \begin{equation*}
1808:     \alpha_i:=\Delta(y_i,F(x_i)),
1809:     \quad
1810:     i=m+1,\ldots,l,
1811:   \end{equation*}
1812:   for each example in the calibration set.
1813: \item
1814:   For every test object $x_i$,
1815:   $i=l+1,\ldots,l+k$,
1816:   do the following:
1817:   \begin{enumerate}
1818:   \item
1819:     for every possible label $Y\in\mathbf{Y}$
1820:     compute the nonconformity score $\alpha_i:=\Delta(y_i,F(x_i))$
1821:     and the p-value
1822:     \begin{equation*}
1823:       p_Y
1824:       :=
1825:       \frac
1826:       {
1827:         \#\{j\in\{m+1,\ldots,l,i\} \st \alpha_j\ge\alpha_i\}
1828:       }
1829:       {l-m+1};
1830:     \end{equation*}
1831:   \item
1832:     output the prediction sets
1833:     $
1834:       \Gamma^{\epsilon}
1835:       \left(
1836:         x_1,y_1,\ldots,x_{l},y_{l},x_{i}
1837:       \right)
1838:     $
1839:     given by the right-hand side of (\ref{eq:Gamma}).
1840:   \end{enumerate}
1841: \end{enumerate}
1842: This is a special case of ``inductive conformal predictors'',
1843: as defined in \cite{vovk/etal:2005}, Section 4.1.
1844: In the case of classification,
1845: of course,
1846: we could package the p-values as a simple prediction
1847: complemented with confidence (\ref{eq:conf}) and credibility (\ref{eq:cred}).
1848: 
1849: Inductive conformal predictors are valid in the sense that
1850: the probability of error
1851: \begin{equation*}
1852:   y_{i}
1853:   \notin
1854:   \Gamma^{\epsilon}
1855:   \left(
1856:     x_1,y_1,
1857:     \ldots
1858:     x_l,y_l,
1859:     x_{i}
1860:   \right)
1861: \end{equation*}
1862: ($i=l+1,\ldots,l+k$, $\epsilon\in(0,1)$)
1863: never exceeds $\epsilon$
1864: (cf.\ (\ref{eq:error})).
1865: The on-line version of inductive conformal predictors,
1866: with a stronger notion of validity,
1867: is described in \cite{vovk:2002}
1868: and \cite{vovk/etal:2005} (Section 4.1).
1869: 
1870: The main advantage of inductive conformal predictors
1871: is their computational efficiency:
1872: the bulk of the computations is performed only once,
1873: and what remains to do for each test example
1874: is to apply the prediction rule found at the inductive step,
1875: to apply $\Delta$ to find the nonconformity score $\alpha$ for this example,
1876: and to find the position of $\alpha$ among the nonconformity scores
1877: of the calibration examples.
1878: The main disadvantage is a possible loss of the prediction efficiency:
1879: for conformal predictors,
1880: we can effectively use the whole training set
1881: as both the proper training set and the calibration set.
1882: 
1883: \section{Conclusion}
1884: \label{sec:conclusion}
1885: 
1886: This paper shows how many machine-learning techniques
1887: can be complemented with provably valid measures
1888: of accuracy and reliability.
1889: We explained briefly how this can be done
1890: for support vector machines, nearest neighbours algorithms,
1891: and the ridge regression procedure,
1892: but the principle is general:
1893: virtually any (we are not aware of exceptions) successful prediction technique
1894: designed to work under the randomness assumption
1895: can be used to produce equally successful hedged predictions.
1896: Further examples are given in our recent book \cite{vovk/etal:2005}
1897: (joint with Glenn Shafer),
1898: where we construct conformal predictors and inductive conformal predictors
1899: based on nearest neighbours regression, logistic regression,
1900: bootstrap, decision trees, boosting, and neural networks;
1901: general schemes for constructing conformal predictors
1902: and inductive conformal predictors
1903: are given on pp.~28--29 and on pp.~99--100 of \cite{vovk/etal:2005},
1904: respectively.
1905: Replacing the original simple predictions with hedged predictions
1906: enables us to control the number of errors made
1907: by appropriately choosing the confidence level.
1908: 
1909: \section*{Acknowledgements}
1910: 
1911: This work is partially supported by MRC
1912: (grant % G0301107
1913: ``Pro\-te\-o\-mic analysis of the human serum pro\-te\-ome'')
1914: and the Royal Society
1915: (grant ``Efficient pseudo-random number generators'').
1916: 
1917: \begin{thebibliography}{99}
1918: 
1919: \bibitem{bellotti/etal:2005}
1920:   Bellotti, T., Luo, Z., Gammerman, A., van Delft, F.~W.\ and Saha, V.\ (2005)
1921:   Qualified predictions for microarray and proteomics pattern diagnostics with confidence machines.
1922:   \emph{International Journal of Neural Systems}, \textbf{15}, 247--258.
1923:   Yang, Z.~R.\ and Dalby, A.~R.\ (eds),
1924:   Special Issue on Bioinformatics.
1925: \bibitem{cesabianchi/lugosi:2006}
1926:   Cesa-Bianchi, N.\ and Lugosi, G.\ (2006)
1927:   \emph{Prediction, Learning, and Games}.
1928:   Cambridge University Press, Cambridge.
1929: \bibitem{cox/hinkley:1974}
1930:   Cox, D.~R.\ and Hinkley, D.~V.\ (1974)
1931:   \emph{Theoretical Statistics}.
1932:   Chapman and Hall, London.
1933: % \bibitem{gammerman/etal:1998}
1934: %   A.~Gammerman, V.~N.~Vapnik and V.~Vovk,
1935: %   Learning by transduction,
1936: %   in: G.~F.~Cooper and S.~Moral, eds.,
1937: %   \emph{Proceedings of the Fourteenth Conference
1938: %   on Uncertainty in Artificial Intelligence}
1939: %   (Morgan Kaufmann, San Francisco, CA, 1998)
1940: %   148--156.
1941: \bibitem{gammerman/thatcher:1992}
1942:   Gammerman, A.\ and Thatcher, A.~R.\ (1992)
1943:   Bayes\-ian diagnostic probabilities without assuming in\-de\-pen\-dence of symptoms.
1944:   \emph{Yearbook of Medical In\-for\-mat\-ics}, pp.~323--330.
1945: \bibitem{lecun/etal:1990}
1946:    LeCun, Y., Boser, B., Denker, J.~S., Henderson, D., How\-ard, R.~E.,
1947:    Hubbard, W.\ and Jackel, L.~J.\ (1990)
1948:    Handwritten digit recognition with backpropagation network.
1949:    In \emph{Advances in Neural Information Processing Systems 2},
1950:    pp.~396--404,
1951:    Morgan Kaufmann, San Ma\-teo, CA.
1952: \bibitem{li/vitanyi:1997}
1953:   Li, M.\ and Vit\'anyi, P.\ (1993)
1954:   \emph{An Introduction to Kolmogorov Complexity and Its Applications}.
1955:   Springer, New York.
1956:   Second edition: 1997.
1957: \bibitem{martin-lof:1966}
1958:   Martin-L\"of, P.\ (1966)
1959:   The definition of random sequences.
1960:   \emph{Information and Control}, \textbf{9}, 602--619.
1961: \bibitem{melluish/etal:2001}
1962:   Melluish, T., Saunders, C., Nouretdinov, I. and Vovk, V.\ (2001)
1963:   Comparing the Bayes and typicalness frameworks.
1964:   In De Raedt, L.\ and Flash, P.\ (eds),
1965:   \emph{Machine Learning: ECML 2001,
1966:   Proceedings of the Twelfth European Conference on Machine Learning,
1967:   LNAI}, \textbf{2167}, pp.~360--371,
1968:   Springer, Heidelberg.
1969:   Full version published as Technical Report TR-01-05,
1970:   Computer Learning Research Centre,
1971:   Royal Holloway, University of London.
1972: % (can be downloaded from \texttt{http://www.clrc.rhul.ac.uk}).
1973: \bibitem{nouretdinov/etal:2001rr}
1974:    Nouretdinov, I., Melluish, T.\ and Vovk, V.\ (2001)
1975:    Ridge Regression Confidence Machine.
1976:    In \emph{Proceedings of the Eighteenth International Conference
1977:    on Machine Learning}, pp.~385--392,
1978:    Morgan Kaufmann, San Fran\-cis\-co, CA.
1979: \bibitem{nouretdinov/vovk:2003}
1980:   Nouretdinov, I.\ and Vovk, V.\ (2003)
1981:   Criterion of calibration for transductive confidence machine with limited feedback.
1982:   In Gavald\`a, R., Jantke, K.~P.\ and Takimoto, E.\ (eds),
1983:   \emph{Proceedings of the Fourteenth International Conference on Algorithmic Learning Theory,
1984:   LNAI}, \textbf{2842}, pp.~259--267,
1985:   Springer, Berlin.
1986:   To appear in \emph{Theoretical Computer Science}
1987:   (special issue devoted to the ALT'2003 conference).
1988: % \bibitem{nouretdinov/etal:2001de}
1989: %   I.~Nouretdinov, V.~Vovk, M.~Vyugin and A.~Gammerman,
1990: %   Pattern recognition and density estimation under the general iid assumption,
1991: %   in: D.~Helmbold and B.~Williamson, eds.,
1992: %   \emph{Proceedings of the Fourteenth Annual Conference
1993: %   on Computational Learning Theory
1994: %   and Fifth European Conference
1995: %   on Computational Learning Theory},
1996: %   \emph{Lecture Notes in Artificial Intelligence},
1997: %   \textbf{2111} (2001) 337--353;
1998: %   Full version published as a CLRC technical report
1999: %   (can be downloaded from \texttt{http://www.clrc.rhul.ac.uk}).
2000: \bibitem{papadopoulos/etal:2002a}
2001:   Papadopoulos, H., Proedrou, K., Vovk, V.\ and Gammerman, A.\ (2002)
2002:   Inductive Confidence Machines for regression.
2003:   In Elomaaa, T., Mannila, H.\ and Toivonen, H.\ (eds),
2004:   \emph{Machine Learning: ECML 2002,
2005:   Proceedings of the Thirteenth European Conference on Machine Learning,
2006:   LNCS}, \textbf{2430}, pp.~345--356,
2007:   Springer, Berlin.
2008: \bibitem{papadopoulos/etal:2002b}
2009:   Papadopoulos, H., Vovk, V.\ and Gammerman, A.\ (2002)
2010:   Qualified predictions for large data sets in the case of pattern recognition.
2011:   In \emph{Proceedings of the International Conference on Machine Learning and Applications
2012:   (ICMLA'2002)}, pp.~159--163,
2013:   CSREA Press.
2014: \bibitem{popper:1934}
2015:   Popper, K.~R.\ (1934)
2016:   \emph{Logik der Forschung}.
2017:   Springer, Vienna.
2018:   English translation (1959):
2019:   \emph{The Logic of Sci\-en\-tif\-ic Discovery},
2020:   Hutchinson, London.
2021: % \bibitem{proedrou/etal:2002}
2022: %   Proedrou, K., Papadopoulos, H., Vovk, V.\ and Gammerman, A.\ (2002)
2023: %   Nearest Neighbours Transductive Confidence Machine,
2024: %   in: \emph{Proceedings of the Artificial Intelligence and Statistics Conference}
2025: \bibitem{ryabko/etal:2003}
2026:   Ryabko, D., Vovk, V.\ and Gammerman, A.\ (2003)
2027:   Online prediction with real teachers.
2028:   Technical Report CS-TR-03-09, Department of Computer Science,
2029:   Royal Holloway, University of London.
2030: % \bibitem{saunders/etal:1999}
2031: %   C.~Saunders, A.~Gammerman and V.~Vovk,
2032: %   Transduction with confidence and credibility,
2033: %   in: \emph{Proceedings of the Sixteenth International Joint Conference
2034: %   on Artificial Intelligence}
2035: %   (Morgan Kaufmann, 1999)
2036: %   722--726.
2037: % \bibitem{scholkopf/etal:1999}
2038: %   B.~Sch\"olkopf, C.~J.~C.~Burges and A.~J.~Smola, eds.,
2039: %   \emph{Advances in Kernel Methods, Support Vector Learning}
2040: %   (MIT Press, 1999).
2041: \bibitem{shahmuradov/etal:2005}
2042:   Shahmuradov, I.~A., Solovyev, V.~V.\ and Gammerman, A.\ (2005)
2043:   Plant promoter prediction with confidence estimation.
2044:   \emph{Nucleic Acids Research}, \textbf{33}, 1069--1076.
2045: \bibitem{sutton/barto:1998}
2046:   Sutton, R.~S.\ and Barto, A.~G.\ (1998)
2047:   \emph{Reinforcement Learning: An Introduction}.
2048:   MIT Press, Cambridge, MA.
2049: \ifLATIN
2050:   \bibitem{takeuchi:1975}
2051:     Takeuchi, K.\ (1975)
2052:     \emph{Statistical Pre\-dic\-tion Theory} (in Japanese).
2053:     Baih\=ukan, Tokyo.
2054: \fi
2055: \ifnotLATIN
2056:   \bibitem{takeuchi:1975}
2057:     Takeuchi, K.\ (1975)
2058:     \begin{CJK*}[dnp]{JIS}{min}Åý·×Ūͽ¬ÏÀ\end{CJK*}
2059:     (\emph{Statistical Pre\-dic\-tion Theory}).
2060:     Baih\=ukan, Tokyo.
2061: \fi
2062: \bibitem{vapnik:1995}
2063:   Vapnik, V.~N.\ (1995)
2064:   \emph{The Nature of Statistical Learning Theory}.
2065:   Springer, New York.
2066:   Second edition: 2000.
2067: \bibitem{vapnik:1998}
2068:   Vapnik, V.~N.\ (1998)
2069:   \emph{Statistical Learning Theory}.
2070:   Wiley, New York.
2071: \ifLATIN
2072:   \bibitem{vapnik/chervonenkis:1974}
2073:     Vapnik, V.~N.\ and Chervonenkis, A.~Y.\ (1974)
2074:     \emph{Theory of Pattern Rec\-og\-ni\-tion} (in Russian).
2075:     Nauka, Moscow.
2076:     German translation (1979): \emph{Theorie der Zeichenerkennung},
2077:     Akademie, Berlin.
2078: \fi
2079: \ifnotLATIN
2080:   \bibitem{vapnik/chervonenkis:1974}
2081:     Vapnik, V.~N.\ and Chervonenkis, A.~Y.\ (1974)
2082:     \begin{cyr}Te\-o\-ri{ya}\ ras\-po\-zna\-va\-ni{ya}\
2083:     ob\-ra\-zov\end{cyr} (\emph{Theory of Pattern Rec\-og\-ni\-tion}).
2084:     Nauka, Moscow.
2085:     German translation (1979): \emph{Theorie der Zeichenerkennung},
2086:     Akademie, Berlin.
2087: \fi
2088: \bibitem{vovk/etal:1999}
2089:   Vovk, V., Gammerman, A.\ and Saunders, C.\ (1999)
2090:   Machine-learning applications of algorithmic ran\-dom\-ness.
2091:   In Bratko, I.\ and Dzeroski, S.\ (eds),
2092:   \emph{Proceedings of the Sixteenth International Conference on Machine Learning},
2093:   pp.~444--453,
2094:   Morgan Kaufmann, San Fran\-cis\-co, CA.
2095: \bibitem{vovk:2001}
2096:   Vovk, V.\ (2001)
2097:   Competitive on-line statistics.
2098:   \emph{International Statistical Review}, \textbf{69}, 213--248.
2099: \bibitem{vovk:2002}
2100:   Vovk, V.\ (2002)
2101:   On-line Confidence Machines are well-calibrated.
2102:   In \emph{Proceedings of the Forty Third Annual Symposium on Foundations of Computer Science},
2103:   pp.~187--196,
2104:   IEEE Computer Society, Los Alamitos, CA.
2105: \bibitem{vovk/etal:2005}
2106:   Vovk, V., Gammerman, A.\ and Shafer, G.\ (2005)
2107:   \emph{Al\-go\-rith\-mic Learning in a Random World}.
2108:   Springer, New York.
2109: \end{thebibliography}
2110: \end{document}
2111: 
2112: 
2113: Remove:
2114: 
2115: \emergencystretch=5mm
2116: \tolerance=400
2117: \allowdisplaybreaks[3]
2118: 
2119: \newcommand{\Vladimir}{Vladimir }
2120: \newcommand{\DOT}{.}
2121: \newcommand{\zzrelax}[1]{}
2122: 
2123: \DeclareMathAlphabet{\mathbfit}{OT1}{cmr}{bx}{it}	% description: LATEX companion, pp.177 and 181
2124: 
2125: \newcommand{\st}{\mathrel{:}}
2126: \newcommand{\given}{\mathrel{|}}
2127: 
2128: \newcommand{\bbbr}{\mathbb{R}}		% real numbers
2129: \newcommand{\bbbc}{\mathbb{C}}		% complex numbers
2130: \newcommand{\bbbq}{\mathbb{Q}}		% rational numbers
2131: \newcommand{\bbbn}{\mathbb{N}}		% natural numbers
2132: \newcommand{\III}{\mathbb{I}}		% indicator
2133: \newcommand{\bbbp}{\mathbb{P}}		% auxiliary (probability)
2134: \newcommand{\bbbe}{\mathbb{E}}		% auxiliary (expectation)
2135: \newcommand{\K}{\mathcal{K}}		% capital
2136: \newcommand{\FFF}{\mathcal{F}}		% sigma-algebra
2137: \newcommand{\GGG}{\mathcal{G}}		% sigma-algebra
2138: \newcommand{\PPP}{\mathcal{P}}		% statistical model
2139: 
2140: \newcommand{\Prob}{\mathop{\bbbp}\nolimits}
2141: \newcommand{\Expect}{\mathop{\bbbe}\nolimits}
2142: %\newcommand{\LP}{\mathop{\underline{\bbbp}}\nolimits}
2143: %\newcommand{\UP}{\mathop{\overline{\bbbp}}\nolimits}
2144: %\newcommand{\ULP}{\mathop{\overline{\underline{\bbbp}}}\nolimits}
2145: \newcommand{\sign}{\mathop{{\rm sign}}\nolimits}
2146: \newcommand{\var}{\mathop{{\rm var}}\nolimits}
2147: \newcommand{\co}{\mathop{{\rm co}}\nolimits}
2148: \newcommand{\rank}{\mathop{{\rm rank}}\nolimits}
2149: \newcommand{\err}{\mathop{{\rm err}}\nolimits}
2150: \newcommand{\Err}{\mathop{{\rm Err}}\nolimits}
2151: \newcommand{\length}{\mathop{{\rm length}}\nolimits}
2152: \newcommand{\lth}{\mathop{{\rm lth}}\nolimits}
2153: \newcommand{\Lth}{\mathop{{\rm Lth}}\nolimits}
2154: 
2155: \newenvironment{Proof}[1]
2156:   {\trivlist\item[\hskip\labelsep\textbf{Proof #1}]}
2157:   {\endtrivlist}
2158: \newcommand{\boxforqed}{\rule{.3em}{1.5ex}}
2159: \newcommand{\qedtext}{\unskip\nobreak\hfil
2160:   \penalty50\hskip1em\null\nobreak\hfil\boxforqed
2161:   \parfillskip=0pt\finalhyphendemerits=0\endgraf}
2162: \newcommand{\qedmath}{\eqno\boxforqed}
2163: \newtheorem{Remark}{Remark}
2164: \newenvironment{remark}
2165:   {\begin{Remark} \begingroup\rm}
2166:   {\endgroup \end{Remark}}
2167: \newenvironment{remark*}
2168:   {\trivlist\item[\hskip\labelsep{\bfseries Remark}]\relax}
2169:   {\endtrivlist}
2170: 
2171: \begin{document}
2172: \label{firstpage}
2173: \maketitle
2174: 
2175: \begin{abstract}
2176:   We consider the on-line predictive version
2177:   of the standard problem of linear regression;
2178:   the goal is to predict each consecutive response
2179:   given the corresponding explanatory variables
2180:   and all the previous observations.
2181:   The standard treatment of prediction in linear regression analysis
2182:   has two drawbacks:
2183:   (1) the usual prediction intervals
2184:   guarantee that the probability of error
2185:   is equal to the nominal significance level $\epsilon$,
2186:   but this property per se does not imply that the long-run frequency of error
2187:   is close to $\epsilon$;
2188:   (2) it is not suitable for prediction of complex systems
2189:   as it assumes that the number of observations
2190:   exceeds the number of parameters.
2191:   We state a general result showing that in the on-line protocol
2192:   the frequency of error does equal the nominal significance level,
2193:   up to statistical fluctuations,
2194:   and we describe alternative regression models
2195:   in which informative prediction intervals can be found
2196:   before the number of observations exceeds the number of parameters.
2197:   One of these models,
2198:   which only assumes that the observations are independent and identically distributed,
2199:   is popular in machine learning but
2200:   greatly underused in the statistical theory of regression.
2201: \end{abstract}
2202: 
2203: \ifJOURNAL
2204:   \noindent
2205:   \textbf{Key words:}
2206:   Gauss linear model; independent identically distributed observations;
2207:   multivariate analysis; on-line protocol; prequential statistics; regression
2208: \fi
2209: