1: % Last changed by Volodya, 17 Oct 2006
2: % Spell checked (UK): 17 Oct 2006
3: % Main message: you can control the number of mistakes.
4: % 1957 lines, 79 KB
5:
6: \newif\ifJOURNAL
7: \JOURNALfalse
8: \newif\ifWP
9: \WPfalse
10: \newif\ifarXiv
11: \arXivfalse
12:
13: %\JOURNALtrue % choose JOURNAL, WP, or arXiv
14: %\WPtrue
15: \arXivtrue
16:
17: \newif\ifnotJOURNAL % derivative conditional
18: \notJOURNALtrue
19: \ifJOURNAL\notJOURNALfalse\fi
20:
21: \newif\ifLATIN % LATIN means that the Cyrillic references should be set in Latin
22:
23: \ifJOURNAL
24: \documentclass{cja4}
25:
26: %%the optional argument is used to get times font instead of CMR
27: %\documentclass[mathtime]{cja4}
28:
29: \copyrightyear{2006}
30: \vol{00}
31: \issue{0}
32: \DOI{000}
33: \usepackage{amsmath,amsfonts,latexsym,graphicx}
34: \LATINfalse
35: \fi
36:
37: \ifWP
38: \documentclass[toc]{kpnsarticle}
39: \usepackage{amsmath,amsfonts,latexsym,graphicx,epsfig}
40: \LATINfalse
41: \fi
42:
43: \ifarXiv
44: \documentclass{article}
45: \usepackage{amsmath,amsfonts,latexsym,graphicx}
46: \LATINtrue
47: \fi
48:
49: \newif\ifnotLATIN % derivative conditional
50: \notLATINtrue
51: \ifLATIN\notLATINfalse\fi
52:
53: \emergencystretch=5mm
54: \tolerance=400
55: \allowdisplaybreaks[3]
56: %\input{hyphenation.txt}
57:
58: \ifnotLATIN
59: \usepackage{CJK}
60: \input{OT2enc.def}
61: \newenvironment{cyr}
62: {\fontencoding{OT2}\fontfamily{wncyr}\fontseries{m}\fontshape{n}\selectfont}
63: {\fontencoding{OT1}\fontfamily{tir}\selectfont}
64: \fi
65:
66: \newcommand{\bbbr}{{\mathbb{R}}}
67: \newcommand{\bbbn}{{\mathbb{N}}}
68: \newcommand{\st}{:}
69: \newcommand{\given}{\mathbin{|}}
70:
71: \newlength{\picturewidth}
72: \ifJOURNAL
73: \setlength{\picturewidth}{0.98\columnwidth}
74: \fi
75: \ifnotJOURNAL
76: \setlength{\picturewidth}{0.72\columnwidth}
77: \fi
78:
79: \newcommand{\E}{{\bf E}}
80:
81: \newcommand{\bbbe}{{\mathbb{E}}} % expected value
82: \newcommand{\Expect}{\mathop{\bbbe}\nolimits}
83:
84: \newcommand{\Err}{\mathop{{\rm Err}}\nolimits}
85: \newcommand{\err}{\mathop{{\rm err}}\nolimits}
86:
87: \newcommand{\Mult}{\mathop{{\rm Mult}}\nolimits}
88: \newcommand{\mult}{\mathop{{\rm mult}}\nolimits}
89:
90: \newcommand{\Emp}{\mathop{{\rm Emp}}\nolimits}
91: \newcommand{\emp}{\mathop{{\rm emp}}\nolimits}
92:
93: \ifnotJOURNAL
94: \newtheorem{lemma}{Lemma}
95: \newtheorem{proposition}{Proposition}
96: \newtheorem{corollary}{Corollary}
97: \newtheorem{theorem}{Theorem}
98: \newenvironment{proof}
99: {\trivlist\item[\hskip\labelsep\textbf{Proof}]}
100: {\endtrivlist}
101: \fi
102:
103: \newenvironment{remark*}
104: {\trivlist\item[\hskip\labelsep{\bfseries Remark}]\relax}
105: {\endtrivlist}
106: \newenvironment{definition*}
107: {\trivlist\item[\hskip\labelsep{\bfseries Definition}]\relax}
108: {\endtrivlist}
109:
110: \ifWP
111: \title{Hedging Predictions in Machine Learning}
112: \author{Alexander Gammerman and Vladimir Vovk}
113: \newcommand{\No}{2}
114: %For the two dates option: uncomment the next 2 lines
115: %\twodatestrue
116: %\newcommand{\firstposted}{November 2, 2006}
117: \fi
118:
119: \ifarXiv
120: \title{Hedging Predictions in Machine Learning}
121: \author{Alexander Gammerman and Vladimir Vovk\\
122: Computer Learning Research Centre\\
123: Department of Computer Science\\
124: Royal Holloway, University of London\\
125: Egham, Surrey TW20 0EX, UK\\
126: \texttt{\{alex,vovk\}@cs.rhul.ac.uk}}
127: \fi
128:
129: \begin{document}
130: \ifJOURNAL
131: \title[Hedging Predictions]{Hedging Predictions\\in Machine Learning}
132: % {\large preliminary draft, 28 April 2006}}
133: \author{Alexander Gammerman}
134: \author{Vladimir Vovk}
135: \affiliation{Computer Learning Research Centre,
136: Royal Holloway, University of London\\
137: Egham, Surrey TW20 0EX}
138: \email{\{alex,vovk\}@cs.rhul.ac.uk}
139:
140: \shortauthors{A.~Gammerman and V.~Vovk}
141:
142: \received{00 Month 2006}
143: \revised{00 Month 2006}
144: \fi
145:
146: \ifnotJOURNAL
147: \maketitle
148: \fi
149:
150: \begin{abstract}
151: Recent advances in machine learning make it possible
152: to design efficient prediction algorithms for data sets with huge numbers of parameters.
153: This paper describes a new technique for ``hedging'' the predictions
154: output by many such algorithms,
155: including support vector machines, kernel ridge regression, kernel nearest neighbours,
156: and by many other state-of-the-art methods.
157: The hedged predictions for the labels of new objects
158: include quantitative measures of their own accuracy and reliability.
159: These measures are provably valid under the assumption of randomness,
160: traditional in machine learning:
161: the objects and their labels are assumed to be generated independently
162: from the same probability distribution.
163: In particular, it becomes possible to control (up to statistical fluctuations)
164: the number of erroneous predictions by selecting a suitable confidence level.
165: Validity being achieved automatically,
166: the remaining goal of hedged prediction is efficiency:
167: taking full account of the new objects' features
168: and other available information to produce as accurate predictions as possible.
169: This can be done successfully using the powerful machinery of modern machine learning.
170: \end{abstract}
171:
172: \ifJOURNAL
173: \keywords{Classification, confidence, induction, learning, prediction, randomness, regression, transduction}
174:
175: \maketitle
176: \fi
177:
178: \section{Introduction}
179: \label{sec:introduction}
180:
181: % 1. Successes of machine learning:
182: % prediction under only one assumption (randomness)
183: % kernel methods: high-dimensional data
184: % 2. Weak point: no confidence, or loose bounds, or strong assumptions (Bayesian)
185: % 3. Advantages of conformal prediction
186: % 4. Contents of this paper
187:
188: The two main varieties of the problem of prediction,
189: classification and regression,
190: % I talk about classification and regression
191: % since prediction is often associated with the Kalman filter,
192: % which is not covered in this paper
193: % (because it works outside the randomness assumption)
194: are standard subjects in statistics and machine learning.
195: The classical classification and regression techniques
196: can deal successfully with conventional small-scale, low-dimensional data sets;
197: however, attempts to apply these techniques to modern high-dimensional and high-throughput data sets
198: encounter serious conceptual and computational difficulties.
199: Several new techniques,
200: first of all support vector machines \cite{vapnik:1995,vapnik:1998}
201: and other kernel methods,
202: have been developed in machine learning recently
203: with the explicit goal of dealing with high-dimensional data sets
204: % kernel methods: we do not need to process many attributes explicitly
205: with large numbers of objects.
206: % at some point we can discard all elements that are not support vectors
207:
208: A typical drawback of the new techniques is the lack of useful measures of confidence
209: in their predictions.
210: For example, some of the tightest upper bounds of the popular PAC theory
211: on the probability of error exceed~1 even for relatively clean data sets
212: (\cite{vovk/etal:2005}, p.~249).
213: This paper describes an efficient way to ``hedge'' the predictions
214: produced by the new and traditional machine-learning methods,
215: i.e., to complement them with measures of their accuracy and reliability.
216: Appropriately chosen,
217: not only are these measures valid and informative,
218: but they also take full account of the special features
219: of the object to be predicted.
220:
221: We call our algorithms for producing hedged predictions ``conformal predictors'';
222: they are formally introduced in Section \ref{sec:conformal}.
223: Their most important property is the automatic validity under the randomness assumption
224: (to be discussed shortly).
225: Informally, validity means that conformal predictors never overrate
226: the accuracy and reliability of their predictions.
227: This property, stated in Sections \ref{sec:conformal} and \ref{sec:on-line},
228: is formalized in terms of finite data sequences,
229: without any recourse to asymptotics.
230:
231: The claim of validity of conformal predictors
232: depends on an assumption that is shared by many other algorithms in machine learning,
233: which we call the assumption of randomness:
234: the objects and their labels are assumed to be generated independently
235: from the same probability distribution.
236: Admittedly, this is a strong assumption,
237: and areas of machine learning are emerging
238: that rely on other assumptions
239: (such as the Markovian assumption of reinforcement learning;
240: see, e.g., \cite{sutton/barto:1998})
241: or dispense with any stochastic assumptions altogether
242: (competitive on-line learning;
243: see, e.g., \cite{cesabianchi/lugosi:2006,vovk:2001}).
244: It is, however, much weaker than assuming a parametric statistical model,
245: sometimes complemented with a prior distribution on the parameter space,
246: which is customary in the statistical theory of prediction.
247: And taking into account the strength of the guarantees that can be proved
248: under this assumption,
249: it does not appear overly restrictive.
250:
251: So we know that conformal predictors tell the truth.
252: Clearly, this is not enough:
253: truth can be uninformative and so useless.
254: We will refer to various measures of informativeness of conformal predictors
255: as their ``efficiency''.
256: As conformal predictors are provably valid,
257: efficiency is the only thing we need to worry about
258: when designing conformal predictors
259: for solving specific problems.
260: Virtually any classification or regression algorithm
261: can be transformed into a conformal predictor,
262: and so most of the arsenal of methods of modern machine learning
263: can be brought to bear on the design of efficient conformal predictors.
264:
265: We start the main part of the paper, in Section \ref{sec:ideal},
266: with the description of an idealized predictor
267: based on Kolmogorov's algorithmic theory of randomness.
268: This ``universal predictor'' produces the best possible hedged predictions
269: but, unfortunately, is noncomputable.
270: We can, however, set ourselves the task of approximating the universal predictor
271: as well as possible.
272:
273: In Section \ref{sec:conformal} we formally introduce the notion of conformal predictors
274: and state a simple result about their validity.
275: In that section we also briefly describe results of computer experiments
276: demonstrating the methodology of conformal prediction.
277:
278: In Section \ref{sec:Bayesian} we consider an example demonstrating
279: how conformal predictors react to the violation of our model
280: of the stochastic mechanism generating the data
281: (within the framework of the randomness assumption).
282: If the model coincides with the actual stochastic mechanism,
283: we can construct an optimal conformal predictor,
284: which turns out to be almost as good as the Bayes-optimal confidence predictor
285: (the formal definitions will be given later).
286: When the stochastic mechanism significantly deviates from the model,
287: conformal predictions remain valid but their efficiency inevitably suffers.
288: The Bayes-optimal predictor starts producing very misleading results
289: which superficially look as good as when the model is correct.
290:
291: In Section \ref{sec:on-line} we describe the ``on-line'' setting
292: of the problem of prediction,
293: and in Section \ref{sec:slow} contrast it with the more standard ``batch'' setting.
294: The notion of validity introduced in Section \ref{sec:conformal}
295: is applicable to both settings,
296: but in the on-line setting it can be strengthened:
297: we can now prove that the percentage of the erroneous predictions
298: will be close, with high probability,
299: to a chosen confidence level.
300: For the batch setting,
301: the stronger property of validity for conformal predictors
302: remains an empirical fact.
303: In Section \ref{sec:slow} we also discuss limitations of the on-line setting
304: and introduce new settings intermediate between on-line and batch.
305: To a large degree,
306: conformal predictors still enjoy the stronger property of validity
307: for the intermediate settings.
308:
309: Section \ref{sec:induction-transduction} is devoted
310: to the discussion of the difference between two kinds of inference from empirical data,
311: induction and transduction
312: (emphasized by Vladimir Vapnik \cite{vapnik:1995,vapnik:1998}).
313: Conformal predictors belong to transduction,
314: but combining them with elements of induction
315: can lead to a significant improvement in their computational efficiency
316: (Section \ref{sec:ICP}).
317:
318: We show how some popular methods of machine learning
319: can be used as underlying algorithms for hedged prediction.
320: We do not give the full description of these methods
321: and refer the reader to the existing readily accessible descriptions.
322: This paper is, however, self-contained in the sense
323: that we explain all features of the underlying algorithms
324: that are used in hedging their predictions.
325: We hope that the information we provide will enable the reader
326: to apply our hedging techniques
327: to their favourite machine-learning methods.
328:
329: \section{Ideal hedged predictions}
330: \label{sec:ideal}
331:
332: % Algorithmic randomness and idealized conformal predictors
333: % (interesting objects for math research)
334:
335: The most basic problem of machine learning is perhaps the following.
336: We are given a \emph{training set} of \emph{examples}
337: \begin{equation}\label{eq:training-set}
338: (x_1,y_1),\ldots,(x_l,y_l),
339: \end{equation}
340: each example $(x_i,y_i)$, $i=1,\ldots,l$, consisting of an \emph{object} $x_i$
341: (typically, a vector of attributes)
342: and its label $y_i$;
343: the problem is to predict the label $y_{l+1}$
344: of a new object $x_{l+1}$.
345: Two important special cases are where the labels are known \emph{a priori}
346: to belong to a relatively small finite set
347: (the problem of \emph{classification})
348: and where the labels are allowed to be any real numbers
349: (the problem of \emph{regression}).
350:
351: The usual goal of classification is to produce a prediction $\hat y_{l+1}$
352: that is likely to coincide with the true label $y_{l+1}$,
353: and the usual goal of regression is to produce a prediction $\hat y_{l+1}$
354: that is likely to be close to the true label $y_{l+1}$.
355: In the case of classification,
356: our goal will be to complement the prediction $\hat y_{l+1}$
357: with some measure of its reliability.
358: In the case of regression,
359: we would like to have some measure of accuracy and reliability of our prediction.
360: There is a clear trade-off between accuracy and reliability:
361: we can improve the former by relaxing the latter
362: and vice versa.
363: We are looking for algorithms that achieve the best possible trade-off
364: and for a measure that would quantify the achieved trade-off.
365:
366: Let us start from the case of classification.
367: The idea is to try every possible label $Y$ as a candidate for $x_{l+1}$'s label
368: and see how well the resulting sequence
369: \begin{equation}\label{eq:completion}
370: (x_1,y_1),\dots,(x_l,y_l),(x_{l+1},Y)
371: \end{equation}
372: conforms to the randomness assumption
373: (if it does conform to this assumption, we will say that it is ``random'';
374: this will be formalized later in this section).
375: The ideal case is where all $Y$s but one lead to sequences (\ref{eq:completion})
376: that are not random;
377: we can then use the remaining $Y$ as a confident prediction for $y_{l+1}$.
378:
379: In the case of regression,
380: we can output the set of all $Y$s that lead to random (\ref{eq:completion})
381: as our ``prediction set''.
382: An obvious obstacle is that the set of all possible $Y$s is infinite
383: and so we cannot go through all the $Y$s explicitly,
384: but we will see in the next section that there are ways to overcome this difficulty.
385:
386: We can see that the problem of hedged prediction
387: is intimately connected with the problem of testing randomness.
388: Different versions of the ``universal'' notion of randomness
389: were defined by Kolmogorov, Martin-L\"of and Levin (see, e.g., \cite{li/vitanyi:1997})
390: based on the existence of universal Turing machines.
391: Adapted to our current setting,
392: Martin-L\"of's definition is as follows.
393: Let $\mathbf{Z}$ be the set of all possible examples;
394: as each example consists of an object and a label,
395: $\mathbf{Z}=\mathbf{X}\times\mathbf{Y}$,
396: where $\mathbf{X}$ is the set of all possible objects
397: and $\mathbf{Y}$, $\left|\mathbf{Y}\right|>1$, is the set of all possible labels.
398: We will use $\mathbf{Z}^*$ as the notation for all finite sequences of examples.
399: A function $t:\mathbf{Z}^*\to[0,1]$
400: is a \emph{randomness test} if
401: \begin{enumerate}
402: \item
403: for all $\epsilon\in(0,1)$, all $n\in\{1,2,\dots\}$
404: and all probability distributions $P$ on $\mathbf{Z}$,
405: \begin{equation}\label{eq:test-validity}
406: P^n
407: \left\{
408: z\in\mathbf{Z}^n
409: \st
410: t(z)\le\epsilon
411: \right\}
412: \le
413: \epsilon;
414: \end{equation}
415: \item
416: $t$ is upper semicomputable.
417: \end{enumerate}
418: The first condition means that the randomness test is required to be valid:
419: if, for example, we observe $t(z)\le1\%$ for our data set $z$,
420: then either the data set was not generated independently from the same probability distribution $P$
421: or a rare (of probability at most 1\%, under any $P$) event has occurred.
422: The second condition means that
423: we should be able to compute the test, in a weak sense
424: (we cannot require computability in the usual sense,
425: since the universal test can only be upper semicomputable:
426: it can work forever to discover \emph{all} patterns in the data sequence
427: that make it non-random).
428: Martin-L\"of (developing Kolmogorov's earlier ideas) proved
429: that there exists a smallest, to within a constant factor,
430: randomness test.
431:
432: Let us fix a smallest randomness test,
433: call it the \emph{universal test},
434: and call the value it takes on a data sequence
435: the \emph{randomness level} of this sequence.
436: A random sequence is one whose randomness level is not small;
437: this is rather informal,
438: but it is clear that for finite data sequences we cannot have a clear-cut division
439: of all sequences into random and non-random
440: (like the one defined by Martin-L\"of \cite{martin-lof:1966} for infinite sequences).
441: If $t$ is a randomness test, not necessarily universal,
442: the value that it takes on a data sequence will be called
443: the \emph{randomness level detected by} $t$.
444:
445: \begin{remark*}
446: The word ``random'' is used in (at least) two different senses in the existing literature.
447: In this paper we need both but, luckily,
448: the difference does not matter within our current framework.
449: First, randomness can refer to the assumption that the examples
450: are generated independently from the same distribution;
451: this is the origin of our ``assumption of randomness''.
452: Second, a data sequence is said to be random with respect to a statistical model
453: if the universal test (a generalization of the notion of universal test as defined above)
454: does not detect any lack of conformity between the two.
455: Since the only statistical model we are interested in this paper
456: is the one embodying the assumption of randomness,
457: we have a perfect agreement between the two senses.
458: \end{remark*}
459:
460: \subsection*{Prediction with Confidence and Credibility}
461:
462: Once we have a randomness test $t$, universal or not,
463: we can use it for hedged prediction.
464: There are two natural ways to package the results
465: of such predictions:
466: in this subsection we will describe the way that can only be used
467: in classification problems.
468: If the randomness test is not computable,
469: we can imagine an oracle answering questions about its values.
470:
471: Given the training set (\ref{eq:training-set}) and the test object $x_{l+1}$,
472: we can act as follows:
473: \begin{itemize}
474: \item
475: consider all possible values $Y\in\mathbf{Y}$
476: for the label $y_{l+1}$;
477: \item
478: find the randomness level detected by $t$ for every possible completion (\ref{eq:completion});
479: \item
480: predict the label $Y$ corresponding to a completion
481: with the largest randomness level detected by $t$;
482: \item
483: output as the \emph{confidence} in this prediction
484: one minus the second largest randomness level detected by $t$;
485: \item
486: output as the \emph{credibility} of this prediction
487: the randomness level detected by $t$
488: of the output prediction $Y$
489: (i.e., the largest randomness level detected by $t$ over all possible labels).
490: \end{itemize}
491: To understand the intuition behind confidence,
492: let us tentatively choose a conventional ``significance level'', such as $1\%$.
493: (In the terminology of this paper, this corresponds to a ``confidence level'' of $99\%$,
494: i.e.,
495: $100\%$ minus $1\%$.)
496: If the confidence in our prediction is $99\%$ or more
497: and the prediction is wrong,
498: the actual data sequence belongs to an \emph{a priori} chosen
499: set of probability at most $1\%$
500: (the set of all data sequences with randomness level detected by $t$
501: not exceeding $1\%$).
502:
503: Intuitively, low credibility means that
504: either the training set is non-random
505: or the test object is not representative of the training set
506: (say, in the training set we have images of digits
507: and the test object is that of a letter).
508:
509: \subsection*{Confidence Predictors}
510:
511: In regression problems,
512: confidence, as defined in the previous subsection,
513: is not a useful quantity:
514: it will typically be equal to 0.
515: A better approach is to choose a range of confidence levels $1-\epsilon$,
516: and for each of them specify a \emph{prediction set}
517: $\Gamma^{\epsilon}\subseteq\mathbf{Y}$,
518: the set of labels deemed possible at the confidence level $1-\epsilon$.
519: We will always consider nested prediction sets:
520: $\Gamma^{\epsilon_1}\subseteq\Gamma^{\epsilon_2}$ when $\epsilon_1\ge\epsilon_2$.
521: A \emph{confidence predictor} is a function
522: that maps each training set, each new object, and each confidence level $1-\epsilon$
523: (formally, we allow $\epsilon$ to take any value in $(0,1)$)
524: to the corresponding prediction set $\Gamma^{\epsilon}$.
525: For the confidence predictor to be \emph{valid} the probability that the true label
526: will fall outside the prediction set $\Gamma^{\epsilon}$ should not exceed $\epsilon$,
527: for each $\epsilon$.
528:
529: We might, for example, choose the confidence levels 99\%, 95\% and 80\%,
530: and refer to the 99\% prediction set $\Gamma^{1\%}$ as the highly confident prediction,
531: to the 95\% prediction set $\Gamma^{5\%}$ as the confident prediction,
532: and to the 80\% prediction set $\Gamma^{20\%}$ as the casual prediction.
533: Figure \ref{fig:predset} shows how such a family of prediction sets might look
534: in the case of a rectangular label space $\mathbf{Y}$.
535: The casual prediction pinpoints the target quite well,
536: but we know that this kind of prediction can be wrong with probability 20\%.
537: The confident prediction is much bigger.
538: If we want to be highly confident
539: (make a mistake only with probability 1\%),
540: we must accept an even lower accuracy;
541: there is even a completely different location that we cannot rule out
542: at this level of confidence.
543: % In principle, a confidence predictor outputs prediction sets
544: % for all confidence levels, and these sets are nested,
545: % as in the figure above.
546:
547: \begin{figure}
548: \centering
549: \makebox{\includegraphics[width=\picturewidth,clip=true]{predset.eps}}
550: \caption{\label{fig:predset}An example of a nested family of prediction sets
551: (casual prediction in black,
552: confident prediction in dark grey,
553: and highly confident prediction in light grey).}
554: \end{figure}
555:
556: Given a randomness test, again universal or not,
557: we can define the corresponding confidence predictor as follows:
558: for any confidence level $1-\epsilon$,
559: the corresponding prediction set consists of the $Y$s
560: such that the randomness level of the completion (\ref{eq:completion})
561: detected by the test is greater than $\epsilon$.
562: The condition (\ref{eq:test-validity}) of validity for statistical tests
563: implies that a confidence predictor defined in this way
564: is always valid.
565:
566: The confidence predictor based on the universal test
567: (the \emph{universal confidence predictor})
568: is an interesting object for mathematical investigation
569: (see, e.g., \cite{vovk/etal:1999}, Section 4),
570: but it is not computable and so cannot be used in practice.
571: Our goal in the following sections will be
572: to find computable approximations to it.
573:
574: \section{Conformal Prediction}
575: \label{sec:conformal}
576:
577: % Practical approximation: conformal prediction (universal for invariant predictors)
578:
579: In the previous section we explained how randomness tests
580: can be used for prediction.
581: The connection between testing and prediction is, of course, well understood
582: and have been discussed at length by philosophers \cite{popper:1934}
583: and statisticians
584: (see, e.g., the textbook \cite{cox/hinkley:1974}, Section 7.5).
585: % In fact, this connection is two-way,
586: % so we do not lose anything basing our predictions on testing.
587: In this section we will see how some popular prediction algorithms
588: can be transformed into randomness tests
589: and, therefore, be used for producing hedged predictions.
590:
591: Let us start with the most successful recent development in machine learning,
592: support vector machines
593: (\cite{vapnik:1995,vapnik:1998},
594: with a key idea going back
595: to the generalized portrait method \cite{vapnik/chervonenkis:1974}).
596: Suppose the label space is $\mathbf{Y}=\{-1,1\}$
597: (we are dealing with the binary classification problem).
598: With each set of examples
599: \begin{equation}\label{eq:set}
600: (x_1,y_1),
601: \ldots,
602: (x_n,y_n)
603: \end{equation}
604: one associates an optimization problem
605: whose solution produces nonnegative numbers $\alpha_1,\ldots,\alpha_n$
606: (``Lagrange multipliers'').
607: These numbers determine the prediction rule used by the support vector machine
608: (see \cite{vapnik:1998}, Chapter 10, for details),
609: but they also are interesting objects in their own right.
610: Each $\alpha_i$, $i=1,\ldots,n$, tells us
611: how ``strange'' an element of the set (\ref{eq:set})
612: the corresponding example $(x_i,y_i)$ is.
613: If $\alpha_i=0$, $(x_i,y_i)$ fits (\ref{eq:set}) very well
614: (in fact so well that such examples are uninformative,
615: and the support vector machine ignores them when making predictions).
616: The elements with $\alpha_i>0$ are called \emph{support vectors},
617: and the large value of $\alpha_i$ indicates
618: that the corresponding $(x_i,y_i)$ is an outlier.
619: % It is customary to impose an upper bound $C$ on the values of $\alpha_i$,
620: % one reason being to prevent the outliers affecting too much the prediction
621: % (the other to delimit the search space).
622:
623: Taking the completion (\ref{eq:completion}) as (\ref{eq:set})
624: (so that $n=l+1$),
625: we can find the corresponding $\alpha_1,\ldots,\alpha_{l+1}$.
626: If $Y$ is different from the actual label $y_{l+1}$,
627: we expect $(x_{l+1},Y)$ to be an outlier in (\ref{eq:completion})
628: and so $\alpha_{l+1}$ be large as compared with $\alpha_1,\ldots,\alpha_l$.
629: A natural way to compare $\alpha_{l+1}$ to the other $\alpha$s
630: is to look at the ratio
631: \begin{equation}\label{eq:p}
632: p_Y
633: :=
634: \frac
635: {
636: \left|
637: \{i=1,\ldots,l+1 \st \alpha_i\ge\alpha_{l+1}\}
638: \right|
639: }
640: {l+1},
641: \end{equation}
642: which we call the \emph{p-value} associated with the possible label $Y$ for $x_{l+1}$.
643: In words, the p-value is the proportion of the $\alpha$s
644: which are at least as large as the last $\alpha$.
645:
646: The methodology of support vector machines
647: (as described in \cite{vapnik:1995,vapnik:1998})
648: is directly applicable
649: only to the binary classification problems,
650: but the general case can be reduced to the binary case
651: by the standard ``one-against-one'' or ``one-against-the-rest'' procedures.
652: This allows us to define the strangeness values $\alpha_1,\ldots,\alpha_{l+1}$
653: for general classification problems
654: (see \cite{vovk/etal:2005}, p.~59, for details),
655: which in turn determine the p-values (\ref{eq:p}).
656:
657: The function that assigns to each sequence (\ref{eq:completion})
658: the corresponding p-value, defined by (\ref{eq:p}),
659: is a randomness test
660: (this will follow from Theorem \ref{thm:on-line}
661: stated in Section \ref{sec:on-line} below).
662: Therefore, the p-values,
663: which are our approximations to the corresponding randomness levels,
664: can be used for hedged prediction
665: as described in the previous section.
666: For example, if the p-value $p_{-1}$ is small while $p_1$ is not small,
667: we can predict $1$ with confidence $1-p_{-1}$ and credibility $p_1$.
668: Typical credibility will be 1:
669: for most data sets the percentage of support vectors is small
670: (\cite{vapnik:1998}, Chapter 12),
671: and so we can expect $\alpha_{l+1}=0$ when $Y=y_{l+1}$.
672:
673: \begin{remark*}
674: When the order of examples is irrelevant,
675: we refer to the data set (\ref{eq:set}) as a set,
676: although as a mathematical object it is a multiset rather than a set
677: since it can contain several copies of the same example.
678: We will continue to use this informal terminology
679: (to be completely accurate,
680: we would have to say ``data multiset'' instead of ``data set''!)
681: \end{remark*}
682:
683: % [This in fact demonstrate the $\mathbf{X}$ is large,
684: % not that it is high-dimensional]
685: % Already this data set can be used to illustrate the high-dimensional character
686: % of many modern data sets.
687: % Each object (handwritten digit) is a $16\times16$ grey-scale matrix,
688: % with 31 shades of grey,
689: % so there are $31^{16 \times 16}$ (approximately $10^{381}$)
690: % possible objects.
691: % This greatly exceeds the number of objects in the USPS data set, which is 9298.
692:
693: % Several kernels are used.
694: % The results show that the method works well in predicting classifications;
695: % in addition, of course,
696: % the method also provides valid and practically useful confidence information,
697: % in sharp contrast with typical PAC error bounds
698: % (valid but not useful)
699: % and Bayesian methods
700: % (usually not valid).
701:
702: \ifJOURNAL
703: \begin{table*}
704: \processtable{Selected test examples from the USPS data set:
705: the p-values of digits (0--9), true and predicted labels,
706: and confidence and credibility values.\label{tab:examples}}
707: %\begingroup\tiny
708: {\footnotesize\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
709: \hline 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 &
710: \vbox{\hbox{\strut true}\hbox{\strut label}} &
711: \vbox{\hbox{\strut pre-}\hbox{\strut diction}} &
712: \vbox{\hbox{\strut confi-}\hbox{\strut dence}} &
713: \vbox{\hbox{\strut credi-}\hbox{\strut bility}}\\
714: \hline 0.01\% & 0.11\% & 0.01\% & 0.01\% & 0.07\% & 0.01\% & 100\% & 0.01\% & 0.01\% & 0.01\%
715: & 6 & 6 & 99.89\% & 100\%\\
716: \hline 0.32\% & 0.38\% & 1.07\% & 0.67\% & 1.43\% & 0.67\% & 0.38\% & 0.33\% & 0.73\% & 0.78\%
717: & 6 & 4 & 98.93\% & 1.43\%\\
718: \hline 0.01\% & 0.27\% & 0.03\% & 0.04\% & 0.18\% & 0.01\% & 0.04\% & 0.01\% & 0.12\% & 100\%
719: & 9 & 9 & 99.73\% & 100\%\\
720: %\hline 100\% & 0.03\% & 0.01\% & 0.01\% & 0.04\% & 0.01\% & 0.01\% & 0.01\% & 0.01\% & 0.01\%
721: % & 0 & 0 & 99.96\% & 100\%\\
722: %\hline 0.04\% & 0.30\% & 0.05\% & 0.38\% & 0.29\% & 0.01\% & 0.08\% & 0.07\% & 0.40\% & 0.22\%
723: % & 2 & 8 & 99.62\% & 0.40\%\\
724: %\hline 0.01\% & 0.22\% & 0.03\% & 0.55\% & 0.16\% & 0.04\% & 0.03\% & 0.01\% & 0.04\% & 0.05\%
725: % & 3 & 3 & 99.78\% & 0.55\%\\
726: %\hline 0.04\% & 0.32\% & 0.10\% & 2.06\% & 0.29\% & 2.98\% & 0.04\% & 0.07\% & 0.37\% & 0.34\%
727: % & 3 & 5 & 97.94\% & 2.98\%\\
728: %\hline 0.30\% & 0.49\% & 0.43\% & 0.36\% & 1.28\% & 0.51\% & 0.29\% & 0.21\% & 0.38\% & 1.19\%
729: % & 4 & 4 & 98.81\% & 1.28\%\\
730: %\hline 0.01\% & 0.04\% & 0.01\% & 0.01\% & 0.03\% & 0.01\% & 0.01\% & 0.01\% & 0.01\% & 100\%
731: % & 9 & 9 & 99.96\% & 100\%\\
732: %\hline 0.01\% & 0.32\% & 0.04\% & 0.01\% & 0.26\% & 100\% & 0.01\% & 0.05\% & 0.11\% & 0.18\%
733: % & 5 & 5 & 99.68\% & 100\%\\
734: %\hline 0.41\% & 0.44\% & 0.27\% & 2.07\% & 0.70\% & 1.87\% & 0.23\% & 0.29\% & 0.44\% & 0.80\%
735: % & 5 & 3 & 98.13\% & 2.07\%\\
736: \hline
737: \end{tabular}}{}
738: %\endgroup
739: \end{table*}
740: \fi
741:
742: \ifnotJOURNAL
743: \begin{table*}
744: \caption{Selected test examples from the USPS data set:
745: the p-values of digits (0--9), true and predicted labels,
746: and confidence and credibility values.\label{tab:examples}}
747:
748: \medskip
749:
750: {\tiny\hspace{-12mm}\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
751: \hline 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 &
752: \vbox{\hbox{\strut true}\hbox{\strut label}} &
753: \vbox{\hbox{\strut pre-}\hbox{\strut diction}} &
754: \vbox{\hbox{\strut confi-}\hbox{\strut dence}} &
755: \vbox{\hbox{\strut credi-}\hbox{\strut bility}}\\
756: \hline 0.01\% & 0.11\% & 0.01\% & 0.01\% & 0.07\% & 0.01\% & 100\% & 0.01\% & 0.01\% & 0.01\%
757: & 6 & 6 & 99.89\% & 100\%\\
758: \hline 0.32\% & 0.38\% & 1.07\% & 0.67\% & 1.43\% & 0.67\% & 0.38\% & 0.33\% & 0.73\% & 0.78\%
759: & 6 & 4 & 98.93\% & 1.43\%\\
760: \hline 0.01\% & 0.27\% & 0.03\% & 0.04\% & 0.18\% & 0.01\% & 0.04\% & 0.01\% & 0.12\% & 100\%
761: & 9 & 9 & 99.73\% & 100\%\\
762: %\hline 100\% & 0.03\% & 0.01\% & 0.01\% & 0.04\% & 0.01\% & 0.01\% & 0.01\% & 0.01\% & 0.01\%
763: % & 0 & 0 & 99.96\% & 100\%\\
764: %\hline 0.04\% & 0.30\% & 0.05\% & 0.38\% & 0.29\% & 0.01\% & 0.08\% & 0.07\% & 0.40\% & 0.22\%
765: % & 2 & 8 & 99.62\% & 0.40\%\\
766: %\hline 0.01\% & 0.22\% & 0.03\% & 0.55\% & 0.16\% & 0.04\% & 0.03\% & 0.01\% & 0.04\% & 0.05\%
767: % & 3 & 3 & 99.78\% & 0.55\%\\
768: %\hline 0.04\% & 0.32\% & 0.10\% & 2.06\% & 0.29\% & 2.98\% & 0.04\% & 0.07\% & 0.37\% & 0.34\%
769: % & 3 & 5 & 97.94\% & 2.98\%\\
770: %\hline 0.30\% & 0.49\% & 0.43\% & 0.36\% & 1.28\% & 0.51\% & 0.29\% & 0.21\% & 0.38\% & 1.19\%
771: % & 4 & 4 & 98.81\% & 1.28\%\\
772: %\hline 0.01\% & 0.04\% & 0.01\% & 0.01\% & 0.03\% & 0.01\% & 0.01\% & 0.01\% & 0.01\% & 100\%
773: % & 9 & 9 & 99.96\% & 100\%\\
774: %\hline 0.01\% & 0.32\% & 0.04\% & 0.01\% & 0.26\% & 100\% & 0.01\% & 0.05\% & 0.11\% & 0.18\%
775: % & 5 & 5 & 99.68\% & 100\%\\
776: %\hline 0.41\% & 0.44\% & 0.27\% & 2.07\% & 0.70\% & 1.87\% & 0.23\% & 0.29\% & 0.44\% & 0.80\%
777: % & 5 & 3 & 98.13\% & 2.07\%\\
778: \hline
779: \end{tabular}}{}
780: %\endgroup
781: \end{table*}
782: \fi
783:
784: Table~\ref{tab:examples} illustrates the results of hedged prediction
785: for a popular data set of hand-written digits
786: called the USPS data set \cite{lecun/etal:1990}.
787: The data set contains 9298 digits represented as a $16\times16$ matrix of pixels;
788: it is divided into a training set of size 7291 and a test set of size 2007.
789: For several test examples the table shows
790: the p-values for each possible label, the actual label,
791: the predicted label, confidence, and credibility,
792: computed using the support vector method with the polynomial kernel of degree 5.
793: To interpret the numbers in this table,
794: remember that high (i.e., close to 100\%) confidence
795: means that all labels except the predicted one are unlikely.
796: If, say, the first example were predicted wrongly,
797: this would mean that a rare event (of probability less than 1\%) had occurred;
798: therefore, we expect the prediction to be correct (which it is).
799: In the case of the second example,
800: confidence is also quite high (more than 95\%),
801: but we can see that the credibility is low (less than 5\%).
802: From the confidence we can conclude that the labels other than 4
803: are excluded at level 5\%,
804: but the label 4 itself is also excluded at the level 5\%.
805: This shows that the prediction algorithm
806: was unable to extract from the training set enough information
807: to allow us to confidently classify this example:
808: the strangeness of the labels different from 4 may be due
809: to the fact that the object itself is strange;
810: perhaps the test example is very different from all examples in the training set.
811: Unsurprisingly, the prediction for the second example is wrong.
812:
813: In general,
814: high confidence shows that all alternatives
815: to the predicted label are unlikely.
816: Low credibility means that the whole situation is suspect;
817: as we have already mentioned, we will obtain a very low credibility
818: if the new example is a letter (whereas all training examples are digits).
819: Credibility will also be low if the new example is a digit
820: written in an unusual way.
821: Notice that typically credibility will not be low
822: provided the data set was generated independently from the same distribution:
823: the probability that credibility
824: will not exceed some threshold $\epsilon$ (such as 1\%)
825: is at most $\epsilon$.
826: In summary,
827: we can trust a prediction if
828: (1) the confidence is close to 100\% and
829: (2) the credibility is not low (say, is not less than 5\%).
830: % Table~\ref{tab:examples} gives credibility values typical
831: % when using support vector machines
832: % for computing p-values:
833: % credibility is exactly 100\% on a few occasions.
834: % This happens because most of the $\alpha$'s computed
835: % by support vector machines are zero.
836: % For many other learning methods typical values of credibility
837: % are in the range 5\%--95\%.
838:
839: Many other prediction algorithms can be used as underlying algorithms
840: for hedged prediction.
841: For example, we can use the nearest neighbours technique to associate
842: \begin{equation}\label{eq:NN}
843: \alpha_i
844: :=
845: \frac
846: {\sum_{j=1}^k d_{ij}^+}
847: {\sum_{j=1}^k d_{ij}^-},
848: \quad
849: i=1,\ldots,n,
850: \end{equation}
851: with the elements $(x_i,y_i)$ of the set (\ref{eq:set}),
852: where $d_{ij}^+$ is the $j$th shortest distance from $x_i$
853: to other objects labelled in the same way as $x_i$,
854: and $d_{ij}^-$ is the $j$th shortest distance
855: from $x_i$ to the objects labelled differently from $x_i$;
856: the parameter $k\in\{1,2,\dots\}$ in~(\ref{eq:NN})
857: is the number of nearest neighbours taken into account.
858: The distances can be computed in a feature space
859: (that is, the distance between $x\in\mathbf{X}$ and $x'\in\mathbf{X}$
860: can be understood as $\left\|F(x)-F(x')\right\|$,
861: $F$ mapping the object space $\mathbf{X}$ into a feature, typically Hilbert, space),
862: and so (\ref{eq:NN}) can also be used with the kernel nearest neighbours.
863:
864: The intuition behind (\ref{eq:NN}) is as follows:
865: a typical object $x_i$ labelled by, say, $y$
866: will tend to be surrounded by other objects labelled by $y$;
867: and if this is the case, the corresponding $\alpha_i$ will be small.
868: In the untypical case that there are objects whose labels are different from $y$
869: nearer than objects labelled $y$,
870: $\alpha_i$ will become larger.
871: Therefore, the $\alpha$s reflect the strangeness of examples.
872:
873: The p-values computed by (\ref{eq:NN})
874: can again be used for hedged prediction.
875: % as described in Section \ref{sec:ideal}.
876: It is a general empirical fact that
877: the accuracy and reliability of the hedged predictions
878: are in line with the error rate of the underlying algorithm.
879: For example, in the case of the USPS data set,
880: the 1-nearest neighbour algorithm
881: (i.e., the one with $k=1$)
882: achieves the error rate of 2.2\%,
883: and the hedged predictions based on (\ref{eq:NN}) are highly confident
884: (achieve confidence of at least $99\%$)
885: for more than 95\% of the test examples.
886:
887: \subsection*{General Definition}
888:
889: The general notion of conformal predictor can be defined as follows.
890: A \emph{nonconformity measure} is a function that assigns
891: to every data sequence (\ref{eq:set}) a sequence of numbers
892: $\alpha_1,\ldots,\alpha_n$,
893: called \emph{nonconformity scores},
894: in such a way that interchanging any two examples $(x_i,y_i)$ and $(x_j,y_j)$
895: leads to the interchange of the corresponding nonconformity scores $\alpha_i$ and $\alpha_j$
896: (with all the other nonconformity scores unaffected).
897: The corresponding \emph{conformal predictor} maps each data set (\ref{eq:training-set}),
898: $l=0,1,\ldots$,
899: each new object $x_{l+1}$,
900: and each confidence level $1-\epsilon\in(0,1)$,
901: to the prediction set
902: \begin{equation}\label{eq:Gamma}
903: \Gamma^{\epsilon}
904: \left(
905: x_1,y_1,\ldots,x_{l},y_{l},x_{l+1}
906: \right)
907: :=
908: \left\{
909: Y\in\mathbf{Y}
910: \st
911: p_Y
912: >
913: \epsilon
914: \right\},
915: \end{equation}
916: where $p_Y$ are defined by (\ref{eq:p})
917: with $\alpha_1,\ldots,\alpha_{l+1}$ being the nonconformity scores
918: corresponding to the data sequence (\ref{eq:completion}).
919:
920: We have already remarked that associating with each completion (\ref{eq:completion})
921: the p-value (\ref{eq:p}) gives a randomness test;
922: this is true in general.
923: This implies that for each $l$ the probability of the event
924: \begin{equation*}
925: y_{l+1}
926: \in
927: \Gamma^{\epsilon}
928: \left(
929: x_1,y_1,\ldots,x_{l},y_{l},x_{l+1}
930: \right)
931: \end{equation*}
932: is at least $1-\epsilon$.
933:
934: This definition works for both classification and regression,
935: but in the case of classification we can summarize (\ref{eq:Gamma})
936: by two numbers:
937: the confidence
938: \begin{equation}\label{eq:conf}
939: \sup
940: \left\{
941: 1-\epsilon
942: \st
943: \left|
944: \Gamma^{\epsilon}
945: \right|
946: \le
947: 1
948: \right\}
949: \end{equation}
950: and the credibility
951: \begin{equation}\label{eq:cred}
952: \inf
953: \left\{
954: \epsilon
955: \st
956: \left|
957: \Gamma^{\epsilon}
958: \right|
959: =
960: 0
961: \right\}.
962: \end{equation}
963:
964: \subsection*{Computationally Efficient Regression}
965:
966: As we have already mentioned,
967: the algorithms described so far
968: cannot be applied directly in the case of regression,
969: even if the randomness test is efficiently computable:
970: now we cannot consider all possible values $Y$ for $y_{l+1}$
971: since there are infinitely many of them.
972: However, there might still be computationally efficient
973: % (in the sense of required computational resources)
974: ways to find the prediction sets $\Gamma^{\epsilon}$.
975: The idea is that if $\alpha_i$ are defined as the residuals
976: \begin{equation}\label{eq:residual}
977: \alpha_i
978: :=
979: \left|
980: y_i - f_Y(x_i)
981: \right|
982: \end{equation}
983: where $f_Y:\mathbf{X}\to\bbbr$ is a regression function
984: fitted to the completed data set~(\ref{eq:completion}),
985: then $\alpha_i$ may have a simple expression in terms of $Y$,
986: leading to an efficient way of computing the prediction sets
987: (via (\ref{eq:p}) and (\ref{eq:Gamma})).
988: This idea was implemented in \cite{nouretdinov/etal:2001rr}
989: in the case where $f_Y$ is found from the ridge regression,
990: or kernel ridge regression, procedure,
991: with the resulting algorithm of hedged prediction
992: called the \emph{ridge regression confidence machine}.
993: For a much fuller description of the ridge regression confidence machine
994: (and its modifications in the case where (\ref{eq:residual})
995: are replaced by the fancier ``deleted'' or ``studentized'' residuals)
996: see \cite{vovk/etal:2005}, Section 2.3.
997:
998: \section{Bayesian Approach to Conformal Prediction}
999: \label{sec:Bayesian}
1000:
1001: Bayesian methods have become very popular in both machine learning and statistics
1002: thanks to their power and versatility,
1003: and in this section we will see
1004: how Bayesian ideas can be used for designing efficient conformal predictors.
1005: We will only describe results of computer experiments
1006: (following \cite{melluish/etal:2001})
1007: with artificial data sets,
1008: since for real-world data sets there is no way
1009: to make sure that the Bayesian assumption is satisfied.
1010:
1011: Suppose $\mathbf{X}=\bbbr^p$
1012: (each object is a vector of $p$ real-valued attributes)
1013: and our model of the data-generating mechanism is
1014: \begin{equation}\label{eq:model}
1015: y_i
1016: =
1017: w\cdot x_i
1018: +
1019: \xi_i,
1020: \quad
1021: i=1,2,\ldots,
1022: \end{equation}
1023: where $\xi_i$ are independent standard Gaussian random variables
1024: % (we use the notation $N(\mu,\sigma^2)$ for the Gaussian distribution
1025: % with mean $\mu$ and variance $\sigma^2$)
1026: and the weight vector $w\in\bbbr^p$ is distributed as $N(0,(1/a)I_p)$
1027: (we use the notation $I_p$ for the unit $p\times p$ matrix
1028: and $N(0,A)$ for the $p$-dimensional Gaussian distribution
1029: with covariance matrix $A$);
1030: $a$ is a positive constant.
1031: % which we believe to be $1$.
1032: The actual data-generating mechanism used in our experiments
1033: will correspond to this model with $a$ set to 1.
1034:
1035: Under the model (\ref{eq:model}) the best (in the mean-square sense) fit
1036: to a data set (\ref{eq:set})
1037: is provided by the ridge regression procedure with parameter $a$
1038: (for details, see, e.g., \cite{vovk/etal:2005}, Section 10.3).
1039: Using the residuals (\ref{eq:residual}) with $f_Y$
1040: found by ridge regression with parameter $a$
1041: leads to an efficient conformal predictor
1042: which will be referred to as the ridge regression confidence machine with parameter $a$.
1043: Each prediction set output by the ridge regression confidence machine
1044: will be replaced by its convex hull,
1045: the corresponding \emph{prediction interval}.
1046:
1047: To test the validity and efficiency of the ridge regression confidence machine
1048: the following procedure was used.
1049: Ten times a vector $w\in\bbbr^5$ was independently generated from the distribution $N(0,I_5)$.
1050: For each of the 10 values of $w$,
1051: 100 training objects and 100 test objects
1052: were independently generated from the uniform distribution on $[-10,10]^5$
1053: and for each object $x$ its label $y$ was generated as $w\cdot x+\xi$,
1054: with all the $\xi$ standard Gaussian and independent.
1055: For each of the 1000 test objects and each confidence level $1-\epsilon$
1056: the prediction set $\Gamma^{\epsilon}$ for its label
1057: was found from the corresponding training set
1058: using the ridge regression confidence machine with parameter $a=1$.
1059: The solid line in Figure~\ref{fig:rrcm-errors} shows the confidence level
1060: against the percentage of test examples whose labels
1061: were not covered by the corresponding prediction intervals at that confidence level.
1062: Since conformal predictors are always valid,
1063: the percentage outside the prediction interval
1064: should never exceed 100 minus the confidence level,
1065: up to statistical fluctuations,
1066: and this is confirmed by the picture.
1067:
1068: \begin{figure}
1069: \centering
1070: \makebox{\includegraphics[width=\picturewidth,clip=true]{rrcm_errors.eps}}
1071: \caption{\label{fig:rrcm-errors}Validity for the ridge regression confidence machine.}
1072: \end{figure}
1073:
1074: A natural measure of efficiency of confidence predictors
1075: is the mean width of their prediction intervals,
1076: at different confidence levels:
1077: the algorithm is the more efficient the narrower prediction intervals it produces.
1078: The solid line in Figure~\ref{fig:rrcm-widths} shows
1079: the confidence level against the mean
1080: (over all test examples)
1081: width of the prediction intervals at that confidence level.
1082:
1083: \begin{figure}
1084: \centering
1085: \makebox{\includegraphics[width=\picturewidth,clip=true]{rrcm_widths.eps}}
1086: \caption{\label{fig:rrcm-widths}Efficiency for the ridge regression confidence machine.}
1087: \end{figure}
1088:
1089: Since we know the data-generating mechanism,
1090: the approach via conformal prediction appears somewhat roundabout:
1091: for each test object we could instead find
1092: the conditional probability distribution of its label,
1093: which is Gaussian,
1094: and output as the prediction set $\Gamma^{\epsilon}$
1095: the shortest
1096: (i.e., centred at the mean of the conditional distribution)
1097: interval of conditional probability $1-\epsilon$.
1098: Figures \ref{fig:Bayes-errors} and \ref{fig:Bayes-widths}
1099: are the analogues of Figures \ref{fig:rrcm-errors} and \ref{fig:rrcm-widths}
1100: for this \emph{Bayes-optimal confidence predictor}.
1101: The solid line in Figure \ref{fig:Bayes-errors}
1102: demonstrates the validity of the Bayes-optimal confidence predictor.
1103:
1104: \begin{figure}
1105: \centering
1106: \makebox{\includegraphics[width=\picturewidth,clip=true]{bayes_errors.eps}}
1107: \caption{\label{fig:Bayes-errors}Validity for the Bayes-optimal confidence predictor.}
1108: \end{figure}
1109:
1110: \begin{figure}
1111: \centering
1112: \makebox{\includegraphics[width=\picturewidth,clip=true]{bayes_widths.eps}}
1113: \caption{\label{fig:Bayes-widths}Efficiency for the Bayes-optimal confidence predictor.}
1114: \end{figure}
1115:
1116: What is interesting is that the solid lines
1117: in Figures~\ref{fig:Bayes-widths} and \ref{fig:rrcm-widths}
1118: look exactly the same,
1119: taking account of the different scales of the vertical axes.
1120: The ridge regression confidence machine
1121: appears as good as the Bayes-optimal predictor.
1122: (This is a general phenomenon;
1123: it is also illustrated, in the case of classification,
1124: by the construction in Section 3.3 of \cite{vovk/etal:2005}
1125: of a conformal predictor that is asymptotically
1126: as good as the Bayes-optimal confidence predictor.)
1127:
1128: The similarity between the two algorithms disappears
1129: when they are given wrong values for $a$.
1130: For example,
1131: let us see what happens if we tell the algorithms
1132: that the expected value of $\|w\|$ is just $1\%$ of what it really is
1133: (this corresponds to taking $a=10000$).
1134: The ridge regression confidence machine stays valid
1135: (see the dashed line in Figure \ref{fig:rrcm-errors}),
1136: but its efficiency deteriorates
1137: (the dashed line in Figure \ref{fig:rrcm-widths}).
1138: The efficiency of the Bayes-optimal confidence predictor
1139: (the dashed line in Figure \ref{fig:Bayes-widths})
1140: is hardly affected,
1141: but its predictions become invalid
1142: (the dashed line in Figure \ref{fig:Bayes-errors}
1143: deviates significantly from the diagonal,
1144: especially for the most important large confidence levels:
1145: e.g., only about 15\% of labels fall within the 90\% prediction sets).
1146: The worst that can happen to the ridge regression confidence machine
1147: is that its predictions will become useless
1148: (but at least harmless),
1149: whereas the Bayes-optimal predictions can become misleading.
1150:
1151: Figures \ref{fig:rrcm-errors}--\ref{fig:Bayes-widths} also show the graphs
1152: for the intermediate value $a=1000$.
1153: Similar results but for different data sets
1154: are also given in \cite{vovk/etal:2005}, Section 10.3.
1155: A general scheme of Bayes-type conformal prediction
1156: is described in \cite{vovk/etal:2005}, pp.~102--103.
1157:
1158: \iffalse
1159: \begin{figure}
1160: \centering
1161: \makebox{\includegraphics[width=0.25\picturewidth,clip=true]{autompg_errors.eps}}
1162: \makebox{\includegraphics[width=0.25\picturewidth,clip=true]{autompg_widths.eps}}
1163: \makebox{\includegraphics[width=0.25\picturewidth,clip=true]{boston_errors.eps}}
1164: \makebox{\includegraphics[width=0.25\picturewidth,clip=true]{boston_widths.eps}}
1165: \caption{\label{fig:benchmarks}Bayesian RR and RRCM applied to Auto mpg and Boston housing benchmarks.}
1166: \end{figure}
1167:
1168: Figure~\ref{fig:benchmarks} extends these results
1169: to two benchmark data sets taken from the UCI machine learning repository,
1170: the auto-mpg data set and the Boston housing data set.
1171: For the benchmark data sets,
1172: the training and test examples were randomly drawn from the set of all data points.
1173: The ridge coefficient $a$ in Figure \ref{fig:benchmarks}
1174: is chosen so that a reasonable mean square error is obtained.
1175: The top graphs in the figure show
1176: that Bayesian Ridge Regression is overconfident on the auto-mpg dataset,
1177: predicting tolerance regions that are too narrow.
1178: The RRCM predicts valid tolerance regions,
1179: and the top right graph shows that to do so
1180: it gives wider tolerance regions than Bayesian Ridge Regression.
1181: On the Boston housing data set,
1182: Bayesian Ridge Regression is too conservative.
1183: The bottom left graph shows that its predicted tolerance regions are always valid;
1184: however, it also shows that they are much wider than those given by the RRCM.
1185: As the RRCM's tolerance regions are also valid,
1186: we prefer the more accurate RRCM's predictions.
1187:
1188: \textbf{These results probably do not make much sense
1189: since \cite{melluish/etal:2001} assumes the standard deviation $\sigma$ of $\xi_i$ known:
1190: $\sigma=1$.
1191: This assumption alone might lead to the gross inadequacies of the Bayesian method
1192: that show in Figure \ref{fig:benchmarks}.}
1193: \fi
1194:
1195: \section{On-line prediction}
1196: \label{sec:on-line}
1197:
1198: % Properties in the on-line framework
1199:
1200: We know from Section \ref{sec:conformal}
1201: that conformal predictors are valid in the sense that the probability of error
1202: \begin{equation}\label{eq:error}
1203: y_{l+1}
1204: \notin
1205: \Gamma^{\epsilon}
1206: \left(
1207: x_1,y_1,
1208: \ldots
1209: x_l,y_l,
1210: x_{l+1}
1211: \right)
1212: \end{equation}
1213: at confidence level $1-\epsilon$
1214: never exceeds $\epsilon$.
1215: The word ``probability'' means ``unconditional probability'' here:
1216: the frequentist meaning of the statement that the probability of (\ref{eq:error})
1217: does not exceed $\epsilon$
1218: is that,
1219: if we repeatedly generate many sequences
1220: \begin{equation*}
1221: x_1,y_1,\ldots,x_l,y_l,x_{l+1},y_{l+1},
1222: \end{equation*}
1223: the fraction of them satisfying (\ref{eq:error})
1224: will be at most $\epsilon$,
1225: to within statistical fluctuations.
1226: To say that we are controlling the number of errors
1227: would be an exaggeration
1228: because of the artificial character of this scheme
1229: of repeatedly generating a new training set and a new test example.
1230: Can we say that the confidence level $1-\epsilon$
1231: translates into a bound on the number of mistakes
1232: for a natural learning protocol?
1233: In this section we show that the answer is ``yes''
1234: for the popular on-line learning protocol,
1235: and in the next section we will see to what degree
1236: this carries over to other protocols.
1237:
1238: In on-line learning the examples are presented one by one.
1239: Each time, we observe the object and predict its label.
1240: Then we observe the label and go on to the next example.
1241: We start by observing the first object $x_1$ and predicting its label $y_1$.
1242: Then we observe $y_1$ and the second object $x_2$, and predict its label $y_2$.
1243: And so on.
1244: At the $n$th step,
1245: we have observed the previous examples
1246: $ %\begin{equation*}
1247: (x_1,y_1),\dots,(x_{n-1},y_{n-1})
1248: $ %\end{equation*}
1249: and the new object $x_n$, and our task is to predict $y_n$.
1250: The quality of our predictions should improve
1251: as we accumulate more and more old examples.
1252: This is the sense in which we are learning.
1253:
1254: Our prediction for $y_n$ is a nested family of prediction sets
1255: $\Gamma_n^{\epsilon}\subseteq\mathbf{Y}$,
1256: $\epsilon\in(0,1)$.
1257: The process of prediction can be summarized by the following protocol:
1258:
1259: \medskip
1260:
1261: \noindent\textsc{On-line prediction protocol}
1262: \ifJOURNAL
1263: \newcommand{\Indent}{\quad}
1264: \fi
1265: \ifnotJOURNAL
1266: \newcommand{\Indent}{\quad\enspace}
1267:
1268: \smallskip
1269:
1270: \fi
1271:
1272: \noindent
1273: \Indent$\Err_0:=0$;
1274:
1275: \noindent
1276: \Indent$\Mult_0:=0$;
1277:
1278: \noindent
1279: \Indent$\Emp_0:=0$;
1280:
1281: \noindent
1282: \Indent FOR $n=1,2,\ldots$:
1283:
1284: \noindent
1285: \Indent\Indent Reality outputs $x_n\in\mathbf{X}$;
1286:
1287: \noindent
1288: \Indent\Indent Predictor outputs $\Gamma_n^{\epsilon}\subseteq\mathbf{Y}$ for all $\epsilon\in(0,1)$;
1289:
1290: \noindent
1291: \Indent\Indent Reality outputs $y_n\in\mathbf{Y}$;
1292:
1293: \noindent
1294: \Indent\Indent$\err_n^{\epsilon}
1295: :=
1296: \left\{
1297: \begin{array}{ll}
1298: 1 & \text{if $y_n \notin \Gamma_n^{\epsilon}$}\\
1299: 0 & \text{otherwise},
1300: \end{array}
1301: \right.
1302: \quad
1303: \epsilon\in(0,1)$;
1304:
1305: \noindent
1306: \Indent\Indent\strut$\Err_n^{\epsilon}:=\Err^{\epsilon}_{n-1}+\err_n^{\epsilon},
1307: \quad
1308: \epsilon\in(0,1)$;
1309:
1310: \noindent
1311: \Indent\Indent$\mult_n^{\epsilon}
1312: :=
1313: \left\{
1314: \begin{array}{ll}
1315: 1 & \text{if $\left|\Gamma_n^{\epsilon}\right|>1$}\\
1316: 0 & \text{otherwise},
1317: \end{array}
1318: \right.
1319: \quad
1320: \epsilon\in(0,1)$;
1321:
1322: \noindent
1323: \Indent\Indent\strut$\Mult_n^{\epsilon}:=\Mult_{n-1}^{\epsilon}+\mult_n^{\epsilon},
1324: \quad
1325: \epsilon\in(0,1)$;
1326:
1327: \noindent
1328: \Indent\Indent$\emp_n^{\epsilon}
1329: :=
1330: \left\{
1331: \begin{array}{ll}
1332: 1 & \text{if $\left|\Gamma_n^{\epsilon}\right|=0$}\\
1333: 0 & \text{otherwise},
1334: \end{array}
1335: \right.
1336: \quad
1337: \epsilon\in(0,1)$;
1338:
1339: \noindent
1340: \Indent\Indent\strut$\Emp_n^{\epsilon}:=\Emp_{n-1}^{\epsilon}+\Emp_n^{\epsilon},
1341: \quad
1342: \epsilon\in(0,1)$
1343:
1344: \noindent
1345: \Indent END FOR.
1346:
1347: \medskip
1348:
1349: \noindent
1350: As we said, the family $\Gamma_n^{\epsilon}$
1351: is assumed nested:
1352: $\Gamma_n^{\epsilon_1}\subseteq\Gamma_n^{\epsilon_2}$ when $\epsilon_1\ge\epsilon_2$.
1353: In this protocol we also record the cumulative numbers
1354: $\Err_n^{\epsilon}$ of erroneous prediction sets,
1355: $\Mult_n^{\epsilon}$ of \emph{multiple} prediction sets
1356: (i.e., prediction sets containing more than one label)
1357: and $\Emp_n^{\epsilon}$ of empty prediction sets
1358: at each confidence level $1-\epsilon$.
1359: We will discuss the significance of each of these numbers in turn.
1360:
1361: The number of erroneous predictions is a measure of validity of our confidence predictors:
1362: we would like to have $\Err_n^{\epsilon}\le\epsilon n$,
1363: up to statistical fluctuations.
1364: In Figure~\ref{fig:CP0err} we can see the lines $n\mapsto\Err_n^{\epsilon}$
1365: for one particular conformal predictor
1366: and for three confidence levels $1-\epsilon$:
1367: the solid line for 99\%, the dash-dot line for 95\%, and the dotted line for 80\%.
1368: The number of errors made grows linearly,
1369: and the slope is approximately
1370: 20\% for the confidence level 80\%,
1371: 5\% for the confidence level 95\%,
1372: and 1\% for the confidence level 99\%.
1373: We will see below that this is not accidental.
1374:
1375: \begin{figure}
1376: \centering
1377: \makebox{\includegraphics[width=\picturewidth]{CP0err.eps}}
1378: \caption{\label{fig:CP0err}Cumulative numbers of errors for a conformal predictor
1379: (the 1-nearest neighbour conformal predictor)
1380: run in the on-line mode on the USPS data set
1381: (9298 hand-written digits, randomly permuted)
1382: at the confidence levels 80\%, 95\% and 99\%.}
1383: \end{figure}
1384:
1385: The number of multiple predictions $\Mult_n$
1386: is a useful measure of efficiency in the case of classification:
1387: we would like as many as possible of our predictions to be singletons.
1388: Figure \ref{fig:TCM975} shows the cumulative numbers of errors
1389: $n\mapsto\Err_n^{2.5\%}$ (solid line)
1390: and multiple predictions
1391: $n\mapsto\Mult_n^{2.5\%}$ (dotted line)
1392: at the fixed confidence level 97.5\%.
1393: We can see that out of approximately 10,000 predictions
1394: about 250 (approximately 2.5\%) were errors
1395: and about 300 (approximately 3\%) were multiple predictions.
1396:
1397: \begin{figure}
1398: \centering
1399: \makebox{\includegraphics[width=\picturewidth]{TCM0_975bwF.eps}}
1400: \caption{\label{fig:TCM975}The on-line performance of the 1-nearest neighbour conformal predictor
1401: at the confidence level 97.5\% on the USPS data set (randomly permuted).}
1402: \end{figure}
1403:
1404: We can see that by choosing $\epsilon$ we are able to control the number of errors.
1405: For small $\epsilon$
1406: (relative to the difficulty of the data set)
1407: this might lead to the need sometimes to give
1408: multiple predictions.
1409: On the other hand,
1410: for larger $\epsilon$ this might lead to empty predictions at some steps,
1411: as can be seen from the bottom right corner of Figure \ref{fig:TCM975}:
1412: when the predictor ceases to make multiple predictions
1413: it starts making occasional empty predictions
1414: (the dash-dot line).
1415: An empty prediction is a warning that the object to be predicted is unusual
1416: (the credibility, as defined in Section \ref{sec:ideal}, is $\epsilon$ or less).
1417:
1418: It would be a mistake to concentrate exclusively on one confidence level $1-\epsilon$.
1419: If the prediction $\Gamma_n^{\epsilon}$ is empty,
1420: this does not mean that we cannot make any prediction at all:
1421: we should just shift our attention to other confidence levels
1422: (perhaps look at the range of $\epsilon$ for which $\Gamma_n^{\epsilon}$ is a singleton).
1423: Likewise, $\Gamma_n^{\epsilon}$ being multiple
1424: does not mean that all labels in $\Gamma_n^{\epsilon}$ are equally likely:
1425: slightly increasing $\epsilon$ might lead to the removal of some labels.
1426: Of course,
1427: taking in the continuum of predictions sets, for all $\epsilon\in(0,1)$,
1428: might be too difficult or tiresome for a human mind,
1429: and concentrating on a few conventional levels,
1430: as in Figure \ref{fig:predset},
1431: might be a reasonable compromise.
1432:
1433: \ifJOURNAL
1434: \begin{table*}
1435: \processtable{A selected test example from a data set of hospital records of patients
1436: who suffered acute abdominal pain \cite{gammerman/thatcher:1992}:
1437: the p-values for the nine possible diagnostic groups
1438: (appendicitis APP, diverticulitis DIV, perforated peptic ulcer PPU,
1439: non-specific abdominal pain NAP, cholecystitis CHO, intestinal obstruction INO,
1440: pancreatitis PAN, renal colic RCO, dyspepsia DYS)
1441: and the true label.\label{tab:abdominal}}
1442: %\begingroup\tiny
1443: {\footnotesize\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
1444: \hline APP & DIV & PPU & NAP & CHO & INO & PAN & RCO & DYS & true label\\
1445: \hline 1.23\% & 0.36\% & 0.16\% & 2.83\% & 5.72\% & 0.89\% & 1.37\% & 0.48\% & 80.56\% & DYS\\
1446: \hline
1447: \end{tabular}}{}
1448: %\endgroup
1449: \end{table*}
1450: \fi
1451:
1452: \ifnotJOURNAL
1453: \begin{table*}
1454: \caption{A selected test example from a data set of hospital records of patients
1455: who suffered acute abdominal pain \cite{gammerman/thatcher:1992}:
1456: the p-values for the nine possible diagnostic groups
1457: (appendicitis APP, diverticulitis DIV, perforated peptic ulcer PPU,
1458: non-specific abdominal pain NAP, cholecystitis CHO, intestinal obstruction INO,
1459: pancreatitis PAN, renal colic RCO, dyspepsia DYS)
1460: and the true label.\label{tab:abdominal}}
1461:
1462: \medskip
1463:
1464: {\footnotesize\hspace{-2mm}\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
1465: \hline APP & DIV & PPU & NAP & CHO & INO & PAN & RCO & DYS & true label\\
1466: \hline 1.23\% & 0.36\% & 0.16\% & 2.83\% & 5.72\% & 0.89\% & 1.37\% & 0.48\% & 80.56\% & DYS\\
1467: \hline
1468: \end{tabular}}{}
1469: %\endgroup
1470: \end{table*}
1471: \fi
1472:
1473: % Typical output: example 5 (correctly predicted)
1474: %
1475: % Real class = Dyspepsia (8) [starting from 0 rather than 1, as in the paper]
1476: % Predicted class = Dyspepsia (8)
1477: %
1478: % p-values for each class:
1479: % Class 0: Appendicitis = 0.012306289881494986
1480: % Class 1: Diverticulitis = 0.0036463081130355514
1481: % Class 2: Perforated peptic ulcer = 0.0015952597994530538
1482: % Class 3: Non-specific abdominal pain = 0.028258887876025523
1483: % Class 4: Cholecystitis = 0.057201458523245215
1484: % Class 5: Intestinal obstruction = 0.008887876025524157
1485: % Class 6: Pancreatitis = 0.013673655423883319
1486: % Class 7: Renal colic = 0.004785779398359161
1487: % Class 8: Dyspepsia = 0.8056061987237921
1488:
1489: For example, Table \ref{tab:abdominal}
1490: gives the p-values for different kinds of abdominal pain
1491: obtained for a specific patient based on his symptoms.
1492: % check his sex with Sasha!
1493: We can see that at the confidence level 95\% the prediction set
1494: is multiple,
1495: $\{$cholecystitis, dyspepsia$\}$.
1496: When we relax the confidence level to 90\%,
1497: the prediction set narrows down to $\{$dyspepsia$\}$
1498: (the singleton containing only the true label);
1499: on the other hand,
1500: at the confidence level 99\% the prediction set widens to
1501: $\{$appendicitis, non-specific abdominal pain, cholecystitis, pancreatitis, dyspepsia$\}$.
1502: Such detailed confidence information,
1503: in combination with the property of validity,
1504: is especially valuable in medicine
1505: (and some of the first applications of conformal predictors
1506: have been to the fields of medicine and bioinformatics:
1507: see, e.g., \cite{bellotti/etal:2005,shahmuradov/etal:2005}).
1508:
1509: In the case of regression,
1510: we will usually have $\Mult_n^{\epsilon}=n$ and $\Emp_n^{\epsilon}=0$,
1511: and so these are not useful measures of efficiency.
1512: Better measures,
1513: such as the ones used in the previous section,
1514: would, e.g., take into account the widths of the prediction intervals.
1515:
1516: \subsection*{Theoretical Analysis}
1517:
1518: Looking at Figures \ref{fig:CP0err} and \ref{fig:TCM975}
1519: we might be tempted to guess that the probability of error
1520: at each step of the on-line protocol
1521: is $\epsilon$
1522: and that errors are made independently at different steps.
1523: This is not literally true,
1524: as a closer examination of the bottom left corner of Figure \ref{fig:TCM975} reveals.
1525: It, however, becomes true
1526: (as noticed in \cite{vovk:2002})
1527: if the p-values (\ref{eq:p}) are redefined as
1528: \begin{equation}\label{eq:p-smoothed}
1529: p_Y
1530: :=
1531: \frac
1532: {
1533: \left|
1534: \{i \st \alpha_i>\alpha_{l+1}\}
1535: \right|
1536: +
1537: \eta
1538: \left|
1539: \{i \st \alpha_i=\alpha_{l+1}\}
1540: \right|
1541: }
1542: {l+1},
1543: \end{equation}
1544: where $i$ ranges over $\{1,\ldots,l+1\}$
1545: and $\eta\in[0,1]$ is generated randomly from the uniform distribution on $[0,1]$
1546: (the $\eta$s should be independent between themselves and of everything else;
1547: in practice they are produced by pseudo-random number generators).
1548: The only difference between (\ref{eq:p}) and (\ref{eq:p-smoothed})
1549: is that the expression (\ref{eq:p-smoothed}) takes more care in breaking the ties
1550: $\alpha_i=\alpha_{l+1}$.
1551: Replacing (\ref{eq:p}) by (\ref{eq:p-smoothed})
1552: in the definition of conformal predictor
1553: we obtain the notion of \emph{smoothed conformal predictor}.
1554:
1555: The validity property for smoothed conformal predictors can now be stated as follows.
1556: \begin{theorem}\label{thm:on-line}
1557: Suppose the examples
1558: \begin{equation*}
1559: (x_1,y_1),(x_2,y_2),\ldots
1560: \end{equation*}
1561: are generated independently
1562: from the same distribution.
1563: For any smoothed conformal predictor working in the on-line prediction protocol
1564: and any confidence level $1-\epsilon$,
1565: the random variables $\err_1^{\epsilon},\err_2^{\epsilon},\ldots$
1566: are independent and take value 1 with probability $\epsilon$.
1567: \end{theorem}
1568:
1569: Combining Theorem \ref{thm:on-line}
1570: with the strong law of large numbers
1571: we can see that
1572: \begin{equation*}
1573: \lim_{n\to\infty}
1574: \frac{\Err_n^{\epsilon}}{n}
1575: =
1576: \epsilon
1577: \end{equation*}
1578: holds with probability one for smoothed conformal predictors.
1579: (They are ``well calibrated''.)
1580: Since the number of mistakes made by a conformal predictor
1581: never exceeds the number of mistakes
1582: made by the corresponding smoothed conformal predictor,
1583: \begin{equation*}
1584: \limsup_{n\to\infty}
1585: \frac{\Err_n^{\epsilon}}{n}
1586: \le
1587: \epsilon
1588: \end{equation*}
1589: holds with probability one for conformal predictors.
1590: (They are ``conservatively well calibrated''.)
1591:
1592: \section{Slow teachers, lazy teachers, and the batch setting}
1593: \label{sec:slow}
1594:
1595: % Lazy and slow teachers; batch and mixtures on-line/batch
1596:
1597: In the pure on-line setting, considered in the previous section,
1598: we get an immediate feedback (the true label) for every example that we predict.
1599: This makes practical applications of this scenario questionable.
1600: Imagine, for example, a mail sorting centre
1601: using an on-line prediction algorithm
1602: for zip code recognition;
1603: suppose the feedback about the ``true'' label comes from a human ``teacher''.
1604: If the feedback is given for every object $x_i$,
1605: there is no point in having the prediction algorithm:
1606: we can just as well use the label provided by the teacher.
1607: It would help if the prediction algorithm could still work well,
1608: in particular be valid,
1609: if only every, say, tenth object were classified by a human teacher
1610: (the scenario of ``lazy'' teachers).
1611: Alternatively,
1612: even if the prediction algorithm requires the knowledge of all labels,
1613: it might still be useful if the labels were allowed to be given not immediately
1614: but with a delay (``slow'' teachers).
1615: In our mail sorting example,
1616: such a delay might make sure that we hear
1617: from local post offices about any mistakes made
1618: before giving a feedback to the algorithm.
1619:
1620: In the pure on-line protocol we had validity in the strongest possible sense:
1621: at each confidence level $1-\epsilon$ each smoothed conformal predictor
1622: made errors independently with probability $\epsilon$.
1623: In the case of weaker teachers
1624: (as usual, we are using the word ``teacher'' in the general sense of the entity
1625: providing the feedback,
1626: called Reality in the previous section),
1627: we have to accept a weaker notion of validity.
1628: Suppose the predictor receives a feedback from the teacher
1629: at the end of steps $n_1,n_2,\ldots$,
1630: $n_1<n_2<\cdots$;
1631: the feedback is the label of one of the objects that the predictor
1632: has already seen (and predicted).
1633: This scheme \cite{ryabko/etal:2003} covers both slow and lazy teachers
1634: (as well as teachers who are both slow and lazy).
1635: It was proved in \cite{nouretdinov/vovk:2003}
1636: (see also \cite{vovk/etal:2005}, Theorem 4.2)
1637: that the smoothed conformal predictors
1638: (using only the examples with known labels)
1639: remain valid in the sense
1640: \begin{equation*}
1641: \forall\epsilon\in(0,1):
1642: \Err_n^{\epsilon}/n\to\epsilon
1643: \text{ in probability}
1644: \end{equation*}
1645: if and only if $n_k/n_{k-1}\to1$ as $k\to\infty$.
1646: In other words,
1647: the validity in the sense of convergence in probability holds
1648: if and only if the growth rate of $n_k$ is subexponential.
1649: (This condition is amply satisfied for our example
1650: of a teacher giving feedback for every tenth object.)
1651:
1652: \iffalse
1653: Below are two examples of ``weak'' (slow and lazy) teachers at 99\%
1654: confidence using well-known NIST data set.
1655:
1656: \begin{figure}
1657: \centering
1658: \makebox{\includegraphics[width=\picturewidth]{tcmSlow10.eps}}
1659: \caption{\label{fig:slow teachers}An example of a Slow Teacher Predictor
1660: with a delay of 10 examples on the NIST data set.}
1661: \end{figure}
1662:
1663: \begin{figure}
1664: \centering
1665: \makebox{\includegraphics[width=\picturewidth]{tcmLazyAP10.eps}}
1666: \caption{\label{fig:lazy}An example of a Lazy Teacher Predictor
1667: with delays follow the arithmetic progression with coefficient 10 on the NIST data set.}
1668: \end{figure}
1669: \fi
1670:
1671: The most standard \emph{batch} setting of the problem of prediction
1672: is in one respect even more demanding than our scenarios of weak teachers.
1673: In this setting we are given a training set (\ref{eq:training-set})
1674: and our goal is to predict the labels
1675: given the objects in the test set
1676: \begin{equation}\label{eq:test-set}
1677: (x_{l+1},y_{l+1}),\ldots,(x_{l+k},y_{l+k}).
1678: \end{equation}
1679: This can be interpreted as a finite-horizon version
1680: of the lazy-teacher setting:
1681: no labels are returned after step $l$.
1682: Computer experiments (see, e.g., Figure \ref{fig:batch-errors})
1683: show that approximate validity still holds;
1684: for related theoretical results,
1685: see \cite{vovk/etal:2005}, Section 4.4.
1686:
1687: \begin{figure}
1688: \centering
1689: \makebox{\includegraphics[width=\picturewidth]{TCM_test_errors_bw.eps}}
1690: \caption{\label{fig:batch-errors}Cumulative numbers of errors made on the test set
1691: by the 1-nearest neighbour conformal predictor
1692: used in the batch mode on the USPS data set
1693: (randomly permuted and split into a training set of size 7291 and a test set of size 2007)
1694: at the confidence levels 80\%, 95\% and 99\%.}
1695: \end{figure}
1696:
1697: \section{Induction and transduction}
1698: \label{sec:induction-transduction}
1699:
1700: % Transductive vs. inductive inference
1701:
1702: Vapnik's \cite{vapnik:1995,vapnik:1998}
1703: distinction between induction and transduction,
1704: as applied to the problem of prediction,
1705: is depicted in Figure \ref{fig:trans}.
1706: In \emph{inductive prediction}
1707: we first move from examples in hand to some more or less general rule,
1708: which we might call a prediction or decision rule,
1709: a model, or a theory;
1710: this is the \emph{inductive step}.
1711: When presented with a new object,
1712: we derive a prediction from the general rule;
1713: this is the \emph{deductive step}.
1714: In \emph{transductive prediction},
1715: we take a shortcut,
1716: moving from the old examples directly
1717: to the prediction about the new object.
1718:
1719: \begin{figure}
1720: \centering
1721: \input{trans.pic}
1722: \caption{\label{fig:trans}Inductive and transductive prediction.}
1723: \end{figure}
1724:
1725: Typical examples of the inductive step
1726: are estimating parameters in statistics
1727: and finding an approximating function
1728: in statistical learning theory.
1729: Examples of transductive prediction
1730: are estimation of future observations in statistics
1731: (\cite{cox/hinkley:1974}, Section 7.5, \cite{takeuchi:1975})
1732: and nearest neighbours algorithms
1733: in machine learning.
1734:
1735: In the case of simple (i.e., traditional, not hedged) predictions
1736: the distinction between induction and transduction
1737: is less than crisp.
1738: A method for doing transduction,
1739: in the simplest setting of predicting one label,
1740: is a method for predicting $y_{l+1}$
1741: from (\ref{eq:training-set}) and $x_{l+1}$.
1742: Such a method gives a prediction for any object
1743: that might be presented as $x_{l+1}$, and so it defines,
1744: at least implicitly, a rule,
1745: which might be extracted from the training set (\ref{eq:training-set}) (induction),
1746: stored, and then subsequently applied to $x_{l+1}$ to predict $y_{l+1}$ (deduction).
1747: So any real distinction is really at a practical and computational level:
1748: do we extract and store the general rule or not?
1749:
1750: For hedged predictions the difference between induction and transduction goes deeper.
1751: We will typically want different notions of hedged prediction
1752: in the two frameworks.
1753: Mathematical results about induction usually involve two parameters,
1754: often denoted $\epsilon$ (the desired accuracy of the prediction rule)
1755: and $\delta$ (the probability of achieving the accuracy of $\epsilon$),
1756: whereas results about transduction involve only one parameter,
1757: which we denote $\epsilon$ in this paper
1758: (the probability of error we are willing to tolerate);
1759: see Figure \ref{fig:trans}.
1760: For a review of inductive prediction
1761: from this point of view, see \cite{vovk/etal:2005}, Section 10.1.
1762:
1763: \section{Inductive conformal predictors}
1764: \label{sec:ICP}
1765:
1766: % Computational issues: inductive conformal predictors
1767:
1768: Our approach to prediction is thoroughly transductive,
1769: and this is what makes valid and efficient hedged prediction possible.
1770: In this section we will see, however,
1771: that there is also room for an element of induction
1772: in conformal prediction.
1773:
1774: Let us take a closer look at the process of conformal prediction,
1775: as described in Section \ref{sec:conformal}.
1776: Suppose we are given a training set (\ref{eq:training-set})
1777: and the objects in a test set (\ref{eq:test-set}),
1778: and our goal is to predict the label of each test object.
1779: If we want to use the conformal predictor based on the support vector method,
1780: as described in Section \ref{sec:conformal},
1781: we will have to find the set of the Lagrange multipliers
1782: for each test object and for each potential label $Y$ that can be assigned to it.
1783: This would involve solving
1784: $k\left|\mathbf{Y}\right|$ essentially independent optimization problems.
1785: Using the nearest neighbours approach
1786: is typically more computationally efficient,
1787: but even it is much slower than the following procedure,
1788: suggested in \cite{papadopoulos/etal:2002a,papadopoulos/etal:2002b}.
1789:
1790: Suppose we have an inductive algorithm which,
1791: given a training set (\ref{eq:training-set}) and a new object $x$
1792: outputs a prediction $\hat y$ for $x$'s label $y$.
1793: Fix some measure $\Delta(y,\hat y)$ of difference between $y$ and $\hat y$.
1794: The procedure is:
1795: \begin{enumerate}
1796: \item
1797: Divide the original training set (\ref{eq:training-set})
1798: into two subsets:
1799: the \emph{proper training set}
1800: $(x_1,y_1),\ldots,(x_m,y_m)$
1801: and the \emph{calibration set}
1802: $(x_{m+1},y_{m+1}),\ldots,(x_l,y_l)$.
1803: \item
1804: Construct a prediction rule $F$ from the proper training set.
1805: \item
1806: Compute the nonconformity score
1807: \begin{equation*}
1808: \alpha_i:=\Delta(y_i,F(x_i)),
1809: \quad
1810: i=m+1,\ldots,l,
1811: \end{equation*}
1812: for each example in the calibration set.
1813: \item
1814: For every test object $x_i$,
1815: $i=l+1,\ldots,l+k$,
1816: do the following:
1817: \begin{enumerate}
1818: \item
1819: for every possible label $Y\in\mathbf{Y}$
1820: compute the nonconformity score $\alpha_i:=\Delta(y_i,F(x_i))$
1821: and the p-value
1822: \begin{equation*}
1823: p_Y
1824: :=
1825: \frac
1826: {
1827: \#\{j\in\{m+1,\ldots,l,i\} \st \alpha_j\ge\alpha_i\}
1828: }
1829: {l-m+1};
1830: \end{equation*}
1831: \item
1832: output the prediction sets
1833: $
1834: \Gamma^{\epsilon}
1835: \left(
1836: x_1,y_1,\ldots,x_{l},y_{l},x_{i}
1837: \right)
1838: $
1839: given by the right-hand side of (\ref{eq:Gamma}).
1840: \end{enumerate}
1841: \end{enumerate}
1842: This is a special case of ``inductive conformal predictors'',
1843: as defined in \cite{vovk/etal:2005}, Section 4.1.
1844: In the case of classification,
1845: of course,
1846: we could package the p-values as a simple prediction
1847: complemented with confidence (\ref{eq:conf}) and credibility (\ref{eq:cred}).
1848:
1849: Inductive conformal predictors are valid in the sense that
1850: the probability of error
1851: \begin{equation*}
1852: y_{i}
1853: \notin
1854: \Gamma^{\epsilon}
1855: \left(
1856: x_1,y_1,
1857: \ldots
1858: x_l,y_l,
1859: x_{i}
1860: \right)
1861: \end{equation*}
1862: ($i=l+1,\ldots,l+k$, $\epsilon\in(0,1)$)
1863: never exceeds $\epsilon$
1864: (cf.\ (\ref{eq:error})).
1865: The on-line version of inductive conformal predictors,
1866: with a stronger notion of validity,
1867: is described in \cite{vovk:2002}
1868: and \cite{vovk/etal:2005} (Section 4.1).
1869:
1870: The main advantage of inductive conformal predictors
1871: is their computational efficiency:
1872: the bulk of the computations is performed only once,
1873: and what remains to do for each test example
1874: is to apply the prediction rule found at the inductive step,
1875: to apply $\Delta$ to find the nonconformity score $\alpha$ for this example,
1876: and to find the position of $\alpha$ among the nonconformity scores
1877: of the calibration examples.
1878: The main disadvantage is a possible loss of the prediction efficiency:
1879: for conformal predictors,
1880: we can effectively use the whole training set
1881: as both the proper training set and the calibration set.
1882:
1883: \section{Conclusion}
1884: \label{sec:conclusion}
1885:
1886: This paper shows how many machine-learning techniques
1887: can be complemented with provably valid measures
1888: of accuracy and reliability.
1889: We explained briefly how this can be done
1890: for support vector machines, nearest neighbours algorithms,
1891: and the ridge regression procedure,
1892: but the principle is general:
1893: virtually any (we are not aware of exceptions) successful prediction technique
1894: designed to work under the randomness assumption
1895: can be used to produce equally successful hedged predictions.
1896: Further examples are given in our recent book \cite{vovk/etal:2005}
1897: (joint with Glenn Shafer),
1898: where we construct conformal predictors and inductive conformal predictors
1899: based on nearest neighbours regression, logistic regression,
1900: bootstrap, decision trees, boosting, and neural networks;
1901: general schemes for constructing conformal predictors
1902: and inductive conformal predictors
1903: are given on pp.~28--29 and on pp.~99--100 of \cite{vovk/etal:2005},
1904: respectively.
1905: Replacing the original simple predictions with hedged predictions
1906: enables us to control the number of errors made
1907: by appropriately choosing the confidence level.
1908:
1909: \section*{Acknowledgements}
1910:
1911: This work is partially supported by MRC
1912: (grant % G0301107
1913: ``Pro\-te\-o\-mic analysis of the human serum pro\-te\-ome'')
1914: and the Royal Society
1915: (grant ``Efficient pseudo-random number generators'').
1916:
1917: \begin{thebibliography}{99}
1918:
1919: \bibitem{bellotti/etal:2005}
1920: Bellotti, T., Luo, Z., Gammerman, A., van Delft, F.~W.\ and Saha, V.\ (2005)
1921: Qualified predictions for microarray and proteomics pattern diagnostics with confidence machines.
1922: \emph{International Journal of Neural Systems}, \textbf{15}, 247--258.
1923: Yang, Z.~R.\ and Dalby, A.~R.\ (eds),
1924: Special Issue on Bioinformatics.
1925: \bibitem{cesabianchi/lugosi:2006}
1926: Cesa-Bianchi, N.\ and Lugosi, G.\ (2006)
1927: \emph{Prediction, Learning, and Games}.
1928: Cambridge University Press, Cambridge.
1929: \bibitem{cox/hinkley:1974}
1930: Cox, D.~R.\ and Hinkley, D.~V.\ (1974)
1931: \emph{Theoretical Statistics}.
1932: Chapman and Hall, London.
1933: % \bibitem{gammerman/etal:1998}
1934: % A.~Gammerman, V.~N.~Vapnik and V.~Vovk,
1935: % Learning by transduction,
1936: % in: G.~F.~Cooper and S.~Moral, eds.,
1937: % \emph{Proceedings of the Fourteenth Conference
1938: % on Uncertainty in Artificial Intelligence}
1939: % (Morgan Kaufmann, San Francisco, CA, 1998)
1940: % 148--156.
1941: \bibitem{gammerman/thatcher:1992}
1942: Gammerman, A.\ and Thatcher, A.~R.\ (1992)
1943: Bayes\-ian diagnostic probabilities without assuming in\-de\-pen\-dence of symptoms.
1944: \emph{Yearbook of Medical In\-for\-mat\-ics}, pp.~323--330.
1945: \bibitem{lecun/etal:1990}
1946: LeCun, Y., Boser, B., Denker, J.~S., Henderson, D., How\-ard, R.~E.,
1947: Hubbard, W.\ and Jackel, L.~J.\ (1990)
1948: Handwritten digit recognition with backpropagation network.
1949: In \emph{Advances in Neural Information Processing Systems 2},
1950: pp.~396--404,
1951: Morgan Kaufmann, San Ma\-teo, CA.
1952: \bibitem{li/vitanyi:1997}
1953: Li, M.\ and Vit\'anyi, P.\ (1993)
1954: \emph{An Introduction to Kolmogorov Complexity and Its Applications}.
1955: Springer, New York.
1956: Second edition: 1997.
1957: \bibitem{martin-lof:1966}
1958: Martin-L\"of, P.\ (1966)
1959: The definition of random sequences.
1960: \emph{Information and Control}, \textbf{9}, 602--619.
1961: \bibitem{melluish/etal:2001}
1962: Melluish, T., Saunders, C., Nouretdinov, I. and Vovk, V.\ (2001)
1963: Comparing the Bayes and typicalness frameworks.
1964: In De Raedt, L.\ and Flash, P.\ (eds),
1965: \emph{Machine Learning: ECML 2001,
1966: Proceedings of the Twelfth European Conference on Machine Learning,
1967: LNAI}, \textbf{2167}, pp.~360--371,
1968: Springer, Heidelberg.
1969: Full version published as Technical Report TR-01-05,
1970: Computer Learning Research Centre,
1971: Royal Holloway, University of London.
1972: % (can be downloaded from \texttt{http://www.clrc.rhul.ac.uk}).
1973: \bibitem{nouretdinov/etal:2001rr}
1974: Nouretdinov, I., Melluish, T.\ and Vovk, V.\ (2001)
1975: Ridge Regression Confidence Machine.
1976: In \emph{Proceedings of the Eighteenth International Conference
1977: on Machine Learning}, pp.~385--392,
1978: Morgan Kaufmann, San Fran\-cis\-co, CA.
1979: \bibitem{nouretdinov/vovk:2003}
1980: Nouretdinov, I.\ and Vovk, V.\ (2003)
1981: Criterion of calibration for transductive confidence machine with limited feedback.
1982: In Gavald\`a, R., Jantke, K.~P.\ and Takimoto, E.\ (eds),
1983: \emph{Proceedings of the Fourteenth International Conference on Algorithmic Learning Theory,
1984: LNAI}, \textbf{2842}, pp.~259--267,
1985: Springer, Berlin.
1986: To appear in \emph{Theoretical Computer Science}
1987: (special issue devoted to the ALT'2003 conference).
1988: % \bibitem{nouretdinov/etal:2001de}
1989: % I.~Nouretdinov, V.~Vovk, M.~Vyugin and A.~Gammerman,
1990: % Pattern recognition and density estimation under the general iid assumption,
1991: % in: D.~Helmbold and B.~Williamson, eds.,
1992: % \emph{Proceedings of the Fourteenth Annual Conference
1993: % on Computational Learning Theory
1994: % and Fifth European Conference
1995: % on Computational Learning Theory},
1996: % \emph{Lecture Notes in Artificial Intelligence},
1997: % \textbf{2111} (2001) 337--353;
1998: % Full version published as a CLRC technical report
1999: % (can be downloaded from \texttt{http://www.clrc.rhul.ac.uk}).
2000: \bibitem{papadopoulos/etal:2002a}
2001: Papadopoulos, H., Proedrou, K., Vovk, V.\ and Gammerman, A.\ (2002)
2002: Inductive Confidence Machines for regression.
2003: In Elomaaa, T., Mannila, H.\ and Toivonen, H.\ (eds),
2004: \emph{Machine Learning: ECML 2002,
2005: Proceedings of the Thirteenth European Conference on Machine Learning,
2006: LNCS}, \textbf{2430}, pp.~345--356,
2007: Springer, Berlin.
2008: \bibitem{papadopoulos/etal:2002b}
2009: Papadopoulos, H., Vovk, V.\ and Gammerman, A.\ (2002)
2010: Qualified predictions for large data sets in the case of pattern recognition.
2011: In \emph{Proceedings of the International Conference on Machine Learning and Applications
2012: (ICMLA'2002)}, pp.~159--163,
2013: CSREA Press.
2014: \bibitem{popper:1934}
2015: Popper, K.~R.\ (1934)
2016: \emph{Logik der Forschung}.
2017: Springer, Vienna.
2018: English translation (1959):
2019: \emph{The Logic of Sci\-en\-tif\-ic Discovery},
2020: Hutchinson, London.
2021: % \bibitem{proedrou/etal:2002}
2022: % Proedrou, K., Papadopoulos, H., Vovk, V.\ and Gammerman, A.\ (2002)
2023: % Nearest Neighbours Transductive Confidence Machine,
2024: % in: \emph{Proceedings of the Artificial Intelligence and Statistics Conference}
2025: \bibitem{ryabko/etal:2003}
2026: Ryabko, D., Vovk, V.\ and Gammerman, A.\ (2003)
2027: Online prediction with real teachers.
2028: Technical Report CS-TR-03-09, Department of Computer Science,
2029: Royal Holloway, University of London.
2030: % \bibitem{saunders/etal:1999}
2031: % C.~Saunders, A.~Gammerman and V.~Vovk,
2032: % Transduction with confidence and credibility,
2033: % in: \emph{Proceedings of the Sixteenth International Joint Conference
2034: % on Artificial Intelligence}
2035: % (Morgan Kaufmann, 1999)
2036: % 722--726.
2037: % \bibitem{scholkopf/etal:1999}
2038: % B.~Sch\"olkopf, C.~J.~C.~Burges and A.~J.~Smola, eds.,
2039: % \emph{Advances in Kernel Methods, Support Vector Learning}
2040: % (MIT Press, 1999).
2041: \bibitem{shahmuradov/etal:2005}
2042: Shahmuradov, I.~A., Solovyev, V.~V.\ and Gammerman, A.\ (2005)
2043: Plant promoter prediction with confidence estimation.
2044: \emph{Nucleic Acids Research}, \textbf{33}, 1069--1076.
2045: \bibitem{sutton/barto:1998}
2046: Sutton, R.~S.\ and Barto, A.~G.\ (1998)
2047: \emph{Reinforcement Learning: An Introduction}.
2048: MIT Press, Cambridge, MA.
2049: \ifLATIN
2050: \bibitem{takeuchi:1975}
2051: Takeuchi, K.\ (1975)
2052: \emph{Statistical Pre\-dic\-tion Theory} (in Japanese).
2053: Baih\=ukan, Tokyo.
2054: \fi
2055: \ifnotLATIN
2056: \bibitem{takeuchi:1975}
2057: Takeuchi, K.\ (1975)
2058: \begin{CJK*}[dnp]{JIS}{min}Åý·×Ūͽ¬ÏÀ\end{CJK*}
2059: (\emph{Statistical Pre\-dic\-tion Theory}).
2060: Baih\=ukan, Tokyo.
2061: \fi
2062: \bibitem{vapnik:1995}
2063: Vapnik, V.~N.\ (1995)
2064: \emph{The Nature of Statistical Learning Theory}.
2065: Springer, New York.
2066: Second edition: 2000.
2067: \bibitem{vapnik:1998}
2068: Vapnik, V.~N.\ (1998)
2069: \emph{Statistical Learning Theory}.
2070: Wiley, New York.
2071: \ifLATIN
2072: \bibitem{vapnik/chervonenkis:1974}
2073: Vapnik, V.~N.\ and Chervonenkis, A.~Y.\ (1974)
2074: \emph{Theory of Pattern Rec\-og\-ni\-tion} (in Russian).
2075: Nauka, Moscow.
2076: German translation (1979): \emph{Theorie der Zeichenerkennung},
2077: Akademie, Berlin.
2078: \fi
2079: \ifnotLATIN
2080: \bibitem{vapnik/chervonenkis:1974}
2081: Vapnik, V.~N.\ and Chervonenkis, A.~Y.\ (1974)
2082: \begin{cyr}Te\-o\-ri{ya}\ ras\-po\-zna\-va\-ni{ya}\
2083: ob\-ra\-zov\end{cyr} (\emph{Theory of Pattern Rec\-og\-ni\-tion}).
2084: Nauka, Moscow.
2085: German translation (1979): \emph{Theorie der Zeichenerkennung},
2086: Akademie, Berlin.
2087: \fi
2088: \bibitem{vovk/etal:1999}
2089: Vovk, V., Gammerman, A.\ and Saunders, C.\ (1999)
2090: Machine-learning applications of algorithmic ran\-dom\-ness.
2091: In Bratko, I.\ and Dzeroski, S.\ (eds),
2092: \emph{Proceedings of the Sixteenth International Conference on Machine Learning},
2093: pp.~444--453,
2094: Morgan Kaufmann, San Fran\-cis\-co, CA.
2095: \bibitem{vovk:2001}
2096: Vovk, V.\ (2001)
2097: Competitive on-line statistics.
2098: \emph{International Statistical Review}, \textbf{69}, 213--248.
2099: \bibitem{vovk:2002}
2100: Vovk, V.\ (2002)
2101: On-line Confidence Machines are well-calibrated.
2102: In \emph{Proceedings of the Forty Third Annual Symposium on Foundations of Computer Science},
2103: pp.~187--196,
2104: IEEE Computer Society, Los Alamitos, CA.
2105: \bibitem{vovk/etal:2005}
2106: Vovk, V., Gammerman, A.\ and Shafer, G.\ (2005)
2107: \emph{Al\-go\-rith\-mic Learning in a Random World}.
2108: Springer, New York.
2109: \end{thebibliography}
2110: \end{document}
2111:
2112:
2113: Remove:
2114:
2115: \emergencystretch=5mm
2116: \tolerance=400
2117: \allowdisplaybreaks[3]
2118:
2119: \newcommand{\Vladimir}{Vladimir }
2120: \newcommand{\DOT}{.}
2121: \newcommand{\zzrelax}[1]{}
2122:
2123: \DeclareMathAlphabet{\mathbfit}{OT1}{cmr}{bx}{it} % description: LATEX companion, pp.177 and 181
2124:
2125: \newcommand{\st}{\mathrel{:}}
2126: \newcommand{\given}{\mathrel{|}}
2127:
2128: \newcommand{\bbbr}{\mathbb{R}} % real numbers
2129: \newcommand{\bbbc}{\mathbb{C}} % complex numbers
2130: \newcommand{\bbbq}{\mathbb{Q}} % rational numbers
2131: \newcommand{\bbbn}{\mathbb{N}} % natural numbers
2132: \newcommand{\III}{\mathbb{I}} % indicator
2133: \newcommand{\bbbp}{\mathbb{P}} % auxiliary (probability)
2134: \newcommand{\bbbe}{\mathbb{E}} % auxiliary (expectation)
2135: \newcommand{\K}{\mathcal{K}} % capital
2136: \newcommand{\FFF}{\mathcal{F}} % sigma-algebra
2137: \newcommand{\GGG}{\mathcal{G}} % sigma-algebra
2138: \newcommand{\PPP}{\mathcal{P}} % statistical model
2139:
2140: \newcommand{\Prob}{\mathop{\bbbp}\nolimits}
2141: \newcommand{\Expect}{\mathop{\bbbe}\nolimits}
2142: %\newcommand{\LP}{\mathop{\underline{\bbbp}}\nolimits}
2143: %\newcommand{\UP}{\mathop{\overline{\bbbp}}\nolimits}
2144: %\newcommand{\ULP}{\mathop{\overline{\underline{\bbbp}}}\nolimits}
2145: \newcommand{\sign}{\mathop{{\rm sign}}\nolimits}
2146: \newcommand{\var}{\mathop{{\rm var}}\nolimits}
2147: \newcommand{\co}{\mathop{{\rm co}}\nolimits}
2148: \newcommand{\rank}{\mathop{{\rm rank}}\nolimits}
2149: \newcommand{\err}{\mathop{{\rm err}}\nolimits}
2150: \newcommand{\Err}{\mathop{{\rm Err}}\nolimits}
2151: \newcommand{\length}{\mathop{{\rm length}}\nolimits}
2152: \newcommand{\lth}{\mathop{{\rm lth}}\nolimits}
2153: \newcommand{\Lth}{\mathop{{\rm Lth}}\nolimits}
2154:
2155: \newenvironment{Proof}[1]
2156: {\trivlist\item[\hskip\labelsep\textbf{Proof #1}]}
2157: {\endtrivlist}
2158: \newcommand{\boxforqed}{\rule{.3em}{1.5ex}}
2159: \newcommand{\qedtext}{\unskip\nobreak\hfil
2160: \penalty50\hskip1em\null\nobreak\hfil\boxforqed
2161: \parfillskip=0pt\finalhyphendemerits=0\endgraf}
2162: \newcommand{\qedmath}{\eqno\boxforqed}
2163: \newtheorem{Remark}{Remark}
2164: \newenvironment{remark}
2165: {\begin{Remark} \begingroup\rm}
2166: {\endgroup \end{Remark}}
2167: \newenvironment{remark*}
2168: {\trivlist\item[\hskip\labelsep{\bfseries Remark}]\relax}
2169: {\endtrivlist}
2170:
2171: \begin{document}
2172: \label{firstpage}
2173: \maketitle
2174:
2175: \begin{abstract}
2176: We consider the on-line predictive version
2177: of the standard problem of linear regression;
2178: the goal is to predict each consecutive response
2179: given the corresponding explanatory variables
2180: and all the previous observations.
2181: The standard treatment of prediction in linear regression analysis
2182: has two drawbacks:
2183: (1) the usual prediction intervals
2184: guarantee that the probability of error
2185: is equal to the nominal significance level $\epsilon$,
2186: but this property per se does not imply that the long-run frequency of error
2187: is close to $\epsilon$;
2188: (2) it is not suitable for prediction of complex systems
2189: as it assumes that the number of observations
2190: exceeds the number of parameters.
2191: We state a general result showing that in the on-line protocol
2192: the frequency of error does equal the nominal significance level,
2193: up to statistical fluctuations,
2194: and we describe alternative regression models
2195: in which informative prediction intervals can be found
2196: before the number of observations exceeds the number of parameters.
2197: One of these models,
2198: which only assumes that the observations are independent and identically distributed,
2199: is popular in machine learning but
2200: greatly underused in the statistical theory of regression.
2201: \end{abstract}
2202:
2203: \ifJOURNAL
2204: \noindent
2205: \textbf{Key words:}
2206: Gauss linear model; independent identically distributed observations;
2207: multivariate analysis; on-line protocol; prequential statistics; regression
2208: \fi
2209: