0611:cs0611011/TCJ.TEX

1: % Last changed by Volodya, 17 Oct 2006

2: % Spell checked (UK): 17 Oct 2006

3: % Main message: you can control the number of mistakes.

4: % 1957 lines, 79 KB

5:

6: \newif\ifJOURNAL

7: \JOURNALfalse

8: \newif\ifWP

9: \WPfalse

10: \newif\ifarXiv

11: \arXivfalse

12:

13: %\JOURNALtrue			% choose JOURNAL, WP, or arXiv

14: %\WPtrue

15: \arXivtrue

16:

17: \newif\ifnotJOURNAL	% derivative conditional

18: \notJOURNALtrue

19: \ifJOURNAL\notJOURNALfalse\fi

20:

21: \newif\ifLATIN		% LATIN means that the Cyrillic references should be set in Latin

22:

23: \ifJOURNAL

24:   \documentclass{cja4}

25:

26:   %%the optional argument is used to get times font instead of CMR

27:   %\documentclass[mathtime]{cja4}

28:

29:   \copyrightyear{2006}

30:   \vol{00}

31:   \issue{0}

32:   \DOI{000}

33:   \usepackage{amsmath,amsfonts,latexsym,graphicx}

34:   \LATINfalse

35: \fi

36:

37: \ifWP

38:   \documentclass[toc]{kpnsarticle}

39:   \usepackage{amsmath,amsfonts,latexsym,graphicx,epsfig}

40:   \LATINfalse

41: \fi

42:

43: \ifarXiv

44:   \documentclass{article}

45:   \usepackage{amsmath,amsfonts,latexsym,graphicx}

46:   \LATINtrue

47: \fi

48:

49: \newif\ifnotLATIN	% derivative conditional

50: \notLATINtrue

51: \ifLATIN\notLATINfalse\fi

52:

53: \emergencystretch=5mm

54: \tolerance=400

55: \allowdisplaybreaks[3]

56: %\input{hyphenation.txt}

57:

58: \ifnotLATIN

59:   \usepackage{CJK}

60:   \input{OT2enc.def}

61:   \newenvironment{cyr}

62:   {\fontencoding{OT2}\fontfamily{wncyr}\fontseries{m}\fontshape{n}\selectfont}

63:   {\fontencoding{OT1}\fontfamily{tir}\selectfont}

64: \fi

65:

66: \newcommand{\bbbr}{{\mathbb{R}}}

67: \newcommand{\bbbn}{{\mathbb{N}}}

68: \newcommand{\st}{:}

69: \newcommand{\given}{\mathbin{|}}

70:

71: \newlength{\picturewidth}

72: \ifJOURNAL

73:   \setlength{\picturewidth}{0.98\columnwidth}

74: \fi

75: \ifnotJOURNAL

76:   \setlength{\picturewidth}{0.72\columnwidth}

77: \fi

78:

79: \newcommand{\E}{{\bf E}}

80:

81: \newcommand{\bbbe}{{\mathbb{E}}}		% expected value

82: \newcommand{\Expect}{\mathop{\bbbe}\nolimits}

83:

84: \newcommand{\Err}{\mathop{{\rm Err}}\nolimits}

85: \newcommand{\err}{\mathop{{\rm err}}\nolimits}

86:

87: \newcommand{\Mult}{\mathop{{\rm Mult}}\nolimits}

88: \newcommand{\mult}{\mathop{{\rm mult}}\nolimits}

89:

90: \newcommand{\Emp}{\mathop{{\rm Emp}}\nolimits}

91: \newcommand{\emp}{\mathop{{\rm emp}}\nolimits}

92:

93: \ifnotJOURNAL

94:   \newtheorem{lemma}{Lemma}

95:   \newtheorem{proposition}{Proposition}

96:   \newtheorem{corollary}{Corollary}

97:   \newtheorem{theorem}{Theorem}

98:   \newenvironment{proof}

99:     {\trivlist\item[\hskip\labelsep\textbf{Proof}]}

100:     {\endtrivlist}

101: \fi

102:

103: \newenvironment{remark*}

104:   {\trivlist\item[\hskip\labelsep{\bfseries Remark}]\relax}

105:   {\endtrivlist}

106: \newenvironment{definition*}

107:   {\trivlist\item[\hskip\labelsep{\bfseries Definition}]\relax}

108:   {\endtrivlist}

109:

110: \ifWP

111:   \title{Hedging Predictions in Machine Learning}

112:   \author{Alexander Gammerman and Vladimir Vovk}

113:   \newcommand{\No}{2}

114:   %For the two dates option: uncomment the next 2 lines

115:   %\twodatestrue

116:   %\newcommand{\firstposted}{November 2, 2006}

117: \fi

118:

119: \ifarXiv

120:   \title{Hedging Predictions in Machine Learning}

121:   \author{Alexander Gammerman and Vladimir Vovk\\

122:       Computer Learning Research Centre\\

123:       Department of Computer Science\\

124:       Royal Holloway, University of London\\

125:       Egham, Surrey TW20 0EX, UK\\

126:       \texttt{\{alex,vovk\}@cs.rhul.ac.uk}}

127: \fi

128:

129: \begin{document}

130: \ifJOURNAL

131:   \title[Hedging Predictions]{Hedging Predictions\\in Machine Learning}

132:   % {\large preliminary draft, 28 April 2006}}

133:   \author{Alexander Gammerman}

134:   \author{Vladimir Vovk}

135:   \affiliation{Computer Learning Research Centre,

136:     Royal Holloway, University of London\\

137:     Egham, Surrey TW20 0EX}

138:   \email{\{alex,vovk\}@cs.rhul.ac.uk}

139:

140:   \shortauthors{A.~Gammerman and V.~Vovk}

141:

142:   \received{00 Month 2006}

143:   \revised{00 Month 2006}

144: \fi

145:

146: \ifnotJOURNAL

147:   \maketitle

148: \fi

149:

150: \begin{abstract}

151:   Recent advances in machine learning make it possible

152:   to design efficient prediction algorithms for data sets with huge numbers of parameters.

153:   This paper describes a new technique for ``hedging'' the predictions

154:   output by many such algorithms,

155:   including support vector machines, kernel ridge regression, kernel nearest neighbours,

156:   and by many other state-of-the-art methods.

157:   The hedged predictions for the labels of new objects

158:   include quantitative measures of their own accuracy and reliability.

159:   These measures are provably valid under the assumption of randomness,

160:   traditional in machine learning:

161:   the objects and their labels are assumed to be generated independently

162:   from the same probability distribution.

163:   In particular, it becomes possible to control (up to statistical fluctuations)

164:   the number of erroneous predictions by selecting a suitable confidence level.

165:   Validity being achieved automatically,

166:   the remaining goal of hedged prediction is efficiency:

167:   taking full account of the new objects' features

168:   and other available information to produce as accurate predictions as possible.

169:   This can be done successfully using the powerful machinery of modern machine learning.

170: \end{abstract}

171:

172: \ifJOURNAL

173:   \keywords{Classification, confidence, induction, learning, prediction, randomness, regression, transduction}

174:

175:   \maketitle

176: \fi

177:

178: \section{Introduction}

179: \label{sec:introduction}

180:

181: % 1. Successes of machine learning:

182: %    prediction under only one assumption (randomness)

183: %    kernel methods: high-dimensional data

184: % 2. Weak point: no confidence, or loose bounds, or strong assumptions (Bayesian)

185: % 3. Advantages of conformal prediction

186: % 4. Contents of this paper

187:

188: The two main varieties of the problem of prediction,

189: classification and regression,

190: % I talk about classification and regression

191: % since prediction is often associated with the Kalman filter,

192: % which is not covered in this paper

193: % (because it works outside the randomness assumption)

194: are standard subjects in statistics and machine learning.

195: The classical classification and regression techniques

196: can deal successfully with conventional small-scale, low-dimensional data sets;

197: however, attempts to apply these techniques to modern high-dimensional and high-throughput data sets

198: encounter serious conceptual and computational difficulties.

199: Several new techniques,

200: first of all support vector machines \cite{vapnik:1995,vapnik:1998}

201: and other kernel methods,

202: have been developed in machine learning recently

203: with the explicit goal of dealing with high-dimensional data sets

204: % kernel methods: we do not need to process many attributes explicitly

205: with large numbers of objects.

206: % at some point we can discard all elements that are not support vectors

207:

208: A typical drawback of the new techniques is the lack of useful measures of confidence

209: in their predictions.

210: For example, some of the tightest upper bounds of the popular PAC theory

211: on the probability of error exceed~1 even for relatively clean data sets

212: (\cite{vovk/etal:2005}, p.~249).

213: This paper describes an efficient way to ``hedge'' the predictions

214: produced by the new and traditional machine-learning methods,

215: i.e., to complement them with measures of their accuracy and reliability.

216: Appropriately chosen,

217: not only are these measures valid and informative,

218: but they also take full account of the special features

219: of the object to be predicted.

220:

221: We call our algorithms for producing hedged predictions ``conformal predictors'';

222: they are formally introduced in Section \ref{sec:conformal}.

223: Their most important property is the automatic validity under the randomness assumption

224: (to be discussed shortly).

225: Informally, validity means that conformal predictors never overrate

226: the accuracy and reliability of their predictions.

227: This property, stated in Sections \ref{sec:conformal} and \ref{sec:on-line},

228: is formalized in terms of finite data sequences,

229: without any recourse to asymptotics.

230:

231: The claim of validity of conformal predictors

232: depends on an assumption that is shared by many other algorithms in machine learning,

233: which we call the assumption of randomness:

234: the objects and their labels are assumed to be generated independently

235: from the same probability distribution.

236: Admittedly, this is a strong assumption,

237: and areas of machine learning are emerging

238: that rely on other assumptions

239: (such as the Markovian assumption of reinforcement learning;

240: see, e.g., \cite{sutton/barto:1998})

241: or dispense with any stochastic assumptions altogether

242: (competitive on-line learning;

243: see, e.g., \cite{cesabianchi/lugosi:2006,vovk:2001}).

244: It is, however, much weaker than assuming a parametric statistical model,

245: sometimes complemented with a prior distribution on the parameter space,

246: which is customary in the statistical theory of prediction.

247: And taking into account the strength of the guarantees that can be proved

248: under this assumption,

249: it does not appear overly restrictive.

250:

251: So we know that conformal predictors tell the truth.

252: Clearly, this is not enough:

253: truth can be uninformative and so useless.

254: We will refer to various measures of informativeness of conformal predictors

255: as their ``efficiency''.

256: As conformal predictors are provably valid,

257: efficiency is the only thing we need to worry about

258: when designing conformal predictors

259: for solving specific problems.

260: Virtually any classification or regression algorithm

261: can be transformed into a conformal predictor,

262: and so most of the arsenal of methods of modern machine learning

263: can be brought to bear on the design of efficient conformal predictors.

264:

265: We start the main part of the paper, in Section \ref{sec:ideal},

266: with the description of an idealized predictor

267: based on Kolmogorov's algorithmic theory of randomness.

268: This ``universal predictor'' produces the best possible hedged predictions

269: but, unfortunately, is noncomputable.

270: We can, however, set ourselves the task of approximating the universal predictor

271: as well as possible.

272:

273: In Section \ref{sec:conformal} we formally introduce the notion of conformal predictors

274: and state a simple result about their validity.

275: In that section we also briefly describe results of computer experiments

276: demonstrating the methodology of conformal prediction.

277:

278: In Section \ref{sec:Bayesian} we consider an example demonstrating

279: how conformal predictors react to the violation of our model

280: of the stochastic mechanism generating the data

281: (within the framework of the randomness assumption).

282: If the model coincides with the actual stochastic mechanism,

283: we can construct an optimal conformal predictor,

284: which turns out to be almost as good as the Bayes-optimal confidence predictor

285: (the formal definitions will be given later).

286: When the stochastic mechanism significantly deviates from the model,

287: conformal predictions remain valid but their efficiency inevitably suffers.

288: The Bayes-optimal predictor starts producing very misleading results

289: which superficially look as good as when the model is correct.

290:

291: In Section \ref{sec:on-line} we describe the ``on-line'' setting

292: of the problem of prediction,

293: and in Section \ref{sec:slow} contrast it with the more standard ``batch'' setting.

294: The notion of validity introduced in Section \ref{sec:conformal}

295: is applicable to both settings,

296: but in the on-line setting it can be strengthened:

297: we can now prove that the percentage of the erroneous predictions

298: will be close, with high probability,

299: to a chosen confidence level.

300: For the batch setting,

301: the stronger property of validity for conformal predictors

302: remains an empirical fact.

303: In Section \ref{sec:slow} we also discuss limitations of the on-line setting

304: and introduce new settings intermediate between on-line and batch.

305: To a large degree,

306: conformal predictors still enjoy the stronger property of validity

307: for the intermediate settings.

308:

309: Section \ref{sec:induction-transduction} is devoted

310: to the discussion of the difference between two kinds of inference from empirical data,

311: induction and transduction

312: (emphasized by Vladimir Vapnik \cite{vapnik:1995,vapnik:1998}).

313: Conformal predictors belong to transduction,

314: but combining them with elements of induction

315: can lead to a significant improvement in their computational efficiency

316: (Section \ref{sec:ICP}).

317:

318: We show how some popular methods of machine learning

319: can be used as underlying algorithms for hedged prediction.

320: We do not give the full description of these methods

321: and refer the reader to the existing readily accessible descriptions.

322: This paper is, however, self-contained in the sense

323: that we explain all features of the underlying algorithms

324: that are used in hedging their predictions.

325: We hope that the information we provide will enable the reader

326: to apply our hedging techniques

327: to their favourite machine-learning methods.

328:

329: \section{Ideal hedged predictions}

330: \label{sec:ideal}

331:

332: % Algorithmic randomness and idealized conformal predictors

333: % (interesting objects for math research)

334:

335: The most basic problem of machine learning is perhaps the following.

336: We are given a \emph{training set} of \emph{examples}

337: \begin{equation}\label{eq:training-set}

338:   (x_1,y_1),\ldots,(x_l,y_l),

339: \end{equation}

340: each example $(x_i,y_i)$, $i=1,\ldots,l$, consisting of an \emph{object} $x_i$

341: (typically, a vector of attributes)

342: and its label $y_i$;

343: the problem is to predict the label $y_{l+1}$

344: of a new object $x_{l+1}$.

345: Two important special cases are where the labels are known \emph{a priori}

346: to belong to a relatively small finite set

347: (the problem of \emph{classification})

348: and where the labels are allowed to be any real numbers

349: (the problem of \emph{regression}).

350:

351: The usual goal of classification is to produce a prediction $\hat y_{l+1}$

352: that is likely to coincide with the true label $y_{l+1}$,

353: and the usual goal of regression is to produce a prediction $\hat y_{l+1}$

354: that is likely to be close to the true label $y_{l+1}$.

355: In the case of classification,

356: our goal will be to complement the prediction $\hat y_{l+1}$

357: with some measure of its reliability.

358: In the case of regression,

359: we would like to have some measure of accuracy and reliability of our prediction.

360: There is a clear trade-off between accuracy and reliability:

361: we can improve the former by relaxing the latter

362: and vice versa.

363: We are looking for algorithms that achieve the best possible trade-off

364: and for a measure that would quantify the achieved trade-off.

365:

366: Let us start from the case of classification.

367: The idea is to try every possible label $Y$ as a candidate for $x_{l+1}$'s label

368: and see how well the resulting sequence

369: \begin{equation}\label{eq:completion}

370:   (x_1,y_1),\dots,(x_l,y_l),(x_{l+1},Y)

371: \end{equation}

372: conforms to the randomness assumption

373: (if it does conform to this assumption, we will say that it is ``random'';

374: this will be formalized later in this section).

375: The ideal case is where all $Y$s but one lead to sequences (\ref{eq:completion})

376: that are not random;

377: we can then use the remaining $Y$ as a confident prediction for $y_{l+1}$.

378:

379: In the case of regression,

380: we can output the set of all $Y$s that lead to random (\ref{eq:completion})

381: as our ``prediction set''.

382: An obvious obstacle is that the set of all possible $Y$s is infinite

383: and so we cannot go through all the $Y$s explicitly,

384: but we will see in the next section that there are ways to overcome this difficulty.

385:

386: We can see that the problem of hedged prediction

387: is intimately connected with the problem of testing randomness.

388: Different versions of the ``universal'' notion of randomness

389: were defined by Kolmogorov, Martin-L\"of and Levin (see, e.g., \cite{li/vitanyi:1997})

390: based on the existence of universal Turing machines.

391: Adapted to our current setting,

392: Martin-L\"of's definition is as follows.

393: Let $\mathbf{Z}$ be the set of all possible examples;

394: as each example consists of an object and a label,

395: $\mathbf{Z}=\mathbf{X}\times\mathbf{Y}$,

396: where $\mathbf{X}$ is the set of all possible objects

397: and $\mathbf{Y}$, $\left|\mathbf{Y}\right|>1$, is the set of all possible labels.

398: We will use $\mathbf{Z}^*$ as the notation for all finite sequences of examples.

399: A function $t:\mathbf{Z}^*\to[0,1]$

400: is a \emph{randomness test} if

401: \begin{enumerate}

402: \item

403:   for all $\epsilon\in(0,1)$, all $n\in\{1,2,\dots\}$

404:   and all probability distributions $P$ on $\mathbf{Z}$,

405:   \begin{equation}\label{eq:test-validity}

406:     P^n

407:     \left\{

408:       z\in\mathbf{Z}^n

409:       \st

410:       t(z)\le\epsilon

411:     \right\}

412:     \le

413:     \epsilon;

414:   \end{equation}

415: \item

416:   $t$ is upper semicomputable.

417: \end{enumerate}

418: The first condition means that the randomness test is required to be valid:

419: if, for example, we observe $t(z)\le1\%$ for our data set $z$,

420: then either the data set was not generated independently from the same probability distribution $P$

421: or a rare (of probability at most 1\%, under any $P$) event has occurred.

422: The second condition means that

423: we should be able to compute the test, in a weak sense

424: (we cannot require computability in the usual sense,

425: since the universal test can only be upper semicomputable:

426: it can work forever to discover \emph{all} patterns in the data sequence

427: that make it non-random).

428: Martin-L\"of (developing Kolmogorov's earlier ideas) proved

429: that there exists a smallest, to within a constant factor,

430: randomness test.

431:

432: Let us fix a smallest randomness test,

433: call it the \emph{universal test},

434: and call the value it takes on a data sequence

435: the \emph{randomness level} of this sequence.

436: A random sequence is one whose randomness level is not small;

437: this is rather informal,

438: but it is clear that for finite data sequences we cannot have a clear-cut division

439: of all sequences into random and non-random

440: (like the one defined by Martin-L\"of \cite{martin-lof:1966} for infinite sequences).

441: If $t$ is a randomness test, not necessarily universal,

442: the value that it takes on a data sequence will be called

443: the \emph{randomness level detected by} $t$.

444:

445: \begin{remark*}

446:   The word ``random'' is used in (at least) two different senses in the existing literature.

447:   In this paper we need both but, luckily,

448:   the difference does not matter within our current framework.

449:   First, randomness can refer to the assumption that the examples

450:   are generated independently from the same distribution;

451:   this is the origin of our ``assumption of randomness''.

452:   Second, a data sequence is said to be random with respect to a statistical model

453:   if the universal test (a generalization of the notion of universal test as defined above)

454:   does not detect any lack of conformity between the two.

455:   Since the only statistical model we are interested in this paper

456:   is the one embodying the assumption of randomness,

457:   we have a perfect agreement between the two senses.

458: \end{remark*}

459:

460: \subsection*{Prediction with Confidence and Credibility}

461:

462: Once we have a randomness test $t$, universal or not,

463: we can use it for hedged prediction.

464: There are two natural ways to package the results

465: of such predictions:

466: in this subsection we will describe the way that can only be used

467: in classification problems.

468: If the randomness test is not computable,

469: we can imagine an oracle answering questions about its values.

470:

471: Given the training set (\ref{eq:training-set}) and the test object $x_{l+1}$,

472: we can act as follows:

473: \begin{itemize}

474: \item

475:   consider all possible values $Y\in\mathbf{Y}$

476:   for the label $y_{l+1}$;

477: \item

478:   find the randomness level detected by $t$ for every possible completion (\ref{eq:completion});

479: \item

480:   predict the label $Y$ corresponding to a completion

481:   with the largest randomness level detected by $t$;

482: \item

483:   output as the \emph{confidence} in this prediction

484:   one minus the second largest randomness level detected by $t$;

485: \item

486:   output as the \emph{credibility} of this prediction

487:   the randomness level detected by $t$

488:   of the output prediction $Y$

489:   (i.e., the largest randomness level detected by $t$ over all possible labels).

490: \end{itemize}

491: To understand the intuition behind confidence,

492: let us tentatively choose a conventional ``significance level'', such as $1\%$.

493: (In the terminology of this paper, this corresponds to a ``confidence level'' of $99\%$,

494: i.e.,

495: $100\%$ minus $1\%$.)

496: If the confidence in our prediction is $99\%$ or more

497: and the prediction is wrong,

498: the actual data sequence belongs to an \emph{a priori} chosen

499: set of probability at most $1\%$

500: (the set of all data sequences with randomness level detected by $t$

501: not exceeding $1\%$).

502:

503: Intuitively, low credibility means that

504: either the training set is non-random

505: or the test object is not representative of the training set

506: (say, in the training set we have images of digits

507: and the test object is that of a letter).

508:

509: \subsection*{Confidence Predictors}

510:

511: In regression problems,

512: confidence, as defined in the previous subsection,

513: is not a useful quantity:

514: it will typically be equal to 0.

515: A better approach is to choose a range of confidence levels $1-\epsilon$,

516: and for each of them specify a \emph{prediction set}

517: $\Gamma^{\epsilon}\subseteq\mathbf{Y}$,

518: the set of labels deemed possible at the confidence level $1-\epsilon$.

519: We will always consider nested prediction sets:

520: $\Gamma^{\epsilon_1}\subseteq\Gamma^{\epsilon_2}$ when $\epsilon_1\ge\epsilon_2$.

521: A \emph{confidence predictor} is a function

522: that maps each training set, each new object, and each confidence level $1-\epsilon$

523: (formally, we allow $\epsilon$ to take any value in $(0,1)$)

524: to the corresponding prediction set $\Gamma^{\epsilon}$.

525: For the confidence predictor to be \emph{valid} the probability that the true label

526: will fall outside the prediction set $\Gamma^{\epsilon}$ should not exceed $\epsilon$,

527: for each $\epsilon$.

528:

529: We might, for example, choose the confidence levels 99\%, 95\% and 80\%,

530: and refer to the 99\% prediction set $\Gamma^{1\%}$ as the highly confident prediction,

531: to the 95\% prediction set $\Gamma^{5\%}$ as the confident prediction,

532: and to the 80\% prediction set $\Gamma^{20\%}$ as the casual prediction.

533: Figure \ref{fig:predset} shows how such a family of prediction sets might look

534: in the case of a rectangular label space $\mathbf{Y}$.

535: The casual prediction pinpoints the target quite well,

536: but we know that this kind of prediction can be wrong with probability 20\%.

537: The confident prediction is much bigger.

538: If we want to be highly confident

539: (make a mistake only with probability 1\%),

540: we must accept an even lower accuracy;

541: there is even a completely different location that we cannot rule out

542: at this level of confidence.

543: % In principle, a confidence predictor outputs prediction sets

544: % for all confidence levels, and these sets are nested,

545: % as in the figure above.

546:

547: \begin{figure}

548:   \centering

549:   \makebox{\includegraphics[width=\picturewidth,clip=true]{predset.eps}}

550:   \caption{\label{fig:predset}An example of a nested family of prediction sets

551:     (casual prediction in black,

552:     confident prediction in dark grey,

553:     and highly confident prediction in light grey).}

554: \end{figure}

555:

556: Given a randomness test, again universal or not,

557: we can define the corresponding confidence predictor as follows:

558: for any confidence level $1-\epsilon$,

559: the corresponding prediction set consists of the $Y$s

560: such that the randomness level of the completion (\ref{eq:completion})

561: detected by the test is greater than $\epsilon$.

562: The condition (\ref{eq:test-validity}) of validity for statistical tests

563: implies that a confidence predictor defined in this way

564: is always valid.

565:

566: The confidence predictor based on the universal test

567: (the \emph{universal confidence predictor})

568: is an interesting object for mathematical investigation

569: (see, e.g., \cite{vovk/etal:1999}, Section 4),

570: but it is not computable and so cannot be used in practice.

571: Our goal in the following sections will be

572: to find computable approximations to it.

573:

574: \section{Conformal Prediction}

575: \label{sec:conformal}

576:

577: % Practical approximation: conformal prediction (universal for invariant predictors)

578:

579: In the previous section we explained how randomness tests

580: can be used for prediction.

581: The connection between testing and prediction is, of course, well understood

582: and have been discussed at length by philosophers \cite{popper:1934}

583: and statisticians

584: (see, e.g., the textbook \cite{cox/hinkley:1974}, Section 7.5).

585: % In fact, this connection is two-way,

586: % so we do not lose anything basing our predictions on testing.

587: In this section we will see how some popular prediction algorithms

588: can be transformed into randomness tests

589: and, therefore, be used for producing hedged predictions.

590:

591: Let us start with the most successful recent development in machine learning,

592: support vector machines

593: (\cite{vapnik:1995,vapnik:1998},

594: with a key idea going back

595: to the generalized portrait method \cite{vapnik/chervonenkis:1974}).

596: Suppose the label space is $\mathbf{Y}=\{-1,1\}$

597: (we are dealing with the binary classification problem).

598: With each set of examples

599: \begin{equation}\label{eq:set}

600:   (x_1,y_1),

601:   \ldots,

602:   (x_n,y_n)

603: \end{equation}

604: one associates an optimization problem

605: whose solution produces nonnegative numbers $\alpha_1,\ldots,\alpha_n$

606: (``Lagrange multipliers'').

607: These numbers determine the prediction rule used by the support vector machine

608: (see \cite{vapnik:1998}, Chapter 10, for details),

609: but they also are interesting objects in their own right.

610: Each $\alpha_i$, $i=1,\ldots,n$, tells us

611: how ``strange'' an element of the set (\ref{eq:set})

612: the corresponding example $(x_i,y_i)$ is.

613: If $\alpha_i=0$, $(x_i,y_i)$ fits (\ref{eq:set}) very well

614: (in fact so well that such examples are uninformative,

615: and the support vector machine ignores them when making predictions).

616: The elements with $\alpha_i>0$ are called \emph{support vectors},

617: and the large value of $\alpha_i$ indicates

618: that the corresponding $(x_i,y_i)$ is an outlier.

619: % It is customary to impose an upper bound $C$ on the values of $\alpha_i$,

620: % one reason being to prevent the outliers affecting too much the prediction

621: % (the other to delimit the search space).

622:

623: Taking the completion (\ref{eq:completion}) as (\ref{eq:set})

624: (so that $n=l+1$),

625: we can find the corresponding $\alpha_1,\ldots,\alpha_{l+1}$.

626: If $Y$ is different from the actual label $y_{l+1}$,

627: we expect $(x_{l+1},Y)$ to be an outlier in (\ref{eq:completion})

628: and so $\alpha_{l+1}$ be large as compared with $\alpha_1,\ldots,\alpha_l$.

629: A natural way to compare $\alpha_{l+1}$ to the other $\alpha$s

630: is to look at the ratio

631: \begin{equation}\label{eq:p}

632:   p_Y

633:   :=

634:   \frac

635:   {

636:     \left|

637:       \{i=1,\ldots,l+1 \st \alpha_i\ge\alpha_{l+1}\}

638:     \right|

639:   }

640:   {l+1},

641: \end{equation}

642: which we call the \emph{p-value} associated with the possible label $Y$ for $x_{l+1}$.

643: In words, the p-value is the proportion of the $\alpha$s

644: which are at least as large as the last $\alpha$.

645:

646: The methodology of support vector machines

647: (as described in \cite{vapnik:1995,vapnik:1998})

648: is directly applicable

649: only to the binary classification problems,

650: but the general case can be reduced to the binary case

651: by the standard ``one-against-one'' or ``one-against-the-rest'' procedures.

652: This allows us to define the strangeness values $\alpha_1,\ldots,\alpha_{l+1}$

653: for general classification problems

654: (see \cite{vovk/etal:2005}, p.~59, for details),

655: which in turn determine the p-values (\ref{eq:p}).

656:

657: The function that assigns to each sequence (\ref{eq:completion})

658: the corresponding p-value, defined by (\ref{eq:p}),

659: is a randomness test

660: (this will follow from Theorem \ref{thm:on-line}

661: stated in Section \ref{sec:on-line} below).

662: Therefore, the p-values,

663: which are our approximations to the corresponding randomness levels,

664: can be used for hedged prediction

665: as described in the previous section.

666: For example, if the p-value $p_{-1}$ is small while $p_1$ is not small,

667: we can predict $1$ with confidence $1-p_{-1}$ and credibility $p_1$.

668: Typical credibility will be 1:

669: for most data sets the percentage of support vectors is small

670: (\cite{vapnik:1998}, Chapter 12),

671: and so we can expect $\alpha_{l+1}=0$ when $Y=y_{l+1}$.

672:

673: \begin{remark*}

674:   When the order of examples is irrelevant,

675:   we refer to the data set (\ref{eq:set}) as a set,

676:   although as a mathematical object it is a multiset rather than a set

677:   since it can contain several copies of the same example.

678:   We will continue to use this informal terminology

679:   (to be completely accurate,

680:   we would have to say ``data multiset'' instead of ``data set''!)

681: \end{remark*}

682:

683: % [This in fact demonstrate the $\mathbf{X}$ is large,

684: % not that it is high-dimensional]

685: % Already this data set can be used to illustrate the high-dimensional character

686: % of many modern data sets.

687: % Each object (handwritten digit) is a $16\times16$ grey-scale matrix,

688: % with 31 shades of grey,

689: % so there are $31^{16 \times 16}$ (approximately $10^{381}$)

690: % possible objects.

691: % This greatly exceeds the number of objects in the USPS data set, which is 9298.

692:

693: % Several kernels are used.

694: % The results show that the method works well in predicting classifications;

695: % in addition, of course,

696: % the method also provides valid and practically useful confidence information,

697: % in sharp contrast with typical PAC error bounds

698: % (valid but not useful)

699: % and Bayesian methods

700: % (usually not valid).

701:

702: \ifJOURNAL

703: \begin{table*}

704: \processtable{Selected test examples from the USPS data set:

705:   the p-values of digits (0--9), true and predicted labels,

706:   and confidence and credibility values.\label{tab:examples}}

707: %\begingroup\tiny

708: {\footnotesize\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}

709: \hline 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 &

710:   \vbox{\hbox{\strut true}\hbox{\strut label}} &

711:   \vbox{\hbox{\strut pre-}\hbox{\strut diction}} &

712:   \vbox{\hbox{\strut confi-}\hbox{\strut dence}} &

713:   \vbox{\hbox{\strut credi-}\hbox{\strut bility}}\\

714: \hline 0.01\% & 0.11\% & 0.01\% & 0.01\% & 0.07\% & 0.01\% & 100\% & 0.01\% & 0.01\% & 0.01\%

715:    & 6 & 6 & 99.89\% & 100\%\\

716: \hline 0.32\% & 0.38\% & 1.07\% & 0.67\% & 1.43\% & 0.67\% & 0.38\% & 0.33\% & 0.73\% & 0.78\%

717:    & 6 & 4 & 98.93\% & 1.43\%\\

718: \hline 0.01\% & 0.27\% & 0.03\% & 0.04\% & 0.18\% & 0.01\% & 0.04\% & 0.01\% & 0.12\% & 100\%

719:  & 9 & 9 & 99.73\% & 100\%\\

720: %\hline 100\% & 0.03\% & 0.01\% & 0.01\% & 0.04\% & 0.01\% & 0.01\% & 0.01\% & 0.01\% & 0.01\%

721: % & 0 & 0 & 99.96\% & 100\%\\

722: %\hline 0.04\% & 0.30\% & 0.05\% & 0.38\% & 0.29\% & 0.01\% & 0.08\% & 0.07\% & 0.40\% & 0.22\%

723: % & 2 & 8 & 99.62\% & 0.40\%\\

724: %\hline 0.01\% & 0.22\% & 0.03\% & 0.55\% & 0.16\% & 0.04\% & 0.03\% & 0.01\% & 0.04\% & 0.05\%

725: % & 3 & 3 & 99.78\% & 0.55\%\\

726: %\hline 0.04\% & 0.32\% & 0.10\% & 2.06\% & 0.29\% & 2.98\% & 0.04\% & 0.07\% & 0.37\% & 0.34\%

727: % & 3 & 5 & 97.94\% & 2.98\%\\

728: %\hline 0.30\% & 0.49\% & 0.43\% & 0.36\% & 1.28\% & 0.51\% & 0.29\% & 0.21\% & 0.38\% & 1.19\%

729: % & 4 & 4 & 98.81\% & 1.28\%\\

730: %\hline 0.01\% & 0.04\% & 0.01\% & 0.01\% & 0.03\% & 0.01\% & 0.01\% & 0.01\% & 0.01\% & 100\%

731: % & 9 & 9 & 99.96\% & 100\%\\

732: %\hline 0.01\% & 0.32\% & 0.04\% & 0.01\% & 0.26\% & 100\% & 0.01\% & 0.05\% & 0.11\% & 0.18\%

733: % & 5 & 5 & 99.68\% & 100\%\\

734: %\hline 0.41\% & 0.44\% & 0.27\% & 2.07\% & 0.70\% & 1.87\% & 0.23\% & 0.29\% & 0.44\% & 0.80\%

735: % & 5 & 3 & 98.13\% & 2.07\%\\

736: \hline

737: \end{tabular}}{}

738: %\endgroup

739: \end{table*}

740: \fi

741:

742: \ifnotJOURNAL

743: \begin{table*}

744: \caption{Selected test examples from the USPS data set:

745:   the p-values of digits (0--9), true and predicted labels,

746:   and confidence and credibility values.\label{tab:examples}}

747:

748: \medskip

749:

750: {\tiny\hspace{-12mm}\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}

751: \hline 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 &

752:   \vbox{\hbox{\strut true}\hbox{\strut label}} &

753:   \vbox{\hbox{\strut pre-}\hbox{\strut diction}} &

754:   \vbox{\hbox{\strut confi-}\hbox{\strut dence}} &

755:   \vbox{\hbox{\strut credi-}\hbox{\strut bility}}\\

756: \hline 0.01\% & 0.11\% & 0.01\% & 0.01\% & 0.07\% & 0.01\% & 100\% & 0.01\% & 0.01\% & 0.01\%

757:    & 6 & 6 & 99.89\% & 100\%\\

758: \hline 0.32\% & 0.38\% & 1.07\% & 0.67\% & 1.43\% & 0.67\% & 0.38\% & 0.33\% & 0.73\% & 0.78\%

759:    & 6 & 4 & 98.93\% & 1.43\%\\

760: \hline 0.01\% & 0.27\% & 0.03\% & 0.04\% & 0.18\% & 0.01\% & 0.04\% & 0.01\% & 0.12\% & 100\%

761:  & 9 & 9 & 99.73\% & 100\%\\

762: %\hline 100\% & 0.03\% & 0.01\% & 0.01\% & 0.04\% & 0.01\% & 0.01\% & 0.01\% & 0.01\% & 0.01\%

763: % & 0 & 0 & 99.96\% & 100\%\\

764: %\hline 0.04\% & 0.30\% & 0.05\% & 0.38\% & 0.29\% & 0.01\% & 0.08\% & 0.07\% & 0.40\% & 0.22\%

765: % & 2 & 8 & 99.62\% & 0.40\%\\

766: %\hline 0.01\% & 0.22\% & 0.03\% & 0.55\% & 0.16\% & 0.04\% & 0.03\% & 0.01\% & 0.04\% & 0.05\%

767: % & 3 & 3 & 99.78\% & 0.55\%\\

768: %\hline 0.04\% & 0.32\% & 0.10\% & 2.06\% & 0.29\% & 2.98\% & 0.04\% & 0.07\% & 0.37\% & 0.34\%

769: % & 3 & 5 & 97.94\% & 2.98\%\\

770: %\hline 0.30\% & 0.49\% & 0.43\% & 0.36\% & 1.28\% & 0.51\% & 0.29\% & 0.21\% & 0.38\% & 1.19\%

771: % & 4 & 4 & 98.81\% & 1.28\%\\

772: %\hline 0.01\% & 0.04\% & 0.01\% & 0.01\% & 0.03\% & 0.01\% & 0.01\% & 0.01\% & 0.01\% & 100\%

773: % & 9 & 9 & 99.96\% & 100\%\\

774: %\hline 0.01\% & 0.32\% & 0.04\% & 0.01\% & 0.26\% & 100\% & 0.01\% & 0.05\% & 0.11\% & 0.18\%

775: % & 5 & 5 & 99.68\% & 100\%\\

776: %\hline 0.41\% & 0.44\% & 0.27\% & 2.07\% & 0.70\% & 1.87\% & 0.23\% & 0.29\% & 0.44\% & 0.80\%

777: % & 5 & 3 & 98.13\% & 2.07\%\\

778: \hline

779: \end{tabular}}{}

780: %\endgroup

781: \end{table*}

782: \fi

783:

784: Table~\ref{tab:examples} illustrates the results of hedged prediction

785: for a popular data set of hand-written digits

786: called the USPS data set \cite{lecun/etal:1990}.

787: The data set contains 9298 digits represented as a $16\times16$ matrix of pixels;

788: it is divided into a training set of size 7291 and a test set of size 2007.

789: For several test examples the table shows

790: the p-values for each possible label, the actual label,

791: the predicted label, confidence, and credibility,

792: computed using the support vector method with the polynomial kernel of degree 5.

793: To interpret the numbers in this table,

794: remember that high (i.e., close to 100\%) confidence

795: means that all labels except the predicted one are unlikely.

796: If, say, the first example were predicted wrongly,

797: this would mean that a rare event (of probability less than 1\%) had occurred;

798: therefore, we expect the prediction to be correct (which it is).

799: In the case of the second example,

800: confidence is also quite high (more than 95\%),

801: but we can see that the credibility is low (less than 5\%).

802: From the confidence we can conclude that the labels other than 4

803: are excluded at level 5\%,

804: but the label 4 itself is also excluded at the level 5\%.

805: This shows that the prediction algorithm

806: was unable to extract from the training set enough information

807: to allow us to confidently classify this example:

808: the strangeness of the labels different from 4 may be due

809: to the fact that the object itself is strange;

810: perhaps the test example is very different from all examples in the training set.

811: Unsurprisingly, the prediction for the second example is wrong.

812:

813: In general,

814: high confidence shows that all alternatives

815: to the predicted label are unlikely.

816: Low credibility means that the whole situation is suspect;

817: as we have already mentioned, we will obtain a very low credibility

818: if the new example is a letter (whereas all training examples are digits).

819: Credibility will also be low if the new example is a digit

820: written in an unusual way.

821: Notice that typically credibility will not be low

822: provided the data set was generated independently from the same distribution:

823: the probability that credibility

824: will not exceed some threshold $\epsilon$ (such as 1\%)

825: is at most $\epsilon$.

826: In summary,

827: we can trust a prediction if

828: (1) the confidence is close to 100\% and

829: (2) the credibility is not low (say, is not less than 5\%).

830: % Table~\ref{tab:examples} gives credibility values typical

831: % when using support vector machines

832: % for computing p-values:

833: % credibility is exactly 100\% on a few occasions.

834: % This happens because most of the $\alpha$'s computed

835: % by support vector machines are zero.

836: % For many  other learning methods typical values of credibility

837: % are in the range 5\%--95\%.

838:

839: Many other prediction algorithms can be used as underlying algorithms

840: for hedged prediction.

841: For example, we can use the nearest neighbours technique to associate

842: \begin{equation}\label{eq:NN}

843:   \alpha_i

844:   :=

845:   \frac

846:   {\sum_{j=1}^k d_{ij}^+}

847:   {\sum_{j=1}^k d_{ij}^-},

848:   \quad

849:   i=1,\ldots,n,

850: \end{equation}

851: with the elements $(x_i,y_i)$ of the set (\ref{eq:set}),

852: where $d_{ij}^+$ is the $j$th shortest distance from $x_i$

853: to other objects labelled in the same way as $x_i$,

854: and $d_{ij}^-$ is the $j$th shortest distance

855: from $x_i$ to the objects labelled differently from $x_i$;

856: the parameter $k\in\{1,2,\dots\}$ in~(\ref{eq:NN})

857: is the number of nearest neighbours taken into account.

858: The distances can be computed in a feature space

859: (that is, the distance between $x\in\mathbf{X}$ and $x'\in\mathbf{X}$

860: can be understood as $\left\|F(x)-F(x')\right\|$,

861: $F$ mapping the object space $\mathbf{X}$ into a feature, typically Hilbert, space),

862: and so (\ref{eq:NN}) can also be used with the kernel nearest neighbours.

863:

864: The intuition behind (\ref{eq:NN}) is as follows:

865: a typical object $x_i$ labelled by, say, $y$

866: will tend to be surrounded by other objects labelled by $y$;

867: and if this is the case, the corresponding $\alpha_i$ will be small.

868: In the untypical case that there are objects whose labels are different from $y$

869: nearer than objects labelled $y$,

870: $\alpha_i$ will become larger.

871: Therefore, the $\alpha$s reflect the strangeness of examples.

872:

873: The p-values computed by (\ref{eq:NN})

874: can again be used for hedged prediction.

875: % as described in Section \ref{sec:ideal}.

876: It is a general empirical fact that

877: the accuracy and reliability of the hedged predictions

878: are in line with the error rate of the underlying algorithm.

879: For example, in the case of the USPS data set,

880: the 1-nearest neighbour algorithm

881: (i.e., the one with $k=1$)

882: achieves the error rate of 2.2\%,

883: and the hedged predictions based on (\ref{eq:NN}) are highly confident

884: (achieve confidence of at least $99\%$)

885: for more than 95\% of the test examples.

886:

887: \subsection*{General Definition}

888:

889: The general notion of conformal predictor can be defined as follows.

890: A \emph{nonconformity measure} is a function that assigns

891: to every data sequence (\ref{eq:set}) a sequence of numbers

892: $\alpha_1,\ldots,\alpha_n$,

893: called \emph{nonconformity scores},

894: in such a way that interchanging any two examples $(x_i,y_i)$ and $(x_j,y_j)$

895: leads to the interchange of the corresponding nonconformity scores $\alpha_i$ and $\alpha_j$

896: (with all the other nonconformity scores unaffected).

897: The corresponding \emph{conformal predictor} maps each data set (\ref{eq:training-set}),

898: $l=0,1,\ldots$,

899: each new object $x_{l+1}$,

900: and each confidence level $1-\epsilon\in(0,1)$,

901: to the prediction set

902: \begin{equation}\label{eq:Gamma}

903:   \Gamma^{\epsilon}

904:   \left(

905:     x_1,y_1,\ldots,x_{l},y_{l},x_{l+1}

906:   \right)

907:   :=

908:   \left\{

909:     Y\in\mathbf{Y}

910:     \st

911:     p_Y

912:     >

913:     \epsilon

914:   \right\},

915: \end{equation}

916: where $p_Y$ are defined by (\ref{eq:p})

917: with $\alpha_1,\ldots,\alpha_{l+1}$ being the nonconformity scores

918: corresponding to the data sequence (\ref{eq:completion}).

919:

920: We have already remarked that associating with each completion (\ref{eq:completion})

921: the p-value (\ref{eq:p}) gives a randomness test;

922: this is true in general.

923: This implies that for each $l$ the probability of the event

924: \begin{equation*}

925:   y_{l+1}

926:   \in

927:   \Gamma^{\epsilon}

928:   \left(

929:     x_1,y_1,\ldots,x_{l},y_{l},x_{l+1}

930:   \right)

931: \end{equation*}

932: is at least $1-\epsilon$.

933:

934: This definition works for both classification and regression,

935: but in the case of classification we can summarize (\ref{eq:Gamma})

936: by two numbers:

937: the confidence

938: \begin{equation}\label{eq:conf}

939:   \sup

940:   \left\{

941:     1-\epsilon

942:     \st

943:     \left|

944:       \Gamma^{\epsilon}

945:     \right|

946:     \le

947:     1

948:   \right\}

949: \end{equation}

950: and the credibility

951: \begin{equation}\label{eq:cred}

952:   \inf

953:   \left\{

954:     \epsilon

955:     \st

956:     \left|

957:       \Gamma^{\epsilon}

958:     \right|

959:     =

960:     0

961:   \right\}.

962: \end{equation}

963:

964: \subsection*{Computationally Efficient Regression}

965:

966: As we have already mentioned,

967: the algorithms described so far

968: cannot be applied directly in the case of regression,

969: even if the randomness test is efficiently computable:

970: now we cannot consider all possible values $Y$ for $y_{l+1}$

971: since there are infinitely many of them.

972: However, there might still be computationally efficient

973: % (in the sense of required computational resources)

974: ways to find the prediction sets $\Gamma^{\epsilon}$.

975: The idea is that if $\alpha_i$ are defined as the residuals

976: \begin{equation}\label{eq:residual}

977:   \alpha_i

978:   :=

979:   \left|

980:     y_i - f_Y(x_i)

981:   \right|

982: \end{equation}

983: where $f_Y:\mathbf{X}\to\bbbr$ is a regression function

984: fitted to the completed data set~(\ref{eq:completion}),

985: then $\alpha_i$ may have a simple expression in terms of $Y$,

986: leading to an efficient way of computing the prediction sets

987: (via (\ref{eq:p}) and (\ref{eq:Gamma})).

988: This idea was implemented in \cite{nouretdinov/etal:2001rr}

989: in the case where $f_Y$ is found from the ridge regression,

990: or kernel ridge regression, procedure,

991: with the resulting algorithm of hedged prediction

992: called the \emph{ridge regression confidence machine}.

993: For a much fuller description of the ridge regression confidence machine

994: (and its modifications in the case where (\ref{eq:residual})

995: are replaced by the fancier ``deleted'' or ``studentized'' residuals)

996: see \cite{vovk/etal:2005}, Section 2.3.

997:

998: \section{Bayesian Approach to Conformal Prediction}

999: \label{sec:Bayesian}

1000:

1001: Bayesian methods have become very popular in both machine learning and statistics

1002: thanks to their power and versatility,

1003: and in this section we will see

1004: how Bayesian ideas can be used for designing efficient conformal predictors.

1005: We will only describe results of computer experiments

1006: (following \cite{melluish/etal:2001})

1007: with artificial data sets,

1008: since for real-world data sets there is no way

1009: to make sure that the Bayesian assumption is satisfied.

1010:

1011: Suppose $\mathbf{X}=\bbbr^p$

1012: (each object is a vector of $p$ real-valued attributes)

1013: and our model of the data-generating mechanism is

1014: \begin{equation}\label{eq:model}

1015:   y_i

1016:   =

1017:   w\cdot x_i

1018:   +

1019:   \xi_i,

1020:   \quad

1021:   i=1,2,\ldots,

1022: \end{equation}

1023: where $\xi_i$ are independent standard Gaussian random variables

1024: % (we use the notation $N(\mu,\sigma^2)$ for the Gaussian distribution

1025: % with mean $\mu$ and variance $\sigma^2$)

1026: and the weight vector $w\in\bbbr^p$ is distributed as $N(0,(1/a)I_p)$

1027: (we use the notation $I_p$ for the unit $p\times p$ matrix

1028: and $N(0,A)$ for the $p$-dimensional Gaussian distribution

1029: with covariance matrix $A$);

1030: $a$ is a positive constant.

1031: % which we believe to be $1$.

1032: The actual data-generating mechanism used in our experiments

1033: will correspond to this model with $a$ set to 1.

1034:

1035: Under the model (\ref{eq:model}) the best (in the mean-square sense) fit

1036: to a data set (\ref{eq:set})

1037: is provided by the ridge regression procedure with parameter $a$

1038: (for details, see, e.g., \cite{vovk/etal:2005}, Section 10.3).

1039: Using the residuals (\ref{eq:residual}) with $f_Y$

1040: found by ridge regression with parameter $a$

1041: leads to an efficient conformal predictor

1042: which will be referred to as the ridge regression confidence machine with parameter $a$.

1043: Each prediction set output by the ridge regression confidence machine

1044: will be replaced by its convex hull,

1045: the corresponding \emph{prediction interval}.

1046:

1047: To test the validity and efficiency of the ridge regression confidence machine

1048: the following procedure was used.

1049: Ten times a vector $w\in\bbbr^5$ was independently generated from the distribution $N(0,I_5)$.

1050: For each of the 10 values of $w$,

1051: 100 training objects and 100 test objects

1052: were independently generated from the uniform distribution on $[-10,10]^5$

1053: and for each object $x$ its label $y$ was generated as $w\cdot x+\xi$,

1054: with all the $\xi$ standard Gaussian and independent.

1055: For each of the 1000 test objects and each confidence level $1-\epsilon$

1056: the prediction set $\Gamma^{\epsilon}$ for its label

1057: was found from the corresponding training set

1058: using the ridge regression confidence machine with parameter $a=1$.

1059: The solid line in Figure~\ref{fig:rrcm-errors} shows the confidence level

1060: against the percentage of test examples whose labels

1061: were not covered by the corresponding prediction intervals at that confidence level.

1062: Since conformal predictors are always valid,

1063: the percentage outside the prediction interval

1064: should never exceed 100 minus the confidence level,

1065: up to statistical fluctuations,

1066: and this is confirmed by the picture.

1067:

1068: \begin{figure}

1069:   \centering

1070:   \makebox{\includegraphics[width=\picturewidth,clip=true]{rrcm_errors.eps}}

1071:   \caption{\label{fig:rrcm-errors}Validity for the ridge regression confidence machine.}

1072: \end{figure}

1073:

1074: A natural measure of efficiency of confidence predictors

1075: is the mean width of their prediction intervals,

1076: at different confidence levels:

1077: the algorithm is the more efficient the narrower prediction intervals it produces.

1078: The solid line in Figure~\ref{fig:rrcm-widths} shows

1079: the confidence level against the mean

1080: (over all test examples)

1081: width of the prediction intervals at that confidence level.

1082:

1083: \begin{figure}

1084:   \centering

1085:   \makebox{\includegraphics[width=\picturewidth,clip=true]{rrcm_widths.eps}}

1086:   \caption{\label{fig:rrcm-widths}Efficiency for the ridge regression confidence machine.}

1087: \end{figure}

1088:

1089: Since we know the data-generating mechanism,

1090: the approach via conformal prediction appears somewhat roundabout:

1091: for each test object we could instead find

1092: the conditional probability distribution of its label,

1093: which is Gaussian,

1094: and output as the prediction set $\Gamma^{\epsilon}$

1095: the shortest

1096: (i.e., centred at the mean of the conditional distribution)

1097: interval of conditional probability $1-\epsilon$.

1098: Figures \ref{fig:Bayes-errors} and \ref{fig:Bayes-widths}

1099: are the analogues of Figures \ref{fig:rrcm-errors} and \ref{fig:rrcm-widths}

1100: for this \emph{Bayes-optimal confidence predictor}.

1101: The solid line in Figure \ref{fig:Bayes-errors}

1102: demonstrates the validity of the Bayes-optimal confidence predictor.

1103:

1104: \begin{figure}

1105:   \centering

1106:   \makebox{\includegraphics[width=\picturewidth,clip=true]{bayes_errors.eps}}

1107:   \caption{\label{fig:Bayes-errors}Validity for the Bayes-optimal confidence predictor.}

1108: \end{figure}

1109:

1110: \begin{figure}

1111:   \centering

1112:   \makebox{\includegraphics[width=\picturewidth,clip=true]{bayes_widths.eps}}

1113:   \caption{\label{fig:Bayes-widths}Efficiency for the Bayes-optimal confidence predictor.}

1114: \end{figure}

1115:

1116: What is interesting is that the solid lines

1117: in Figures~\ref{fig:Bayes-widths} and \ref{fig:rrcm-widths}

1118: look exactly the same,

1119: taking account of the different scales of the vertical axes.

1120: The ridge regression confidence machine

1121: appears as good as the Bayes-optimal predictor.

1122: (This is a general phenomenon;

1123: it is also illustrated, in the case of classification,

1124: by the construction in Section 3.3 of \cite{vovk/etal:2005}

1125: of a conformal predictor that is asymptotically

1126: as good as the Bayes-optimal confidence predictor.)

1127:

1128: The similarity between the two algorithms disappears

1129: when they are given wrong values for $a$.

1130: For example,

1131: let us see what happens if we tell the algorithms

1132: that the expected value of $\|w\|$ is just $1\%$ of what it really is

1133: (this corresponds to taking $a=10000$).

1134: The ridge regression confidence machine stays valid

1135: (see the dashed line in Figure \ref{fig:rrcm-errors}),

1136: but its efficiency deteriorates

1137: (the dashed line in Figure \ref{fig:rrcm-widths}).

1138: The efficiency of the Bayes-optimal confidence predictor

1139: (the dashed line in Figure \ref{fig:Bayes-widths})

1140: is hardly affected,

1141: but its predictions become invalid

1142: (the dashed line in Figure \ref{fig:Bayes-errors}

1143: deviates significantly from the diagonal,

1144: especially for the most important large confidence levels:

1145: e.g., only about 15\% of labels fall within the 90\% prediction sets).

1146: The worst that can happen to the ridge regression confidence machine

1147: is that its predictions will become useless

1148: (but at least harmless),

1149: whereas the Bayes-optimal predictions can become misleading.

1150:

1151: Figures \ref{fig:rrcm-errors}--\ref{fig:Bayes-widths} also show the graphs

1152: for the intermediate value $a=1000$.

1153: Similar results but for different data sets

1154: are also given in \cite{vovk/etal:2005}, Section 10.3.

1155: A general scheme of Bayes-type conformal prediction

1156: is described in \cite{vovk/etal:2005}, pp.~102--103.

1157:

1158: \iffalse

1159: \begin{figure}

1160: \centering

1161:   \makebox{\includegraphics[width=0.25\picturewidth,clip=true]{autompg_errors.eps}}

1162:   \makebox{\includegraphics[width=0.25\picturewidth,clip=true]{autompg_widths.eps}}

1163:   \makebox{\includegraphics[width=0.25\picturewidth,clip=true]{boston_errors.eps}}

1164:   \makebox{\includegraphics[width=0.25\picturewidth,clip=true]{boston_widths.eps}}

1165: \caption{\label{fig:benchmarks}Bayesian RR and RRCM applied to Auto mpg and Boston housing benchmarks.}

1166: \end{figure}

1167:

1168: Figure~\ref{fig:benchmarks} extends these results

1169: to two benchmark data sets taken from the UCI machine learning repository,

1170: the auto-mpg data set and the Boston housing data set.

1171: For the benchmark data sets,

1172: the training and test examples were randomly drawn from the set of all data points.

1173: The ridge coefficient $a$ in Figure \ref{fig:benchmarks}

1174: is chosen so that a reasonable mean square error is obtained.

1175: The top graphs in the figure show

1176: that Bayesian Ridge Regression is overconfident on the auto-mpg dataset,

1177: predicting tolerance regions that are too narrow.

1178: The RRCM predicts valid tolerance regions,

1179: and the top right graph shows that to do so

1180: it gives wider tolerance regions than Bayesian Ridge Regression.

1181: On the Boston housing data set,

1182: Bayesian Ridge Regression is too conservative.

1183: The bottom left graph shows that its predicted tolerance regions are always valid;

1184: however, it also shows that they are much wider than those given by the RRCM.

1185: As the RRCM's tolerance regions are also valid,

1186: we prefer the more accurate RRCM's predictions.

1187:

1188: \textbf{These results probably do not make much sense

1189: since \cite{melluish/etal:2001} assumes the standard deviation $\sigma$ of $\xi_i$ known:

1190: $\sigma=1$.

1191: This assumption alone might lead to the gross inadequacies of the Bayesian method

1192: that show in Figure \ref{fig:benchmarks}.}

1193: \fi

1194:

1195: \section{On-line prediction}

1196: \label{sec:on-line}

1197:

1198: % Properties in the on-line framework

1199:

1200: We know from Section \ref{sec:conformal}

1201: that conformal predictors are valid in the sense that the probability of error

1202: \begin{equation}\label{eq:error}

1203:   y_{l+1}

1204:   \notin

1205:   \Gamma^{\epsilon}

1206:   \left(

1207:     x_1,y_1,

1208:     \ldots

1209:     x_l,y_l,

1210:     x_{l+1}

1211:   \right)

1212: \end{equation}

1213: at confidence level $1-\epsilon$

1214: never exceeds $\epsilon$.

1215: The word ``probability'' means ``unconditional probability'' here:

1216: the frequentist meaning of the statement that the probability of (\ref{eq:error})

1217: does not exceed $\epsilon$

1218: is that,

1219: if we repeatedly generate many sequences

1220: \begin{equation*}

1221:   x_1,y_1,\ldots,x_l,y_l,x_{l+1},y_{l+1},

1222: \end{equation*}

1223: the fraction of them satisfying (\ref{eq:error})

1224: will be at most $\epsilon$,

1225: to within statistical fluctuations.

1226: To say that we are controlling the number of errors

1227: would be an exaggeration

1228: because of the artificial character of this scheme

1229: of repeatedly generating a new training set and a new test example.

1230: Can we say that the confidence level $1-\epsilon$

1231: translates into a bound on the number of mistakes

1232: for a natural learning protocol?

1233: In this section we show that the answer is ``yes''

1234: for the popular on-line learning protocol,

1235: and in the next section we will see to what degree

1236: this carries over to other protocols.

1237:

1238: In on-line learning the examples are presented one by one.

1239: Each time, we observe the object and predict its label.

1240: Then we observe the label and go on to the next example.

1241: We start by observing the first object $x_1$ and predicting its label $y_1$.

1242: Then we observe $y_1$ and the second object $x_2$, and predict its label $y_2$.

1243: And so on.

1244: At the $n$th step,

1245: we have observed the previous examples

1246: $ %\begin{equation*}

1247:   (x_1,y_1),\dots,(x_{n-1},y_{n-1})

1248: $ %\end{equation*}

1249: and the new object $x_n$, and our task is to predict $y_n$.

1250: The quality of our predictions should improve

1251: as we accumulate more and more old examples.

1252: This is the sense in which we are learning.

1253:

1254: Our prediction for $y_n$ is a nested family of prediction sets

1255: $\Gamma_n^{\epsilon}\subseteq\mathbf{Y}$,

1256: $\epsilon\in(0,1)$.

1257: The process of prediction can be summarized by the following protocol:

1258:

1259: \medskip

1260:

1261: \noindent\textsc{On-line prediction protocol}

1262: \ifJOURNAL

1263:   \newcommand{\Indent}{\quad}

1264: \fi

1265: \ifnotJOURNAL

1266:   \newcommand{\Indent}{\quad\enspace}

1267:

1268:   \smallskip

1269:

1270: \fi

1271:

1272: \noindent

1273: \Indent$\Err_0:=0$;

1274:

1275: \noindent

1276: \Indent$\Mult_0:=0$;

1277:

1278: \noindent

1279: \Indent$\Emp_0:=0$;

1280:

1281: \noindent

1282: \Indent FOR $n=1,2,\ldots$:

1283:

1284: \noindent

1285: \Indent\Indent Reality outputs $x_n\in\mathbf{X}$;

1286:

1287: \noindent

1288: \Indent\Indent Predictor outputs $\Gamma_n^{\epsilon}\subseteq\mathbf{Y}$ for all $\epsilon\in(0,1)$;

1289:

1290: \noindent

1291: \Indent\Indent Reality outputs $y_n\in\mathbf{Y}$;

1292:

1293: \noindent

1294: \Indent\Indent$\err_n^{\epsilon}

1295:   :=

1296:   \left\{

1297:     \begin{array}{ll}

1298:       1 & \text{if $y_n \notin \Gamma_n^{\epsilon}$}\\

1299:       0 & \text{otherwise},

1300:     \end{array}

1301:   \right.

1302:   \quad

1303:   \epsilon\in(0,1)$;

1304:

1305: \noindent

1306: \Indent\Indent\strut$\Err_n^{\epsilon}:=\Err^{\epsilon}_{n-1}+\err_n^{\epsilon},

1307:   \quad

1308:   \epsilon\in(0,1)$;

1309:

1310: \noindent

1311: \Indent\Indent$\mult_n^{\epsilon}

1312:   :=

1313:   \left\{

1314:     \begin{array}{ll}

1315:       1 & \text{if $\left|\Gamma_n^{\epsilon}\right|>1$}\\

1316:       0 & \text{otherwise},

1317:     \end{array}

1318:   \right.

1319:   \quad

1320:   \epsilon\in(0,1)$;

1321:

1322: \noindent

1323: \Indent\Indent\strut$\Mult_n^{\epsilon}:=\Mult_{n-1}^{\epsilon}+\mult_n^{\epsilon},

1324:   \quad

1325:   \epsilon\in(0,1)$;

1326:

1327: \noindent

1328: \Indent\Indent$\emp_n^{\epsilon}

1329:   :=

1330:   \left\{

1331:     \begin{array}{ll}

1332:       1 & \text{if $\left|\Gamma_n^{\epsilon}\right|=0$}\\

1333:       0 & \text{otherwise},

1334:     \end{array}

1335:   \right.

1336:   \quad

1337:   \epsilon\in(0,1)$;

1338:

1339: \noindent

1340: \Indent\Indent\strut$\Emp_n^{\epsilon}:=\Emp_{n-1}^{\epsilon}+\Emp_n^{\epsilon},

1341:   \quad

1342:   \epsilon\in(0,1)$

1343:

1344: \noindent

1345: \Indent END FOR.

1346:

1347: \medskip

1348:

1349: \noindent

1350: As we said, the family $\Gamma_n^{\epsilon}$

1351: is assumed nested:

1352: $\Gamma_n^{\epsilon_1}\subseteq\Gamma_n^{\epsilon_2}$ when $\epsilon_1\ge\epsilon_2$.

1353: In this protocol we also record the cumulative numbers

1354: $\Err_n^{\epsilon}$ of erroneous prediction sets,

1355: $\Mult_n^{\epsilon}$ of \emph{multiple} prediction sets

1356: (i.e., prediction sets containing more than one label)

1357: and $\Emp_n^{\epsilon}$ of empty prediction sets

1358: at each confidence level $1-\epsilon$.

1359: We will discuss the significance of each of these numbers in turn.

1360:

1361: The number of erroneous predictions is a measure of validity of our confidence predictors:

1362: we would like to have $\Err_n^{\epsilon}\le\epsilon n$,

1363: up to statistical fluctuations.

1364: In Figure~\ref{fig:CP0err} we can see the lines $n\mapsto\Err_n^{\epsilon}$

1365: for one particular conformal predictor

1366: and for three confidence levels $1-\epsilon$:

1367: the solid line for 99\%, the dash-dot line for 95\%, and the dotted line for 80\%.

1368: The number of errors made grows linearly,

1369: and the slope is approximately

1370: 20\% for the confidence level 80\%,

1371: 5\% for the confidence level 95\%,

1372: and 1\% for the confidence level 99\%.

1373: We will see below that this is not accidental.

1374:

1375: \begin{figure}

1376:   \centering

1377:   \makebox{\includegraphics[width=\picturewidth]{CP0err.eps}}

1378:   \caption{\label{fig:CP0err}Cumulative numbers of errors for a conformal predictor

1379:     (the 1-nearest neighbour conformal predictor)

1380:     run in the on-line mode on the USPS data set

1381:     (9298 hand-written digits, randomly permuted)

1382:     at the confidence levels 80\%, 95\% and 99\%.}

1383: \end{figure}

1384:

1385: The number of multiple predictions $\Mult_n$

1386: is a useful measure of efficiency in the case of classification:

1387: we would like as many as possible of our predictions to be singletons.

1388: Figure \ref{fig:TCM975} shows the cumulative numbers of errors

1389: $n\mapsto\Err_n^{2.5\%}$ (solid line)

1390: and multiple predictions

1391: $n\mapsto\Mult_n^{2.5\%}$ (dotted line)

1392: at the fixed confidence level 97.5\%.

1393: We can see that out of approximately 10,000 predictions

1394: about 250 (approximately 2.5\%) were errors

1395: and about 300 (approximately 3\%) were multiple predictions.

1396:

1397: \begin{figure}

1398:   \centering

1399:   \makebox{\includegraphics[width=\picturewidth]{TCM0_975bwF.eps}}

1400:   \caption{\label{fig:TCM975}The on-line performance of the 1-nearest neighbour conformal predictor

1401:     at the confidence level 97.5\% on the USPS data set (randomly permuted).}

1402: \end{figure}

1403:

1404: We can see that by choosing $\epsilon$ we are able to control the number of errors.

1405: For small $\epsilon$

1406: (relative to the difficulty of the data set)

1407: this might lead to the need sometimes to give

1408: multiple predictions.

1409: On the other hand,

1410: for larger $\epsilon$ this might lead to empty predictions at some steps,

1411: as can be seen from the bottom right corner of Figure \ref{fig:TCM975}:

1412: when the predictor ceases to make multiple predictions

1413: it starts making occasional empty predictions

1414: (the dash-dot line).

1415: An empty prediction is a warning that the object to be predicted is unusual

1416: (the credibility, as defined in Section \ref{sec:ideal}, is $\epsilon$ or less).

1417:

1418: It would be a mistake to concentrate exclusively on one confidence level $1-\epsilon$.

1419: If the prediction $\Gamma_n^{\epsilon}$ is empty,

1420: this does not mean that we cannot make any prediction at all:

1421: we should just shift our attention to other confidence levels

1422: (perhaps look at the range of $\epsilon$ for which $\Gamma_n^{\epsilon}$ is a singleton).

1423: Likewise, $\Gamma_n^{\epsilon}$ being multiple

1424: does not mean that all labels in $\Gamma_n^{\epsilon}$ are equally likely:

1425: slightly increasing $\epsilon$ might lead to the removal of some labels.

1426: Of course,

1427: taking in the continuum of predictions sets, for all $\epsilon\in(0,1)$,

1428: might be too difficult or tiresome for a human mind,

1429: and concentrating on a few conventional levels,

1430: as in Figure \ref{fig:predset},

1431: might be a reasonable compromise.

1432:

1433: \ifJOURNAL

1434: \begin{table*}

1435: \processtable{A selected test example from a data set of hospital records of patients

1436:   who suffered acute abdominal pain \cite{gammerman/thatcher:1992}:

1437:   the p-values for the nine possible diagnostic groups

1438:   (appendicitis APP, diverticulitis DIV, perforated peptic ulcer PPU,

1439:   non-specific abdominal pain NAP, cholecystitis CHO, intestinal obstruction INO,

1440:   pancreatitis PAN, renal colic RCO, dyspepsia DYS)

1441:   and the true label.\label{tab:abdominal}}

1442: %\begingroup\tiny

1443: {\footnotesize\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}

1444: \hline APP & DIV & PPU & NAP & CHO & INO & PAN & RCO & DYS & true label\\

1445: \hline 1.23\% & 0.36\% & 0.16\% & 2.83\% & 5.72\% & 0.89\% & 1.37\% & 0.48\% & 80.56\% & DYS\\

1446: \hline

1447: \end{tabular}}{}

1448: %\endgroup

1449: \end{table*}

1450: \fi

1451:

1452: \ifnotJOURNAL

1453: \begin{table*}

1454: \caption{A selected test example from a data set of hospital records of patients

1455:   who suffered acute abdominal pain \cite{gammerman/thatcher:1992}:

1456:   the p-values for the nine possible diagnostic groups

1457:   (appendicitis APP, diverticulitis DIV, perforated peptic ulcer PPU,

1458:   non-specific abdominal pain NAP, cholecystitis CHO, intestinal obstruction INO,

1459:   pancreatitis PAN, renal colic RCO, dyspepsia DYS)

1460:   and the true label.\label{tab:abdominal}}

1461:

1462: \medskip

1463:

1464: {\footnotesize\hspace{-2mm}\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}

1465: \hline APP & DIV & PPU & NAP & CHO & INO & PAN & RCO & DYS & true label\\

1466: \hline 1.23\% & 0.36\% & 0.16\% & 2.83\% & 5.72\% & 0.89\% & 1.37\% & 0.48\% & 80.56\% & DYS\\

1467: \hline

1468: \end{tabular}}{}

1469: %\endgroup

1470: \end{table*}

1471: \fi

1472:

1473: % Typical output: example 5 (correctly predicted)

1474: %

1475: % Real class = Dyspepsia (8) [starting from 0 rather than 1, as in the paper]

1476: % Predicted class = Dyspepsia (8)

1477: %

1478: % p-values for each class:

1479: % Class 0: Appendicitis = 0.012306289881494986

1480: % Class 1: Diverticulitis = 0.0036463081130355514

1481: % Class 2: Perforated peptic ulcer = 0.0015952597994530538

1482: % Class 3: Non-specific abdominal pain = 0.028258887876025523

1483: % Class 4: Cholecystitis = 0.057201458523245215

1484: % Class 5: Intestinal obstruction = 0.008887876025524157

1485: % Class 6: Pancreatitis = 0.013673655423883319

1486: % Class 7: Renal colic = 0.004785779398359161

1487: % Class 8: Dyspepsia = 0.8056061987237921

1488:

1489: For example, Table \ref{tab:abdominal}

1490: gives the p-values for different kinds of abdominal pain

1491: obtained for a specific patient based on his symptoms.

1492: % check his sex with Sasha!

1493: We can see that at the confidence level 95\% the prediction set

1494: is multiple,

1495: $\{$cholecystitis, dyspepsia$\}$.

1496: When we relax the confidence level to 90\%,

1497: the prediction set narrows down to $\{$dyspepsia$\}$

1498: (the singleton containing only the true label);

1499: on the other hand,

1500: at the confidence level 99\% the prediction set widens to

1501: $\{$appendicitis, non-specific abdominal pain, cholecystitis, pancreatitis, dyspepsia$\}$.

1502: Such detailed confidence information,

1503: in combination with the property of validity,

1504: is especially valuable in medicine

1505: (and some of the first applications of conformal predictors

1506: have been to the fields of medicine and bioinformatics:

1507: see, e.g., \cite{bellotti/etal:2005,shahmuradov/etal:2005}).

1508:

1509: In the case of regression,

1510: we will usually have $\Mult_n^{\epsilon}=n$ and $\Emp_n^{\epsilon}=0$,

1511: and so these are not useful measures of efficiency.

1512: Better measures,

1513: such as the ones used in the previous section,

1514: would, e.g., take into account the widths of the prediction intervals.

1515:

1516: \subsection*{Theoretical Analysis}

1517:

1518: Looking at Figures \ref{fig:CP0err} and \ref{fig:TCM975}

1519: we might be tempted to guess that the probability of error

1520: at each step of the on-line protocol

1521: is $\epsilon$

1522: and that errors are made independently at different steps.

1523: This is not literally true,

1524: as a closer examination of the bottom left corner of Figure \ref{fig:TCM975} reveals.

1525: It, however, becomes true

1526: (as noticed in \cite{vovk:2002})

1527: if the p-values (\ref{eq:p}) are redefined as

1528: \begin{equation}\label{eq:p-smoothed}

1529:   p_Y

1530:   :=

1531:   \frac

1532:   {

1533:     \left|

1534:       \{i \st \alpha_i>\alpha_{l+1}\}

1535:     \right|

1536:     +

1537:     \eta

1538:     \left|

1539:       \{i \st \alpha_i=\alpha_{l+1}\}

1540:     \right|

1541:   }

1542:   {l+1},

1543: \end{equation}

1544: where $i$ ranges over $\{1,\ldots,l+1\}$

1545: and $\eta\in[0,1]$ is generated randomly from the uniform distribution on $[0,1]$

1546: (the $\eta$s should be independent between themselves and of everything else;

1547: in practice they are produced by pseudo-random number generators).

1548: The only difference between (\ref{eq:p}) and (\ref{eq:p-smoothed})

1549: is that the expression (\ref{eq:p-smoothed}) takes more care in breaking the ties

1550: $\alpha_i=\alpha_{l+1}$.

1551: Replacing (\ref{eq:p}) by (\ref{eq:p-smoothed})

1552: in the definition of conformal predictor

1553: we obtain the notion of \emph{smoothed conformal predictor}.

1554:

1555: The validity property for smoothed conformal predictors can now be stated as follows.

1556: \begin{theorem}\label{thm:on-line}

1557:   Suppose the examples

1558:   \begin{equation*}

1559:     (x_1,y_1),(x_2,y_2),\ldots

1560:   \end{equation*}

1561:   are generated independently

1562:   from the same distribution.

1563:   For any smoothed conformal predictor working in the on-line prediction protocol

1564:   and any confidence level $1-\epsilon$,

1565:   the random variables $\err_1^{\epsilon},\err_2^{\epsilon},\ldots$

1566:   are independent and take value 1 with probability $\epsilon$.

1567: \end{theorem}

1568:

1569: Combining Theorem \ref{thm:on-line}

1570: with the strong law of large numbers

1571: we can see that

1572: \begin{equation*}

1573:   \lim_{n\to\infty}

1574:   \frac{\Err_n^{\epsilon}}{n}

1575:   =

1576:   \epsilon

1577: \end{equation*}

1578: holds with probability one for smoothed conformal predictors.

1579: (They are ``well calibrated''.)

1580: Since the number of mistakes made by a conformal predictor

1581: never exceeds the number of mistakes

1582: made by the corresponding smoothed conformal predictor,

1583: \begin{equation*}

1584:   \limsup_{n\to\infty}

1585:   \frac{\Err_n^{\epsilon}}{n}

1586:   \le

1587:   \epsilon

1588: \end{equation*}

1589: holds with probability one for conformal predictors.

1590: (They are ``conservatively well calibrated''.)

1591:

1592: \section{Slow teachers, lazy teachers, and the batch setting}

1593: \label{sec:slow}

1594:

1595: % Lazy and slow teachers; batch and mixtures on-line/batch

1596:

1597: In the pure on-line setting, considered in the previous section,

1598: we get an immediate feedback (the true label) for every example that we predict.

1599: This makes practical applications of this scenario questionable.

1600: Imagine, for example, a mail sorting centre

1601: using an on-line prediction algorithm

1602: for zip code recognition;

1603: suppose the feedback about the ``true'' label comes from a human ``teacher''.

1604: If the feedback is given for every object $x_i$,

1605: there is no point in having the prediction algorithm:

1606: we can just as well use the label provided by the teacher.

1607: It would help if the prediction algorithm could still work well,

1608: in particular be valid,

1609: if only every, say, tenth object were classified by a human teacher

1610: (the scenario of ``lazy'' teachers).

1611: Alternatively,

1612: even if the prediction algorithm requires the knowledge of all labels,

1613: it might still be useful if the labels were allowed to be given not immediately

1614: but with a delay (``slow'' teachers).

1615: In our mail sorting example,

1616: such a delay might make sure that we hear

1617: from local post offices about any mistakes made

1618: before giving a feedback to the algorithm.

1619:

1620: In the pure on-line protocol we had validity in the strongest possible sense:

1621: at each confidence level $1-\epsilon$ each smoothed conformal predictor

1622: made errors independently with probability $\epsilon$.

1623: In the case of weaker teachers

1624: (as usual, we are using the word ``teacher'' in the general sense of the entity

1625: providing the feedback,

1626: called Reality in the previous section),

1627: we have to accept a weaker notion of validity.

1628: Suppose the predictor receives a feedback from the teacher

1629: at the end of steps $n_1,n_2,\ldots$,

1630: $n_1<n_2<\cdots$;

1631: the feedback is the label of one of the objects that the predictor

1632: has already seen (and predicted).

1633: This scheme \cite{ryabko/etal:2003} covers both slow and lazy teachers

1634: (as well as teachers who are both slow and lazy).

1635: It was proved in \cite{nouretdinov/vovk:2003}

1636: (see also \cite{vovk/etal:2005}, Theorem 4.2)

1637: that the smoothed conformal predictors

1638: (using only the examples with known labels)

1639: remain valid in the sense

1640: \begin{equation*}

1641:   \forall\epsilon\in(0,1):

1642:   \Err_n^{\epsilon}/n\to\epsilon

1643:   \text{ in probability}

1644: \end{equation*}

1645: if and only if $n_k/n_{k-1}\to1$ as $k\to\infty$.

1646: In other words,

1647: the validity in the sense of convergence in probability holds

1648: if and only if the growth rate of $n_k$ is subexponential.

1649: (This condition is amply satisfied for our example

1650: of a teacher giving feedback for every tenth object.)

1651:

1652: \iffalse

1653: Below are two examples of ``weak'' (slow and lazy) teachers at 99\%

1654: confidence using well-known NIST data set.

1655:

1656: \begin{figure}

1657:   \centering

1658:   \makebox{\includegraphics[width=\picturewidth]{tcmSlow10.eps}}

1659:   \caption{\label{fig:slow teachers}An example of a Slow Teacher Predictor

1660:     with a delay of 10 examples on the NIST data set.}

1661: \end{figure}

1662:

1663: \begin{figure}

1664:   \centering

1665:   \makebox{\includegraphics[width=\picturewidth]{tcmLazyAP10.eps}}

1666:   \caption{\label{fig:lazy}An example of a Lazy Teacher Predictor

1667:     with delays follow the arithmetic progression with coefficient 10 on the NIST data set.}

1668: \end{figure}

1669: \fi

1670:

1671: The most standard \emph{batch} setting of the problem of prediction

1672: is in one respect even more demanding than our scenarios of weak teachers.

1673: In this setting we are given a training set (\ref{eq:training-set})

1674: and our goal is to predict the labels

1675: given the objects in the test set

1676: \begin{equation}\label{eq:test-set}

1677:   (x_{l+1},y_{l+1}),\ldots,(x_{l+k},y_{l+k}).

1678: \end{equation}

1679: This can be interpreted as a finite-horizon version

1680: of the lazy-teacher setting:

1681: no labels are returned after step $l$.

1682: Computer experiments (see, e.g., Figure \ref{fig:batch-errors})

1683: show that approximate validity still holds;

1684: for related theoretical results,

1685: see \cite{vovk/etal:2005}, Section 4.4.

1686:

1687: \begin{figure}

1688:   \centering

1689:   \makebox{\includegraphics[width=\picturewidth]{TCM_test_errors_bw.eps}}

1690:   \caption{\label{fig:batch-errors}Cumulative numbers of errors made on the test set

1691:     by the 1-nearest neighbour conformal predictor

1692:     used in the batch mode on the USPS data set

1693:     (randomly permuted and split into a training set of size 7291 and a test set of size 2007)

1694:     at the confidence levels 80\%, 95\% and 99\%.}

1695: \end{figure}

1696:

1697: \section{Induction and transduction}

1698: \label{sec:induction-transduction}

1699:

1700: % Transductive vs. inductive inference

1701:

1702: Vapnik's \cite{vapnik:1995,vapnik:1998}

1703: distinction between induction and transduction,

1704: as applied to the problem of prediction,

1705: is depicted in Figure \ref{fig:trans}.

1706: In \emph{inductive prediction}

1707: we first move from examples in hand to some more or less general rule,

1708: which we might call a prediction or decision rule,

1709: a model, or a theory;

1710: this is the \emph{inductive step}.

1711: When presented with a new object,

1712: we derive a prediction from the general rule;

1713: this is the \emph{deductive step}.

1714: In \emph{transductive prediction},

1715: we take a shortcut,

1716: moving from the old examples directly

1717: to the prediction about the new object.

1718:

1719: \begin{figure}

1720:   \centering

1721:   \input{trans.pic}

1722:   \caption{\label{fig:trans}Inductive and transductive prediction.}

1723: \end{figure}

1724:

1725: Typical examples of the inductive step

1726: are estimating parameters in statistics

1727: and finding an approximating function

1728: in statistical learning theory.

1729: Examples of transductive prediction

1730: are estimation of future observations in statistics

1731: (\cite{cox/hinkley:1974}, Section 7.5, \cite{takeuchi:1975})

1732: and nearest neighbours algorithms

1733: in machine learning.

1734:

1735: In the case of simple (i.e., traditional, not hedged) predictions

1736: the distinction between induction and transduction

1737: is less than crisp.

1738: A method for doing transduction,

1739: in the simplest setting of predicting one label,

1740: is a method for predicting $y_{l+1}$

1741: from (\ref{eq:training-set}) and $x_{l+1}$.

1742: Such a method gives a prediction for any object

1743: that might be presented as $x_{l+1}$, and so it defines,

1744: at least implicitly, a rule,

1745: which might be extracted from the training set (\ref{eq:training-set}) (induction),

1746: stored, and then subsequently applied to $x_{l+1}$ to predict $y_{l+1}$ (deduction).

1747: So any real distinction is really at a practical and computational level:

1748: do we extract and store the general rule or not?

1749:

1750: For hedged predictions the difference between induction and transduction goes deeper.

1751: We will typically want different notions of hedged prediction

1752: in the two frameworks.

1753: Mathematical results about induction usually involve two parameters,

1754: often denoted $\epsilon$ (the desired accuracy of the prediction rule)

1755: and $\delta$ (the probability of achieving the accuracy of $\epsilon$),

1756: whereas results about transduction involve only one parameter,

1757: which we denote $\epsilon$ in this paper

1758: (the probability of error we are willing to tolerate);

1759: see Figure \ref{fig:trans}.

1760: For a review of inductive prediction

1761: from this point of view, see \cite{vovk/etal:2005}, Section 10.1.

1762:

1763: \section{Inductive conformal predictors}

1764: \label{sec:ICP}

1765:

1766: % Computational issues: inductive conformal predictors

1767:

1768: Our approach to prediction is thoroughly transductive,

1769: and this is what makes valid and efficient hedged prediction possible.

1770: In this section we will see, however,

1771: that there is also room for an element of induction

1772: in conformal prediction.

1773:

1774: Let us take a closer look at the process of conformal prediction,

1775: as described in Section \ref{sec:conformal}.

1776: Suppose we are given a training set (\ref{eq:training-set})

1777: and the objects in a test set (\ref{eq:test-set}),

1778: and our goal is to predict the label of each test object.

1779: If we want to use the conformal predictor based on the support vector method,

1780: as described in Section \ref{sec:conformal},

1781: we will have to find the set of the Lagrange multipliers

1782: for each test object and for each potential label $Y$ that can be assigned to it.

1783: This would involve solving

1784: $k\left|\mathbf{Y}\right|$ essentially independent optimization problems.

1785: Using the nearest neighbours approach

1786: is typically more computationally efficient,

1787: but even it is much slower than the following procedure,

1788: suggested in \cite{papadopoulos/etal:2002a,papadopoulos/etal:2002b}.

1789:

1790: Suppose we have an inductive algorithm which,

1791: given a training set (\ref{eq:training-set}) and a new object $x$

1792: outputs a prediction $\hat y$ for $x$'s label $y$.

1793: Fix some measure $\Delta(y,\hat y)$ of difference between $y$ and $\hat y$.

1794: The procedure is:

1795: \begin{enumerate}

1796: \item

1797:   Divide the original training set (\ref{eq:training-set})

1798:   into two subsets:

1799:   the \emph{proper training set}

1800:   $(x_1,y_1),\ldots,(x_m,y_m)$

1801:   and the \emph{calibration set}

1802:   $(x_{m+1},y_{m+1}),\ldots,(x_l,y_l)$.

1803: \item

1804:   Construct a prediction rule $F$ from the proper training set.

1805: \item

1806:   Compute the nonconformity score

1807:   \begin{equation*}

1808:     \alpha_i:=\Delta(y_i,F(x_i)),

1809:     \quad

1810:     i=m+1,\ldots,l,

1811:   \end{equation*}

1812:   for each example in the calibration set.

1813: \item

1814:   For every test object $x_i$,

1815:   $i=l+1,\ldots,l+k$,

1816:   do the following:

1817:   \begin{enumerate}

1818:   \item

1819:     for every possible label $Y\in\mathbf{Y}$

1820:     compute the nonconformity score $\alpha_i:=\Delta(y_i,F(x_i))$

1821:     and the p-value

1822:     \begin{equation*}

1823:       p_Y

1824:       :=

1825:       \frac

1826:       {

1827:         \#\{j\in\{m+1,\ldots,l,i\} \st \alpha_j\ge\alpha_i\}

1828:       }

1829:       {l-m+1};

1830:     \end{equation*}

1831:   \item

1832:     output the prediction sets

1833:     $

1834:       \Gamma^{\epsilon}

1835:       \left(

1836:         x_1,y_1,\ldots,x_{l},y_{l},x_{i}

1837:       \right)

1838:     $

1839:     given by the right-hand side of (\ref{eq:Gamma}).

1840:   \end{enumerate}

1841: \end{enumerate}

1842: This is a special case of ``inductive conformal predictors'',

1843: as defined in \cite{vovk/etal:2005}, Section 4.1.

1844: In the case of classification,

1845: of course,

1846: we could package the p-values as a simple prediction

1847: complemented with confidence (\ref{eq:conf}) and credibility (\ref{eq:cred}).

1848:

1849: Inductive conformal predictors are valid in the sense that

1850: the probability of error

1851: \begin{equation*}

1852:   y_{i}

1853:   \notin

1854:   \Gamma^{\epsilon}

1855:   \left(

1856:     x_1,y_1,

1857:     \ldots

1858:     x_l,y_l,

1859:     x_{i}

1860:   \right)

1861: \end{equation*}

1862: ($i=l+1,\ldots,l+k$, $\epsilon\in(0,1)$)

1863: never exceeds $\epsilon$

1864: (cf.\ (\ref{eq:error})).

1865: The on-line version of inductive conformal predictors,

1866: with a stronger notion of validity,

1867: is described in \cite{vovk:2002}

1868: and \cite{vovk/etal:2005} (Section 4.1).

1869:

1870: The main advantage of inductive conformal predictors

1871: is their computational efficiency:

1872: the bulk of the computations is performed only once,

1873: and what remains to do for each test example

1874: is to apply the prediction rule found at the inductive step,

1875: to apply $\Delta$ to find the nonconformity score $\alpha$ for this example,

1876: and to find the position of $\alpha$ among the nonconformity scores

1877: of the calibration examples.

1878: The main disadvantage is a possible loss of the prediction efficiency:

1879: for conformal predictors,

1880: we can effectively use the whole training set

1881: as both the proper training set and the calibration set.

1882:

1883: \section{Conclusion}

1884: \label{sec:conclusion}

1885:

1886: This paper shows how many machine-learning techniques

1887: can be complemented with provably valid measures

1888: of accuracy and reliability.

1889: We explained briefly how this can be done

1890: for support vector machines, nearest neighbours algorithms,

1891: and the ridge regression procedure,

1892: but the principle is general:

1893: virtually any (we are not aware of exceptions) successful prediction technique

1894: designed to work under the randomness assumption

1895: can be used to produce equally successful hedged predictions.

1896: Further examples are given in our recent book \cite{vovk/etal:2005}

1897: (joint with Glenn Shafer),

1898: where we construct conformal predictors and inductive conformal predictors

1899: based on nearest neighbours regression, logistic regression,

1900: bootstrap, decision trees, boosting, and neural networks;

1901: general schemes for constructing conformal predictors

1902: and inductive conformal predictors

1903: are given on pp.~28--29 and on pp.~99--100 of \cite{vovk/etal:2005},

1904: respectively.

1905: Replacing the original simple predictions with hedged predictions

1906: enables us to control the number of errors made

1907: by appropriately choosing the confidence level.

1908:

1909: \section*{Acknowledgements}

1910:

1911: This work is partially supported by MRC

1912: (grant % G0301107

1913: ``Pro\-te\-o\-mic analysis of the human serum pro\-te\-ome'')

1914: and the Royal Society

1915: (grant ``Efficient pseudo-random number generators'').

1916:

1917: \begin{thebibliography}{99}

1918:

1919: \bibitem{bellotti/etal:2005}

1920:   Bellotti, T., Luo, Z., Gammerman, A., van Delft, F.~W.\ and Saha, V.\ (2005)

1921:   Qualified predictions for microarray and proteomics pattern diagnostics with confidence machines.

1922:   \emph{International Journal of Neural Systems}, \textbf{15}, 247--258.

1923:   Yang, Z.~R.\ and Dalby, A.~R.\ (eds),

1924:   Special Issue on Bioinformatics.

1925: \bibitem{cesabianchi/lugosi:2006}

1926:   Cesa-Bianchi, N.\ and Lugosi, G.\ (2006)

1927:   \emph{Prediction, Learning, and Games}.

1928:   Cambridge University Press, Cambridge.

1929: \bibitem{cox/hinkley:1974}

1930:   Cox, D.~R.\ and Hinkley, D.~V.\ (1974)

1931:   \emph{Theoretical Statistics}.

1932:   Chapman and Hall, London.

1933: % \bibitem{gammerman/etal:1998}

1934: %   A.~Gammerman, V.~N.~Vapnik and V.~Vovk,

1935: %   Learning by transduction,

1936: %   in: G.~F.~Cooper and S.~Moral, eds.,

1937: %   \emph{Proceedings of the Fourteenth Conference

1938: %   on Uncertainty in Artificial Intelligence}

1939: %   (Morgan Kaufmann, San Francisco, CA, 1998)

1940: %   148--156.

1941: \bibitem{gammerman/thatcher:1992}

1942:   Gammerman, A.\ and Thatcher, A.~R.\ (1992)

1943:   Bayes\-ian diagnostic probabilities without assuming in\-de\-pen\-dence of symptoms.

1944:   \emph{Yearbook of Medical In\-for\-mat\-ics}, pp.~323--330.

1945: \bibitem{lecun/etal:1990}

1946:    LeCun, Y., Boser, B., Denker, J.~S., Henderson, D., How\-ard, R.~E.,

1947:    Hubbard, W.\ and Jackel, L.~J.\ (1990)

1948:    Handwritten digit recognition with backpropagation network.

1949:    In \emph{Advances in Neural Information Processing Systems 2},

1950:    pp.~396--404,

1951:    Morgan Kaufmann, San Ma\-teo, CA.

1952: \bibitem{li/vitanyi:1997}

1953:   Li, M.\ and Vit\'anyi, P.\ (1993)

1954:   \emph{An Introduction to Kolmogorov Complexity and Its Applications}.

1955:   Springer, New York.

1956:   Second edition: 1997.

1957: \bibitem{martin-lof:1966}

1958:   Martin-L\"of, P.\ (1966)

1959:   The definition of random sequences.

1960:   \emph{Information and Control}, \textbf{9}, 602--619.

1961: \bibitem{melluish/etal:2001}

1962:   Melluish, T., Saunders, C., Nouretdinov, I. and Vovk, V.\ (2001)

1963:   Comparing the Bayes and typicalness frameworks.

1964:   In De Raedt, L.\ and Flash, P.\ (eds),

1965:   \emph{Machine Learning: ECML 2001,

1966:   Proceedings of the Twelfth European Conference on Machine Learning,

1967:   LNAI}, \textbf{2167}, pp.~360--371,

1968:   Springer, Heidelberg.

1969:   Full version published as Technical Report TR-01-05,

1970:   Computer Learning Research Centre,

1971:   Royal Holloway, University of London.

1972: % (can be downloaded from \texttt{http://www.clrc.rhul.ac.uk}).

1973: \bibitem{nouretdinov/etal:2001rr}

1974:    Nouretdinov, I., Melluish, T.\ and Vovk, V.\ (2001)

1975:    Ridge Regression Confidence Machine.

1976:    In \emph{Proceedings of the Eighteenth International Conference

1977:    on Machine Learning}, pp.~385--392,

1978:    Morgan Kaufmann, San Fran\-cis\-co, CA.

1979: \bibitem{nouretdinov/vovk:2003}

1980:   Nouretdinov, I.\ and Vovk, V.\ (2003)

1981:   Criterion of calibration for transductive confidence machine with limited feedback.

1982:   In Gavald\`a, R., Jantke, K.~P.\ and Takimoto, E.\ (eds),

1983:   \emph{Proceedings of the Fourteenth International Conference on Algorithmic Learning Theory,

1984:   LNAI}, \textbf{2842}, pp.~259--267,

1985:   Springer, Berlin.

1986:   To appear in \emph{Theoretical Computer Science}

1987:   (special issue devoted to the ALT'2003 conference).

1988: % \bibitem{nouretdinov/etal:2001de}

1989: %   I.~Nouretdinov, V.~Vovk, M.~Vyugin and A.~Gammerman,

1990: %   Pattern recognition and density estimation under the general iid assumption,

1991: %   in: D.~Helmbold and B.~Williamson, eds.,

1992: %   \emph{Proceedings of the Fourteenth Annual Conference

1993: %   on Computational Learning Theory

1994: %   and Fifth European Conference

1995: %   on Computational Learning Theory},

1996: %   \emph{Lecture Notes in Artificial Intelligence},

1997: %   \textbf{2111} (2001) 337--353;

1998: %   Full version published as a CLRC technical report

1999: %   (can be downloaded from \texttt{http://www.clrc.rhul.ac.uk}).

2000: \bibitem{papadopoulos/etal:2002a}

2001:   Papadopoulos, H., Proedrou, K., Vovk, V.\ and Gammerman, A.\ (2002)

2002:   Inductive Confidence Machines for regression.

2003:   In Elomaaa, T., Mannila, H.\ and Toivonen, H.\ (eds),

2004:   \emph{Machine Learning: ECML 2002,

2005:   Proceedings of the Thirteenth European Conference on Machine Learning,

2006:   LNCS}, \textbf{2430}, pp.~345--356,

2007:   Springer, Berlin.

2008: \bibitem{papadopoulos/etal:2002b}

2009:   Papadopoulos, H., Vovk, V.\ and Gammerman, A.\ (2002)

2010:   Qualified predictions for large data sets in the case of pattern recognition.

2011:   In \emph{Proceedings of the International Conference on Machine Learning and Applications

2012:   (ICMLA'2002)}, pp.~159--163,

2013:   CSREA Press.

2014: \bibitem{popper:1934}

2015:   Popper, K.~R.\ (1934)

2016:   \emph{Logik der Forschung}.

2017:   Springer, Vienna.

2018:   English translation (1959):

2019:   \emph{The Logic of Sci\-en\-tif\-ic Discovery},

2020:   Hutchinson, London.

2021: % \bibitem{proedrou/etal:2002}

2022: %   Proedrou, K., Papadopoulos, H., Vovk, V.\ and Gammerman, A.\ (2002)

2023: %   Nearest Neighbours Transductive Confidence Machine,

2024: %   in: \emph{Proceedings of the Artificial Intelligence and Statistics Conference}

2025: \bibitem{ryabko/etal:2003}

2026:   Ryabko, D., Vovk, V.\ and Gammerman, A.\ (2003)

2027:   Online prediction with real teachers.

2028:   Technical Report CS-TR-03-09, Department of Computer Science,

2029:   Royal Holloway, University of London.

2030: % \bibitem{saunders/etal:1999}

2031: %   C.~Saunders, A.~Gammerman and V.~Vovk,

2032: %   Transduction with confidence and credibility,

2033: %   in: \emph{Proceedings of the Sixteenth International Joint Conference

2034: %   on Artificial Intelligence}

2035: %   (Morgan Kaufmann, 1999)

2036: %   722--726.

2037: % \bibitem{scholkopf/etal:1999}

2038: %   B.~Sch\"olkopf, C.~J.~C.~Burges and A.~J.~Smola, eds.,

2039: %   \emph{Advances in Kernel Methods, Support Vector Learning}

2040: %   (MIT Press, 1999).

2041: \bibitem{shahmuradov/etal:2005}

2042:   Shahmuradov, I.~A., Solovyev, V.~V.\ and Gammerman, A.\ (2005)

2043:   Plant promoter prediction with confidence estimation.

2044:   \emph{Nucleic Acids Research}, \textbf{33}, 1069--1076.

2045: \bibitem{sutton/barto:1998}

2046:   Sutton, R.~S.\ and Barto, A.~G.\ (1998)

2047:   \emph{Reinforcement Learning: An Introduction}.

2048:   MIT Press, Cambridge, MA.

2049: \ifLATIN

2050:   \bibitem{takeuchi:1975}

2051:     Takeuchi, K.\ (1975)

2052:     \emph{Statistical Pre\-dic\-tion Theory} (in Japanese).

2053:     Baih\=ukan, Tokyo.

2054: \fi

2055: \ifnotLATIN

2056:   \bibitem{takeuchi:1975}

2057:     Takeuchi, K.\ (1975)

2058:     \begin{CJK*}[dnp]{JIS}{min}����Ūͽ¬��\end{CJK*}

2059:     (\emph{Statistical Pre\-dic\-tion Theory}).

2060:     Baih\=ukan, Tokyo.

2061: \fi

2062: \bibitem{vapnik:1995}

2063:   Vapnik, V.~N.\ (1995)

2064:   \emph{The Nature of Statistical Learning Theory}.

2065:   Springer, New York.

2066:   Second edition: 2000.

2067: \bibitem{vapnik:1998}

2068:   Vapnik, V.~N.\ (1998)

2069:   \emph{Statistical Learning Theory}.

2070:   Wiley, New York.

2071: \ifLATIN

2072:   \bibitem{vapnik/chervonenkis:1974}

2073:     Vapnik, V.~N.\ and Chervonenkis, A.~Y.\ (1974)

2074:     \emph{Theory of Pattern Rec\-og\-ni\-tion} (in Russian).

2075:     Nauka, Moscow.

2076:     German translation (1979): \emph{Theorie der Zeichenerkennung},

2077:     Akademie, Berlin.

2078: \fi

2079: \ifnotLATIN

2080:   \bibitem{vapnik/chervonenkis:1974}

2081:     Vapnik, V.~N.\ and Chervonenkis, A.~Y.\ (1974)

2082:     \begin{cyr}Te\-o\-ri{ya}\ ras\-po\-zna\-va\-ni{ya}\

2083:     ob\-ra\-zov\end{cyr} (\emph{Theory of Pattern Rec\-og\-ni\-tion}).

2084:     Nauka, Moscow.

2085:     German translation (1979): \emph{Theorie der Zeichenerkennung},

2086:     Akademie, Berlin.

2087: \fi

2088: \bibitem{vovk/etal:1999}

2089:   Vovk, V., Gammerman, A.\ and Saunders, C.\ (1999)

2090:   Machine-learning applications of algorithmic ran\-dom\-ness.

2091:   In Bratko, I.\ and Dzeroski, S.\ (eds),

2092:   \emph{Proceedings of the Sixteenth International Conference on Machine Learning},

2093:   pp.~444--453,

2094:   Morgan Kaufmann, San Fran\-cis\-co, CA.

2095: \bibitem{vovk:2001}

2096:   Vovk, V.\ (2001)

2097:   Competitive on-line statistics.

2098:   \emph{International Statistical Review}, \textbf{69}, 213--248.

2099: \bibitem{vovk:2002}

2100:   Vovk, V.\ (2002)

2101:   On-line Confidence Machines are well-calibrated.

2102:   In \emph{Proceedings of the Forty Third Annual Symposium on Foundations of Computer Science},

2103:   pp.~187--196,

2104:   IEEE Computer Society, Los Alamitos, CA.

2105: \bibitem{vovk/etal:2005}

2106:   Vovk, V., Gammerman, A.\ and Shafer, G.\ (2005)

2107:   \emph{Al\-go\-rith\-mic Learning in a Random World}.

2108:   Springer, New York.

2109: \end{thebibliography}

2110: \end{document}

2111:

2112:

2113: Remove:

2114:

2115: \emergencystretch=5mm

2116: \tolerance=400

2117: \allowdisplaybreaks[3]

2118:

2119: \newcommand{\Vladimir}{Vladimir }

2120: \newcommand{\DOT}{.}

2121: \newcommand{\zzrelax}[1]{}

2122:

2123: \DeclareMathAlphabet{\mathbfit}{OT1}{cmr}{bx}{it}	% description: LATEX companion, pp.177 and 181

2124:

2125: \newcommand{\st}{\mathrel{:}}

2126: \newcommand{\given}{\mathrel{|}}

2127:

2128: \newcommand{\bbbr}{\mathbb{R}}		% real numbers

2129: \newcommand{\bbbc}{\mathbb{C}}		% complex numbers

2130: \newcommand{\bbbq}{\mathbb{Q}}		% rational numbers

2131: \newcommand{\bbbn}{\mathbb{N}}		% natural numbers

2132: \newcommand{\III}{\mathbb{I}}		% indicator

2133: \newcommand{\bbbp}{\mathbb{P}}		% auxiliary (probability)

2134: \newcommand{\bbbe}{\mathbb{E}}		% auxiliary (expectation)

2135: \newcommand{\K}{\mathcal{K}}		% capital

2136: \newcommand{\FFF}{\mathcal{F}}		% sigma-algebra

2137: \newcommand{\GGG}{\mathcal{G}}		% sigma-algebra

2138: \newcommand{\PPP}{\mathcal{P}}		% statistical model

2139:

2140: \newcommand{\Prob}{\mathop{\bbbp}\nolimits}

2141: \newcommand{\Expect}{\mathop{\bbbe}\nolimits}

2142: %\newcommand{\LP}{\mathop{\underline{\bbbp}}\nolimits}

2143: %\newcommand{\UP}{\mathop{\overline{\bbbp}}\nolimits}

2144: %\newcommand{\ULP}{\mathop{\overline{\underline{\bbbp}}}\nolimits}

2145: \newcommand{\sign}{\mathop{{\rm sign}}\nolimits}

2146: \newcommand{\var}{\mathop{{\rm var}}\nolimits}

2147: \newcommand{\co}{\mathop{{\rm co}}\nolimits}

2148: \newcommand{\rank}{\mathop{{\rm rank}}\nolimits}

2149: \newcommand{\err}{\mathop{{\rm err}}\nolimits}

2150: \newcommand{\Err}{\mathop{{\rm Err}}\nolimits}

2151: \newcommand{\length}{\mathop{{\rm length}}\nolimits}

2152: \newcommand{\lth}{\mathop{{\rm lth}}\nolimits}

2153: \newcommand{\Lth}{\mathop{{\rm Lth}}\nolimits}

2154:

2155: \newenvironment{Proof}[1]

2156:   {\trivlist\item[\hskip\labelsep\textbf{Proof #1}]}

2157:   {\endtrivlist}

2158: \newcommand{\boxforqed}{\rule{.3em}{1.5ex}}

2159: \newcommand{\qedtext}{\unskip\nobreak\hfil

2160:   \penalty50\hskip1em\null\nobreak\hfil\boxforqed

2161:   \parfillskip=0pt\finalhyphendemerits=0\endgraf}

2162: \newcommand{\qedmath}{\eqno\boxforqed}

2163: \newtheorem{Remark}{Remark}

2164: \newenvironment{remark}

2165:   {\begin{Remark} \begingroup\rm}

2166:   {\endgroup \end{Remark}}

2167: \newenvironment{remark*}

2168:   {\trivlist\item[\hskip\labelsep{\bfseries Remark}]\relax}

2169:   {\endtrivlist}

2170:

2171: \begin{document}

2172: \label{firstpage}

2173: \maketitle

2174:

2175: \begin{abstract}

2176:   We consider the on-line predictive version

2177:   of the standard problem of linear regression;

2178:   the goal is to predict each consecutive response

2179:   given the corresponding explanatory variables

2180:   and all the previous observations.

2181:   The standard treatment of prediction in linear regression analysis

2182:   has two drawbacks:

2183:   (1) the usual prediction intervals

2184:   guarantee that the probability of error

2185:   is equal to the nominal significance level $\epsilon$,

2186:   but this property per se does not imply that the long-run frequency of error

2187:   is close to $\epsilon$;

2188:   (2) it is not suitable for prediction of complex systems

2189:   as it assumes that the number of observations

2190:   exceeds the number of parameters.

2191:   We state a general result showing that in the on-line protocol

2192:   the frequency of error does equal the nominal significance level,

2193:   up to statistical fluctuations,

2194:   and we describe alternative regression models

2195:   in which informative prediction intervals can be found

2196:   before the number of observations exceeds the number of parameters.

2197:   One of these models,

2198:   which only assumes that the observations are independent and identically distributed,

2199:   is popular in machine learning but

2200:   greatly underused in the statistical theory of regression.

2201: \end{abstract}

2202:

2203: \ifJOURNAL

2204:   \noindent

2205:   \textbf{Key words:}

2206:   Gauss linear model; independent identically distributed observations;

2207:   multivariate analysis; on-line protocol; prequential statistics; regression

2208: \fi

2209: