0009:cs0009007/final.tex

1: \documentclass[namedreferences]{article}

2: \usepackage{theapa}

3: \usepackage{epsfig}

4: \usepackage{psfig}

5: \usepackage{xspace}

6: \usepackage{url}

7:

8: \usepackage{latexsym}           % This gives us the $\Box$ symbol

9: \usepackage{endnotes}           % For notes.

10:

11: \newcommand{\POS}{\texttt{\bf p}}

12: \newcommand{\NEG}{\texttt{\bf n}}

13: \newcommand{\YES}{\texttt{\bf Y}}

14: \newcommand{\NO}{\texttt{\bf N}}

15:

16: \newcommand{\rocch}{\textsc{rocch}}

17:

18: \newcommand{\IF}{\textbf{if~}}

19: \newcommand{\THEN}{\textbf{then~}}

20: \newcommand{\ELSE}{\textbf{else~}}

21: \newcommand{\ENDIF}{\textbf{end if}}

22: \newcommand{\ENDFOR}{\textbf{end for}}

23: \newcommand{\ENDWHILE}{\textbf{end while}}

24: \newcommand{\FOR}{\textbf{for~}}

25: \newcommand{\WHILE}{\textbf{while~}}

26: \newcommand{\DO}{\textbf{do~}}

27: \newcommand{\END}{\textbf{end~}}

28:

29: \newcommand{\EndProof}{$\Box$}

30:

31: \newtheorem{theorem}{Theorem}

32: \newtheorem{lemma}[theorem]{Lemma}

33: \newtheorem{corollary}[theorem]{Corollary}

34: \newtheorem{definition}{Definition}

35:

36: \newcommand{\about}{\symbol{126}}

37: \newcommand{\rem}[1]{\marginpar{\scriptsize $\rightarrow$ \raggedright #1}}

38:

39: \newcommand{\Partial}[2]{\frac{\partial #1}{\partial #2}}

40:

41: \newcommand{\mlc}{\ensuremath{\mathcal{MLC\hspace{-.05em}\raisebox{.4ex}{\tiny\bf ++}}}}

42: \def\CC{\mbox{C\hspace{-.05em}\raisebox{.4ex}{\tiny\bf ++}}}

43:

44: \graphicspath{

45:   {./}

46:   {./Figs/}

47:   }

48:

49: \setlength{\textwidth}{6.0in}

50: \setlength{\textheight}{9.2in}

51: \setlength{\oddsidemargin}{0.25in}

52: \setlength{\evensidemargin}{0.25in}

53: \setlength{\marginparwidth}{0in}

54: \setlength{\topmargin}{0in}

55: \addtolength{\voffset}{0.0in}

56: \setlength{\hoffset}{-0.25truein}

57:

58: \newcommand{\eg}{{e.g.},\xspace}

59: \newcommand{\ie}{{i.e.},\xspace}

60: \newcommand{\etal}{et al.\@\xspace}

61: \newcommand{\legit}{{\footnotesize \  }}

62: \newcommand{\fraud}{{\footnotesize \textsc{bandit}}}

63:

64: \begin{document}

65:

66: \centerline{\textbf{\Large Robust Classification for Imprecise Environments}}

67: \vspace{1ex}

68:

69: \begin{flushleft}

70:   Foster Provost \hfill \texttt{provost@acm.org}\\

71:   \hspace*{.1in}\textit{New York University, New York, NY 10012}\\

72:   Tom Fawcett \hfill \texttt{tfawcett@acm.org}\\

73:   \hspace*{.1in}\textit{Hewlett-Packard Laboratories, Palo Alto, CA 94304}\\

74:   \vspace*{.2in}

75: \end{flushleft}

76:

77: \begin{abstract}

78:   In real-world environments it usually is difficult to specify target

79:   operating conditions precisely, for example, target misclassification costs.

80:   This uncertainty makes building robust classification systems problematic.

81:   We show that it is possible to build a hybrid classifier that will perform

82:   at least as well as the best available classifier for any target conditions.

83:   In some cases, the performance of the hybrid actually can surpass that of

84:   the best known classifier.  This robust performance extends across a wide

85:   variety of comparison frameworks, including the optimization of metrics such

86:   as accuracy, expected cost, lift, precision, recall, and workforce

87:   utilization.  The hybrid also is efficient to build, to store, and to

88:   update.  The hybrid is based on a method for the comparison of classifier

89:   performance that is robust to imprecise class distributions and

90:   misclassification costs.  The ROC convex hull (\rocch) method combines

91:   techniques from ROC analysis, decision analysis and computational geometry,

92:   and adapts them to the particulars of analyzing learned classifiers.  The

93:   method is efficient and incremental, minimizes the management of classifier

94:   performance data, and allows for clear visual comparisons and sensitivity

95:   analyses.  Finally, we point to empirical evidence that a robust hybrid

96:   classifier indeed is needed for many real-world problems.

97: \end{abstract}

98:

99: \begin{flushleft}

100:   \textbf{Keywords:} classification, learning, uncertainty, evaluation,

101:   comparison, multiple models, cost-sensitive learning, skewed distributions\\

102:

103:   \vspace*{.1in}

104:   \textbf{\large To appear in \emph{Machine Learning Journal}}

105:

106: \end{flushleft}

107:

108: \vspace{.1in}

109:

110: \section{Introduction}

111:

112: Traditionally, classification systems have been built by experimenting with

113: many different classifiers, comparing their performance and choosing the best.

114: Experimenting with different induction algorithms, parameter settings, and

115: training regimes yields a large number of classifiers to be evaluated and

116: compared.  Unfortunately, comparison often is difficult in real-world

117: environments because key parameters of the target environment are not known.

118: The optimal cost/benefit tradeoffs and the target class priors seldom are

119: known precisely, and often are subject to change

120: \cite{ZahaviLevin:1997:issues_probl_applying_neural_comput,FriedmanWyatt:97,KlinkenbergJoachims:2000}.

121: For example, in fraud detection we cannot ignore misclassification costs or

122: the skewed class distribution, nor can we assume that our estimates are

123: precise or static \cite{FawcettProvost:97}.  We need a method for the

124: management, comparison, and application of multiple classifiers that is robust

125: in imprecise and changing environments.

126:

127: We describe the \textit{ROC convex hull} (\rocch) method, which combines

128: techniques from ROC analysis, decision analysis and computational geometry.

129: The ROC convex hull decouples classifier performance from specific class and

130: cost distributions, and may be used to specify the subset of methods that are

131: potentially optimal under any combination of cost assumptions and class

132: distribution assumptions.  The \rocch\ method is efficient, so it facilitates

133: the comparison of a large number of classifiers.  It minimizes the management

134: of classifier performance data because it can specify exactly those

135: classifiers that are potentially optimal, and it is incremental, easily

136: incorporating new and varied classifiers without having to reevaluate all

137: prior classifiers.

138:

139: We demonstrate that it is possible and desirable to avoid complete commitment

140: to a single best classifier during system construction.  Instead, the \rocch\

141: can be used to build from the available classifiers a hybrid classification

142: system that will perform best under any target cost/benefit and class

143: distributions.  Target conditions can then be specified at run time.

144: Moreover, in cases where precise information is still unavailable when the

145: system is run (or if the conditions change dynamically during operation), the

146: hybrid system can be tuned easily (and optimally) based on feedback from its

147: actual performance.

148:

149: The paper is structured as follows.  First we sketch briefly the traditional

150: approach to building such systems, in order to demonstrate that it is brittle

151: under the types of imprecision common in real-world problems.  We then

152: introduce and describe the \rocch\ and its properties for comparing and

153: visualizing classifier performance in imprecise environments.  In the

154: following sections we formalize the notion of a robust classification system,

155: and show that the \rocch\ is an elegant method for constructing one

156: automatically.  The solution is elegant because the resulting hybrid

157: classifier is robust for a wide variety of problem formulations, including the

158: optimization of metrics such as accuracy, expected cost, lift, precision,

159: recall, and workforce utilization, and it is efficient to build, to store, and

160: to update.  We then show that the hybrid actually can do better than the best

161: known classifier in certain situations.  Finally, by citing results from

162: empirical studies, we provide evidence that this type of system indeed is

163: needed.

164:

165: \subsection{An example}

166:

167: A systems-building team wants to create a system that will take a

168: large number of instances and identify those for which an action

169: should be taken.  The instances could be potential cases of fraudulent

170: account behavior, of faulty equipment, of responsive customers, of

171: interesting science, etc.  We consider problems for which the best

172: method for classifying or ranking instances is not well defined, so

173: the system builders may consider machine learning methods, neural

174: networks, case-based systems, and hand-crafted knowledge bases as

175: potential classification models.  Ignoring for the moment issues of

176: efficiency, the foremost question facing the system builders is: which

177: of the available models performs ``best'' at classification?

178:

179: Traditionally, an experimental approach has been taken to answer this question,

180: because the distribution of instances can be sampled if it is not known a

181: priori.  The standard approach is to estimate the error rate of each model

182: statistically and then to choose the model with the lowest error rate.  This

183: strategy is common in machine learning, pattern recognition, data mining,

184: expert systems and medical diagnosis.  In some cases, other measures such as

185: cost or benefit are used as well.  Applied statistics provides methods such as

186: cross-validation and the bootstrap for estimating model error rates and recent

187: studies have compared the effectiveness of different methods

188: \cite{Dietterich:98,kohavi-accest,Salzberg:97}.

189:

190: Unfortunately, this experimental approach is brittle under two types

191: of imprecision that are common in real-world environments.

192: Specifically, costs and benefits usually are not known precisely, and

193: target (prior) class distributions often are known only approximately

194: as well.  This observation has been made by many authors

195: \cite{Bradley:97,Catlett:95,ProvostFawcett:97}, and is in fact the

196: concern of a large subfield of decision analysis

197: \cite{WeinsteinFineberg:80}.  Imprecision also arises because the

198: environment may change between the time the system is conceived and

199: the time it is used, and even as it is used.  For example, levels of

200: fraud and levels of customer responsiveness change continually over

201: time and from place to place.

202:

203: \subsection{Basic terminology}

204:

205: \begin{figure}[tb]

206:   \begin{center}

207:     \epsfig{file=NeymanPearson.eps,height=3in}

208:     \caption{Three classifiers under three different Neyman-Pearson decision

209:       criteria}

210:     \label{fig:NP}

211:   \end{center}

212: \end{figure}

213:

214: In this paper we address two-class problems.  Formally, each instance

215: $I$ is mapped to one element of the set $\{\POS,\NEG\}$ of (correct)

216: positive and negative classes.  A \emph{classification model} (or

217: \emph{classifier}) is a mapping from instances to predicted classes.

218: Some classification models produce a continuous output (\eg an

219: estimate of an instance's class membership probability) to which

220: different thresholds may be applied to predict class membership.  To

221: distinguish between the actual class and the predicted class of an

222: instance, we will use the labels $\{\YES,\NO\}$ for the

223: classifications produced by a model.  For our discussion, let

224: $c(\textit{classification}, \textit{class})$ be a two-place error cost

225: function where $c(\YES,\NEG)$ is the cost of a false positive error

226: and $c(\NO,\POS)$ is the cost of a false negative error.\footnote{For

227: this paper, we consider error costs to include benefits not realized,

228: and ignore the costs of correct classifications.}

229: We represent class distributions by the classes' prior probabilities

230: $p(\POS)$ and $p(\NEG) = 1 - p(\POS)$.

231:

232:

233: The true positive rate, or hit rate, of a classifier is:

234: \begin{displaymath}

235:   TP = p(\YES|\POS) \approx \frac{\rm positives\: correctly\: classified}

236:                    {\rm total\: positives}

237: \end{displaymath}

238: The false positive rate, or false alarm rate, of a classifier is:

239: \begin{displaymath}

240:   FP = p(\YES|\NEG) \approx \frac{\rm negatives\: incorrectly\: classified}

241:                    {\rm total\: negatives}

242: \end{displaymath}

243:

244:

245: The traditional experimental approach is brittle because it chooses

246: one model as ``best'' with respect to a specific set of cost functions

247: and class distribution.  If the target conditions change, this system

248: may no longer perform optimally, or even acceptably.  As an example,

249: assume that we have a maximum false positive rate $FP$, that must not

250: be exceeded.  We want to find the classifier with the highest possible

251: true positive rate, $TP$, that does not exceed the $FP$ limit.  This

252: is the Neyman-Pearson decision criterion \cite{Egan:75}.  Three

253: classifiers, under three such $FP$ limits, are shown in

254: figure~\ref{fig:NP}.  A different classifier is best for each $FP$

255: limit; any system built with a single ``best'' classifier is brittle

256: if the $FP$ requirement can change.

257:

258: \section{Evaluating and visualizing classifier performance}

259:

260: \subsection{Classifier comparison: decision analysis and ROC analysis}

261:

262: Most prior work on building classifiers uses classification accuracy (or,

263: equivalently, undifferentiated error rate) as the primary evaluation metric.

264: The use of accuracy assumes that the class priors in the target environment

265: will be \textit{constant and relatively balanced}.  In the real world this

266: rarely is the case.  Classifiers often are used to sift through a large

267: population of normal or uninteresting entities in order to find a relatively

268: small number of unusual ones; for example, looking for defrauded accounts

269: among a large population of customers, screening medical tests for rare

270: diseases, and checking an assembly line for defective parts.  Because the

271: unusual or interesting class is rare among the general population, the class

272: distribution is very skewed

273: \cite{EzawaEtal:96,FawcettProvost:96,FawcettProvost:97,KubatHolteMatwin:98,SaittaNeri:98}.

274:

275: As the class distribution becomes more skewed, evaluation based on accuracy

276: breaks down.  Consider a domain where the classes appear in a 999:1 ratio.  A

277: simple rule---always classify as the maximum likelihood class---gives a 99.9\%

278: accuracy.  This accuracy may be quite difficult for an induction algorithm

279: to beat, though the simple rule presumably is unacceptable if a non-trivial

280: solution is sought.  Skews of $10^2$ are common in fraud detection and skews

281: exceeding $10^6$ have been reported in other applications

282: \cite{ClearwaterStern:91}.

283:

284: Evaluation by classification accuracy also assumes \textit{equal error costs}:

285: $c(\YES,\NEG)=c(\NO,\POS)$.  In the real world classifications lead to

286: actions, which have consequences.  Actions can be as diverse as denying a

287: credit charge, discarding a manufactured part, moving a control surface on an

288: airplane, or informing a patient of a cancer diagnosis.  The consequences may

289: be grave, and performing an incorrect action may be very costly.  Rarely are

290: the costs of mistakes equivalent.  In mushroom classification, for example,

291: judging a poisonous mushroom to be edible is far worse than judging an edible

292: mushroom to be poisonous.  Indeed, it is hard to imagine a domain in which a

293: classification system may be indifferent to whether it makes a false positive

294: or a false negative error.  In such cases, accuracy maximization should be

295: replaced with cost minimization.

296:

297: The problems of unequal error costs and uneven class distributions are

298: related.  It has been suggested that, for training, high-cost

299: instances can be compensated for by increasing their prevalence in an

300: instance set \cite{bre84}.  Unfortunately, little work has been

301: published on either problem.  There exist several dozen articles in

302: which techniques for cost-sensitive learning are suggested

303: \cite{Turney-cost-bib}, but few studies evaluate and compare them

304: \cite{Domingos:99,pazzani-cost:94,ProvostFawcettKohavi:98}.  The

305: literature provides even less guidance in situations where

306: distributions are imprecise or can change.

307:

308: \begin{figure}[tb]

309:   \begin{center}

310:     \epsfig{file=ROC-curves.eps,height=3in,width=3.2in}

311:     \caption{ROC graph of three classifiers}

312:     \label{fig:ROC-curves}

313:   \end{center}

314: \end{figure}

315:

316: Given an estimate of $p(\POS|I)$, the posterior probability of an instance's

317: class membership, decision analysis gives us a way to produce cost-sensitive

318: classifications \cite{WeinsteinFineberg:80}.  Classifier error frequencies can

319: be used to approximate such probabilities \cite{pazzani-cost:94}.  For an

320: instance $I$, the decision to emit a positive classification from a particular

321: classifier is:

322:

323: \[

324: [1-p(\POS|I)] \cdot c(\YES,\NEG) \; < \; p(\POS|I) \cdot c(\NO,\POS)

325: \]

326:

327: Regardless of whether a classifier produces probabilistic or binary

328: classifications, its normalized cost on a test set can be evaluated

329: empirically as:

330: \[

331: \textrm{Cost} = FP\cdot c(\YES,\NEG) + (1 - TP)\cdot c(\NO,\POS)

332: \]

333: Most published work on cost-sensitive classification uses an equation such as

334: this to rank classifiers.  Given a set of classifiers, a set of examples, and a

335: precise cost function, each classifier's cost is computed and the minimum-cost

336: classifier is chosen.  However, as discussed above, such analyses assume that

337: the distributions are precisely known and static.

338:

339: More general comparisons can be made with Receiver Operating Characteristic

340: (ROC) analysis, a classic methodology from signal detection theory that is

341: common in medical diagnosis and has recently begun to be used more generally

342: in AI classifier work

343: \cite{Beck-Schultz:86,Egan:75,Swets:88,FriedmanWyatt:97}.  ROC graphs depict

344: tradeoffs between hit rate and false alarm rate.

345:

346: We use the term \textit{ROC space} to denote the coordinate system used for

347: visualizing classifier performance.  In ROC space, $TP$ is represented on the Y

348: axis and $FP$ is represented on the X axis.  Each classifier is represented by

349: the point in ROC space corresponding to its $(FP,TP)$ pair.  For models that

350: produce a continuous output, e.g., posterior probabilities, $TP$ and $FP$ vary

351: together as a threshold on the output is varied between its extremes (each

352: threshold defines a classifier); the resulting curve is called the ROC curve.

353: An ROC curve illustrates the error tradeoffs available with a given model.

354: Figure~\ref{fig:ROC-curves} shows a graph of three typical ROC curves; in fact,

355: these are the complete ROC curves of the classifiers shown in

356: figure~\ref{fig:NP}.

357:

358:

359: For orientation, several points on an ROC graph should be noted.  The lower

360: left point $(0,0)$ represents the strategy of never alarming, the upper right

361: point $(1,1)$ represents the strategy of always alarming, the point $(0,1)$

362: represents perfect classification, and the line $y=x$ (not shown) represents

363: the strategy of randomly guessing the class.  Informally, one point in ROC

364: space is better than another if it is to the northwest ($TP$ is higher, $FP$ is

365: lower, or both).  An ROC graph allows an informal visual comparison of a set of

366: classifiers.

367:

368:

369:

370:

371: ROC graphs illustrate the behavior of a classifier \emph{without

372: regard to class distribution or error cost}, and so they decouple

373: classification performance from these factors.  Unfortunately, while

374: an ROC graph is a valuable visualization technique, it does a poor job

375: of aiding the choice of classifiers.  Only when one classifier clearly

376: dominates another over the entire performance space can it be declared

377: better.

378:

379:

380: \subsection{The ROC Convex Hull method}

381:

382: In this section we combine decision analysis with ROC analysis and adapt them

383: for comparing the performance of a set of learned classifiers.  The method is

384: based on three high-level principles.  First, ROC space is used to separate

385: classification performance from class and cost distribution information.

386: Second, decision-analytic information is projected onto the ROC space.  Third,

387: the convex hull in ROC space is used to identify the subset of classifiers

388: that are potentially optimal.

389:

390:

391: \begin{figure}[tb]

392:   \centering

393:   \epsfig{file=ROC2.eps}

394:   \caption{The ROC convex hull identifies potentially optimal classifiers.}

395:   \label{fig:ROC-hull}

396: \end{figure}

397:

398: \subsubsection{Iso-performance lines}

399:

400: By separating classification performance from class and cost distribution

401: assumptions, the decision goal can be projected onto ROC space for a neat

402: visualization.  Specifically, the expected cost of applying the classifier

403: represented by a point ($FP$,$TP$) in ROC space is:

404:

405:

406: \[

407: p(\POS)\cdot (1-TP)\cdot c(\NO,\POS) \; + \; p(\NEG)\cdot FP \cdot c(\YES,\NEG)

408: \]

409:

410: Therefore, two points, ($FP_1$,$TP_1$) and ($FP_2$,$TP_2$),

411: have the same performance if

412:

413: \[

414: \frac{TP_2 - TP_1}{FP_2 - FP_1}

415: =

416: \frac{c(\YES,\NEG)p(\NEG)}{c(\NO,\POS)p(\POS)}

417: \]

418:

419:

420: This equation defines the slope of an \textit{iso-performance line}.

421: That is, all classifiers corresponding to points on the line have the

422: same expected cost.  Each set of class and cost distributions defines

423: a family of iso-performance lines.  Lines ``more northwest'' (having a

424: larger $TP$-intercept) are better because they correspond to

425: classifiers with lower expected cost.

426:

427: \subsubsection{The ROC convex hull}

428:

429: Because in most real-world cases the target distributions are not known

430: precisely, it is valuable to be able to identify those classifiers that

431: potentially are optimal.  Each possible set of distributions defines a family

432: of iso-performance lines, and for a given family, the optimal methods are

433: those that lie on the ``most-northwest'' iso-performance line.  Thus, a

434: classifier is optimal for some conditions if and only if it lies on the

435: northwest boundary (\ie above the line $y=x$) of the convex hull

436: \cite{quickhull:96} of the set of points in ROC space.\footnote{The convex

437:   hull of a set of points is the smallest convex set that contains the

438:   points.}  We discuss this in detail in Section~\ref{sect:rocch-hybrid}.

439:

440:

441: \begin{figure}[tb]

442:   \centering

443:   \epsfig{file=ROC3.eps}

444:   \caption{Lines $\alpha$ and $\beta$ show the optimal classifier under

445:     different sets of conditions.}

446:   \label{fig:ROC-hull2}

447: \end{figure}

448:

449: We call the convex hull of the set of points in ROC space the \textit{ROC

450: convex hull} (\rocch) of the corresponding set of classifiers.

451: Figure~\ref{fig:ROC-hull} shows four ROC curves with the ROC convex hull drawn

452: as the border between the shaded and unshaded areas.  $\mathsf{D}$ is clearly

453: not optimal.  Perhaps surprisingly, $\mathsf{B}$ can never be optimal either

454: because none of the points of its ROC curve lies on the convex hull.  We can

455: also remove from consideration any points of $\mathsf{A}$ and $\mathsf{C}$

456: that do not lie on the hull.

457:

458: Consider these classifiers under two distribution scenarios.  In each, negative

459: examples outnumber positives by 5:1.  In scenario $\mathcal{A}$, false

460: positive and false negative errors have equal cost.  In scenario $\mathcal{B}$,

461: a false negative is 25 times as expensive as a false positive (\eg missing a

462: case of fraud is much worse than a false alarm).  Each scenario defines a

463: family of iso-performance lines.  The lines corresponding to scenario

464: $\mathcal{A}$ have slope 5; those for $\mathcal{B}$ have slope $\frac{1}{5}$.

465: Figure~\ref{fig:ROC-hull2} shows the convex hull and two iso-performance

466: lines, $\alpha$ and $\beta$.  Line $\alpha$ is the ``best'' line

467: with slope $5$ that intersects the convex hull; line $\beta$ is the best line

468: with slope $\frac{1}{5}$ that intersects the convex hull.  Each line

469: identifies the optimal classifier under the given distribution.

470:

471: \begin{figure}[tb]

472:   \begin{center}

473:     \epsfig{file=ROC-hull.eps,height=3in,width=3.2in}

474:     \caption{ROC curves with convex hull}

475:     \label{fig:ROCCH}

476:   \end{center}

477: \end{figure}

478:

479:

480: Figure~\ref{fig:ROCCH} shows the three ROC curves from our initial

481: example, with the convex hull drawn.

482:

483:

484: \subsubsection{Generating the ROC Convex Hull}

485:

486: The {\it ROC convex hull method} selects the potentially optimal classifiers

487: based on the ROC convex hull and iso-performance lines.

488:

489: \begin{table}[tb]

490:   \caption{Algorithm for generating an ROC curve from a set of

491:     ranked examples.}

492:   \begin{center}

493:     \rule{\textwidth}{.01in}

494:     \begin{tabbing}

495:       \textbf{\rmfamily Given:}~~ \=E: \= List of \=tuples

496:       $\langle I, p \rangle$ where:\\

497:       \>\>\>$I$: labeled example\\

498:       \>\>\>$p$: numeric ranking assigned to $I$ by the classifier \\

499:       \>$P, N$: count of positive and negative examples in E, respectively.\\

500:       \textbf{\rmfamily Output:}  R: List of points on the ROC curve.\\

501:       \vspace*{1ex}\\

502:       xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=\kill

503:       $Tcount = 0$; \>\>\>\>\>\>{\it /* current TP tally */ }\\

504:       $Fcount = 0$; \>\>\>\>\>\>{\it /* current FP tally */ }\\

505:       $plast = -\infty$; \>\>\>\>\>\>{\it /* last score seen */ }\\

506:       $R = \langle \rangle$; \>\>\>\>\>\>{\it /* list of ROC points */ }\\

507:       sort $E$ in decreasing order by $p$ values;\\

508:       \WHILE (E $\neq \emptyset$) \DO \\

509:       \>remove tuple $\langle I, p \rangle$ from head of E;\\

510:       \>\IF ($p \neq plast$) \THEN\\

511:       \>\>add point ($\frac{Fcount}{N}$, $\frac{Tcount}{P}$) to end of R;\\

512:       \>\>$plast = p$;\\

513:       \>\ENDIF\\

514:       \>\IF ($I$ is a positive example) \THEN\\

515:       \>\>$Tcount = Tcount + 1$;\\

516:       \>\ELSE \>\>\>\>\>{\it /* I is a negative example */}\\

517:       \>\>$Fcount = Fcount + 1$;\\

518:       \>\ENDIF\\

519:       \ENDWHILE\\

520:       add point ($\frac{Fcount}{N}$, $\frac{Tcount}{P}$) to end of R;\\

521:     \end{tabbing}

522:     \rule{\textwidth}{.01in}

523:   \end{center}

524:   \label{tab:ROC-alg}

525: \end{table}

526:

527: \begin{enumerate}

528:

529: \item For each classifier, plot $TP$ and $FP$ in ROC space.  For

530: continuous-output classifiers, vary a threshold over the output range

531: and plot the ROC curve.  Table~\ref{tab:ROC-alg} shows an algorithm

532: for producing such an ROC curve in a single pass.\footnote{There is a

533: subtle complication to producing ROC curves from ranked test-set data,

534: which is reflected in the algorithm shown in Table~\ref{tab:ROC-alg}.

535: Specifically, consecutive examples with the same score can give overly

536: optimistic or overly pessimistic ROC curves, depending on the ordering

537: of positive and negative examples.  The ROC curve generating algorithm

538: shown here waits until all examples with the same score have been

539: tallied before computing the next point of the ROC curve.  The result

540: is a segment that bisects the area that would have resulted from the

541: most optimistic and most pessimistic orderings.}

542:

543: \item Find the convex hull of the set of points representing the predictive

544:   behavior of all classifiers of interest, for example by using the QuickHull

545:   algorithm \cite{quickhull:96}.

546:

547: \item For each set of class and cost distributions of interest, find the slope

548:   (or range of slopes) of the corresponding iso-performance lines.

549:

550: \item For each set of class and cost distributions, the optimal classifier will

551:   be the point on the convex hull that intersects the iso-performance line with

552:   largest $TP$-intercept.  Ranges of slopes specify hull segments.

553:

554: \end{enumerate}

555:

556:

557: Figures~\ref{fig:ROC-hull} and \ref{fig:ROC-hull2} demonstrate how the

558: subset of classifiers that are potentially optimal can be identified

559: and how classifiers can be compared under different cost and class

560: distributions.

561:

562: \subsubsection{Comparing a variety of classifiers}

563:

564: The ROC convex hull method accommodates both binary and continuous

565: classifiers.  Binary classifiers are represented by individual points in ROC

566: space.  Continuous classifiers produce numeric outputs to which thresholds can

567: be applied, yielding a series of $(FP, TP)$ pairs forming an ROC curve.  Each

568: point may or may not contribute to the ROC convex hull.

569: Figure~\ref{fig:Adding-EFG} depicts the binary classifiers $\mathsf{E}$,

570: $\mathsf{F}$ and $\mathsf{G}$ added to the previous hull.  $\mathsf{E}$ may be

571: optimal under some circumstances because it extends the convex hull.

572: Classifiers $\mathsf{F}$ and $\mathsf{G}$ never will be optimal because they

573: do not extend the hull.

574:

575: \begin{figure}[tb]

576:   \centering \epsfig{file=Adding-classifiers.eps,height=3in}

577:   \caption{Classifier $\mathsf{E}$ may be optimal for some conditions because

578:     it extends the ROC convex hull.  $\mathsf{F}$ and $\mathsf{G}$ cannot be

579:     optimal they are not on the hull, nor do they extend it.}

580:   \label{fig:Adding-EFG}

581: \end{figure}

582:

583: New classifiers can be added incrementally to an \rocch\ analysis, as

584: demonstrated in figure~\ref{fig:Adding-EFG} by the addition of classifiers

585: $\mathsf{E}$,$\mathsf{F}$, and $\mathsf{G}$.  Each new classifier either

586: extends the existing hull or it does not.  In the former case the hull must be

587: updated accordingly, but in the latter case the new classifier can be ignored.

588: Therefore, the method does not require saving every classifier (or saving

589: statistics on every classifier) for re-analysis under different

590: conditions---only those points on the convex hull.  Recall that each point is

591: a classifier and might take up considerable space.  Further, the management of

592: knowledge about many classifiers and their statistics from many different runs

593: of learning programs (e.g., with different algorithms or parameter settings)

594: can be a substantial undertaking.  Classifiers not on the \rocch\ can never be

595: optimal, so they need not be saved.  Every classifier that \emph{does} lie on

596: the convex hull must be saved.  In Section~\ref{sect:our-study} we demonstrate

597: the \rocch\ in use, managing the results of many learning experiments.

598:

599: \subsubsection{Changing distributions and costs}

600:

601: Class and cost distributions that change over time necessitate the reevaluation

602: of classifier choice.  In fraud detection, costs change based on workforce and

603: reimbursement issues; the amount of fraud changes monthly.  With the ROC convex

604: hull method, comparing under a new distribution involves only calculating the

605: slope(s) of the corresponding iso-performance lines and intersecting them with

606: the hull, as shown in figure~\ref{fig:ROC-hull2}.

607:

608: The ROC convex hull method scales gracefully to any degree of

609: precision in specifying the cost and class distributions.  If nothing

610: is known about a distribution, the ROC convex hull shows all

611: classifiers that may be optimal under any conditions.

612: Figure~\ref{fig:ROC-hull} showed that, given classifiers $\mathsf{A}$,

613: $\mathsf{B}$, $\mathsf{C}$ and $\mathsf{D}$, only $\mathsf{A}$ and

614: $\mathsf{C}$ can ever be optimal.  With complete information, the

615: method identifies the optimal classifier(s).  In

616: figure~\ref{fig:ROC-hull2} we saw that classifier $\mathsf{A}$ (with a

617: particular threshold value) is optimal under scenario $\mathcal{A}$

618: and classifier $\mathsf{C}$ is optimal under scenario $\mathcal{B}$.

619: Next we will see that with less precise information, the ROC convex

620: hull can show the subset of possibly optimal classifiers.

621:

622: \subsubsection{Sensitivity analysis}

623:

624:

625: \begin{figure}[tb]

626:   \begin{center}

627:     %

628:     %

629:     \epsfig{file=Sensitivity-1.eps,height=2.7in,width=2.6in} \\

630:     a.~~Low sensitivity\\

631:     \vspace*{.2in}

632:     \epsfig{file=Sensitivity-2.eps,height=2.5in,width=2.5in}\\

633:     b.~~High sensitivity\\

634:   \end{center}

635:   \caption{Sensitivity analysis using the ROC convex hull:  (a) low

636:     sensitivity (only C can be optimal), (b) high sensitivity (A, E, or C can

637:     be optimal)}

638:   \label{fig:sensitive}

639: \end{figure}

640:

641:

642: Imprecise distribution information defines a \emph{range} of slopes for

643: iso-performance lines.  This range of slopes intersects a segment of the ROC

644: convex hull, which facilitates sensitivity analysis.  For example, if the

645: segment defined by a range of slopes corresponds to a single point in ROC

646: space or a small threshold range for a single classifier, then there is no

647: sensitivity to the distribution assumptions in question.  Consider a scenario

648: similar to $\mathcal{A}$ and $\mathcal{B}$ in that negative examples are 5

649: times as prevalent as positive ones.  In this scenario, consider the cost of

650: dealing with a false alarm to be between \$10 and \$20, and the cost of

651: missing a positive example to be between \$200 and \$250.  These conditions

652: define a range of slopes for iso-performance lines: $\frac{1}{5}\le m \le

653: \frac{1}{2}$.  Figure~\ref{fig:sensitive}a depicts this range of slopes and

654: the corresponding segment of the ROC convex hull.  The figure shows that the

655: choice of classifier is insensitive to changes within this range (and only

656: fine tuning of the classifier's threshold will be necessary).

657: Figure~\ref{fig:sensitive}b depicts a scenario with a wider range of slopes:

658: $\frac{1}{2} \le m \le 3$.  The figure shows that under this scenario the

659: choice of classifier is very sensitive to the distribution.  Classifiers

660: $\mathsf{A}$, $\mathsf{C}$ and $\mathsf{E}$ each are optimal for some

661: subrange.

662:

663: \section{Building robust classifiers}

664: \label{sect:rocch-hybrid}

665:

666: Up to this point, we have concentrated on the use of the \rocch\ for

667: visualizing and evaluating sets of classifiers.  The \rocch\ helps to

668: delay classifier selection as long as possible, yet provides a rich

669: performance comparison.  However, once system building incorporates a

670: particular classifier, the problem of brittleness resurfaces.  This is

671: important because the delay between system building and deployment may

672: be large, and because many systems must survive for years.  In fact,

673: in many domains a precise, static specification of future costs and

674: class distributions is not just unlikely, it is impossible

675: \cite{ProvostFawcettKohavi:98}.

676:

677: We address this brittleness by using the \rocch\ to produce

678: \textbf{robust classifiers}, defined as satisfying the following.

679: \emph{Under any target cost and class distributions, a robust

680: classifier will perform at least as well as the best classifier for

681: those conditions.}  Our statements about optimality are practical: the

682: ``best'' classifier may not be the Bayes-optimal classifier, but it is

683: at least as good as any known classifier.

684: Srinivasan \citeyear{Srinivasan:99} calls this ``FAPP-optimal''

685: (optimal for all practical purposes).  Stating that a classifier is

686: robust is stronger than stating that it is optimal for a specific set

687: of conditions.  A robust classifier is optimal under all possible

688: conditions.

689:

690: In principle, classification brittleness could be overcome by saving

691: all possible classifiers (neural nets, decision trees, expert systems,

692: probabilistic models, etc.)  and then performing an automated run-time

693: comparison under the desired target conditions.  However, such a

694: system is not feasible because of time and space limitations---there

695: are myriad possible classification models, arising from the many

696: different learning methods under their many different parameter

697: settings.  Storing all the classifiers is not feasible, and tuning

698: the system by comparing classifiers on the fly under different

699: conditions is not feasible.  Fortunately, doing so is not necessary.

700: Moreover, we will show that it is sometimes possible to do \textit{better} than

701: any of these classifiers.

702:

703: \subsection{ROCCH-hybrid classifiers}

704:

705: We now show that robust hybrid classifiers can be built using the \rocch.

706:

707: \begin{definition}

708:   Let $\mathbf{I}$ be the space of possible instances and let $\mathbf{C}$ be

709:   the space of sets of classification models.  Let a

710:   \mathversion{bold}$\mu$\mathversion{normal}\textbf{-hybrid classifier}

711:   comprise a set of classification models $\mathcal{C} \in \mathbf{C}$ and a

712:   function

713:   \[

714:   \mu: \mathbf{I} \times \Re \times \mathbf{C} \rightarrow \{\YES,\NO\}.

715:   \]

716:   A $\mu$-hybrid classifier takes as input an instance $I \in \mathbf{I}$ for

717:   classification and a number $x \in \Re$.  As output, it produces the

718:   classification produced by $\mu(I,x,\mathcal{C})$.

719: \end{definition}

720:

721: Things will get more involved later, but for the time being consider that each

722: set of cost and class distributions defines a value for $x$, which is used to

723: select the (predetermined) best classifier for those conditions.  To build a

724: $\mu$-hybrid classifier, we must define $\mu$ and the set $\mathcal{C}$.  We

725: would like $\mathcal{C}$ to include only those models that perform optimally

726: under some conditions (class and cost distributions), since these will be

727: stored by the system, and we would like $\mu$ to be general enough to apply to

728: a variety of problem formulations.

729:

730: The models comprising the {\sc rocch} can be combined to form a

731: $\mu$-hybrid classifier that is an elegant, robust classifier.

732:

733: \begin{definition}

734:   The \textbf{{\sc \textbf{rocch}}-hybrid} is a $\mu$-hybrid classifier where

735:   $\mathcal{C}$ is the set of classifiers that form the {\sc rocch} and $\mu$

736:   makes classifications using the classifier on the {\sc rocch} with $FP=x$.

737: \end{definition}

738: Note that for the moment the {\sc rocch}-hybrid is defined only for $FP$

739: values corresponding to {\sc rocch} vertices.

740:

741: \subsection{Robust classification}

742:

743: Our definition of robust classifiers was intentionally vague about

744: what it means for one classifier to be better than another, because

745: different situations call for different comparison frameworks.  We now

746: continue with minimizing expected cost, because the process of proving

747: that the {\sc rocch}-hybrid minimizes expected cost for any cost and

748: class distributions provides a deep understanding of why and how the

749: {\sc rocch}-hybrid works.

750: Later we generalize to a wide variety of

751: comparison frameworks.

752:

753: The \rocch-hybrid can be seen as an application of multi-criteria

754: optimization to classifier design and construction.  The classifiers on the

755: \rocch\ are Edgeworth-Pareto optimal\footnote{Edgeworth-Pareto optimality is

756:   the century-old notion that in a multidimensional space of criteria, optimal

757:   performance is the frontier of achievable performance in this space.  In

758:   cases where performance is a linear combination of the criteria, the

759:   optimality frontier is the convex hull.} \cite{Stadler-book} with respect to

760: TP, FP, and the objective functions we discuss.  Multi-criteria optimization

761: was used previously in machine learning by Tcheng, Lambert, Lu and Rendell

762: \shortcite{TchengEtAl:89} for the selection of inductive bias.

763: Alternatively, the \rocch\ can be seen as an application of the theory of

764: games and statistical decisions, for which convex sets (and the convex hull)

765: represent optimal strategies \cite{BlackwellGirshick:54}.

766:

767: \subsubsection{Minimizing expected cost}

768:

769: From above, the expected cost of applying a classifier is:

770:

771: \begin{equation}

772:   \label{eq:expected_cost}

773:   ec(FP,TP) \; = \; p(\POS)  \cdot  (1-TP)\cdot c(\NO,\POS) \;  +

774:   \; p(\NEG)  \cdot  FP \cdot c(\YES,\NEG)

775: \end{equation}

776:

777: For a particular set of cost and class distributions, the

778: slope of the corresponding iso-performance lines is:

779:

780: \begin{equation}

781:   \label{eq:slope}

782:   m_{ec} = \frac{c(\YES,\NEG)p(\NEG)}{c(\NO,\POS)p(\POS)}

783: \end{equation}

784:

785: Every set of conditions will define an $m_{ec} \ge 0$.  We now can

786: show that the {\sc rocch}-hybrid is robust for problems where the

787: ``best'' classifier is the classifier with the minimum expected cost.

788:

789: The slope of the {\sc rocch} is an important tool in our argument.  The {\sc

790:   rocch} is a piecewise-linear, concave-down ``curve.''  Therefore, as $x$

791: increases, the slope of the {\sc rocch} is monotonically non-increasing with

792: $k-1$ discrete values, where $k$ is the number of {\sc rocch} component

793: classifiers, including the degenerate classifiers that define the {\sc rocch}

794: endpoints.  Where there will be no confusion, we use phrases such as ``points

795: in ROC space'' as a shorthand for the more cumbersome ``classifiers

796: corresponding to points in ROC space.'' For this subsection, unless otherwise

797: noted, ``points on the

798: {\sc rocch}'' refer to vertices of the {\sc rocch}.

799:

800: \begin{definition}

801:   \label{def:slope-of-rocch}

802:   For any real number $m \ge 0$, the \textbf{point where the slope of the

803:     \textsc{rocch}\ is $\mathbf{m}$} is one of the (arbitrarily chosen)

804:   endpoints of the segment of the {\sc rocch} with slope $m$, if such a

805:   segment exists.  Otherwise, it is the vertex for which the left adjacent

806:   segment has slope greater than $m$ and the right adjacent segment has slope

807:   less than $m$.

808: \end{definition}

809:

810: For completeness, the leftmost endpoint of the {\sc rocch} is considered to be

811: attached to a segment with infinite slope and the rightmost endpoint of the

812: {\sc rocch} is considered to be attached to a segment with zero slope.  Note

813: that every $m \ge 0$ defines at least one point on the {\sc rocch}.

814:

815: \begin{lemma}

816:   For any set of cost and class distributions, there is a point on the \rocch\

817:   with minimum expected cost.\\

818:   \textbf{Proof:} (by contradiction) Assume that for some conditions

819:   there exists a point \textbf{C} with smaller expected cost than any

820:   point on the {\sc rocch}.  By equations~\ref{eq:expected_cost} and

821:   \ref{eq:slope}, a point ($FP_2$,$TP_2$) has the same expected cost

822:   as a point ($FP_1$,$TP_1$) if \[ \frac{TP_2 - TP_1}{FP_2 - FP_1} =

823:   m_{ec} \] Therefore, for conditions corresponding to $m_{ec}$, all

824:   points with equal expected cost form an iso-performance line in ROC

825:   space with slope $m_{ec}$.  Also by~\ref{eq:expected_cost}

826:   and~\ref{eq:slope}, points on lines with larger y-intercept have

827:   lower expected cost.  Now, point \textbf{C} is not on the {\sc

828:   rocch}, so it is either above the curve or below the curve.  If it

829:   is above the curve, then the {\sc rocch} is not a convex set

830:   enclosing all points, which is a contradiction.  If it is below the

831:   curve, then the iso-performance line through \textbf{C} also

832:   contains a point \textbf{P} that is on the {\sc rocch} (not

833:   necessarily a vertex).  If this iso-performance line intersects no

834:   {\sc rocch} vertex, then consider the vertices at the endpoints of

835:   the {\sc rocch} segment containing \textbf{P}; one of these vertices

836:   must intersect a better iso-performance line than does \textbf{C}.

837:   In either case, since all points on an iso-performance line have the

838:   same expected cost, point \textbf{C} does not have smaller expected

839:   cost than all points on the {\sc rocch}, which is also a

840:   contradiction.  \EndProof

841: \end{lemma}

842:

843: Although it is not necessary for our purposes here, it can be shown

844: that \textit{all} of the minimum expected-cost classifiers are

845: \textit{on} the {\sc rocch}.

846:

847: \begin{definition}

848:   \label{def:m_iso_perf_line}

849:   An iso-performance line with slope $m$ is an \textbf{m-iso-performance

850:     line}.

851: \end{definition}

852:

853: \begin{lemma}

854:   For any cost and class distributions that translate to $m_{ec}$, a point on

855:   the {\sc rocch} has minimum expected cost only if the slope

856:   of the {\sc rocch} at that point is $m_{ec}$.\\

857:   \textbf{Proof:} (by contradiction) Suppose that there is a point \textbf{D}

858:   on the {\sc rocch} where the slope is \emph{not} $m_{ec}$, but the point

859:   does have minimum expected cost.  By Definition~\ref{def:slope-of-rocch},

860:   either (a) the segment to the left of \textbf{D} has slope less than

861:   $m_{ec}$, or (b) the segment to the right of \textbf{D} has slope greater

862:   than $m_{ec}$.  For case (a), consider point \textbf{N}, the vertex of the

863:   {\sc rocch} that neighbors \textbf{D} to the left, and consider the

864:   (parallel) $m_{ec}$-iso-performance lines $l_D$ and $l_N$ through \textbf{D}

865:   and \textbf{N}.  Because \textbf{N} is to the left of \textbf{D} and the

866:   line connecting them has slope less than $m_{ec}$, the y-intercept of $l_N$

867:   will be greater than the y-intercept of $l_D$.  This means that \textbf{N}

868:   will have lower expected cost than \textbf{D}, which is a contradiction.

869:   The argument for (b) is analogous (symmetric). \EndProof

870: \end{lemma}

871:

872: \begin{lemma}

873:   If the slope of the {\sc rocch} at a point is $m_{ec}$, then the point has

874:   minimum expected cost.\\

875:   \textbf{Proof:} If this point is the only point where the slope of the {\sc

876:     rocch} is $m_{ec}$, then the proof follows directly from Lemma 1 and

877:   Lemma 2.  If there are multiple such points, then by definition they are

878:   connected by an $m_{ec}$-iso-performance line, so they have the same

879:   expected cost, and once again the proof follows directly from Lemma 1 and

880:   Lemma 2. \EndProof

881: \end{lemma}

882:

883: It is straightforward now to show that the {\sc rocch}-hybrid is robust for the

884: problem of minimizing expected cost.

885:

886: \begin{theorem}

887:   The {\sc rocch}-hybrid minimizes expected cost for any cost distribution

888:   and any class distribution.\\

889:   \textbf{Proof:} Because the {\sc rocch}-hybrid is composed of the

890:   classifiers corresponding to the points on the {\sc rocch}, this follows

891:   directly from Lemmas 1, 2, and 3. \EndProof

892: \end{theorem}

893:

894: Now we have shown that the {\sc rocch}-hybrid is robust when the goal

895: is to provide the minimum expected-cost classification.  This result

896: is important even for accuracy maximization, because the preferred

897: classifier may be different for different target class distributions.

898: This rarely is taken into account in experimental comparisons of

899: classifiers.

900:

901: \begin{corollary}

902:   The {\sc rocch}-hybrid minimizes error rate (maximizes accuracy) for any

903:   target class distribution.\\

904:   \textbf{Proof:} Error rate minimization is cost minimization with uniform

905:   error costs. \EndProof

906: \end{corollary}

907:

908: \subsection{Robust classification for other common metrics}

909:

910: Showing that the \rocch-hybrid is robust not only helps us with understanding

911: the \rocch\ method generally, it also shows us how the \rocch-hybrid will pick

912: the best classifier in order to produce the best classifications, which we

913: will return to later.  If we ignore the need to specify how to pick the best

914: component classifier, we can show that the \rocch\ applies more generally.

915:

916: \begin{theorem}

917:   \label{theorem:general-rocch}

918:   For any classifier evaluation metric $f(FP,TP)$, if\\

919:   $\Partial{f}{TP}~\ge~0$ and $\Partial{f}{FP} \le 0$ then there exists a

920:   point on the \rocch\ with an $f$-value at least

921:   as high as that of any known classifier.\\

922:   \textbf{Proof:} (by contradiction) Assume that there exists a classifier

923:   $\mathcal{C}_o$, not on the \rocch, with an $f$-value higher than that of

924:   any point on the \rocch.  $\mathcal{C}_o$ is either (i) above or (ii) below

925:   the \rocch.  In case (i), the \rocch\ is not a convex set enclosing all the

926:   points, which is a contradiction.  In case (ii), let $\mathcal{C}_o$ be

927:   represented in ROC-space by $(FP_o,TP_o)$.  Because $\mathcal{C}_o$ is below

928:   the \rocch\ there exist points, call one $(FP_p,TP_p)$, on the \rocch\ with

929:   $TP_p > TP_o$ and $FP_p < FP_o$.  However, by the restriction on the partial

930:   derivatives, for any such point $f(FP_p,TP_p) \ge f(FP_o,TP_o)$, which again

931:   is a contradiction.  \EndProof

932: \end{theorem}

933:

934: There are two complications to the more general use of the \rocch,

935: both of which are illustrated by the decision criterion from our very

936: first example.  Recall that the Neyman-Pearson criterion specifies a

937: maximum acceptable $FP$ rate.  Standard ROC analysis uses ROC curves

938: to select a single, parameterized classification model; the parameter

939: allows the user to select the ``operating point'' for a

940: decision-making task, usually a threshold on a probabilistic output

941: that will allow for optimal decision making.  Under the Neyman-Pearson

942: criterion, selecting the single best model from a set is easy: plot

943: the ROC curves, draw a vertical line at the desired maximum $FP$, and

944: pick the model whose curve has the largest $TP$ at the intersection

945: with this line.

946:

947: \begin{figure}[tb]

948:   \begin{center}

949:     \epsfig{file=ROC-NP.eps,height=3.1in,width=3in}

950:     \caption{The ROC Convex Hull used to select a classifier under the

951:       Neyman-Pearson criterion}

952:     \label{fig:ROC-NP}

953:   \end{center}

954: \end{figure}

955:

956: With the \rocch-hybrid, making the best classifications under

957: the Neyman-Pearson criterion is not so straightforward.

958: For minimizing expected cost it was sufficient for the {\sc rocch}-hybrid to

959: choose a \textit{vertex} from the {\sc rocch} for any $m_{ec}$ value.  For

960: problem formulations such as the Neyman-Pearson criterion, the performance

961: statistics at a non-vertex point on the {\sc rocch} may be preferable (see

962: figure~\ref{fig:ROC-NP}).  Fortunately, with a slight extension, the {\sc

963:   rocch}-hybrid can yield a classifier with these performance statistics.

964:

965: \begin{theorem}

966:   \label{theorem:rocch-achieves-any-tradeoff} An {\sc rocch}-hybrid

967:   can achieve the $TP$:$FP$ tradeoff represented by any point on the

968:   {\sc rocch}, not just the vertices.\\ \textbf{Proof:} (by

969:   construction) Extend $\mu(I,x,\mathcal{C})$ to non-vertex points as

970:   follows.  Pick the point $P$ on the {\sc rocch} with $FP=x$ (there

971:   is exactly one).  Let $TP_x$ be the $TP$ value of this point.  If

972:   ($x$, $TP_x$) is an {\sc rocch} vertex, use the corresponding

973:   classifier.  If it is not a vertex, call the left endpoint of the

974:   hull segment on which $P$ lies $C_l$, and the right endpoint $C_r$.

975:   Let $d$ be the distance between $C_l$ and $C_r$, and let $p$ be the

976:   distance between $C_l$ and $P$.  Make classifications as follows.

977:   For each input instance flip a weighted coin and choose the answer

978:   given by classifier $C_r$ with probability $\frac{p}{d}$ and that

979:   given by classifier $C_l$ with probability $1-\frac{p}{d}$.  It is

980:   straightforward to show that $FP$ and $TP$ for this classifier will

981:   be $x$ and $TP_x$. \EndProof

982: \end{theorem}

983:

984: The second complication is that, as illustrated by the Neyman-Pearson

985: criterion, many practical classifier comparison frameworks include

986: \textit{constrained} optimization problems (below we will discuss other

987: frameworks).  Arbitrarily constrained optimizations are problematic for the

988: \rocch-hybrid.  Given total freedom, it is possible to devise constraints on

989: classifier performance such that, even with the restriction on the partial

990: derivatives, an interior point scores higher than any \textit{acceptable}

991: point on the hull.  For example, two linear constraints can enclose a subset

992: of the interior and exclude \textit{the entire} \rocch---there will be no

993: acceptable points on the \rocch.  However, many realistic constraints do not

994: thwart the optimality of the \rocch-hybrid.

995:

996: \begin{theorem}

997:   \label{theorem:general-rocch-hybrid}

998:   For any classifier evaluation metric $f(FP,TP)$, if \\

999:   $\Partial{f}{TP}\ge~0$ and $\Partial{f}{FP}\le~0$ and no constraint on

1000:   classifier performance eliminates any point on the \rocch\ without also

1001:   eliminating all higher-scoring interior points, then the \rocch-hybrid can

1002:   perform at least as well as any known classifier.

1003:   \\

1004:   \textbf{Proof:} Follows directly from Theorem~\ref{theorem:general-rocch}

1005:   and Theorem~\ref{theorem:rocch-achieves-any-tradeoff}.  \EndProof

1006: \end{theorem}

1007:

1008: Linear constraints on classifiers' $FP:TP$ performance are common

1009: for real-world problems, so the following is

1010: useful.

1011:

1012: \begin{corollary}

1013:   \label{corollary:linear-constraints}

1014:   For any classifier evaluation metric $f(FP,TP)$, if\\

1015:   $\Partial{f}{TP} \ge 0$ and $\Partial{f}{FP} \le 0$

1016:   and there is a single constraint on classifier performance

1017:   of the form $a \cdot TP + b \cdot FP \le c$, with $a$ and $b$

1018:   non-negative,

1019:   then

1020:   the \rocch-hybrid can perform at least as well as any known

1021:   classifier.

1022:   \\

1023:   \textbf{Proof:}

1024:   The single constraint eliminates from contention all points (classifiers)

1025:   that do not fall to the left of, or below, a line with non-positive

1026:   slope.  By the restriction on the partial derivatives, such a constraint

1027:   will not eliminate a point on the \rocch\  without also eliminating

1028:   all interior points with higher $f$-values.

1029:   Thus, the proof follows directly from Theorem~\ref{theorem:general-rocch-hybrid}.

1030:   \EndProof

1031: \end{corollary}

1032:

1033: So, finally, we have the following:

1034:

1035: \begin{corollary}

1036:   \label{cor:rocch-maximizes-NP}

1037:   For the Neyman-Pearson criterion, the {\sc rocch}-hybrid can perform at

1038:   least as well as that of any known

1039:   classifier.\\

1040:   \textbf{Proof:} For the Neyman-Pearson criterion, the evaluation metric is

1041:   $f(FP,TP)=TP$, that is, a higher $TP$ is better.  The constraint on

1042:   classifier performance is $FP \le FP_{max}$. These satisfy the conditions

1043:   for Corollary~\ref{corollary:linear-constraints}, and therefore this

1044:   corollary follows.  \EndProof

1045: \end{corollary}

1046:

1047: All the foregoing effort may seem misplaced for a simple

1048: criterion like Neyman-Pearson.  However, there are

1049: many other realistic problem formulations.

1050: For example, consider

1051: the decision-support problem of optimizing \textit{workforce utilization}, in

1052: which a workforce is available that can process a fixed number of cases.  Too few

1053: cases will under-utilize the workforce, but too many cases will leave some

1054: cases unattended (expanding the workforce usually is not a short-term

1055: solution).  If the workforce can handle $K$ cases, the system should present

1056: the best possible set of $K$ cases.  This is similar to the Neyman-Pearson

1057: criterion, but with an absolute cutoff ($K$) instead of a percentage cutoff

1058: ($FP$).

1059:

1060:

1061: \begin{theorem}

1062:   \label{the:rocch_best}

1063:   For workforce utilization, the {\sc rocch}-hybrid will provide the best set

1064:   of $K$ cases, for any choice of $K$.\\

1065:  \textbf{Proof:} (by construction) The decision criterion is to maximize $TP$

1066:   subject to the constraint:

1067:   \[

1068:   TP \cdot P + FP \cdot N \le K

1069:   \]

1070:   The theorem therefore follows from Corollary~\ref{corollary:linear-constraints}. \EndProof

1071: \end{theorem}

1072:

1073: In fact, many screening problems, such as are found in marketing and

1074: information retrieval, use exactly this linear constraint.  It follows that

1075: for maximizing lift \cite{BerryLinoff:97}, precision, or recall, subject to

1076: absolute or percentage cutoffs on case presentation, the {\sc rocch}-hybrid

1077: will provide the best set of cases.

1078:

1079: As with minimizing expected cost, imprecision in the environment

1080: forces us to favor a \textit{robust} solution for these other

1081: comparison frameworks.  For many real-world problems, the precise

1082: desired cutoff will be unknown or will change (\eg because of

1083: fundamental uncertainty, variability in case difficulty, or competing

1084: responsibilities).  What is worse, for a fixed (absolute) cutoff

1085: merely changing the size of the universe of cases (e.g., the size of

1086: a document corpus) may change the preferred classifier, because it

1087: will change the constraint line.  The {\sc rocch}-hybrid provides a

1088: robust solution because it gives the optimal subset of cases for any

1089: constraint line.  For example, for document retrieval the {\sc

1090: rocch}-hybrid will yield the best $N$ documents for any $N$, for any

1091: prior class distribution (in the target corpus), and for any target

1092: corpus size.

1093:

1094: \subsection{Ranking cases}

1095: \label{sect:ranking-cases}

1096:

1097: An apparent solution to the problem of robust classification is to use a model

1098: that ranks cases, and just work down the ranked list.  This approach appears

1099: to sidestep the brittleness demonstrated with binary classifiers, since the

1100: choice of a cutoff point can be deferred to classification time.  However,

1101: choosing the best ranking model is still problematic.  For most practical

1102: situations, choosing the best ranking model is equivalent to choosing which

1103: classifier is best \emph{for the cutoff that will be used}.

1104:

1105: An example will illustrate this.  Consider two ranking functions, $R_a$ and

1106: $R_b$, applied to a class-balanced set of 100 cases.  Assume $R_a$ is able to

1107: recognize a common aspect unique to positive cases that occurs in 20\% of the

1108: population, and it ranks these highest.  Assume $R_b$ is able to recognize a

1109: common aspect unique to negative cases occurring in 20\% of the population, and it

1110: ranks these lowest.  So $R_a$ ranks the highest 20\% correctly and performs

1111: randomly on the remainder, while $R_b$ ranks the lowest 20\% correctly and

1112: performs randomly on the remainder.  Which model is better?  The answer

1113: depends entirely upon how far down the list the system will go before it

1114: stops; that is, upon what cutoff will be used.  If fewer than 50 cases are to

1115: be selected then $R_a$ should be used, whereas $R_b$ is better if more than 50

1116: cases will be selected.  Figure~\ref{fig:Ranking-models} shows the ROC curves

1117: corresponding to these two classifiers, and the point corresponding to $N=50$

1118: where the curves cross in ROC space.

1119:

1120: \begin{figure}[tb]

1121:   \begin{center}

1122:     \epsfig{file=Ranking-models.eps ,height=3in}

1123:     \caption{The ROC curves of the two ranking classifiers, $R_a$ and $R_b$,

1124:       described in Section~\ref{sect:ranking-cases}.}

1125:     \label{fig:Ranking-models}

1126:   \end{center}

1127: \end{figure}

1128:

1129: The \rocch\ method can be used to organize such ranking models, as we have

1130: seen.  Recall that ROC curves are formed from case rankings by moving the

1131: cutoff from one extreme to the other (Table~\ref{tab:ROC-alg} shows an

1132: algorithm for calculating the ROC curve from such rankings).  The {\sc

1133:   rocch}-hybrid comprises the ranking models that are best for all possible

1134: conditions.

1135:

1136: \subsection{Whole-curve metrics}

1137:

1138: In situations where either the target cost distribution or class distribution

1139: is \emph{completely} unknown, some researchers advocate choosing the

1140: classifier that maximizes a single-number metric representing the average

1141: performance over the entire curve.  A common whole-curve metric is ``AUC'',

1142: the Area Under the (ROC) Curve \cite{Bradley:97}.  The AUC is equivalent to

1143: the probability that a randomly chosen positive instance will be rated higher

1144: than a negative instance, and thereby is also estimated by the Wilcoxon test

1145: of ranks \cite{HanleyMcNeil:82}.  A criticism of AUC is that for specific

1146: target conditions the classifier with the maximum AUC may be suboptimal

1147: \cite{ProvostFawcettKohavi:98}.  Indeed, this criticism may be made of any

1148: single-number metric.  Fortunately, not only is the \textsc{rocch}-hybrid

1149: optimal for any specific target conditions, it has the maximum

1150: AUC---There is no classifier with AUC larger than that of the {\sc rocch}-hybrid.

1151:

1152: \subsection{Using the ROCCH-hybrid}

1153:

1154: To use the \textsc{rocch}-hybrid for classification, we need to translate

1155: environmental conditions to $x$ values to plug into $\mu(I,x,\mathcal{C})$.

1156: For minimizing expected cost, Equation~\ref{eq:slope} shows how to translate

1157: conditions to $m_{ec}$.  For any $m_{ec}$, by Lemma~3 we want the $FP$ value

1158: of the point where the slope of the {\sc rocch} is $m_{ec}$, which is

1159: straightforward to calculate.  For the Neyman-Pearson criterion the conditions

1160: are defined as $FP$ values.  For workforce utilization with conditions

1161: corresponding to a cutoff $K$, the $FP$ value is found by intersecting the line

1162: $TP \cdot P + FP \cdot N = K$ with the {\sc rocch}.

1163:

1164: We have argued that target conditions (misclassification costs and

1165: class distribution) are rarely known.  It may be confusing that

1166: we now seem to require exact knowledge of these conditions.  The

1167: \textsc{rocch}-hybrid gives us two important capabilities.  First, the

1168: need for precise knowledge of target conditions is deferred until

1169: run time.  Second, in the absence of precise knowledge even at

1170: run time, the system can be optimized easily with minimal feedback.

1171:

1172: By using the \textsc{rocch}-hybrid, information on target conditions is not

1173: needed to train and compare classifiers.  This is important because of

1174: imprecision caused by temporal,

1175: geographic, or other differences that may exist between training and use.

1176: For example, building

1177: a system for a real-world problem introduces a non-trivial delay between the

1178: time data are gathered and the time the learned models will be used.  The

1179: problem is exacerbated in domains where error costs or class distributions

1180: change over time; even with slow drift, a brittle model may become suboptimal

1181: quickly.  In many such scenarios, costs and class distributions can be specified

1182: (or respecified) at run time with reasonable precision by sampling from the

1183: current population, and used to ensure that the {\sc rocch}-hybrid always

1184: performs optimally.

1185:

1186:

1187: In some cases, even at run time these quantities are not known

1188: exactly.  A further benefit of the \textsc{rocch}-hybrid is that it

1189: can be tuned easily to yield optimal performance with only minimal

1190: feedback from the environment.  Conceptually, the {\sc rocch}-hybrid

1191: has one ``knob'' that varies $x$ in $\mu(I,x,\mathcal{C})$ from one

1192: extreme to the other.  For any knob setting, the {\sc rocch}-hybrid

1193: will give the optimal $TP$:$FP$ tradeoff for the target conditions

1194: corresponding to that setting.  Turning the knob to the right

1195: increases $TP$; turning the knob to the left decreases $FP$.  Because

1196: of the monotonicity of the \textsc{rocch}-hybrid, simple hill-climbing

1197: can guarantee optimal performance.  For example, if the system

1198: produces too many false alarms, turn the knob to the left; if the

1199: system is presenting too few cases, turn the knob to the right.

1200:

1201: \subsection{Beating the component classifiers}

1202: \label{sect:beating-the-components}

1203:

1204: Perhaps surprisingly, in many realistic situations an {\sc

1205: rocch}-hybrid system can do \emph{better} than any of its component

1206: classifiers.  Consider the Neyman-Pearson decision criterion.  The

1207: {\sc rocch} may intersect the $FP$-line \textit{above} the highest

1208: component ROC curve.  This occurs when the $FP$-line intersects the

1209: {\sc rocch} between vertices; therefore, there is no component

1210: classifier that actually produces these particular ($FP$,$TP$)

1211: statistics, as in figure~\ref{fig:ROC-NP}.  By

1212: Theorem~\ref{theorem:rocch-achieves-any-tradeoff}, the {\sc

1213: rocch}-hybrid can achieve any $TP$ on the hull.  Only a small number

1214: of $FP$ values correspond to hull vertices.

1215: The same holds for other common problem formulations, such as workforce

1216: utilization, lift maximization, precision maximization, and recall

1217: maximization.

1218:

1219: \subsection{Time and space efficiency}

1220:

1221: We have argued that the {\sc rocch}-hybrid is robust for a wide variety of

1222: problem formulations.  It is also efficient to build, to store, and to update.

1223:

1224: The time efficiency of building the {\sc rocch}-hybrid depends first

1225: on the efficiency of building the component models, which varies

1226: widely by model type.  Some models built by machine learning methods

1227: can be built in seconds (once data are available).  Hand-built models

1228: can take years to build.  However, we presume that this is work that

1229: would be done anyway.  The {\sc rocch}-hybrid can be built with

1230: whatever methods are available, be there two or two thousand. As

1231: described below, as new classifiers become available, the {\sc

1232: rocch}-hybrid can be updated incrementally.  The time efficiency

1233: depends also on the efficiency of the experimental evaluation of the

1234: classifiers.  Once again, we presume that this is work that would be

1235: done anyway.  Finally, the time efficiency of the {\sc rocch}-hybrid

1236: depends on the efficiency of building the {\sc rocch}, which can be

1237: done in $O(N \log N)$ time using the QuickHull algorithm

1238: \cite{quickhull:96} where $N$ is the number of classifiers.

1239:

1240: The {\sc rocch} is space efficient, too, because it comprises only

1241: classifiers that might be optimal under some target conditions (which

1242: follows directly from Lemmas 1--3 and Definitions 3 and 4).  The

1243: number of classifiers that must be stored can be reduced if bounds can

1244: be placed on the potential target conditions.  As described above,

1245: ranges of conditions define segments of the {\sc rocch}.  Thus, the

1246: {\sc rocch}-hybrid may need only a subset of $\mathcal{C}$.

1247:

1248: Adding new classifiers to the {\sc rocch}-hybrid also is efficient.  Adding a

1249: classifier to the \textsc{rocch} will either (i) extend the hull, adding to

1250: (and possibly subtracting from) the {\sc rocch}-hybrid, or (ii) conclude that

1251: the new classifiers are not superior to the existing classifiers in any

1252: portion of ROC space and can be discarded.

1253:

1254: The run-time (classification) complexity of the {\sc rocch}-hybrid is never

1255: worse than that of the component classifiers.  In situations where run-time

1256: complexity is crucial, the {\sc rocch} should be constructed without

1257: prohibitively expensive classification models.  It then will find the best

1258: subset of the computationally efficient models.

1259:

1260: \section{Empirical demonstration of need}

1261:

1262: Robust classification is of fundamental interest because it

1263: weakens two very strong assumptions: the

1264: availability of precise knowledge of costs and

1265: of class distributions.

1266: However, might it not be that existing classifiers already are robust?

1267: For example, if a given classifier is optimal under one set of

1268: conditions, might it not be optimal under all?

1269:

1270: It is beyond the scope of this paper to offer an in-depth experimental study

1271: answering this question.  However, we can provide solid evidence that the

1272: answer is ``no'' by referring to the results of two prior studies.  One is a

1273: comprehensive ROC analysis of medical domains recently conducted by Andrew

1274: Bradley \citeyear{Bradley:97}.\footnote{Bradley's purpose was not to answer

1275:   this question; fortunately, his published results do anyway.}  The other is a

1276: published ROC analysis of UCI database domains that we undertook last year

1277: with Ron Kohavi \cite{ProvostFawcettKohavi:98}.

1278:

1279: Note that a classifier \textit{dominates} if its ROC curve completely

1280: defines the {\sc rocch} (which means dominating classifiers are robust

1281: and vice versa).  Therefore, if there exist more than a trivially few

1282: domains where no single classifier dominates, then techniques like the {\sc

1283: rocch}-hybrid are essential if robust classifiers are desired.

1284:

1285:

1286: \subsection{Bradley's study}

1287:

1288: Bradley studied six

1289: medical data sets, noting that ``unfortunately, we rarely know what the

1290: individual misclassification costs are.''  He plotted the ROC curves of six

1291: classifier learning algorithms (two neural nets, two decision trees and two

1292: statistical techniques).

1293:

1294:

1295: \begin{figure}[tb]

1296:   \begin{center}

1297:     \epsfig{file=Bradley-HB.eps,height=3in,width=3in}

1298:     \caption{Bradley's classifier results for the heart bleeding data.}

1299:     \label{fig:Bradley-HB}

1300:   \end{center}

1301: \end{figure}

1302:

1303: On \textit{not one} of these data sets was there a dominating

1304: classifier.  This means that for each domain, there exist different

1305: sets of conditions for which different classifiers are preferable.  In

1306: fact, the running example in the present article is based on the three

1307: best classifiers from Bradley's results on the heart bleeding data;

1308: his results for the full set of six classifiers can be found in

1309: figure~\ref{fig:Bradley-HB}.  Classifiers constructed for the

1310: Cleveland heart disease data are shown in

1311: figure~\ref{fig:Bradley-Cleveland}.

1312:

1313: Bradley's results show clearly that for many domains the classifier that

1314: maximizes any single metric---be it accuracy, cost, or the area under the ROC

1315: curve---will be the best for some cost and class distributions and will not be

1316: the best for others.  We have shown that the {\sc

1317:   rocch}-hybrid will be the best for all.

1318:

1319: \begin{figure}[tb]

1320:   \begin{center}

1321:     \epsfig{file=Bradley-Cleveland.eps,height=3in,width=3in}

1322:     \caption{Bradley's classifier results for the Cleveland heart disease data}

1323:     \label{fig:Bradley-Cleveland}

1324:   \end{center}

1325: \end{figure}

1326:

1327: \subsection{Our study}

1328: \label{sect:our-study}

1329:

1330: In the study we performed with Ron Kohavi, we chose ten datasets from the UCI

1331: repository, each of which contains at least 250 instances, but for which the

1332: accuracy for decision trees was less than 95\%.  For each domain, we induced

1333: classifiers for the minority class (for Road, we chose the class Grass).  We

1334: selected several induction algorithms from \mlc\ \cite{mlc-new-intro-j}: a

1335: decision tree learner (MC4), Naive Bayes with discretization (NB), $k$-nearest

1336: neighbor for several $k$ values (IB$k$), and Bagged-MC4

1337: \cite{breiman-bagging}.  MC4 is similar to C4.5 \cite{quinlan-c45};

1338: probabilistic predictions are made by using a Laplace correction at the

1339: leaves.  NB discretizes the data based on entropy minimization

1340: \cite{dougherty-kohavi-sahami-disc} and then builds the Naive-Bayes model

1341: \cite{domingos-pazzani-simple-bayes}.  IB$k$ votes the closest $k$ neighbors;

1342: each neighbor votes with a weight equal to one over its distance from the test

1343: instance.

1344:

1345: Some of the ROC curves are shown in Figure~\ref{fig:UCI-ROCs}.  For \emph{only

1346:   one} of these ten domains (Vehicle) was there an absolute dominator.  In

1347: general, very few of the 100 runs performed (on 10 data sets, using 10

1348: cross-validation folds each) had dominating classifiers.  Some cases are very

1349: close, for example Adult and Waveform-21.  In other cases a curve that

1350: dominates in one area of ROC space is dominated in another.  These results

1351: also support the need for methods like the \rocch -hybrid, which produce

1352: robust classifiers.

1353:

1354: \begin{figure}[tb]

1355:   \centerline{%

1356:     \begin{tabular}{c@{\hspace{3pc}}c}

1357:       \epsfig{file=vehicle.eps,height=2.7in,width=2.7in} &

1358:       \epsfig{file=crx.eps,    height=2.7in,width=2.7in}\\

1359:       a.~~Vehicle                        &

1360:       b.~~CRX \\

1361:       \\

1362:       \epsfig{file=roadGrass.eps,height=2.7in,width=2.7in} &

1363:       \epsfig{file=satimage.eps, height=2.7in,width=2.7in}\\

1364:       c.~~RoadGrass                        &

1365:       d.~~Satimage

1366:     \end{tabular}

1367:     }

1368:   \caption{Smoothed ROC curves from UCI database domains}

1369:   \label{fig:UCI-ROCs}

1370: \end{figure}

1371:

1372: \begin{table}[tb]

1373:   \caption{Locally dominating classifiers for four UCI domains}

1374:   \label{tab:convex-hulls}

1375:   \normalsize

1376:   \begin{tabular*}{3.5in}{lll}

1377:       \textbf{Domain} & \textbf{Slope range} & \textbf{Dominator} \\ \hline

1378:       Vehicle         & [0, $\infty$)       & Bagged-MC4\\ \hline

1379:       Road (Grass)    & [0, 0.38]           & NB\\

1380:                       & [0.38, $\infty$)    & Bagged-MC4\\ \hline

1381:       CRX             & [0, 0.03]           & Bagged-MC4\\

1382:                       & [0.03, 0.06]        & NB\\

1383:                       & [0.06, 2.06]        & Bagged-MC4\\

1384:                       & [2.06, $\infty$)    & NB\\ \hline

1385:       Satimage        & [0, 0.05]           & NB \\

1386:                       & [0.05, 0.22]        & Bagged-MC4 \\

1387:                       & [0.22, 2.60]        & IB5 \\

1388:                       & [2.60, 3.11]        & IB3 \\

1389:                       & [3.11, 7.54]        & IB5 \\

1390:                       & [7.54, 31.14]       & IB3 \\

1391:                       & [31.14, $\infty$)   & Bagged-MC4 \\ \hline

1392:   \end{tabular*}

1393: \end{table}

1394:

1395: As examples of what expected-cost-minimizing \textsc{rocch}-hybrids would look

1396: like internally, Table~\ref{tab:convex-hulls} shows the component classifiers

1397: that make up the \rocch\ for the four UCI domains of

1398: figure~\ref{fig:UCI-ROCs}.  For example, in the Road domain (see

1399: figure~\ref{fig:UCI-ROCs} and Table~\ref{tab:convex-hulls}), Naive Bayes would

1400: be chosen for any target conditions corresponding to a slope less than $0.38$,

1401: and Bagged-MC4 would be chosen for slopes greater than $0.38$.  They perform

1402: equally well at $0.38$.

1403:

1404: \section{Limitations and future work}

1405:

1406: There are limitations to the {\sc rocch} method as we have presented it here.

1407: We have defined it here only for two-class problems.  Srinivasan

1408: \citeyear{Srinivasan:99} shows that it can be extended to multiple dimensions.

1409: It should be noted that the dimensionality of the ``ROC-hyperspace'' grows

1410: quadratically in the number of classes, so both efficiency and visualization

1411: capability are called into question.

1412:

1413: We have assumed constant error costs for a given \textit{type} of

1414: error, e.g., all false positives cost the same.  For some problems,

1415: different errors of the same type have different costs.  In many

1416: cases, such a problem can be transformed for evaluation into an

1417: equivalent problem with uniform intra-type error costs by duplicating

1418: instances in proportion to their costs (or by simply modifying the

1419: counting procedure accordingly).

1420:

1421: We also have assumed for this paper that the estimates of the classifiers'

1422: performance statistics ($FP$ and $TP$) are very good.  As mentioned above, much

1423: work has addressed the production of good estimates for simple performance

1424: statistics such as error rate.  Much less work has addressed the production of

1425: good ROC curve estimates.  As with simpler statistics, care should be taken to

1426: avoid over-fitting the training data and to ensure that differences between ROC

1427: curves are meaningful.  One solution is to use cross-validation with averaging

1428: of ROC curves \cite{ProvostFawcettKohavi:98}, which is the procedure used to

1429: produce the ROC curves in Section~\ref{sect:our-study}.  To our knowledge, the

1430: issue is open of how best to produce confidence bands appropriate to a

1431: particular problem.  Those shown in Section~\ref{sect:our-study} are

1432: appropriate for the Neyman-Pearson decision criterion (i.e., they show

1433: confidence on $TP$ for various values of $FP$).

1434:

1435: Also, we have addressed predictive performance and computational

1436: performance.  These are not the only concerns in choosing a

1437: classification model.  What if comprehensibility is important?  The

1438: easy answer is that for any particular setting, the {\sc rocch}-hybrid

1439: is as comprehensible as the underlying model it is using.  However,

1440: this answer falls short if the {\sc rocch}-hybrid is interpolating

1441: between two models or if one wants to understand the

1442: ``multiple-model'' system as a whole.

1443:

1444: Although ROC analysis and the ROCCH method were specifically designed for

1445: classification domains, we have extended them to \emph{activity monitoring}

1446: domains \cite{FawcettProvost:99}.  Such domains involve monitoring the

1447: behavior of a population of entities for interesting events requiring action.

1448: These problems are substantially different from standard classification because

1449: timeliness of classification is important and dependencies exist among

1450: instances; both factors complicate evaluation.

1451:

1452: This work is fundamentally different from other recent machine

1453: learning work on combining multiple models \cite{AliPazzani:96}.  That work

1454: combines models in order to boost performance for a fixed cost and class

1455: distribution.  The {\sc rocch}-hybrid combines models for robustness across

1456: different cost and class distributions.  In principle, these methods should be

1457: independent---multiple-model classifiers are candidates for extending the {\sc

1458:   rocch}.  However, it may be that some multiple-model classifiers achieve

1459: increased performance for a specific set of conditions by (in effect)

1460: interpolating along edges of the {\sc rocch}.

1461: Cherikh \cite{Cherikh-thesis} uses

1462: ROC analysis to study decision making where the decisions of

1463: multiple models are present.  Unlike our work, the goal is to

1464: find optimal combinations of models for specific conditions.

1465: However, it seems that the two methods may be combined profitably:

1466: well-chosen combinations of models

1467: should extend the ROCCH, yielding a better robust classifier.

1468:

1469: The \rocch\ method also complements research on cost-sensitive learning

1470: \cite{Turney-cost-bib}.  Existing cost-sensitive learning methods are brittle

1471: with respect to imprecise cost knowledge.  Thus, the \rocch\ is an essential

1472: evaluation tool.  Furthermore, cost-sensitive learning may be used to find

1473: better components for the \rocch-hybrid, by searching explicitly for

1474: classifiers that extend the \rocch.

1475:

1476:

1477:

1478:

1479: \section{Conclusion}

1480:

1481: The ROC convex hull method is a robust, efficient solution to the

1482: problem of comparing multiple classifiers in imprecise and changing

1483: environments.  It is intuitive, can compare classifiers both in general

1484: and under specific distribution assumptions, and provides crisp

1485: visualizations.  It minimizes the management of classifier performance

1486: data, by selecting exactly those classifiers that are potentially

1487: optimal; thus, only these need to be saved in preparation for

1488: changing conditions.  Moreover, due to its incremental nature, new

1489: classifiers can be incorporated easily, \eg when trying a new parameter

1490: setting.

1491:

1492: The {\sc rocch}-hybrid performs optimally under any target conditions

1493: for many realistic problem formulations, including the optimization of

1494: metrics such as accuracy, expected cost, lift, precision, recall, and

1495: workforce utilization.  It is efficient to build in terms of time and

1496: space, and can be updated incrementally.  Furthermore, it can

1497: sometimes classify better than any (other) known model.  Therefore, we

1498: conclude that it is an elegant, robust classification system.

1499:

1500: We believe that this work has important implications for both machine learning

1501: applications and machine learning research \cite{ProvostFawcettKohavi:98}.  For

1502: applications, it helps free system designers from the need to choose (sometimes

1503: arbitrary) comparison metrics before precise knowledge of key evaluation

1504: parameters is available.  Indeed, such knowledge may never be available, yet

1505: robust systems still can be built.

1506:

1507: For machine learning research, it frees researchers from the need to

1508: have precise class and cost distribution information in order to study

1509: important related phenomena.  In particular, work on cost-sensitive

1510: learning has been impeded by the difficulty of specifying costs, and

1511: by the tenuous nature of conclusions based on a single cost metric.

1512: Researchers need not be held back by either.  Cost-sensitive learning

1513: can be studied generally without specifying costs precisely.  The same

1514: goes for research on learning with highly skewed distributions.  Which

1515: methods are effective for which levels of distribution skew?  The

1516: \rocch\ will provide a detailed answer.

1517:

1518: Recently, Drummond and Holte \cite{drummondholtekdd:00} have

1519: demonstrated an intriguing dual to the \rocch.  Their ``cost curves''

1520: represent expected costs explicitly, rather than as slopes of

1521: iso-performance lines, and thereby provide an insightful alternative

1522: perspective for visualization.

1523:

1524: Note: An implementation of the \rocch\ method in Perl is publicly available.

1525: The code and related papers may be found at:

1526: \url{http://www.hpl.hp.com/personal/Tom_Fawcett/ROCCH/}.

1527:

1528: \section{Acknowledgments}

1529:

1530: Much of this work was done while the authors were employed at the Bell

1531: Atlantic Science and Technology Center.  We thank the many with whom we have

1532: discussed ROC analysis and classifier comparison, especially Rob Holte, George

1533: John, Ron Kohavi, Ron Rymon, and Peter Turney.  We thank Andrew Bradley for

1534: supplying data from his analysis.

1535:

1536: \bibliographystyle{theapa}

1537: \bibliography{final}

1538: \end{document}

1539: