cs0009007/final.tex
1: \documentclass[namedreferences]{article}
2: \usepackage{theapa}
3: \usepackage{epsfig} 
4: \usepackage{psfig}
5: \usepackage{xspace}
6: \usepackage{url}
7: 
8: \usepackage{latexsym}           % This gives us the $\Box$ symbol
9: \usepackage{endnotes}           % For notes.
10:      
11: \newcommand{\POS}{\texttt{\bf p}}
12: \newcommand{\NEG}{\texttt{\bf n}}
13: \newcommand{\YES}{\texttt{\bf Y}}
14: \newcommand{\NO}{\texttt{\bf N}}
15: 
16: \newcommand{\rocch}{\textsc{rocch}}
17: 
18: \newcommand{\IF}{\textbf{if~}}
19: \newcommand{\THEN}{\textbf{then~}}
20: \newcommand{\ELSE}{\textbf{else~}}
21: \newcommand{\ENDIF}{\textbf{end if}}
22: \newcommand{\ENDFOR}{\textbf{end for}}
23: \newcommand{\ENDWHILE}{\textbf{end while}}
24: \newcommand{\FOR}{\textbf{for~}}
25: \newcommand{\WHILE}{\textbf{while~}}
26: \newcommand{\DO}{\textbf{do~}}
27: \newcommand{\END}{\textbf{end~}}
28: 
29: \newcommand{\EndProof}{$\Box$}
30: 
31: \newtheorem{theorem}{Theorem}
32: \newtheorem{lemma}[theorem]{Lemma}
33: \newtheorem{corollary}[theorem]{Corollary}
34: \newtheorem{definition}{Definition}
35: 
36: \newcommand{\about}{\symbol{126}}
37: \newcommand{\rem}[1]{\marginpar{\scriptsize $\rightarrow$ \raggedright #1}}
38: 
39: \newcommand{\Partial}[2]{\frac{\partial #1}{\partial #2}}
40: 
41: \newcommand{\mlc}{\ensuremath{\mathcal{MLC\hspace{-.05em}\raisebox{.4ex}{\tiny\bf ++}}}}
42: \def\CC{\mbox{C\hspace{-.05em}\raisebox{.4ex}{\tiny\bf ++}}}
43: 
44: \graphicspath{
45:   {./}
46:   {./Figs/}
47:   }
48: 
49: \setlength{\textwidth}{6.0in}
50: \setlength{\textheight}{9.2in}
51: \setlength{\oddsidemargin}{0.25in}
52: \setlength{\evensidemargin}{0.25in}
53: \setlength{\marginparwidth}{0in}
54: \setlength{\topmargin}{0in}
55: \addtolength{\voffset}{0.0in}
56: \setlength{\hoffset}{-0.25truein}
57: 
58: \newcommand{\eg}{{e.g.},\xspace} 
59: \newcommand{\ie}{{i.e.},\xspace}
60: \newcommand{\etal}{et al.\@\xspace}
61: \newcommand{\legit}{{\footnotesize \  }}
62: \newcommand{\fraud}{{\footnotesize \textsc{bandit}}}
63: 
64: \begin{document}
65: 
66: \centerline{\textbf{\Large Robust Classification for Imprecise Environments}}
67: \vspace{1ex}
68: 
69: \begin{flushleft}
70:   Foster Provost \hfill \texttt{provost@acm.org}\\
71:   \hspace*{.1in}\textit{New York University, New York, NY 10012}\\
72:   Tom Fawcett \hfill \texttt{tfawcett@acm.org}\\
73:   \hspace*{.1in}\textit{Hewlett-Packard Laboratories, Palo Alto, CA 94304}\\
74:   \vspace*{.2in}
75: \end{flushleft}
76: 
77: \begin{abstract}
78:   In real-world environments it usually is difficult to specify target
79:   operating conditions precisely, for example, target misclassification costs.
80:   This uncertainty makes building robust classification systems problematic.
81:   We show that it is possible to build a hybrid classifier that will perform
82:   at least as well as the best available classifier for any target conditions.
83:   In some cases, the performance of the hybrid actually can surpass that of
84:   the best known classifier.  This robust performance extends across a wide
85:   variety of comparison frameworks, including the optimization of metrics such
86:   as accuracy, expected cost, lift, precision, recall, and workforce
87:   utilization.  The hybrid also is efficient to build, to store, and to
88:   update.  The hybrid is based on a method for the comparison of classifier
89:   performance that is robust to imprecise class distributions and
90:   misclassification costs.  The ROC convex hull (\rocch) method combines
91:   techniques from ROC analysis, decision analysis and computational geometry,
92:   and adapts them to the particulars of analyzing learned classifiers.  The
93:   method is efficient and incremental, minimizes the management of classifier
94:   performance data, and allows for clear visual comparisons and sensitivity
95:   analyses.  Finally, we point to empirical evidence that a robust hybrid
96:   classifier indeed is needed for many real-world problems.
97: \end{abstract}  
98: 
99: \begin{flushleft}
100:   \textbf{Keywords:} classification, learning, uncertainty, evaluation,
101:   comparison, multiple models, cost-sensitive learning, skewed distributions\\
102: 
103:   \vspace*{.1in}
104:   \textbf{\large To appear in \emph{Machine Learning Journal}}
105: 
106: \end{flushleft}
107: 
108: \vspace{.1in}
109: 
110: \section{Introduction}
111: 
112: Traditionally, classification systems have been built by experimenting with
113: many different classifiers, comparing their performance and choosing the best.
114: Experimenting with different induction algorithms, parameter settings, and
115: training regimes yields a large number of classifiers to be evaluated and
116: compared.  Unfortunately, comparison often is difficult in real-world
117: environments because key parameters of the target environment are not known.
118: The optimal cost/benefit tradeoffs and the target class priors seldom are
119: known precisely, and often are subject to change
120: \cite{ZahaviLevin:1997:issues_probl_applying_neural_comput,FriedmanWyatt:97,KlinkenbergJoachims:2000}.
121: For example, in fraud detection we cannot ignore misclassification costs or
122: the skewed class distribution, nor can we assume that our estimates are
123: precise or static \cite{FawcettProvost:97}.  We need a method for the
124: management, comparison, and application of multiple classifiers that is robust
125: in imprecise and changing environments.
126: 
127: We describe the \textit{ROC convex hull} (\rocch) method, which combines
128: techniques from ROC analysis, decision analysis and computational geometry.
129: The ROC convex hull decouples classifier performance from specific class and
130: cost distributions, and may be used to specify the subset of methods that are
131: potentially optimal under any combination of cost assumptions and class
132: distribution assumptions.  The \rocch\ method is efficient, so it facilitates
133: the comparison of a large number of classifiers.  It minimizes the management
134: of classifier performance data because it can specify exactly those
135: classifiers that are potentially optimal, and it is incremental, easily
136: incorporating new and varied classifiers without having to reevaluate all
137: prior classifiers.
138: 
139: We demonstrate that it is possible and desirable to avoid complete commitment
140: to a single best classifier during system construction.  Instead, the \rocch\ 
141: can be used to build from the available classifiers a hybrid classification
142: system that will perform best under any target cost/benefit and class
143: distributions.  Target conditions can then be specified at run time.
144: Moreover, in cases where precise information is still unavailable when the
145: system is run (or if the conditions change dynamically during operation), the
146: hybrid system can be tuned easily (and optimally) based on feedback from its
147: actual performance.
148: 
149: The paper is structured as follows.  First we sketch briefly the traditional
150: approach to building such systems, in order to demonstrate that it is brittle
151: under the types of imprecision common in real-world problems.  We then
152: introduce and describe the \rocch\ and its properties for comparing and
153: visualizing classifier performance in imprecise environments.  In the
154: following sections we formalize the notion of a robust classification system,
155: and show that the \rocch\ is an elegant method for constructing one
156: automatically.  The solution is elegant because the resulting hybrid
157: classifier is robust for a wide variety of problem formulations, including the
158: optimization of metrics such as accuracy, expected cost, lift, precision,
159: recall, and workforce utilization, and it is efficient to build, to store, and
160: to update.  We then show that the hybrid actually can do better than the best
161: known classifier in certain situations.  Finally, by citing results from
162: empirical studies, we provide evidence that this type of system indeed is
163: needed.
164: 
165: \subsection{An example}
166: 
167: A systems-building team wants to create a system that will take a
168: large number of instances and identify those for which an action
169: should be taken.  The instances could be potential cases of fraudulent
170: account behavior, of faulty equipment, of responsive customers, of
171: interesting science, etc.  We consider problems for which the best
172: method for classifying or ranking instances is not well defined, so
173: the system builders may consider machine learning methods, neural
174: networks, case-based systems, and hand-crafted knowledge bases as
175: potential classification models.  Ignoring for the moment issues of
176: efficiency, the foremost question facing the system builders is: which
177: of the available models performs ``best'' at classification?
178: 
179: Traditionally, an experimental approach has been taken to answer this question,
180: because the distribution of instances can be sampled if it is not known a
181: priori.  The standard approach is to estimate the error rate of each model
182: statistically and then to choose the model with the lowest error rate.  This
183: strategy is common in machine learning, pattern recognition, data mining,
184: expert systems and medical diagnosis.  In some cases, other measures such as
185: cost or benefit are used as well.  Applied statistics provides methods such as
186: cross-validation and the bootstrap for estimating model error rates and recent
187: studies have compared the effectiveness of different methods
188: \cite{Dietterich:98,kohavi-accest,Salzberg:97}.
189: 
190: Unfortunately, this experimental approach is brittle under two types
191: of imprecision that are common in real-world environments.
192: Specifically, costs and benefits usually are not known precisely, and
193: target (prior) class distributions often are known only approximately
194: as well.  This observation has been made by many authors
195: \cite{Bradley:97,Catlett:95,ProvostFawcett:97}, and is in fact the
196: concern of a large subfield of decision analysis
197: \cite{WeinsteinFineberg:80}.  Imprecision also arises because the
198: environment may change between the time the system is conceived and
199: the time it is used, and even as it is used.  For example, levels of
200: fraud and levels of customer responsiveness change continually over
201: time and from place to place.
202: 
203: \subsection{Basic terminology}
204: 
205: \begin{figure}[tb]
206:   \begin{center}
207:     \epsfig{file=NeymanPearson.eps,height=3in}
208:     \caption{Three classifiers under three different Neyman-Pearson decision
209:       criteria} 
210:     \label{fig:NP}
211:   \end{center}
212: \end{figure}
213: 
214: In this paper we address two-class problems.  Formally, each instance
215: $I$ is mapped to one element of the set $\{\POS,\NEG\}$ of (correct)
216: positive and negative classes.  A \emph{classification model} (or
217: \emph{classifier}) is a mapping from instances to predicted classes.
218: Some classification models produce a continuous output (\eg an
219: estimate of an instance's class membership probability) to which
220: different thresholds may be applied to predict class membership.  To
221: distinguish between the actual class and the predicted class of an
222: instance, we will use the labels $\{\YES,\NO\}$ for the
223: classifications produced by a model.  For our discussion, let
224: $c(\textit{classification}, \textit{class})$ be a two-place error cost
225: function where $c(\YES,\NEG)$ is the cost of a false positive error
226: and $c(\NO,\POS)$ is the cost of a false negative error.\footnote{For
227: this paper, we consider error costs to include benefits not realized,
228: and ignore the costs of correct classifications.}
229: We represent class distributions by the classes' prior probabilities
230: $p(\POS)$ and $p(\NEG) = 1 - p(\POS)$.
231: 
232: 
233: The true positive rate, or hit rate, of a classifier is:
234: \begin{displaymath}
235:   TP = p(\YES|\POS) \approx \frac{\rm positives\: correctly\: classified}
236:                    {\rm total\: positives}
237: \end{displaymath}
238: The false positive rate, or false alarm rate, of a classifier is:
239: \begin{displaymath}
240:   FP = p(\YES|\NEG) \approx \frac{\rm negatives\: incorrectly\: classified}
241:                    {\rm total\: negatives}
242: \end{displaymath}
243: 
244: 
245: The traditional experimental approach is brittle because it chooses
246: one model as ``best'' with respect to a specific set of cost functions
247: and class distribution.  If the target conditions change, this system
248: may no longer perform optimally, or even acceptably.  As an example,
249: assume that we have a maximum false positive rate $FP$, that must not
250: be exceeded.  We want to find the classifier with the highest possible
251: true positive rate, $TP$, that does not exceed the $FP$ limit.  This
252: is the Neyman-Pearson decision criterion \cite{Egan:75}.  Three
253: classifiers, under three such $FP$ limits, are shown in
254: figure~\ref{fig:NP}.  A different classifier is best for each $FP$
255: limit; any system built with a single ``best'' classifier is brittle
256: if the $FP$ requirement can change.
257: 
258: \section{Evaluating and visualizing classifier performance}
259: 
260: \subsection{Classifier comparison: decision analysis and ROC analysis}
261: 
262: Most prior work on building classifiers uses classification accuracy (or,
263: equivalently, undifferentiated error rate) as the primary evaluation metric.
264: The use of accuracy assumes that the class priors in the target environment
265: will be \textit{constant and relatively balanced}.  In the real world this
266: rarely is the case.  Classifiers often are used to sift through a large
267: population of normal or uninteresting entities in order to find a relatively
268: small number of unusual ones; for example, looking for defrauded accounts
269: among a large population of customers, screening medical tests for rare
270: diseases, and checking an assembly line for defective parts.  Because the
271: unusual or interesting class is rare among the general population, the class
272: distribution is very skewed
273: \cite{EzawaEtal:96,FawcettProvost:96,FawcettProvost:97,KubatHolteMatwin:98,SaittaNeri:98}.
274: 
275: As the class distribution becomes more skewed, evaluation based on accuracy
276: breaks down.  Consider a domain where the classes appear in a 999:1 ratio.  A
277: simple rule---always classify as the maximum likelihood class---gives a 99.9\%
278: accuracy.  This accuracy may be quite difficult for an induction algorithm
279: to beat, though the simple rule presumably is unacceptable if a non-trivial
280: solution is sought.  Skews of $10^2$ are common in fraud detection and skews
281: exceeding $10^6$ have been reported in other applications
282: \cite{ClearwaterStern:91}.
283: 
284: Evaluation by classification accuracy also assumes \textit{equal error costs}:
285: $c(\YES,\NEG)=c(\NO,\POS)$.  In the real world classifications lead to
286: actions, which have consequences.  Actions can be as diverse as denying a
287: credit charge, discarding a manufactured part, moving a control surface on an
288: airplane, or informing a patient of a cancer diagnosis.  The consequences may
289: be grave, and performing an incorrect action may be very costly.  Rarely are
290: the costs of mistakes equivalent.  In mushroom classification, for example,
291: judging a poisonous mushroom to be edible is far worse than judging an edible
292: mushroom to be poisonous.  Indeed, it is hard to imagine a domain in which a
293: classification system may be indifferent to whether it makes a false positive
294: or a false negative error.  In such cases, accuracy maximization should be
295: replaced with cost minimization.
296: 
297: The problems of unequal error costs and uneven class distributions are
298: related.  It has been suggested that, for training, high-cost
299: instances can be compensated for by increasing their prevalence in an
300: instance set \cite{bre84}.  Unfortunately, little work has been
301: published on either problem.  There exist several dozen articles in
302: which techniques for cost-sensitive learning are suggested
303: \cite{Turney-cost-bib}, but few studies evaluate and compare them
304: \cite{Domingos:99,pazzani-cost:94,ProvostFawcettKohavi:98}.  The
305: literature provides even less guidance in situations where
306: distributions are imprecise or can change.
307: 
308: \begin{figure}[tb]
309:   \begin{center}
310:     \epsfig{file=ROC-curves.eps,height=3in,width=3.2in}
311:     \caption{ROC graph of three classifiers}
312:     \label{fig:ROC-curves}
313:   \end{center}
314: \end{figure}
315: 
316: Given an estimate of $p(\POS|I)$, the posterior probability of an instance's
317: class membership, decision analysis gives us a way to produce cost-sensitive
318: classifications \cite{WeinsteinFineberg:80}.  Classifier error frequencies can
319: be used to approximate such probabilities \cite{pazzani-cost:94}.  For an
320: instance $I$, the decision to emit a positive classification from a particular
321: classifier is:
322: 
323: \[
324: [1-p(\POS|I)] \cdot c(\YES,\NEG) \; < \; p(\POS|I) \cdot c(\NO,\POS)
325: \]
326: 
327: Regardless of whether a classifier produces probabilistic or binary
328: classifications, its normalized cost on a test set can be evaluated 
329: empirically as:
330: \[
331: \textrm{Cost} = FP\cdot c(\YES,\NEG) + (1 - TP)\cdot c(\NO,\POS)
332: \]
333: Most published work on cost-sensitive classification uses an equation such as
334: this to rank classifiers.  Given a set of classifiers, a set of examples, and a
335: precise cost function, each classifier's cost is computed and the minimum-cost
336: classifier is chosen.  However, as discussed above, such analyses assume that
337: the distributions are precisely known and static.
338:   
339: More general comparisons can be made with Receiver Operating Characteristic
340: (ROC) analysis, a classic methodology from signal detection theory that is
341: common in medical diagnosis and has recently begun to be used more generally
342: in AI classifier work
343: \cite{Beck-Schultz:86,Egan:75,Swets:88,FriedmanWyatt:97}.  ROC graphs depict
344: tradeoffs between hit rate and false alarm rate.
345: 
346: We use the term \textit{ROC space} to denote the coordinate system used for
347: visualizing classifier performance.  In ROC space, $TP$ is represented on the Y
348: axis and $FP$ is represented on the X axis.  Each classifier is represented by
349: the point in ROC space corresponding to its $(FP,TP)$ pair.  For models that
350: produce a continuous output, e.g., posterior probabilities, $TP$ and $FP$ vary
351: together as a threshold on the output is varied between its extremes (each
352: threshold defines a classifier); the resulting curve is called the ROC curve.
353: An ROC curve illustrates the error tradeoffs available with a given model.
354: Figure~\ref{fig:ROC-curves} shows a graph of three typical ROC curves; in fact,
355: these are the complete ROC curves of the classifiers shown in
356: figure~\ref{fig:NP}.
357: 
358: 
359: For orientation, several points on an ROC graph should be noted.  The lower
360: left point $(0,0)$ represents the strategy of never alarming, the upper right
361: point $(1,1)$ represents the strategy of always alarming, the point $(0,1)$
362: represents perfect classification, and the line $y=x$ (not shown) represents
363: the strategy of randomly guessing the class.  Informally, one point in ROC
364: space is better than another if it is to the northwest ($TP$ is higher, $FP$ is
365: lower, or both).  An ROC graph allows an informal visual comparison of a set of
366: classifiers.  
367: 
368: 
369: 
370: 
371: ROC graphs illustrate the behavior of a classifier \emph{without
372: regard to class distribution or error cost}, and so they decouple
373: classification performance from these factors.  Unfortunately, while
374: an ROC graph is a valuable visualization technique, it does a poor job
375: of aiding the choice of classifiers.  Only when one classifier clearly
376: dominates another over the entire performance space can it be declared
377: better.  
378: 
379: 
380: \subsection{The ROC Convex Hull method}
381: 
382: In this section we combine decision analysis with ROC analysis and adapt them
383: for comparing the performance of a set of learned classifiers.  The method is
384: based on three high-level principles.  First, ROC space is used to separate
385: classification performance from class and cost distribution information.
386: Second, decision-analytic information is projected onto the ROC space.  Third,
387: the convex hull in ROC space is used to identify the subset of classifiers
388: that are potentially optimal.
389: 
390: 
391: \begin{figure}[tb]
392:   \centering
393:   \epsfig{file=ROC2.eps}
394:   \caption{The ROC convex hull identifies potentially optimal classifiers.}
395:   \label{fig:ROC-hull}
396: \end{figure}
397: 
398: \subsubsection{Iso-performance lines}
399: 
400: By separating classification performance from class and cost distribution
401: assumptions, the decision goal can be projected onto ROC space for a neat
402: visualization.  Specifically, the expected cost of applying the classifier
403: represented by a point ($FP$,$TP$) in ROC space is:
404: 
405: 
406: \[
407: p(\POS)\cdot (1-TP)\cdot c(\NO,\POS) \; + \; p(\NEG)\cdot FP \cdot c(\YES,\NEG)
408: \]
409: 
410: Therefore, two points, ($FP_1$,$TP_1$) and ($FP_2$,$TP_2$),
411: have the same performance if
412: 
413: \[
414: \frac{TP_2 - TP_1}{FP_2 - FP_1} 
415: = 
416: \frac{c(\YES,\NEG)p(\NEG)}{c(\NO,\POS)p(\POS)}
417: \]
418: 
419: 
420: This equation defines the slope of an \textit{iso-performance line}.
421: That is, all classifiers corresponding to points on the line have the
422: same expected cost.  Each set of class and cost distributions defines
423: a family of iso-performance lines.  Lines ``more northwest'' (having a
424: larger $TP$-intercept) are better because they correspond to
425: classifiers with lower expected cost.
426: 
427: \subsubsection{The ROC convex hull}
428: 
429: Because in most real-world cases the target distributions are not known
430: precisely, it is valuable to be able to identify those classifiers that
431: potentially are optimal.  Each possible set of distributions defines a family
432: of iso-performance lines, and for a given family, the optimal methods are
433: those that lie on the ``most-northwest'' iso-performance line.  Thus, a
434: classifier is optimal for some conditions if and only if it lies on the
435: northwest boundary (\ie above the line $y=x$) of the convex hull
436: \cite{quickhull:96} of the set of points in ROC space.\footnote{The convex
437:   hull of a set of points is the smallest convex set that contains the
438:   points.}  We discuss this in detail in Section~\ref{sect:rocch-hybrid}.
439: 
440: 
441: \begin{figure}[tb]
442:   \centering
443:   \epsfig{file=ROC3.eps}
444:   \caption{Lines $\alpha$ and $\beta$ show the optimal classifier under
445:     different sets of conditions.}
446:   \label{fig:ROC-hull2}
447: \end{figure}
448: 
449: We call the convex hull of the set of points in ROC space the \textit{ROC
450: convex hull} (\rocch) of the corresponding set of classifiers.
451: Figure~\ref{fig:ROC-hull} shows four ROC curves with the ROC convex hull drawn
452: as the border between the shaded and unshaded areas.  $\mathsf{D}$ is clearly
453: not optimal.  Perhaps surprisingly, $\mathsf{B}$ can never be optimal either
454: because none of the points of its ROC curve lies on the convex hull.  We can
455: also remove from consideration any points of $\mathsf{A}$ and $\mathsf{C}$
456: that do not lie on the hull.
457: 
458: Consider these classifiers under two distribution scenarios.  In each, negative
459: examples outnumber positives by 5:1.  In scenario $\mathcal{A}$, false
460: positive and false negative errors have equal cost.  In scenario $\mathcal{B}$,
461: a false negative is 25 times as expensive as a false positive (\eg missing a
462: case of fraud is much worse than a false alarm).  Each scenario defines a
463: family of iso-performance lines.  The lines corresponding to scenario
464: $\mathcal{A}$ have slope 5; those for $\mathcal{B}$ have slope $\frac{1}{5}$.
465: Figure~\ref{fig:ROC-hull2} shows the convex hull and two iso-performance
466: lines, $\alpha$ and $\beta$.  Line $\alpha$ is the ``best'' line
467: with slope $5$ that intersects the convex hull; line $\beta$ is the best line
468: with slope $\frac{1}{5}$ that intersects the convex hull.  Each line
469: identifies the optimal classifier under the given distribution.
470: 
471: \begin{figure}[tb]
472:   \begin{center}
473:     \epsfig{file=ROC-hull.eps,height=3in,width=3.2in}
474:     \caption{ROC curves with convex hull}
475:     \label{fig:ROCCH}
476:   \end{center}
477: \end{figure}
478: 
479: 
480: Figure~\ref{fig:ROCCH} shows the three ROC curves from our initial
481: example, with the convex hull drawn.
482: 
483: 
484: \subsubsection{Generating the ROC Convex Hull}
485: 
486: The {\it ROC convex hull method} selects the potentially optimal classifiers
487: based on the ROC convex hull and iso-performance lines.
488: 
489: \begin{table}[tb]
490:   \caption{Algorithm for generating an ROC curve from a set of 
491:     ranked examples.}
492:   \begin{center}
493:     \rule{\textwidth}{.01in}
494:     \begin{tabbing}
495:       \textbf{\rmfamily Given:}~~ \=E: \= List of \=tuples
496:       $\langle I, p \rangle$ where:\\
497:       \>\>\>$I$: labeled example\\
498:       \>\>\>$p$: numeric ranking assigned to $I$ by the classifier \\
499:       \>$P, N$: count of positive and negative examples in E, respectively.\\
500:       \textbf{\rmfamily Output:}  R: List of points on the ROC curve.\\
501:       \vspace*{1ex}\\
502:       xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=\kill
503:       $Tcount = 0$; \>\>\>\>\>\>{\it /* current TP tally */ }\\
504:       $Fcount = 0$; \>\>\>\>\>\>{\it /* current FP tally */ }\\
505:       $plast = -\infty$; \>\>\>\>\>\>{\it /* last score seen */ }\\
506:       $R = \langle \rangle$; \>\>\>\>\>\>{\it /* list of ROC points */ }\\
507:       sort $E$ in decreasing order by $p$ values;\\
508:       \WHILE (E $\neq \emptyset$) \DO \\
509:       \>remove tuple $\langle I, p \rangle$ from head of E;\\
510:       \>\IF ($p \neq plast$) \THEN\\
511:       \>\>add point ($\frac{Fcount}{N}$, $\frac{Tcount}{P}$) to end of R;\\
512:       \>\>$plast = p$;\\
513:       \>\ENDIF\\
514:       \>\IF ($I$ is a positive example) \THEN\\
515:       \>\>$Tcount = Tcount + 1$;\\
516:       \>\ELSE \>\>\>\>\>{\it /* I is a negative example */}\\
517:       \>\>$Fcount = Fcount + 1$;\\
518:       \>\ENDIF\\
519:       \ENDWHILE\\
520:       add point ($\frac{Fcount}{N}$, $\frac{Tcount}{P}$) to end of R;\\
521:     \end{tabbing}
522:     \rule{\textwidth}{.01in}
523:   \end{center}
524:   \label{tab:ROC-alg}
525: \end{table}
526: 
527: \begin{enumerate}
528:   
529: \item For each classifier, plot $TP$ and $FP$ in ROC space.  For
530: continuous-output classifiers, vary a threshold over the output range
531: and plot the ROC curve.  Table~\ref{tab:ROC-alg} shows an algorithm
532: for producing such an ROC curve in a single pass.\footnote{There is a
533: subtle complication to producing ROC curves from ranked test-set data,
534: which is reflected in the algorithm shown in Table~\ref{tab:ROC-alg}.
535: Specifically, consecutive examples with the same score can give overly
536: optimistic or overly pessimistic ROC curves, depending on the ordering
537: of positive and negative examples.  The ROC curve generating algorithm
538: shown here waits until all examples with the same score have been
539: tallied before computing the next point of the ROC curve.  The result
540: is a segment that bisects the area that would have resulted from the
541: most optimistic and most pessimistic orderings.}
542:   
543: \item Find the convex hull of the set of points representing the predictive
544:   behavior of all classifiers of interest, for example by using the QuickHull
545:   algorithm \cite{quickhull:96}.
546:   
547: \item For each set of class and cost distributions of interest, find the slope
548:   (or range of slopes) of the corresponding iso-performance lines.
549:   
550: \item For each set of class and cost distributions, the optimal classifier will
551:   be the point on the convex hull that intersects the iso-performance line with
552:   largest $TP$-intercept.  Ranges of slopes specify hull segments.
553: 
554: \end{enumerate}
555: 
556: 
557: Figures~\ref{fig:ROC-hull} and \ref{fig:ROC-hull2} demonstrate how the
558: subset of classifiers that are potentially optimal can be identified
559: and how classifiers can be compared under different cost and class
560: distributions.  
561: 
562: \subsubsection{Comparing a variety of classifiers}
563: 
564: The ROC convex hull method accommodates both binary and continuous
565: classifiers.  Binary classifiers are represented by individual points in ROC
566: space.  Continuous classifiers produce numeric outputs to which thresholds can
567: be applied, yielding a series of $(FP, TP)$ pairs forming an ROC curve.  Each
568: point may or may not contribute to the ROC convex hull.
569: Figure~\ref{fig:Adding-EFG} depicts the binary classifiers $\mathsf{E}$,
570: $\mathsf{F}$ and $\mathsf{G}$ added to the previous hull.  $\mathsf{E}$ may be
571: optimal under some circumstances because it extends the convex hull.
572: Classifiers $\mathsf{F}$ and $\mathsf{G}$ never will be optimal because they
573: do not extend the hull.
574: 
575: \begin{figure}[tb]
576:   \centering \epsfig{file=Adding-classifiers.eps,height=3in}
577:   \caption{Classifier $\mathsf{E}$ may be optimal for some conditions because
578:     it extends the ROC convex hull.  $\mathsf{F}$ and $\mathsf{G}$ cannot be
579:     optimal they are not on the hull, nor do they extend it.}
580:   \label{fig:Adding-EFG}
581: \end{figure}
582: 
583: New classifiers can be added incrementally to an \rocch\ analysis, as
584: demonstrated in figure~\ref{fig:Adding-EFG} by the addition of classifiers
585: $\mathsf{E}$,$\mathsf{F}$, and $\mathsf{G}$.  Each new classifier either
586: extends the existing hull or it does not.  In the former case the hull must be
587: updated accordingly, but in the latter case the new classifier can be ignored.
588: Therefore, the method does not require saving every classifier (or saving
589: statistics on every classifier) for re-analysis under different
590: conditions---only those points on the convex hull.  Recall that each point is
591: a classifier and might take up considerable space.  Further, the management of
592: knowledge about many classifiers and their statistics from many different runs
593: of learning programs (e.g., with different algorithms or parameter settings)
594: can be a substantial undertaking.  Classifiers not on the \rocch\ can never be
595: optimal, so they need not be saved.  Every classifier that \emph{does} lie on
596: the convex hull must be saved.  In Section~\ref{sect:our-study} we demonstrate
597: the \rocch\ in use, managing the results of many learning experiments.
598: 
599: \subsubsection{Changing distributions and costs}
600: 
601: Class and cost distributions that change over time necessitate the reevaluation
602: of classifier choice.  In fraud detection, costs change based on workforce and
603: reimbursement issues; the amount of fraud changes monthly.  With the ROC convex
604: hull method, comparing under a new distribution involves only calculating the
605: slope(s) of the corresponding iso-performance lines and intersecting them with
606: the hull, as shown in figure~\ref{fig:ROC-hull2}.
607: 
608: The ROC convex hull method scales gracefully to any degree of
609: precision in specifying the cost and class distributions.  If nothing
610: is known about a distribution, the ROC convex hull shows all
611: classifiers that may be optimal under any conditions.
612: Figure~\ref{fig:ROC-hull} showed that, given classifiers $\mathsf{A}$,
613: $\mathsf{B}$, $\mathsf{C}$ and $\mathsf{D}$, only $\mathsf{A}$ and
614: $\mathsf{C}$ can ever be optimal.  With complete information, the
615: method identifies the optimal classifier(s).  In
616: figure~\ref{fig:ROC-hull2} we saw that classifier $\mathsf{A}$ (with a
617: particular threshold value) is optimal under scenario $\mathcal{A}$
618: and classifier $\mathsf{C}$ is optimal under scenario $\mathcal{B}$.
619: Next we will see that with less precise information, the ROC convex
620: hull can show the subset of possibly optimal classifiers.
621: 
622: \subsubsection{Sensitivity analysis}
623: 
624: 
625: \begin{figure}[tb]
626:   \begin{center}
627:     %
628:     %
629:     \epsfig{file=Sensitivity-1.eps,height=2.7in,width=2.6in} \\
630:     a.~~Low sensitivity\\
631:     \vspace*{.2in}
632:     \epsfig{file=Sensitivity-2.eps,height=2.5in,width=2.5in}\\
633:     b.~~High sensitivity\\
634:   \end{center}
635:   \caption{Sensitivity analysis using the ROC convex hull:  (a) low
636:     sensitivity (only C can be optimal), (b) high sensitivity (A, E, or C can
637:     be optimal)}
638:   \label{fig:sensitive}
639: \end{figure}
640: 
641: 
642: Imprecise distribution information defines a \emph{range} of slopes for
643: iso-performance lines.  This range of slopes intersects a segment of the ROC
644: convex hull, which facilitates sensitivity analysis.  For example, if the
645: segment defined by a range of slopes corresponds to a single point in ROC
646: space or a small threshold range for a single classifier, then there is no
647: sensitivity to the distribution assumptions in question.  Consider a scenario
648: similar to $\mathcal{A}$ and $\mathcal{B}$ in that negative examples are 5
649: times as prevalent as positive ones.  In this scenario, consider the cost of
650: dealing with a false alarm to be between \$10 and \$20, and the cost of
651: missing a positive example to be between \$200 and \$250.  These conditions
652: define a range of slopes for iso-performance lines: $\frac{1}{5}\le m \le
653: \frac{1}{2}$.  Figure~\ref{fig:sensitive}a depicts this range of slopes and
654: the corresponding segment of the ROC convex hull.  The figure shows that the
655: choice of classifier is insensitive to changes within this range (and only
656: fine tuning of the classifier's threshold will be necessary).
657: Figure~\ref{fig:sensitive}b depicts a scenario with a wider range of slopes:
658: $\frac{1}{2} \le m \le 3$.  The figure shows that under this scenario the
659: choice of classifier is very sensitive to the distribution.  Classifiers
660: $\mathsf{A}$, $\mathsf{C}$ and $\mathsf{E}$ each are optimal for some
661: subrange.
662: 
663: \section{Building robust classifiers}
664: \label{sect:rocch-hybrid}
665: 
666: Up to this point, we have concentrated on the use of the \rocch\ for
667: visualizing and evaluating sets of classifiers.  The \rocch\ helps to
668: delay classifier selection as long as possible, yet provides a rich
669: performance comparison.  However, once system building incorporates a
670: particular classifier, the problem of brittleness resurfaces.  This is
671: important because the delay between system building and deployment may
672: be large, and because many systems must survive for years.  In fact,
673: in many domains a precise, static specification of future costs and
674: class distributions is not just unlikely, it is impossible
675: \cite{ProvostFawcettKohavi:98}.
676: 
677: We address this brittleness by using the \rocch\ to produce
678: \textbf{robust classifiers}, defined as satisfying the following.
679: \emph{Under any target cost and class distributions, a robust
680: classifier will perform at least as well as the best classifier for
681: those conditions.}  Our statements about optimality are practical: the
682: ``best'' classifier may not be the Bayes-optimal classifier, but it is
683: at least as good as any known classifier.
684: Srinivasan \citeyear{Srinivasan:99} calls this ``FAPP-optimal''
685: (optimal for all practical purposes).  Stating that a classifier is
686: robust is stronger than stating that it is optimal for a specific set
687: of conditions.  A robust classifier is optimal under all possible
688: conditions.
689: 
690: In principle, classification brittleness could be overcome by saving
691: all possible classifiers (neural nets, decision trees, expert systems,
692: probabilistic models, etc.)  and then performing an automated run-time
693: comparison under the desired target conditions.  However, such a
694: system is not feasible because of time and space limitations---there
695: are myriad possible classification models, arising from the many
696: different learning methods under their many different parameter
697: settings.  Storing all the classifiers is not feasible, and tuning
698: the system by comparing classifiers on the fly under different
699: conditions is not feasible.  Fortunately, doing so is not necessary.
700: Moreover, we will show that it is sometimes possible to do \textit{better} than
701: any of these classifiers.
702: 
703: \subsection{ROCCH-hybrid classifiers}
704: 
705: We now show that robust hybrid classifiers can be built using the \rocch.
706: 
707: \begin{definition}
708:   Let $\mathbf{I}$ be the space of possible instances and let $\mathbf{C}$ be
709:   the space of sets of classification models.  Let a
710:   \mathversion{bold}$\mu$\mathversion{normal}\textbf{-hybrid classifier}
711:   comprise a set of classification models $\mathcal{C} \in \mathbf{C}$ and a
712:   function
713:   \[
714:   \mu: \mathbf{I} \times \Re \times \mathbf{C} \rightarrow \{\YES,\NO\}.
715:   \]
716:   A $\mu$-hybrid classifier takes as input an instance $I \in \mathbf{I}$ for
717:   classification and a number $x \in \Re$.  As output, it produces the
718:   classification produced by $\mu(I,x,\mathcal{C})$.
719: \end{definition}
720: 
721: Things will get more involved later, but for the time being consider that each
722: set of cost and class distributions defines a value for $x$, which is used to
723: select the (predetermined) best classifier for those conditions.  To build a
724: $\mu$-hybrid classifier, we must define $\mu$ and the set $\mathcal{C}$.  We
725: would like $\mathcal{C}$ to include only those models that perform optimally
726: under some conditions (class and cost distributions), since these will be
727: stored by the system, and we would like $\mu$ to be general enough to apply to
728: a variety of problem formulations.
729: 
730: The models comprising the {\sc rocch} can be combined to form a
731: $\mu$-hybrid classifier that is an elegant, robust classifier.
732: 
733: \begin{definition}
734:   The \textbf{{\sc \textbf{rocch}}-hybrid} is a $\mu$-hybrid classifier where
735:   $\mathcal{C}$ is the set of classifiers that form the {\sc rocch} and $\mu$
736:   makes classifications using the classifier on the {\sc rocch} with $FP=x$.
737: \end{definition}
738: Note that for the moment the {\sc rocch}-hybrid is defined only for $FP$
739: values corresponding to {\sc rocch} vertices.
740: 
741: \subsection{Robust classification}
742: 
743: Our definition of robust classifiers was intentionally vague about
744: what it means for one classifier to be better than another, because
745: different situations call for different comparison frameworks.  We now
746: continue with minimizing expected cost, because the process of proving
747: that the {\sc rocch}-hybrid minimizes expected cost for any cost and
748: class distributions provides a deep understanding of why and how the
749: {\sc rocch}-hybrid works.
750: Later we generalize to a wide variety of
751: comparison frameworks.
752: 
753: The \rocch-hybrid can be seen as an application of multi-criteria
754: optimization to classifier design and construction.  The classifiers on the
755: \rocch\ are Edgeworth-Pareto optimal\footnote{Edgeworth-Pareto optimality is
756:   the century-old notion that in a multidimensional space of criteria, optimal
757:   performance is the frontier of achievable performance in this space.  In
758:   cases where performance is a linear combination of the criteria, the
759:   optimality frontier is the convex hull.} \cite{Stadler-book} with respect to
760: TP, FP, and the objective functions we discuss.  Multi-criteria optimization
761: was used previously in machine learning by Tcheng, Lambert, Lu and Rendell
762: \shortcite{TchengEtAl:89} for the selection of inductive bias.
763: Alternatively, the \rocch\ can be seen as an application of the theory of
764: games and statistical decisions, for which convex sets (and the convex hull)
765: represent optimal strategies \cite{BlackwellGirshick:54}.
766: 
767: \subsubsection{Minimizing expected cost}
768: 
769: From above, the expected cost of applying a classifier is:
770: 
771: \begin{equation}
772:   \label{eq:expected_cost}
773:   ec(FP,TP) \; = \; p(\POS)  \cdot  (1-TP)\cdot c(\NO,\POS) \;  + 
774:   \; p(\NEG)  \cdot  FP \cdot c(\YES,\NEG)
775: \end{equation}
776: 
777: For a particular set of cost and class distributions, the
778: slope of the corresponding iso-performance lines is: 
779: 
780: \begin{equation}
781:   \label{eq:slope}
782:   m_{ec} = \frac{c(\YES,\NEG)p(\NEG)}{c(\NO,\POS)p(\POS)}
783: \end{equation}
784: 
785: Every set of conditions will define an $m_{ec} \ge 0$.  We now can
786: show that the {\sc rocch}-hybrid is robust for problems where the
787: ``best'' classifier is the classifier with the minimum expected cost.
788: 
789: The slope of the {\sc rocch} is an important tool in our argument.  The {\sc
790:   rocch} is a piecewise-linear, concave-down ``curve.''  Therefore, as $x$
791: increases, the slope of the {\sc rocch} is monotonically non-increasing with
792: $k-1$ discrete values, where $k$ is the number of {\sc rocch} component
793: classifiers, including the degenerate classifiers that define the {\sc rocch}
794: endpoints.  Where there will be no confusion, we use phrases such as ``points
795: in ROC space'' as a shorthand for the more cumbersome ``classifiers
796: corresponding to points in ROC space.'' For this subsection, unless otherwise
797: noted, ``points on the
798: {\sc rocch}'' refer to vertices of the {\sc rocch}.
799: 
800: \begin{definition}
801:   \label{def:slope-of-rocch}
802:   For any real number $m \ge 0$, the \textbf{point where the slope of the
803:     \textsc{rocch}\ is $\mathbf{m}$} is one of the (arbitrarily chosen)
804:   endpoints of the segment of the {\sc rocch} with slope $m$, if such a
805:   segment exists.  Otherwise, it is the vertex for which the left adjacent
806:   segment has slope greater than $m$ and the right adjacent segment has slope
807:   less than $m$.
808: \end{definition}
809: 
810: For completeness, the leftmost endpoint of the {\sc rocch} is considered to be
811: attached to a segment with infinite slope and the rightmost endpoint of the
812: {\sc rocch} is considered to be attached to a segment with zero slope.  Note
813: that every $m \ge 0$ defines at least one point on the {\sc rocch}.
814: 
815: \begin{lemma}
816:   For any set of cost and class distributions, there is a point on the \rocch\ 
817:   with minimum expected cost.\\
818:   \textbf{Proof:} (by contradiction) Assume that for some conditions
819:   there exists a point \textbf{C} with smaller expected cost than any
820:   point on the {\sc rocch}.  By equations~\ref{eq:expected_cost} and
821:   \ref{eq:slope}, a point ($FP_2$,$TP_2$) has the same expected cost
822:   as a point ($FP_1$,$TP_1$) if \[ \frac{TP_2 - TP_1}{FP_2 - FP_1} =
823:   m_{ec} \] Therefore, for conditions corresponding to $m_{ec}$, all
824:   points with equal expected cost form an iso-performance line in ROC
825:   space with slope $m_{ec}$.  Also by~\ref{eq:expected_cost}
826:   and~\ref{eq:slope}, points on lines with larger y-intercept have
827:   lower expected cost.  Now, point \textbf{C} is not on the {\sc
828:   rocch}, so it is either above the curve or below the curve.  If it
829:   is above the curve, then the {\sc rocch} is not a convex set
830:   enclosing all points, which is a contradiction.  If it is below the
831:   curve, then the iso-performance line through \textbf{C} also
832:   contains a point \textbf{P} that is on the {\sc rocch} (not
833:   necessarily a vertex).  If this iso-performance line intersects no
834:   {\sc rocch} vertex, then consider the vertices at the endpoints of
835:   the {\sc rocch} segment containing \textbf{P}; one of these vertices
836:   must intersect a better iso-performance line than does \textbf{C}.
837:   In either case, since all points on an iso-performance line have the
838:   same expected cost, point \textbf{C} does not have smaller expected
839:   cost than all points on the {\sc rocch}, which is also a
840:   contradiction.  \EndProof
841: \end{lemma}
842: 
843: Although it is not necessary for our purposes here, it can be shown
844: that \textit{all} of the minimum expected-cost classifiers are
845: \textit{on} the {\sc rocch}.
846: 
847: \begin{definition}
848:   \label{def:m_iso_perf_line}
849:   An iso-performance line with slope $m$ is an \textbf{m-iso-performance
850:     line}.
851: \end{definition}
852: 
853: \begin{lemma}
854:   For any cost and class distributions that translate to $m_{ec}$, a point on
855:   the {\sc rocch} has minimum expected cost only if the slope
856:   of the {\sc rocch} at that point is $m_{ec}$.\\
857:   \textbf{Proof:} (by contradiction) Suppose that there is a point \textbf{D}
858:   on the {\sc rocch} where the slope is \emph{not} $m_{ec}$, but the point
859:   does have minimum expected cost.  By Definition~\ref{def:slope-of-rocch},
860:   either (a) the segment to the left of \textbf{D} has slope less than
861:   $m_{ec}$, or (b) the segment to the right of \textbf{D} has slope greater
862:   than $m_{ec}$.  For case (a), consider point \textbf{N}, the vertex of the
863:   {\sc rocch} that neighbors \textbf{D} to the left, and consider the
864:   (parallel) $m_{ec}$-iso-performance lines $l_D$ and $l_N$ through \textbf{D}
865:   and \textbf{N}.  Because \textbf{N} is to the left of \textbf{D} and the
866:   line connecting them has slope less than $m_{ec}$, the y-intercept of $l_N$
867:   will be greater than the y-intercept of $l_D$.  This means that \textbf{N}
868:   will have lower expected cost than \textbf{D}, which is a contradiction.
869:   The argument for (b) is analogous (symmetric). \EndProof
870: \end{lemma}
871: 
872: \begin{lemma}
873:   If the slope of the {\sc rocch} at a point is $m_{ec}$, then the point has
874:   minimum expected cost.\\
875:   \textbf{Proof:} If this point is the only point where the slope of the {\sc
876:     rocch} is $m_{ec}$, then the proof follows directly from Lemma 1 and
877:   Lemma 2.  If there are multiple such points, then by definition they are
878:   connected by an $m_{ec}$-iso-performance line, so they have the same
879:   expected cost, and once again the proof follows directly from Lemma 1 and
880:   Lemma 2. \EndProof
881: \end{lemma}
882: 
883: It is straightforward now to show that the {\sc rocch}-hybrid is robust for the
884: problem of minimizing expected cost.
885: 
886: \begin{theorem}
887:   The {\sc rocch}-hybrid minimizes expected cost for any cost distribution
888:   and any class distribution.\\
889:   \textbf{Proof:} Because the {\sc rocch}-hybrid is composed of the
890:   classifiers corresponding to the points on the {\sc rocch}, this follows
891:   directly from Lemmas 1, 2, and 3. \EndProof
892: \end{theorem}
893: 
894: Now we have shown that the {\sc rocch}-hybrid is robust when the goal
895: is to provide the minimum expected-cost classification.  This result
896: is important even for accuracy maximization, because the preferred
897: classifier may be different for different target class distributions.
898: This rarely is taken into account in experimental comparisons of
899: classifiers.
900: 
901: \begin{corollary}
902:   The {\sc rocch}-hybrid minimizes error rate (maximizes accuracy) for any
903:   target class distribution.\\
904:   \textbf{Proof:} Error rate minimization is cost minimization with uniform
905:   error costs. \EndProof
906: \end{corollary}
907: 
908: \subsection{Robust classification for other common metrics}
909: 
910: Showing that the \rocch-hybrid is robust not only helps us with understanding
911: the \rocch\ method generally, it also shows us how the \rocch-hybrid will pick
912: the best classifier in order to produce the best classifications, which we
913: will return to later.  If we ignore the need to specify how to pick the best
914: component classifier, we can show that the \rocch\ applies more generally.
915: 
916: \begin{theorem}
917:   \label{theorem:general-rocch}
918:   For any classifier evaluation metric $f(FP,TP)$, if\\
919:   $\Partial{f}{TP}~\ge~0$ and $\Partial{f}{FP} \le 0$ then there exists a
920:   point on the \rocch\ with an $f$-value at least
921:   as high as that of any known classifier.\\
922:   \textbf{Proof:} (by contradiction) Assume that there exists a classifier
923:   $\mathcal{C}_o$, not on the \rocch, with an $f$-value higher than that of
924:   any point on the \rocch.  $\mathcal{C}_o$ is either (i) above or (ii) below
925:   the \rocch.  In case (i), the \rocch\ is not a convex set enclosing all the
926:   points, which is a contradiction.  In case (ii), let $\mathcal{C}_o$ be
927:   represented in ROC-space by $(FP_o,TP_o)$.  Because $\mathcal{C}_o$ is below
928:   the \rocch\ there exist points, call one $(FP_p,TP_p)$, on the \rocch\ with
929:   $TP_p > TP_o$ and $FP_p < FP_o$.  However, by the restriction on the partial
930:   derivatives, for any such point $f(FP_p,TP_p) \ge f(FP_o,TP_o)$, which again
931:   is a contradiction.  \EndProof
932: \end{theorem}
933: 
934: There are two complications to the more general use of the \rocch,
935: both of which are illustrated by the decision criterion from our very
936: first example.  Recall that the Neyman-Pearson criterion specifies a
937: maximum acceptable $FP$ rate.  Standard ROC analysis uses ROC curves
938: to select a single, parameterized classification model; the parameter
939: allows the user to select the ``operating point'' for a
940: decision-making task, usually a threshold on a probabilistic output
941: that will allow for optimal decision making.  Under the Neyman-Pearson
942: criterion, selecting the single best model from a set is easy: plot
943: the ROC curves, draw a vertical line at the desired maximum $FP$, and
944: pick the model whose curve has the largest $TP$ at the intersection
945: with this line.
946: 
947: \begin{figure}[tb]
948:   \begin{center}
949:     \epsfig{file=ROC-NP.eps,height=3.1in,width=3in}
950:     \caption{The ROC Convex Hull used to select a classifier under the
951:       Neyman-Pearson criterion}
952:     \label{fig:ROC-NP}
953:   \end{center}
954: \end{figure}
955: 
956: With the \rocch-hybrid, making the best classifications under
957: the Neyman-Pearson criterion is not so straightforward.
958: For minimizing expected cost it was sufficient for the {\sc rocch}-hybrid to
959: choose a \textit{vertex} from the {\sc rocch} for any $m_{ec}$ value.  For
960: problem formulations such as the Neyman-Pearson criterion, the performance
961: statistics at a non-vertex point on the {\sc rocch} may be preferable (see
962: figure~\ref{fig:ROC-NP}).  Fortunately, with a slight extension, the {\sc
963:   rocch}-hybrid can yield a classifier with these performance statistics.
964: 
965: \begin{theorem}
966:   \label{theorem:rocch-achieves-any-tradeoff} An {\sc rocch}-hybrid
967:   can achieve the $TP$:$FP$ tradeoff represented by any point on the
968:   {\sc rocch}, not just the vertices.\\ \textbf{Proof:} (by
969:   construction) Extend $\mu(I,x,\mathcal{C})$ to non-vertex points as
970:   follows.  Pick the point $P$ on the {\sc rocch} with $FP=x$ (there
971:   is exactly one).  Let $TP_x$ be the $TP$ value of this point.  If
972:   ($x$, $TP_x$) is an {\sc rocch} vertex, use the corresponding
973:   classifier.  If it is not a vertex, call the left endpoint of the
974:   hull segment on which $P$ lies $C_l$, and the right endpoint $C_r$.
975:   Let $d$ be the distance between $C_l$ and $C_r$, and let $p$ be the
976:   distance between $C_l$ and $P$.  Make classifications as follows.
977:   For each input instance flip a weighted coin and choose the answer
978:   given by classifier $C_r$ with probability $\frac{p}{d}$ and that
979:   given by classifier $C_l$ with probability $1-\frac{p}{d}$.  It is
980:   straightforward to show that $FP$ and $TP$ for this classifier will
981:   be $x$ and $TP_x$. \EndProof
982: \end{theorem}
983: 
984: The second complication is that, as illustrated by the Neyman-Pearson
985: criterion, many practical classifier comparison frameworks include
986: \textit{constrained} optimization problems (below we will discuss other
987: frameworks).  Arbitrarily constrained optimizations are problematic for the
988: \rocch-hybrid.  Given total freedom, it is possible to devise constraints on
989: classifier performance such that, even with the restriction on the partial
990: derivatives, an interior point scores higher than any \textit{acceptable}
991: point on the hull.  For example, two linear constraints can enclose a subset
992: of the interior and exclude \textit{the entire} \rocch---there will be no
993: acceptable points on the \rocch.  However, many realistic constraints do not
994: thwart the optimality of the \rocch-hybrid.
995: 
996: \begin{theorem}
997:   \label{theorem:general-rocch-hybrid}
998:   For any classifier evaluation metric $f(FP,TP)$, if \\
999:   $\Partial{f}{TP}\ge~0$ and $\Partial{f}{FP}\le~0$ and no constraint on
1000:   classifier performance eliminates any point on the \rocch\ without also
1001:   eliminating all higher-scoring interior points, then the \rocch-hybrid can
1002:   perform at least as well as any known classifier.
1003:   \\
1004:   \textbf{Proof:} Follows directly from Theorem~\ref{theorem:general-rocch}
1005:   and Theorem~\ref{theorem:rocch-achieves-any-tradeoff}.  \EndProof
1006: \end{theorem}
1007: 
1008: Linear constraints on classifiers' $FP:TP$ performance are common
1009: for real-world problems, so the following is
1010: useful.
1011: 
1012: \begin{corollary}
1013:   \label{corollary:linear-constraints}
1014:   For any classifier evaluation metric $f(FP,TP)$, if\\
1015:   $\Partial{f}{TP} \ge 0$ and $\Partial{f}{FP} \le 0$
1016:   and there is a single constraint on classifier performance
1017:   of the form $a \cdot TP + b \cdot FP \le c$, with $a$ and $b$
1018:   non-negative,
1019:   then
1020:   the \rocch-hybrid can perform at least as well as any known
1021:   classifier.
1022:   \\
1023:   \textbf{Proof:}
1024:   The single constraint eliminates from contention all points (classifiers)
1025:   that do not fall to the left of, or below, a line with non-positive
1026:   slope.  By the restriction on the partial derivatives, such a constraint
1027:   will not eliminate a point on the \rocch\  without also eliminating
1028:   all interior points with higher $f$-values.
1029:   Thus, the proof follows directly from Theorem~\ref{theorem:general-rocch-hybrid}.
1030:   \EndProof
1031: \end{corollary}
1032: 
1033: So, finally, we have the following:
1034: 
1035: \begin{corollary}
1036:   \label{cor:rocch-maximizes-NP}
1037:   For the Neyman-Pearson criterion, the {\sc rocch}-hybrid can perform at
1038:   least as well as that of any known
1039:   classifier.\\
1040:   \textbf{Proof:} For the Neyman-Pearson criterion, the evaluation metric is
1041:   $f(FP,TP)=TP$, that is, a higher $TP$ is better.  The constraint on
1042:   classifier performance is $FP \le FP_{max}$. These satisfy the conditions
1043:   for Corollary~\ref{corollary:linear-constraints}, and therefore this
1044:   corollary follows.  \EndProof
1045: \end{corollary}
1046: 
1047: All the foregoing effort may seem misplaced for a simple
1048: criterion like Neyman-Pearson.  However, there are
1049: many other realistic problem formulations.  
1050: For example, consider
1051: the decision-support problem of optimizing \textit{workforce utilization}, in
1052: which a workforce is available that can process a fixed number of cases.  Too few
1053: cases will under-utilize the workforce, but too many cases will leave some
1054: cases unattended (expanding the workforce usually is not a short-term
1055: solution).  If the workforce can handle $K$ cases, the system should present
1056: the best possible set of $K$ cases.  This is similar to the Neyman-Pearson
1057: criterion, but with an absolute cutoff ($K$) instead of a percentage cutoff
1058: ($FP$).
1059: 
1060: 
1061: \begin{theorem}
1062:   \label{the:rocch_best}
1063:   For workforce utilization, the {\sc rocch}-hybrid will provide the best set
1064:   of $K$ cases, for any choice of $K$.\\ 
1065:  \textbf{Proof:} (by construction) The decision criterion is to maximize $TP$
1066:   subject to the constraint:
1067:   \[
1068:   TP \cdot P + FP \cdot N \le K
1069:   \]
1070:   The theorem therefore follows from Corollary~\ref{corollary:linear-constraints}. \EndProof
1071: \end{theorem}
1072: 
1073: In fact, many screening problems, such as are found in marketing and
1074: information retrieval, use exactly this linear constraint.  It follows that
1075: for maximizing lift \cite{BerryLinoff:97}, precision, or recall, subject to
1076: absolute or percentage cutoffs on case presentation, the {\sc rocch}-hybrid
1077: will provide the best set of cases.
1078: 
1079: As with minimizing expected cost, imprecision in the environment
1080: forces us to favor a \textit{robust} solution for these other
1081: comparison frameworks.  For many real-world problems, the precise
1082: desired cutoff will be unknown or will change (\eg because of
1083: fundamental uncertainty, variability in case difficulty, or competing
1084: responsibilities).  What is worse, for a fixed (absolute) cutoff
1085: merely changing the size of the universe of cases (e.g., the size of
1086: a document corpus) may change the preferred classifier, because it
1087: will change the constraint line.  The {\sc rocch}-hybrid provides a
1088: robust solution because it gives the optimal subset of cases for any
1089: constraint line.  For example, for document retrieval the {\sc
1090: rocch}-hybrid will yield the best $N$ documents for any $N$, for any
1091: prior class distribution (in the target corpus), and for any target
1092: corpus size.
1093: 
1094: \subsection{Ranking cases}
1095: \label{sect:ranking-cases}
1096: 
1097: An apparent solution to the problem of robust classification is to use a model
1098: that ranks cases, and just work down the ranked list.  This approach appears
1099: to sidestep the brittleness demonstrated with binary classifiers, since the
1100: choice of a cutoff point can be deferred to classification time.  However,
1101: choosing the best ranking model is still problematic.  For most practical
1102: situations, choosing the best ranking model is equivalent to choosing which
1103: classifier is best \emph{for the cutoff that will be used}.
1104: 
1105: An example will illustrate this.  Consider two ranking functions, $R_a$ and
1106: $R_b$, applied to a class-balanced set of 100 cases.  Assume $R_a$ is able to
1107: recognize a common aspect unique to positive cases that occurs in 20\% of the
1108: population, and it ranks these highest.  Assume $R_b$ is able to recognize a
1109: common aspect unique to negative cases occurring in 20\% of the population, and it
1110: ranks these lowest.  So $R_a$ ranks the highest 20\% correctly and performs
1111: randomly on the remainder, while $R_b$ ranks the lowest 20\% correctly and
1112: performs randomly on the remainder.  Which model is better?  The answer
1113: depends entirely upon how far down the list the system will go before it
1114: stops; that is, upon what cutoff will be used.  If fewer than 50 cases are to
1115: be selected then $R_a$ should be used, whereas $R_b$ is better if more than 50
1116: cases will be selected.  Figure~\ref{fig:Ranking-models} shows the ROC curves
1117: corresponding to these two classifiers, and the point corresponding to $N=50$
1118: where the curves cross in ROC space.
1119: 
1120: \begin{figure}[tb]
1121:   \begin{center}
1122:     \epsfig{file=Ranking-models.eps ,height=3in}
1123:     \caption{The ROC curves of the two ranking classifiers, $R_a$ and $R_b$,
1124:       described in Section~\ref{sect:ranking-cases}.}
1125:     \label{fig:Ranking-models}
1126:   \end{center}
1127: \end{figure}
1128: 
1129: The \rocch\ method can be used to organize such ranking models, as we have
1130: seen.  Recall that ROC curves are formed from case rankings by moving the
1131: cutoff from one extreme to the other (Table~\ref{tab:ROC-alg} shows an
1132: algorithm for calculating the ROC curve from such rankings).  The {\sc
1133:   rocch}-hybrid comprises the ranking models that are best for all possible
1134: conditions.
1135: 
1136: \subsection{Whole-curve metrics}
1137: 
1138: In situations where either the target cost distribution or class distribution
1139: is \emph{completely} unknown, some researchers advocate choosing the
1140: classifier that maximizes a single-number metric representing the average
1141: performance over the entire curve.  A common whole-curve metric is ``AUC'',
1142: the Area Under the (ROC) Curve \cite{Bradley:97}.  The AUC is equivalent to
1143: the probability that a randomly chosen positive instance will be rated higher
1144: than a negative instance, and thereby is also estimated by the Wilcoxon test
1145: of ranks \cite{HanleyMcNeil:82}.  A criticism of AUC is that for specific
1146: target conditions the classifier with the maximum AUC may be suboptimal
1147: \cite{ProvostFawcettKohavi:98}.  Indeed, this criticism may be made of any
1148: single-number metric.  Fortunately, not only is the \textsc{rocch}-hybrid
1149: optimal for any specific target conditions, it has the maximum 
1150: AUC---There is no classifier with AUC larger than that of the {\sc rocch}-hybrid.
1151: 
1152: \subsection{Using the ROCCH-hybrid}
1153: 
1154: To use the \textsc{rocch}-hybrid for classification, we need to translate
1155: environmental conditions to $x$ values to plug into $\mu(I,x,\mathcal{C})$.
1156: For minimizing expected cost, Equation~\ref{eq:slope} shows how to translate
1157: conditions to $m_{ec}$.  For any $m_{ec}$, by Lemma~3 we want the $FP$ value
1158: of the point where the slope of the {\sc rocch} is $m_{ec}$, which is
1159: straightforward to calculate.  For the Neyman-Pearson criterion the conditions
1160: are defined as $FP$ values.  For workforce utilization with conditions
1161: corresponding to a cutoff $K$, the $FP$ value is found by intersecting the line
1162: $TP \cdot P + FP \cdot N = K$ with the {\sc rocch}.
1163: 
1164: We have argued that target conditions (misclassification costs and
1165: class distribution) are rarely known.  It may be confusing that
1166: we now seem to require exact knowledge of these conditions.  The
1167: \textsc{rocch}-hybrid gives us two important capabilities.  First, the
1168: need for precise knowledge of target conditions is deferred until
1169: run time.  Second, in the absence of precise knowledge even at
1170: run time, the system can be optimized easily with minimal feedback.
1171: 
1172: By using the \textsc{rocch}-hybrid, information on target conditions is not
1173: needed to train and compare classifiers.  This is important because of 
1174: imprecision caused by temporal,
1175: geographic, or other differences that may exist between training and use.  
1176: For example, building
1177: a system for a real-world problem introduces a non-trivial delay between the
1178: time data are gathered and the time the learned models will be used.  The
1179: problem is exacerbated in domains where error costs or class distributions
1180: change over time; even with slow drift, a brittle model may become suboptimal
1181: quickly.  In many such scenarios, costs and class distributions can be specified
1182: (or respecified) at run time with reasonable precision by sampling from the
1183: current population, and used to ensure that the {\sc rocch}-hybrid always
1184: performs optimally.
1185: 
1186: 
1187: In some cases, even at run time these quantities are not known
1188: exactly.  A further benefit of the \textsc{rocch}-hybrid is that it
1189: can be tuned easily to yield optimal performance with only minimal
1190: feedback from the environment.  Conceptually, the {\sc rocch}-hybrid
1191: has one ``knob'' that varies $x$ in $\mu(I,x,\mathcal{C})$ from one
1192: extreme to the other.  For any knob setting, the {\sc rocch}-hybrid
1193: will give the optimal $TP$:$FP$ tradeoff for the target conditions
1194: corresponding to that setting.  Turning the knob to the right
1195: increases $TP$; turning the knob to the left decreases $FP$.  Because
1196: of the monotonicity of the \textsc{rocch}-hybrid, simple hill-climbing
1197: can guarantee optimal performance.  For example, if the system
1198: produces too many false alarms, turn the knob to the left; if the
1199: system is presenting too few cases, turn the knob to the right.
1200: 
1201: \subsection{Beating the component classifiers}
1202: \label{sect:beating-the-components}
1203: 
1204: Perhaps surprisingly, in many realistic situations an {\sc
1205: rocch}-hybrid system can do \emph{better} than any of its component
1206: classifiers.  Consider the Neyman-Pearson decision criterion.  The
1207: {\sc rocch} may intersect the $FP$-line \textit{above} the highest
1208: component ROC curve.  This occurs when the $FP$-line intersects the
1209: {\sc rocch} between vertices; therefore, there is no component
1210: classifier that actually produces these particular ($FP$,$TP$)
1211: statistics, as in figure~\ref{fig:ROC-NP}.  By
1212: Theorem~\ref{theorem:rocch-achieves-any-tradeoff}, the {\sc
1213: rocch}-hybrid can achieve any $TP$ on the hull.  Only a small number
1214: of $FP$ values correspond to hull vertices.
1215: The same holds for other common problem formulations, such as workforce
1216: utilization, lift maximization, precision maximization, and recall
1217: maximization.
1218: 
1219: \subsection{Time and space efficiency}
1220: 
1221: We have argued that the {\sc rocch}-hybrid is robust for a wide variety of
1222: problem formulations.  It is also efficient to build, to store, and to update.
1223: 
1224: The time efficiency of building the {\sc rocch}-hybrid depends first
1225: on the efficiency of building the component models, which varies
1226: widely by model type.  Some models built by machine learning methods
1227: can be built in seconds (once data are available).  Hand-built models
1228: can take years to build.  However, we presume that this is work that
1229: would be done anyway.  The {\sc rocch}-hybrid can be built with
1230: whatever methods are available, be there two or two thousand. As
1231: described below, as new classifiers become available, the {\sc
1232: rocch}-hybrid can be updated incrementally.  The time efficiency
1233: depends also on the efficiency of the experimental evaluation of the
1234: classifiers.  Once again, we presume that this is work that would be
1235: done anyway.  Finally, the time efficiency of the {\sc rocch}-hybrid
1236: depends on the efficiency of building the {\sc rocch}, which can be
1237: done in $O(N \log N)$ time using the QuickHull algorithm
1238: \cite{quickhull:96} where $N$ is the number of classifiers.
1239: 
1240: The {\sc rocch} is space efficient, too, because it comprises only
1241: classifiers that might be optimal under some target conditions (which
1242: follows directly from Lemmas 1--3 and Definitions 3 and 4).  The
1243: number of classifiers that must be stored can be reduced if bounds can
1244: be placed on the potential target conditions.  As described above,
1245: ranges of conditions define segments of the {\sc rocch}.  Thus, the
1246: {\sc rocch}-hybrid may need only a subset of $\mathcal{C}$.
1247: 
1248: Adding new classifiers to the {\sc rocch}-hybrid also is efficient.  Adding a
1249: classifier to the \textsc{rocch} will either (i) extend the hull, adding to
1250: (and possibly subtracting from) the {\sc rocch}-hybrid, or (ii) conclude that
1251: the new classifiers are not superior to the existing classifiers in any
1252: portion of ROC space and can be discarded.
1253: 
1254: The run-time (classification) complexity of the {\sc rocch}-hybrid is never
1255: worse than that of the component classifiers.  In situations where run-time
1256: complexity is crucial, the {\sc rocch} should be constructed without
1257: prohibitively expensive classification models.  It then will find the best
1258: subset of the computationally efficient models.
1259: 
1260: \section{Empirical demonstration of need}
1261: 
1262: Robust classification is of fundamental interest because it
1263: weakens two very strong assumptions: the
1264: availability of precise knowledge of costs and 
1265: of class distributions.
1266: However, might it not be that existing classifiers already are robust?
1267: For example, if a given classifier is optimal under one set of
1268: conditions, might it not be optimal under all?
1269: 
1270: It is beyond the scope of this paper to offer an in-depth experimental study
1271: answering this question.  However, we can provide solid evidence that the
1272: answer is ``no'' by referring to the results of two prior studies.  One is a
1273: comprehensive ROC analysis of medical domains recently conducted by Andrew
1274: Bradley \citeyear{Bradley:97}.\footnote{Bradley's purpose was not to answer
1275:   this question; fortunately, his published results do anyway.}  The other is a
1276: published ROC analysis of UCI database domains that we undertook last year
1277: with Ron Kohavi \cite{ProvostFawcettKohavi:98}.
1278: 
1279: Note that a classifier \textit{dominates} if its ROC curve completely
1280: defines the {\sc rocch} (which means dominating classifiers are robust
1281: and vice versa).  Therefore, if there exist more than a trivially few
1282: domains where no single classifier dominates, then techniques like the {\sc
1283: rocch}-hybrid are essential if robust classifiers are desired.
1284: 
1285: 
1286: \subsection{Bradley's study}
1287: 
1288: Bradley studied six
1289: medical data sets, noting that ``unfortunately, we rarely know what the
1290: individual misclassification costs are.''  He plotted the ROC curves of six
1291: classifier learning algorithms (two neural nets, two decision trees and two
1292: statistical techniques).
1293: 
1294: 
1295: \begin{figure}[tb]
1296:   \begin{center}
1297:     \epsfig{file=Bradley-HB.eps,height=3in,width=3in}
1298:     \caption{Bradley's classifier results for the heart bleeding data.}
1299:     \label{fig:Bradley-HB}
1300:   \end{center}
1301: \end{figure}
1302: 
1303: On \textit{not one} of these data sets was there a dominating
1304: classifier.  This means that for each domain, there exist different
1305: sets of conditions for which different classifiers are preferable.  In
1306: fact, the running example in the present article is based on the three
1307: best classifiers from Bradley's results on the heart bleeding data;
1308: his results for the full set of six classifiers can be found in
1309: figure~\ref{fig:Bradley-HB}.  Classifiers constructed for the
1310: Cleveland heart disease data are shown in
1311: figure~\ref{fig:Bradley-Cleveland}.
1312: 
1313: Bradley's results show clearly that for many domains the classifier that
1314: maximizes any single metric---be it accuracy, cost, or the area under the ROC
1315: curve---will be the best for some cost and class distributions and will not be
1316: the best for others.  We have shown that the {\sc
1317:   rocch}-hybrid will be the best for all.
1318: 
1319: \begin{figure}[tb]
1320:   \begin{center}
1321:     \epsfig{file=Bradley-Cleveland.eps,height=3in,width=3in}
1322:     \caption{Bradley's classifier results for the Cleveland heart disease data}
1323:     \label{fig:Bradley-Cleveland}
1324:   \end{center}
1325: \end{figure}
1326: 
1327: \subsection{Our study}
1328: \label{sect:our-study}
1329: 
1330: In the study we performed with Ron Kohavi, we chose ten datasets from the UCI
1331: repository, each of which contains at least 250 instances, but for which the
1332: accuracy for decision trees was less than 95\%.  For each domain, we induced
1333: classifiers for the minority class (for Road, we chose the class Grass).  We
1334: selected several induction algorithms from \mlc\ \cite{mlc-new-intro-j}: a
1335: decision tree learner (MC4), Naive Bayes with discretization (NB), $k$-nearest
1336: neighbor for several $k$ values (IB$k$), and Bagged-MC4
1337: \cite{breiman-bagging}.  MC4 is similar to C4.5 \cite{quinlan-c45};
1338: probabilistic predictions are made by using a Laplace correction at the
1339: leaves.  NB discretizes the data based on entropy minimization
1340: \cite{dougherty-kohavi-sahami-disc} and then builds the Naive-Bayes model
1341: \cite{domingos-pazzani-simple-bayes}.  IB$k$ votes the closest $k$ neighbors;
1342: each neighbor votes with a weight equal to one over its distance from the test
1343: instance.
1344: 
1345: Some of the ROC curves are shown in Figure~\ref{fig:UCI-ROCs}.  For \emph{only
1346:   one} of these ten domains (Vehicle) was there an absolute dominator.  In
1347: general, very few of the 100 runs performed (on 10 data sets, using 10
1348: cross-validation folds each) had dominating classifiers.  Some cases are very
1349: close, for example Adult and Waveform-21.  In other cases a curve that
1350: dominates in one area of ROC space is dominated in another.  These results
1351: also support the need for methods like the \rocch -hybrid, which produce
1352: robust classifiers.
1353: 
1354: \begin{figure}[tb]
1355:   \centerline{%
1356:     \begin{tabular}{c@{\hspace{3pc}}c}
1357:       \epsfig{file=vehicle.eps,height=2.7in,width=2.7in} & 
1358:       \epsfig{file=crx.eps,    height=2.7in,width=2.7in}\\
1359:       a.~~Vehicle                        & 
1360:       b.~~CRX \\
1361:       \\
1362:       \epsfig{file=roadGrass.eps,height=2.7in,width=2.7in} & 
1363:       \epsfig{file=satimage.eps, height=2.7in,width=2.7in}\\
1364:       c.~~RoadGrass                        & 
1365:       d.~~Satimage
1366:     \end{tabular}
1367:     }
1368:   \caption{Smoothed ROC curves from UCI database domains}
1369:   \label{fig:UCI-ROCs}
1370: \end{figure}
1371: 
1372: \begin{table}[tb]
1373:   \caption{Locally dominating classifiers for four UCI domains}
1374:   \label{tab:convex-hulls}
1375:   \normalsize
1376:   \begin{tabular*}{3.5in}{lll}
1377:       \textbf{Domain} & \textbf{Slope range} & \textbf{Dominator} \\ \hline
1378:       Vehicle         & [0, $\infty$)       & Bagged-MC4\\ \hline
1379:       Road (Grass)    & [0, 0.38]           & NB\\
1380:                       & [0.38, $\infty$)    & Bagged-MC4\\ \hline
1381:       CRX             & [0, 0.03]           & Bagged-MC4\\
1382:                       & [0.03, 0.06]        & NB\\
1383:                       & [0.06, 2.06]        & Bagged-MC4\\
1384:                       & [2.06, $\infty$)    & NB\\ \hline
1385:       Satimage        & [0, 0.05]           & NB \\
1386:                       & [0.05, 0.22]        & Bagged-MC4 \\            
1387:                       & [0.22, 2.60]        & IB5 \\
1388:                       & [2.60, 3.11]        & IB3 \\ 
1389:                       & [3.11, 7.54]        & IB5 \\                       
1390:                       & [7.54, 31.14]       & IB3 \\
1391:                       & [31.14, $\infty$)   & Bagged-MC4 \\ \hline
1392:   \end{tabular*}
1393: \end{table}
1394: 
1395: As examples of what expected-cost-minimizing \textsc{rocch}-hybrids would look
1396: like internally, Table~\ref{tab:convex-hulls} shows the component classifiers
1397: that make up the \rocch\ for the four UCI domains of
1398: figure~\ref{fig:UCI-ROCs}.  For example, in the Road domain (see
1399: figure~\ref{fig:UCI-ROCs} and Table~\ref{tab:convex-hulls}), Naive Bayes would
1400: be chosen for any target conditions corresponding to a slope less than $0.38$,
1401: and Bagged-MC4 would be chosen for slopes greater than $0.38$.  They perform
1402: equally well at $0.38$.
1403: 
1404: \section{Limitations and future work}
1405: 
1406: There are limitations to the {\sc rocch} method as we have presented it here.
1407: We have defined it here only for two-class problems.  Srinivasan
1408: \citeyear{Srinivasan:99} shows that it can be extended to multiple dimensions.
1409: It should be noted that the dimensionality of the ``ROC-hyperspace'' grows
1410: quadratically in the number of classes, so both efficiency and visualization
1411: capability are called into question.
1412: 
1413: We have assumed constant error costs for a given \textit{type} of
1414: error, e.g., all false positives cost the same.  For some problems,
1415: different errors of the same type have different costs.  In many
1416: cases, such a problem can be transformed for evaluation into an
1417: equivalent problem with uniform intra-type error costs by duplicating
1418: instances in proportion to their costs (or by simply modifying the
1419: counting procedure accordingly).
1420: 
1421: We also have assumed for this paper that the estimates of the classifiers'
1422: performance statistics ($FP$ and $TP$) are very good.  As mentioned above, much
1423: work has addressed the production of good estimates for simple performance
1424: statistics such as error rate.  Much less work has addressed the production of
1425: good ROC curve estimates.  As with simpler statistics, care should be taken to
1426: avoid over-fitting the training data and to ensure that differences between ROC
1427: curves are meaningful.  One solution is to use cross-validation with averaging
1428: of ROC curves \cite{ProvostFawcettKohavi:98}, which is the procedure used to
1429: produce the ROC curves in Section~\ref{sect:our-study}.  To our knowledge, the
1430: issue is open of how best to produce confidence bands appropriate to a
1431: particular problem.  Those shown in Section~\ref{sect:our-study} are
1432: appropriate for the Neyman-Pearson decision criterion (i.e., they show
1433: confidence on $TP$ for various values of $FP$).
1434: 
1435: Also, we have addressed predictive performance and computational
1436: performance.  These are not the only concerns in choosing a
1437: classification model.  What if comprehensibility is important?  The
1438: easy answer is that for any particular setting, the {\sc rocch}-hybrid
1439: is as comprehensible as the underlying model it is using.  However,
1440: this answer falls short if the {\sc rocch}-hybrid is interpolating
1441: between two models or if one wants to understand the
1442: ``multiple-model'' system as a whole.
1443: 
1444: Although ROC analysis and the ROCCH method were specifically designed for
1445: classification domains, we have extended them to \emph{activity monitoring}
1446: domains \cite{FawcettProvost:99}.  Such domains involve monitoring the
1447: behavior of a population of entities for interesting events requiring action.
1448: These problems are substantially different from standard classification because
1449: timeliness of classification is important and dependencies exist among
1450: instances; both factors complicate evaluation.
1451: 
1452: This work is fundamentally different from other recent machine
1453: learning work on combining multiple models \cite{AliPazzani:96}.  That work
1454: combines models in order to boost performance for a fixed cost and class
1455: distribution.  The {\sc rocch}-hybrid combines models for robustness across
1456: different cost and class distributions.  In principle, these methods should be
1457: independent---multiple-model classifiers are candidates for extending the {\sc
1458:   rocch}.  However, it may be that some multiple-model classifiers achieve
1459: increased performance for a specific set of conditions by (in effect)
1460: interpolating along edges of the {\sc rocch}.
1461: Cherikh \cite{Cherikh-thesis} uses
1462: ROC analysis to study decision making where the decisions of
1463: multiple models are present.  Unlike our work, the goal is to   
1464: find optimal combinations of models for specific conditions.  
1465: However, it seems that the two methods may be combined profitably:
1466: well-chosen combinations of models
1467: should extend the ROCCH, yielding a better robust classifier.
1468: 
1469: The \rocch\ method also complements research on cost-sensitive learning
1470: \cite{Turney-cost-bib}.  Existing cost-sensitive learning methods are brittle
1471: with respect to imprecise cost knowledge.  Thus, the \rocch\ is an essential
1472: evaluation tool.  Furthermore, cost-sensitive learning may be used to find
1473: better components for the \rocch-hybrid, by searching explicitly for
1474: classifiers that extend the \rocch.
1475: 
1476: 
1477: 
1478: 
1479: \section{Conclusion}
1480: 
1481: The ROC convex hull method is a robust, efficient solution to the
1482: problem of comparing multiple classifiers in imprecise and changing
1483: environments.  It is intuitive, can compare classifiers both in general
1484: and under specific distribution assumptions, and provides crisp
1485: visualizations.  It minimizes the management of classifier performance
1486: data, by selecting exactly those classifiers that are potentially
1487: optimal; thus, only these need to be saved in preparation for
1488: changing conditions.  Moreover, due to its incremental nature, new
1489: classifiers can be incorporated easily, \eg when trying a new parameter
1490: setting.
1491: 
1492: The {\sc rocch}-hybrid performs optimally under any target conditions
1493: for many realistic problem formulations, including the optimization of
1494: metrics such as accuracy, expected cost, lift, precision, recall, and
1495: workforce utilization.  It is efficient to build in terms of time and
1496: space, and can be updated incrementally.  Furthermore, it can
1497: sometimes classify better than any (other) known model.  Therefore, we
1498: conclude that it is an elegant, robust classification system.
1499: 
1500: We believe that this work has important implications for both machine learning
1501: applications and machine learning research \cite{ProvostFawcettKohavi:98}.  For
1502: applications, it helps free system designers from the need to choose (sometimes
1503: arbitrary) comparison metrics before precise knowledge of key evaluation
1504: parameters is available.  Indeed, such knowledge may never be available, yet
1505: robust systems still can be built.
1506: 
1507: For machine learning research, it frees researchers from the need to
1508: have precise class and cost distribution information in order to study
1509: important related phenomena.  In particular, work on cost-sensitive
1510: learning has been impeded by the difficulty of specifying costs, and
1511: by the tenuous nature of conclusions based on a single cost metric.
1512: Researchers need not be held back by either.  Cost-sensitive learning
1513: can be studied generally without specifying costs precisely.  The same
1514: goes for research on learning with highly skewed distributions.  Which
1515: methods are effective for which levels of distribution skew?  The
1516: \rocch\ will provide a detailed answer.  
1517: 
1518: Recently, Drummond and Holte \cite{drummondholtekdd:00} have
1519: demonstrated an intriguing dual to the \rocch.  Their ``cost curves''
1520: represent expected costs explicitly, rather than as slopes of
1521: iso-performance lines, and thereby provide an insightful alternative
1522: perspective for visualization.
1523: 
1524: Note: An implementation of the \rocch\ method in Perl is publicly available.
1525: The code and related papers may be found at:
1526: \url{http://www.hpl.hp.com/personal/Tom_Fawcett/ROCCH/}.
1527: 
1528: \section{Acknowledgments}
1529: 
1530: Much of this work was done while the authors were employed at the Bell
1531: Atlantic Science and Technology Center.  We thank the many with whom we have
1532: discussed ROC analysis and classifier comparison, especially Rob Holte, George
1533: John, Ron Kohavi, Ron Rymon, and Peter Turney.  We thank Andrew Bradley for
1534: supplying data from his analysis.
1535: 
1536: \bibliographystyle{theapa}
1537: \bibliography{final}
1538: \end{document}
1539: