1: \documentclass[namedreferences]{article}
2: \usepackage{theapa}
3: \usepackage{epsfig}
4: \usepackage{psfig}
5: \usepackage{xspace}
6: \usepackage{url}
7:
8: \usepackage{latexsym} % This gives us the $\Box$ symbol
9: \usepackage{endnotes} % For notes.
10:
11: \newcommand{\POS}{\texttt{\bf p}}
12: \newcommand{\NEG}{\texttt{\bf n}}
13: \newcommand{\YES}{\texttt{\bf Y}}
14: \newcommand{\NO}{\texttt{\bf N}}
15:
16: \newcommand{\rocch}{\textsc{rocch}}
17:
18: \newcommand{\IF}{\textbf{if~}}
19: \newcommand{\THEN}{\textbf{then~}}
20: \newcommand{\ELSE}{\textbf{else~}}
21: \newcommand{\ENDIF}{\textbf{end if}}
22: \newcommand{\ENDFOR}{\textbf{end for}}
23: \newcommand{\ENDWHILE}{\textbf{end while}}
24: \newcommand{\FOR}{\textbf{for~}}
25: \newcommand{\WHILE}{\textbf{while~}}
26: \newcommand{\DO}{\textbf{do~}}
27: \newcommand{\END}{\textbf{end~}}
28:
29: \newcommand{\EndProof}{$\Box$}
30:
31: \newtheorem{theorem}{Theorem}
32: \newtheorem{lemma}[theorem]{Lemma}
33: \newtheorem{corollary}[theorem]{Corollary}
34: \newtheorem{definition}{Definition}
35:
36: \newcommand{\about}{\symbol{126}}
37: \newcommand{\rem}[1]{\marginpar{\scriptsize $\rightarrow$ \raggedright #1}}
38:
39: \newcommand{\Partial}[2]{\frac{\partial #1}{\partial #2}}
40:
41: \newcommand{\mlc}{\ensuremath{\mathcal{MLC\hspace{-.05em}\raisebox{.4ex}{\tiny\bf ++}}}}
42: \def\CC{\mbox{C\hspace{-.05em}\raisebox{.4ex}{\tiny\bf ++}}}
43:
44: \graphicspath{
45: {./}
46: {./Figs/}
47: }
48:
49: \setlength{\textwidth}{6.0in}
50: \setlength{\textheight}{9.2in}
51: \setlength{\oddsidemargin}{0.25in}
52: \setlength{\evensidemargin}{0.25in}
53: \setlength{\marginparwidth}{0in}
54: \setlength{\topmargin}{0in}
55: \addtolength{\voffset}{0.0in}
56: \setlength{\hoffset}{-0.25truein}
57:
58: \newcommand{\eg}{{e.g.},\xspace}
59: \newcommand{\ie}{{i.e.},\xspace}
60: \newcommand{\etal}{et al.\@\xspace}
61: \newcommand{\legit}{{\footnotesize \ }}
62: \newcommand{\fraud}{{\footnotesize \textsc{bandit}}}
63:
64: \begin{document}
65:
66: \centerline{\textbf{\Large Robust Classification for Imprecise Environments}}
67: \vspace{1ex}
68:
69: \begin{flushleft}
70: Foster Provost \hfill \texttt{provost@acm.org}\\
71: \hspace*{.1in}\textit{New York University, New York, NY 10012}\\
72: Tom Fawcett \hfill \texttt{tfawcett@acm.org}\\
73: \hspace*{.1in}\textit{Hewlett-Packard Laboratories, Palo Alto, CA 94304}\\
74: \vspace*{.2in}
75: \end{flushleft}
76:
77: \begin{abstract}
78: In real-world environments it usually is difficult to specify target
79: operating conditions precisely, for example, target misclassification costs.
80: This uncertainty makes building robust classification systems problematic.
81: We show that it is possible to build a hybrid classifier that will perform
82: at least as well as the best available classifier for any target conditions.
83: In some cases, the performance of the hybrid actually can surpass that of
84: the best known classifier. This robust performance extends across a wide
85: variety of comparison frameworks, including the optimization of metrics such
86: as accuracy, expected cost, lift, precision, recall, and workforce
87: utilization. The hybrid also is efficient to build, to store, and to
88: update. The hybrid is based on a method for the comparison of classifier
89: performance that is robust to imprecise class distributions and
90: misclassification costs. The ROC convex hull (\rocch) method combines
91: techniques from ROC analysis, decision analysis and computational geometry,
92: and adapts them to the particulars of analyzing learned classifiers. The
93: method is efficient and incremental, minimizes the management of classifier
94: performance data, and allows for clear visual comparisons and sensitivity
95: analyses. Finally, we point to empirical evidence that a robust hybrid
96: classifier indeed is needed for many real-world problems.
97: \end{abstract}
98:
99: \begin{flushleft}
100: \textbf{Keywords:} classification, learning, uncertainty, evaluation,
101: comparison, multiple models, cost-sensitive learning, skewed distributions\\
102:
103: \vspace*{.1in}
104: \textbf{\large To appear in \emph{Machine Learning Journal}}
105:
106: \end{flushleft}
107:
108: \vspace{.1in}
109:
110: \section{Introduction}
111:
112: Traditionally, classification systems have been built by experimenting with
113: many different classifiers, comparing their performance and choosing the best.
114: Experimenting with different induction algorithms, parameter settings, and
115: training regimes yields a large number of classifiers to be evaluated and
116: compared. Unfortunately, comparison often is difficult in real-world
117: environments because key parameters of the target environment are not known.
118: The optimal cost/benefit tradeoffs and the target class priors seldom are
119: known precisely, and often are subject to change
120: \cite{ZahaviLevin:1997:issues_probl_applying_neural_comput,FriedmanWyatt:97,KlinkenbergJoachims:2000}.
121: For example, in fraud detection we cannot ignore misclassification costs or
122: the skewed class distribution, nor can we assume that our estimates are
123: precise or static \cite{FawcettProvost:97}. We need a method for the
124: management, comparison, and application of multiple classifiers that is robust
125: in imprecise and changing environments.
126:
127: We describe the \textit{ROC convex hull} (\rocch) method, which combines
128: techniques from ROC analysis, decision analysis and computational geometry.
129: The ROC convex hull decouples classifier performance from specific class and
130: cost distributions, and may be used to specify the subset of methods that are
131: potentially optimal under any combination of cost assumptions and class
132: distribution assumptions. The \rocch\ method is efficient, so it facilitates
133: the comparison of a large number of classifiers. It minimizes the management
134: of classifier performance data because it can specify exactly those
135: classifiers that are potentially optimal, and it is incremental, easily
136: incorporating new and varied classifiers without having to reevaluate all
137: prior classifiers.
138:
139: We demonstrate that it is possible and desirable to avoid complete commitment
140: to a single best classifier during system construction. Instead, the \rocch\
141: can be used to build from the available classifiers a hybrid classification
142: system that will perform best under any target cost/benefit and class
143: distributions. Target conditions can then be specified at run time.
144: Moreover, in cases where precise information is still unavailable when the
145: system is run (or if the conditions change dynamically during operation), the
146: hybrid system can be tuned easily (and optimally) based on feedback from its
147: actual performance.
148:
149: The paper is structured as follows. First we sketch briefly the traditional
150: approach to building such systems, in order to demonstrate that it is brittle
151: under the types of imprecision common in real-world problems. We then
152: introduce and describe the \rocch\ and its properties for comparing and
153: visualizing classifier performance in imprecise environments. In the
154: following sections we formalize the notion of a robust classification system,
155: and show that the \rocch\ is an elegant method for constructing one
156: automatically. The solution is elegant because the resulting hybrid
157: classifier is robust for a wide variety of problem formulations, including the
158: optimization of metrics such as accuracy, expected cost, lift, precision,
159: recall, and workforce utilization, and it is efficient to build, to store, and
160: to update. We then show that the hybrid actually can do better than the best
161: known classifier in certain situations. Finally, by citing results from
162: empirical studies, we provide evidence that this type of system indeed is
163: needed.
164:
165: \subsection{An example}
166:
167: A systems-building team wants to create a system that will take a
168: large number of instances and identify those for which an action
169: should be taken. The instances could be potential cases of fraudulent
170: account behavior, of faulty equipment, of responsive customers, of
171: interesting science, etc. We consider problems for which the best
172: method for classifying or ranking instances is not well defined, so
173: the system builders may consider machine learning methods, neural
174: networks, case-based systems, and hand-crafted knowledge bases as
175: potential classification models. Ignoring for the moment issues of
176: efficiency, the foremost question facing the system builders is: which
177: of the available models performs ``best'' at classification?
178:
179: Traditionally, an experimental approach has been taken to answer this question,
180: because the distribution of instances can be sampled if it is not known a
181: priori. The standard approach is to estimate the error rate of each model
182: statistically and then to choose the model with the lowest error rate. This
183: strategy is common in machine learning, pattern recognition, data mining,
184: expert systems and medical diagnosis. In some cases, other measures such as
185: cost or benefit are used as well. Applied statistics provides methods such as
186: cross-validation and the bootstrap for estimating model error rates and recent
187: studies have compared the effectiveness of different methods
188: \cite{Dietterich:98,kohavi-accest,Salzberg:97}.
189:
190: Unfortunately, this experimental approach is brittle under two types
191: of imprecision that are common in real-world environments.
192: Specifically, costs and benefits usually are not known precisely, and
193: target (prior) class distributions often are known only approximately
194: as well. This observation has been made by many authors
195: \cite{Bradley:97,Catlett:95,ProvostFawcett:97}, and is in fact the
196: concern of a large subfield of decision analysis
197: \cite{WeinsteinFineberg:80}. Imprecision also arises because the
198: environment may change between the time the system is conceived and
199: the time it is used, and even as it is used. For example, levels of
200: fraud and levels of customer responsiveness change continually over
201: time and from place to place.
202:
203: \subsection{Basic terminology}
204:
205: \begin{figure}[tb]
206: \begin{center}
207: \epsfig{file=NeymanPearson.eps,height=3in}
208: \caption{Three classifiers under three different Neyman-Pearson decision
209: criteria}
210: \label{fig:NP}
211: \end{center}
212: \end{figure}
213:
214: In this paper we address two-class problems. Formally, each instance
215: $I$ is mapped to one element of the set $\{\POS,\NEG\}$ of (correct)
216: positive and negative classes. A \emph{classification model} (or
217: \emph{classifier}) is a mapping from instances to predicted classes.
218: Some classification models produce a continuous output (\eg an
219: estimate of an instance's class membership probability) to which
220: different thresholds may be applied to predict class membership. To
221: distinguish between the actual class and the predicted class of an
222: instance, we will use the labels $\{\YES,\NO\}$ for the
223: classifications produced by a model. For our discussion, let
224: $c(\textit{classification}, \textit{class})$ be a two-place error cost
225: function where $c(\YES,\NEG)$ is the cost of a false positive error
226: and $c(\NO,\POS)$ is the cost of a false negative error.\footnote{For
227: this paper, we consider error costs to include benefits not realized,
228: and ignore the costs of correct classifications.}
229: We represent class distributions by the classes' prior probabilities
230: $p(\POS)$ and $p(\NEG) = 1 - p(\POS)$.
231:
232:
233: The true positive rate, or hit rate, of a classifier is:
234: \begin{displaymath}
235: TP = p(\YES|\POS) \approx \frac{\rm positives\: correctly\: classified}
236: {\rm total\: positives}
237: \end{displaymath}
238: The false positive rate, or false alarm rate, of a classifier is:
239: \begin{displaymath}
240: FP = p(\YES|\NEG) \approx \frac{\rm negatives\: incorrectly\: classified}
241: {\rm total\: negatives}
242: \end{displaymath}
243:
244:
245: The traditional experimental approach is brittle because it chooses
246: one model as ``best'' with respect to a specific set of cost functions
247: and class distribution. If the target conditions change, this system
248: may no longer perform optimally, or even acceptably. As an example,
249: assume that we have a maximum false positive rate $FP$, that must not
250: be exceeded. We want to find the classifier with the highest possible
251: true positive rate, $TP$, that does not exceed the $FP$ limit. This
252: is the Neyman-Pearson decision criterion \cite{Egan:75}. Three
253: classifiers, under three such $FP$ limits, are shown in
254: figure~\ref{fig:NP}. A different classifier is best for each $FP$
255: limit; any system built with a single ``best'' classifier is brittle
256: if the $FP$ requirement can change.
257:
258: \section{Evaluating and visualizing classifier performance}
259:
260: \subsection{Classifier comparison: decision analysis and ROC analysis}
261:
262: Most prior work on building classifiers uses classification accuracy (or,
263: equivalently, undifferentiated error rate) as the primary evaluation metric.
264: The use of accuracy assumes that the class priors in the target environment
265: will be \textit{constant and relatively balanced}. In the real world this
266: rarely is the case. Classifiers often are used to sift through a large
267: population of normal or uninteresting entities in order to find a relatively
268: small number of unusual ones; for example, looking for defrauded accounts
269: among a large population of customers, screening medical tests for rare
270: diseases, and checking an assembly line for defective parts. Because the
271: unusual or interesting class is rare among the general population, the class
272: distribution is very skewed
273: \cite{EzawaEtal:96,FawcettProvost:96,FawcettProvost:97,KubatHolteMatwin:98,SaittaNeri:98}.
274:
275: As the class distribution becomes more skewed, evaluation based on accuracy
276: breaks down. Consider a domain where the classes appear in a 999:1 ratio. A
277: simple rule---always classify as the maximum likelihood class---gives a 99.9\%
278: accuracy. This accuracy may be quite difficult for an induction algorithm
279: to beat, though the simple rule presumably is unacceptable if a non-trivial
280: solution is sought. Skews of $10^2$ are common in fraud detection and skews
281: exceeding $10^6$ have been reported in other applications
282: \cite{ClearwaterStern:91}.
283:
284: Evaluation by classification accuracy also assumes \textit{equal error costs}:
285: $c(\YES,\NEG)=c(\NO,\POS)$. In the real world classifications lead to
286: actions, which have consequences. Actions can be as diverse as denying a
287: credit charge, discarding a manufactured part, moving a control surface on an
288: airplane, or informing a patient of a cancer diagnosis. The consequences may
289: be grave, and performing an incorrect action may be very costly. Rarely are
290: the costs of mistakes equivalent. In mushroom classification, for example,
291: judging a poisonous mushroom to be edible is far worse than judging an edible
292: mushroom to be poisonous. Indeed, it is hard to imagine a domain in which a
293: classification system may be indifferent to whether it makes a false positive
294: or a false negative error. In such cases, accuracy maximization should be
295: replaced with cost minimization.
296:
297: The problems of unequal error costs and uneven class distributions are
298: related. It has been suggested that, for training, high-cost
299: instances can be compensated for by increasing their prevalence in an
300: instance set \cite{bre84}. Unfortunately, little work has been
301: published on either problem. There exist several dozen articles in
302: which techniques for cost-sensitive learning are suggested
303: \cite{Turney-cost-bib}, but few studies evaluate and compare them
304: \cite{Domingos:99,pazzani-cost:94,ProvostFawcettKohavi:98}. The
305: literature provides even less guidance in situations where
306: distributions are imprecise or can change.
307:
308: \begin{figure}[tb]
309: \begin{center}
310: \epsfig{file=ROC-curves.eps,height=3in,width=3.2in}
311: \caption{ROC graph of three classifiers}
312: \label{fig:ROC-curves}
313: \end{center}
314: \end{figure}
315:
316: Given an estimate of $p(\POS|I)$, the posterior probability of an instance's
317: class membership, decision analysis gives us a way to produce cost-sensitive
318: classifications \cite{WeinsteinFineberg:80}. Classifier error frequencies can
319: be used to approximate such probabilities \cite{pazzani-cost:94}. For an
320: instance $I$, the decision to emit a positive classification from a particular
321: classifier is:
322:
323: \[
324: [1-p(\POS|I)] \cdot c(\YES,\NEG) \; < \; p(\POS|I) \cdot c(\NO,\POS)
325: \]
326:
327: Regardless of whether a classifier produces probabilistic or binary
328: classifications, its normalized cost on a test set can be evaluated
329: empirically as:
330: \[
331: \textrm{Cost} = FP\cdot c(\YES,\NEG) + (1 - TP)\cdot c(\NO,\POS)
332: \]
333: Most published work on cost-sensitive classification uses an equation such as
334: this to rank classifiers. Given a set of classifiers, a set of examples, and a
335: precise cost function, each classifier's cost is computed and the minimum-cost
336: classifier is chosen. However, as discussed above, such analyses assume that
337: the distributions are precisely known and static.
338:
339: More general comparisons can be made with Receiver Operating Characteristic
340: (ROC) analysis, a classic methodology from signal detection theory that is
341: common in medical diagnosis and has recently begun to be used more generally
342: in AI classifier work
343: \cite{Beck-Schultz:86,Egan:75,Swets:88,FriedmanWyatt:97}. ROC graphs depict
344: tradeoffs between hit rate and false alarm rate.
345:
346: We use the term \textit{ROC space} to denote the coordinate system used for
347: visualizing classifier performance. In ROC space, $TP$ is represented on the Y
348: axis and $FP$ is represented on the X axis. Each classifier is represented by
349: the point in ROC space corresponding to its $(FP,TP)$ pair. For models that
350: produce a continuous output, e.g., posterior probabilities, $TP$ and $FP$ vary
351: together as a threshold on the output is varied between its extremes (each
352: threshold defines a classifier); the resulting curve is called the ROC curve.
353: An ROC curve illustrates the error tradeoffs available with a given model.
354: Figure~\ref{fig:ROC-curves} shows a graph of three typical ROC curves; in fact,
355: these are the complete ROC curves of the classifiers shown in
356: figure~\ref{fig:NP}.
357:
358:
359: For orientation, several points on an ROC graph should be noted. The lower
360: left point $(0,0)$ represents the strategy of never alarming, the upper right
361: point $(1,1)$ represents the strategy of always alarming, the point $(0,1)$
362: represents perfect classification, and the line $y=x$ (not shown) represents
363: the strategy of randomly guessing the class. Informally, one point in ROC
364: space is better than another if it is to the northwest ($TP$ is higher, $FP$ is
365: lower, or both). An ROC graph allows an informal visual comparison of a set of
366: classifiers.
367:
368:
369:
370:
371: ROC graphs illustrate the behavior of a classifier \emph{without
372: regard to class distribution or error cost}, and so they decouple
373: classification performance from these factors. Unfortunately, while
374: an ROC graph is a valuable visualization technique, it does a poor job
375: of aiding the choice of classifiers. Only when one classifier clearly
376: dominates another over the entire performance space can it be declared
377: better.
378:
379:
380: \subsection{The ROC Convex Hull method}
381:
382: In this section we combine decision analysis with ROC analysis and adapt them
383: for comparing the performance of a set of learned classifiers. The method is
384: based on three high-level principles. First, ROC space is used to separate
385: classification performance from class and cost distribution information.
386: Second, decision-analytic information is projected onto the ROC space. Third,
387: the convex hull in ROC space is used to identify the subset of classifiers
388: that are potentially optimal.
389:
390:
391: \begin{figure}[tb]
392: \centering
393: \epsfig{file=ROC2.eps}
394: \caption{The ROC convex hull identifies potentially optimal classifiers.}
395: \label{fig:ROC-hull}
396: \end{figure}
397:
398: \subsubsection{Iso-performance lines}
399:
400: By separating classification performance from class and cost distribution
401: assumptions, the decision goal can be projected onto ROC space for a neat
402: visualization. Specifically, the expected cost of applying the classifier
403: represented by a point ($FP$,$TP$) in ROC space is:
404:
405:
406: \[
407: p(\POS)\cdot (1-TP)\cdot c(\NO,\POS) \; + \; p(\NEG)\cdot FP \cdot c(\YES,\NEG)
408: \]
409:
410: Therefore, two points, ($FP_1$,$TP_1$) and ($FP_2$,$TP_2$),
411: have the same performance if
412:
413: \[
414: \frac{TP_2 - TP_1}{FP_2 - FP_1}
415: =
416: \frac{c(\YES,\NEG)p(\NEG)}{c(\NO,\POS)p(\POS)}
417: \]
418:
419:
420: This equation defines the slope of an \textit{iso-performance line}.
421: That is, all classifiers corresponding to points on the line have the
422: same expected cost. Each set of class and cost distributions defines
423: a family of iso-performance lines. Lines ``more northwest'' (having a
424: larger $TP$-intercept) are better because they correspond to
425: classifiers with lower expected cost.
426:
427: \subsubsection{The ROC convex hull}
428:
429: Because in most real-world cases the target distributions are not known
430: precisely, it is valuable to be able to identify those classifiers that
431: potentially are optimal. Each possible set of distributions defines a family
432: of iso-performance lines, and for a given family, the optimal methods are
433: those that lie on the ``most-northwest'' iso-performance line. Thus, a
434: classifier is optimal for some conditions if and only if it lies on the
435: northwest boundary (\ie above the line $y=x$) of the convex hull
436: \cite{quickhull:96} of the set of points in ROC space.\footnote{The convex
437: hull of a set of points is the smallest convex set that contains the
438: points.} We discuss this in detail in Section~\ref{sect:rocch-hybrid}.
439:
440:
441: \begin{figure}[tb]
442: \centering
443: \epsfig{file=ROC3.eps}
444: \caption{Lines $\alpha$ and $\beta$ show the optimal classifier under
445: different sets of conditions.}
446: \label{fig:ROC-hull2}
447: \end{figure}
448:
449: We call the convex hull of the set of points in ROC space the \textit{ROC
450: convex hull} (\rocch) of the corresponding set of classifiers.
451: Figure~\ref{fig:ROC-hull} shows four ROC curves with the ROC convex hull drawn
452: as the border between the shaded and unshaded areas. $\mathsf{D}$ is clearly
453: not optimal. Perhaps surprisingly, $\mathsf{B}$ can never be optimal either
454: because none of the points of its ROC curve lies on the convex hull. We can
455: also remove from consideration any points of $\mathsf{A}$ and $\mathsf{C}$
456: that do not lie on the hull.
457:
458: Consider these classifiers under two distribution scenarios. In each, negative
459: examples outnumber positives by 5:1. In scenario $\mathcal{A}$, false
460: positive and false negative errors have equal cost. In scenario $\mathcal{B}$,
461: a false negative is 25 times as expensive as a false positive (\eg missing a
462: case of fraud is much worse than a false alarm). Each scenario defines a
463: family of iso-performance lines. The lines corresponding to scenario
464: $\mathcal{A}$ have slope 5; those for $\mathcal{B}$ have slope $\frac{1}{5}$.
465: Figure~\ref{fig:ROC-hull2} shows the convex hull and two iso-performance
466: lines, $\alpha$ and $\beta$. Line $\alpha$ is the ``best'' line
467: with slope $5$ that intersects the convex hull; line $\beta$ is the best line
468: with slope $\frac{1}{5}$ that intersects the convex hull. Each line
469: identifies the optimal classifier under the given distribution.
470:
471: \begin{figure}[tb]
472: \begin{center}
473: \epsfig{file=ROC-hull.eps,height=3in,width=3.2in}
474: \caption{ROC curves with convex hull}
475: \label{fig:ROCCH}
476: \end{center}
477: \end{figure}
478:
479:
480: Figure~\ref{fig:ROCCH} shows the three ROC curves from our initial
481: example, with the convex hull drawn.
482:
483:
484: \subsubsection{Generating the ROC Convex Hull}
485:
486: The {\it ROC convex hull method} selects the potentially optimal classifiers
487: based on the ROC convex hull and iso-performance lines.
488:
489: \begin{table}[tb]
490: \caption{Algorithm for generating an ROC curve from a set of
491: ranked examples.}
492: \begin{center}
493: \rule{\textwidth}{.01in}
494: \begin{tabbing}
495: \textbf{\rmfamily Given:}~~ \=E: \= List of \=tuples
496: $\langle I, p \rangle$ where:\\
497: \>\>\>$I$: labeled example\\
498: \>\>\>$p$: numeric ranking assigned to $I$ by the classifier \\
499: \>$P, N$: count of positive and negative examples in E, respectively.\\
500: \textbf{\rmfamily Output:} R: List of points on the ROC curve.\\
501: \vspace*{1ex}\\
502: xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=xxxxx\=\kill
503: $Tcount = 0$; \>\>\>\>\>\>{\it /* current TP tally */ }\\
504: $Fcount = 0$; \>\>\>\>\>\>{\it /* current FP tally */ }\\
505: $plast = -\infty$; \>\>\>\>\>\>{\it /* last score seen */ }\\
506: $R = \langle \rangle$; \>\>\>\>\>\>{\it /* list of ROC points */ }\\
507: sort $E$ in decreasing order by $p$ values;\\
508: \WHILE (E $\neq \emptyset$) \DO \\
509: \>remove tuple $\langle I, p \rangle$ from head of E;\\
510: \>\IF ($p \neq plast$) \THEN\\
511: \>\>add point ($\frac{Fcount}{N}$, $\frac{Tcount}{P}$) to end of R;\\
512: \>\>$plast = p$;\\
513: \>\ENDIF\\
514: \>\IF ($I$ is a positive example) \THEN\\
515: \>\>$Tcount = Tcount + 1$;\\
516: \>\ELSE \>\>\>\>\>{\it /* I is a negative example */}\\
517: \>\>$Fcount = Fcount + 1$;\\
518: \>\ENDIF\\
519: \ENDWHILE\\
520: add point ($\frac{Fcount}{N}$, $\frac{Tcount}{P}$) to end of R;\\
521: \end{tabbing}
522: \rule{\textwidth}{.01in}
523: \end{center}
524: \label{tab:ROC-alg}
525: \end{table}
526:
527: \begin{enumerate}
528:
529: \item For each classifier, plot $TP$ and $FP$ in ROC space. For
530: continuous-output classifiers, vary a threshold over the output range
531: and plot the ROC curve. Table~\ref{tab:ROC-alg} shows an algorithm
532: for producing such an ROC curve in a single pass.\footnote{There is a
533: subtle complication to producing ROC curves from ranked test-set data,
534: which is reflected in the algorithm shown in Table~\ref{tab:ROC-alg}.
535: Specifically, consecutive examples with the same score can give overly
536: optimistic or overly pessimistic ROC curves, depending on the ordering
537: of positive and negative examples. The ROC curve generating algorithm
538: shown here waits until all examples with the same score have been
539: tallied before computing the next point of the ROC curve. The result
540: is a segment that bisects the area that would have resulted from the
541: most optimistic and most pessimistic orderings.}
542:
543: \item Find the convex hull of the set of points representing the predictive
544: behavior of all classifiers of interest, for example by using the QuickHull
545: algorithm \cite{quickhull:96}.
546:
547: \item For each set of class and cost distributions of interest, find the slope
548: (or range of slopes) of the corresponding iso-performance lines.
549:
550: \item For each set of class and cost distributions, the optimal classifier will
551: be the point on the convex hull that intersects the iso-performance line with
552: largest $TP$-intercept. Ranges of slopes specify hull segments.
553:
554: \end{enumerate}
555:
556:
557: Figures~\ref{fig:ROC-hull} and \ref{fig:ROC-hull2} demonstrate how the
558: subset of classifiers that are potentially optimal can be identified
559: and how classifiers can be compared under different cost and class
560: distributions.
561:
562: \subsubsection{Comparing a variety of classifiers}
563:
564: The ROC convex hull method accommodates both binary and continuous
565: classifiers. Binary classifiers are represented by individual points in ROC
566: space. Continuous classifiers produce numeric outputs to which thresholds can
567: be applied, yielding a series of $(FP, TP)$ pairs forming an ROC curve. Each
568: point may or may not contribute to the ROC convex hull.
569: Figure~\ref{fig:Adding-EFG} depicts the binary classifiers $\mathsf{E}$,
570: $\mathsf{F}$ and $\mathsf{G}$ added to the previous hull. $\mathsf{E}$ may be
571: optimal under some circumstances because it extends the convex hull.
572: Classifiers $\mathsf{F}$ and $\mathsf{G}$ never will be optimal because they
573: do not extend the hull.
574:
575: \begin{figure}[tb]
576: \centering \epsfig{file=Adding-classifiers.eps,height=3in}
577: \caption{Classifier $\mathsf{E}$ may be optimal for some conditions because
578: it extends the ROC convex hull. $\mathsf{F}$ and $\mathsf{G}$ cannot be
579: optimal they are not on the hull, nor do they extend it.}
580: \label{fig:Adding-EFG}
581: \end{figure}
582:
583: New classifiers can be added incrementally to an \rocch\ analysis, as
584: demonstrated in figure~\ref{fig:Adding-EFG} by the addition of classifiers
585: $\mathsf{E}$,$\mathsf{F}$, and $\mathsf{G}$. Each new classifier either
586: extends the existing hull or it does not. In the former case the hull must be
587: updated accordingly, but in the latter case the new classifier can be ignored.
588: Therefore, the method does not require saving every classifier (or saving
589: statistics on every classifier) for re-analysis under different
590: conditions---only those points on the convex hull. Recall that each point is
591: a classifier and might take up considerable space. Further, the management of
592: knowledge about many classifiers and their statistics from many different runs
593: of learning programs (e.g., with different algorithms or parameter settings)
594: can be a substantial undertaking. Classifiers not on the \rocch\ can never be
595: optimal, so they need not be saved. Every classifier that \emph{does} lie on
596: the convex hull must be saved. In Section~\ref{sect:our-study} we demonstrate
597: the \rocch\ in use, managing the results of many learning experiments.
598:
599: \subsubsection{Changing distributions and costs}
600:
601: Class and cost distributions that change over time necessitate the reevaluation
602: of classifier choice. In fraud detection, costs change based on workforce and
603: reimbursement issues; the amount of fraud changes monthly. With the ROC convex
604: hull method, comparing under a new distribution involves only calculating the
605: slope(s) of the corresponding iso-performance lines and intersecting them with
606: the hull, as shown in figure~\ref{fig:ROC-hull2}.
607:
608: The ROC convex hull method scales gracefully to any degree of
609: precision in specifying the cost and class distributions. If nothing
610: is known about a distribution, the ROC convex hull shows all
611: classifiers that may be optimal under any conditions.
612: Figure~\ref{fig:ROC-hull} showed that, given classifiers $\mathsf{A}$,
613: $\mathsf{B}$, $\mathsf{C}$ and $\mathsf{D}$, only $\mathsf{A}$ and
614: $\mathsf{C}$ can ever be optimal. With complete information, the
615: method identifies the optimal classifier(s). In
616: figure~\ref{fig:ROC-hull2} we saw that classifier $\mathsf{A}$ (with a
617: particular threshold value) is optimal under scenario $\mathcal{A}$
618: and classifier $\mathsf{C}$ is optimal under scenario $\mathcal{B}$.
619: Next we will see that with less precise information, the ROC convex
620: hull can show the subset of possibly optimal classifiers.
621:
622: \subsubsection{Sensitivity analysis}
623:
624:
625: \begin{figure}[tb]
626: \begin{center}
627: %
628: %
629: \epsfig{file=Sensitivity-1.eps,height=2.7in,width=2.6in} \\
630: a.~~Low sensitivity\\
631: \vspace*{.2in}
632: \epsfig{file=Sensitivity-2.eps,height=2.5in,width=2.5in}\\
633: b.~~High sensitivity\\
634: \end{center}
635: \caption{Sensitivity analysis using the ROC convex hull: (a) low
636: sensitivity (only C can be optimal), (b) high sensitivity (A, E, or C can
637: be optimal)}
638: \label{fig:sensitive}
639: \end{figure}
640:
641:
642: Imprecise distribution information defines a \emph{range} of slopes for
643: iso-performance lines. This range of slopes intersects a segment of the ROC
644: convex hull, which facilitates sensitivity analysis. For example, if the
645: segment defined by a range of slopes corresponds to a single point in ROC
646: space or a small threshold range for a single classifier, then there is no
647: sensitivity to the distribution assumptions in question. Consider a scenario
648: similar to $\mathcal{A}$ and $\mathcal{B}$ in that negative examples are 5
649: times as prevalent as positive ones. In this scenario, consider the cost of
650: dealing with a false alarm to be between \$10 and \$20, and the cost of
651: missing a positive example to be between \$200 and \$250. These conditions
652: define a range of slopes for iso-performance lines: $\frac{1}{5}\le m \le
653: \frac{1}{2}$. Figure~\ref{fig:sensitive}a depicts this range of slopes and
654: the corresponding segment of the ROC convex hull. The figure shows that the
655: choice of classifier is insensitive to changes within this range (and only
656: fine tuning of the classifier's threshold will be necessary).
657: Figure~\ref{fig:sensitive}b depicts a scenario with a wider range of slopes:
658: $\frac{1}{2} \le m \le 3$. The figure shows that under this scenario the
659: choice of classifier is very sensitive to the distribution. Classifiers
660: $\mathsf{A}$, $\mathsf{C}$ and $\mathsf{E}$ each are optimal for some
661: subrange.
662:
663: \section{Building robust classifiers}
664: \label{sect:rocch-hybrid}
665:
666: Up to this point, we have concentrated on the use of the \rocch\ for
667: visualizing and evaluating sets of classifiers. The \rocch\ helps to
668: delay classifier selection as long as possible, yet provides a rich
669: performance comparison. However, once system building incorporates a
670: particular classifier, the problem of brittleness resurfaces. This is
671: important because the delay between system building and deployment may
672: be large, and because many systems must survive for years. In fact,
673: in many domains a precise, static specification of future costs and
674: class distributions is not just unlikely, it is impossible
675: \cite{ProvostFawcettKohavi:98}.
676:
677: We address this brittleness by using the \rocch\ to produce
678: \textbf{robust classifiers}, defined as satisfying the following.
679: \emph{Under any target cost and class distributions, a robust
680: classifier will perform at least as well as the best classifier for
681: those conditions.} Our statements about optimality are practical: the
682: ``best'' classifier may not be the Bayes-optimal classifier, but it is
683: at least as good as any known classifier.
684: Srinivasan \citeyear{Srinivasan:99} calls this ``FAPP-optimal''
685: (optimal for all practical purposes). Stating that a classifier is
686: robust is stronger than stating that it is optimal for a specific set
687: of conditions. A robust classifier is optimal under all possible
688: conditions.
689:
690: In principle, classification brittleness could be overcome by saving
691: all possible classifiers (neural nets, decision trees, expert systems,
692: probabilistic models, etc.) and then performing an automated run-time
693: comparison under the desired target conditions. However, such a
694: system is not feasible because of time and space limitations---there
695: are myriad possible classification models, arising from the many
696: different learning methods under their many different parameter
697: settings. Storing all the classifiers is not feasible, and tuning
698: the system by comparing classifiers on the fly under different
699: conditions is not feasible. Fortunately, doing so is not necessary.
700: Moreover, we will show that it is sometimes possible to do \textit{better} than
701: any of these classifiers.
702:
703: \subsection{ROCCH-hybrid classifiers}
704:
705: We now show that robust hybrid classifiers can be built using the \rocch.
706:
707: \begin{definition}
708: Let $\mathbf{I}$ be the space of possible instances and let $\mathbf{C}$ be
709: the space of sets of classification models. Let a
710: \mathversion{bold}$\mu$\mathversion{normal}\textbf{-hybrid classifier}
711: comprise a set of classification models $\mathcal{C} \in \mathbf{C}$ and a
712: function
713: \[
714: \mu: \mathbf{I} \times \Re \times \mathbf{C} \rightarrow \{\YES,\NO\}.
715: \]
716: A $\mu$-hybrid classifier takes as input an instance $I \in \mathbf{I}$ for
717: classification and a number $x \in \Re$. As output, it produces the
718: classification produced by $\mu(I,x,\mathcal{C})$.
719: \end{definition}
720:
721: Things will get more involved later, but for the time being consider that each
722: set of cost and class distributions defines a value for $x$, which is used to
723: select the (predetermined) best classifier for those conditions. To build a
724: $\mu$-hybrid classifier, we must define $\mu$ and the set $\mathcal{C}$. We
725: would like $\mathcal{C}$ to include only those models that perform optimally
726: under some conditions (class and cost distributions), since these will be
727: stored by the system, and we would like $\mu$ to be general enough to apply to
728: a variety of problem formulations.
729:
730: The models comprising the {\sc rocch} can be combined to form a
731: $\mu$-hybrid classifier that is an elegant, robust classifier.
732:
733: \begin{definition}
734: The \textbf{{\sc \textbf{rocch}}-hybrid} is a $\mu$-hybrid classifier where
735: $\mathcal{C}$ is the set of classifiers that form the {\sc rocch} and $\mu$
736: makes classifications using the classifier on the {\sc rocch} with $FP=x$.
737: \end{definition}
738: Note that for the moment the {\sc rocch}-hybrid is defined only for $FP$
739: values corresponding to {\sc rocch} vertices.
740:
741: \subsection{Robust classification}
742:
743: Our definition of robust classifiers was intentionally vague about
744: what it means for one classifier to be better than another, because
745: different situations call for different comparison frameworks. We now
746: continue with minimizing expected cost, because the process of proving
747: that the {\sc rocch}-hybrid minimizes expected cost for any cost and
748: class distributions provides a deep understanding of why and how the
749: {\sc rocch}-hybrid works.
750: Later we generalize to a wide variety of
751: comparison frameworks.
752:
753: The \rocch-hybrid can be seen as an application of multi-criteria
754: optimization to classifier design and construction. The classifiers on the
755: \rocch\ are Edgeworth-Pareto optimal\footnote{Edgeworth-Pareto optimality is
756: the century-old notion that in a multidimensional space of criteria, optimal
757: performance is the frontier of achievable performance in this space. In
758: cases where performance is a linear combination of the criteria, the
759: optimality frontier is the convex hull.} \cite{Stadler-book} with respect to
760: TP, FP, and the objective functions we discuss. Multi-criteria optimization
761: was used previously in machine learning by Tcheng, Lambert, Lu and Rendell
762: \shortcite{TchengEtAl:89} for the selection of inductive bias.
763: Alternatively, the \rocch\ can be seen as an application of the theory of
764: games and statistical decisions, for which convex sets (and the convex hull)
765: represent optimal strategies \cite{BlackwellGirshick:54}.
766:
767: \subsubsection{Minimizing expected cost}
768:
769: From above, the expected cost of applying a classifier is:
770:
771: \begin{equation}
772: \label{eq:expected_cost}
773: ec(FP,TP) \; = \; p(\POS) \cdot (1-TP)\cdot c(\NO,\POS) \; +
774: \; p(\NEG) \cdot FP \cdot c(\YES,\NEG)
775: \end{equation}
776:
777: For a particular set of cost and class distributions, the
778: slope of the corresponding iso-performance lines is:
779:
780: \begin{equation}
781: \label{eq:slope}
782: m_{ec} = \frac{c(\YES,\NEG)p(\NEG)}{c(\NO,\POS)p(\POS)}
783: \end{equation}
784:
785: Every set of conditions will define an $m_{ec} \ge 0$. We now can
786: show that the {\sc rocch}-hybrid is robust for problems where the
787: ``best'' classifier is the classifier with the minimum expected cost.
788:
789: The slope of the {\sc rocch} is an important tool in our argument. The {\sc
790: rocch} is a piecewise-linear, concave-down ``curve.'' Therefore, as $x$
791: increases, the slope of the {\sc rocch} is monotonically non-increasing with
792: $k-1$ discrete values, where $k$ is the number of {\sc rocch} component
793: classifiers, including the degenerate classifiers that define the {\sc rocch}
794: endpoints. Where there will be no confusion, we use phrases such as ``points
795: in ROC space'' as a shorthand for the more cumbersome ``classifiers
796: corresponding to points in ROC space.'' For this subsection, unless otherwise
797: noted, ``points on the
798: {\sc rocch}'' refer to vertices of the {\sc rocch}.
799:
800: \begin{definition}
801: \label{def:slope-of-rocch}
802: For any real number $m \ge 0$, the \textbf{point where the slope of the
803: \textsc{rocch}\ is $\mathbf{m}$} is one of the (arbitrarily chosen)
804: endpoints of the segment of the {\sc rocch} with slope $m$, if such a
805: segment exists. Otherwise, it is the vertex for which the left adjacent
806: segment has slope greater than $m$ and the right adjacent segment has slope
807: less than $m$.
808: \end{definition}
809:
810: For completeness, the leftmost endpoint of the {\sc rocch} is considered to be
811: attached to a segment with infinite slope and the rightmost endpoint of the
812: {\sc rocch} is considered to be attached to a segment with zero slope. Note
813: that every $m \ge 0$ defines at least one point on the {\sc rocch}.
814:
815: \begin{lemma}
816: For any set of cost and class distributions, there is a point on the \rocch\
817: with minimum expected cost.\\
818: \textbf{Proof:} (by contradiction) Assume that for some conditions
819: there exists a point \textbf{C} with smaller expected cost than any
820: point on the {\sc rocch}. By equations~\ref{eq:expected_cost} and
821: \ref{eq:slope}, a point ($FP_2$,$TP_2$) has the same expected cost
822: as a point ($FP_1$,$TP_1$) if \[ \frac{TP_2 - TP_1}{FP_2 - FP_1} =
823: m_{ec} \] Therefore, for conditions corresponding to $m_{ec}$, all
824: points with equal expected cost form an iso-performance line in ROC
825: space with slope $m_{ec}$. Also by~\ref{eq:expected_cost}
826: and~\ref{eq:slope}, points on lines with larger y-intercept have
827: lower expected cost. Now, point \textbf{C} is not on the {\sc
828: rocch}, so it is either above the curve or below the curve. If it
829: is above the curve, then the {\sc rocch} is not a convex set
830: enclosing all points, which is a contradiction. If it is below the
831: curve, then the iso-performance line through \textbf{C} also
832: contains a point \textbf{P} that is on the {\sc rocch} (not
833: necessarily a vertex). If this iso-performance line intersects no
834: {\sc rocch} vertex, then consider the vertices at the endpoints of
835: the {\sc rocch} segment containing \textbf{P}; one of these vertices
836: must intersect a better iso-performance line than does \textbf{C}.
837: In either case, since all points on an iso-performance line have the
838: same expected cost, point \textbf{C} does not have smaller expected
839: cost than all points on the {\sc rocch}, which is also a
840: contradiction. \EndProof
841: \end{lemma}
842:
843: Although it is not necessary for our purposes here, it can be shown
844: that \textit{all} of the minimum expected-cost classifiers are
845: \textit{on} the {\sc rocch}.
846:
847: \begin{definition}
848: \label{def:m_iso_perf_line}
849: An iso-performance line with slope $m$ is an \textbf{m-iso-performance
850: line}.
851: \end{definition}
852:
853: \begin{lemma}
854: For any cost and class distributions that translate to $m_{ec}$, a point on
855: the {\sc rocch} has minimum expected cost only if the slope
856: of the {\sc rocch} at that point is $m_{ec}$.\\
857: \textbf{Proof:} (by contradiction) Suppose that there is a point \textbf{D}
858: on the {\sc rocch} where the slope is \emph{not} $m_{ec}$, but the point
859: does have minimum expected cost. By Definition~\ref{def:slope-of-rocch},
860: either (a) the segment to the left of \textbf{D} has slope less than
861: $m_{ec}$, or (b) the segment to the right of \textbf{D} has slope greater
862: than $m_{ec}$. For case (a), consider point \textbf{N}, the vertex of the
863: {\sc rocch} that neighbors \textbf{D} to the left, and consider the
864: (parallel) $m_{ec}$-iso-performance lines $l_D$ and $l_N$ through \textbf{D}
865: and \textbf{N}. Because \textbf{N} is to the left of \textbf{D} and the
866: line connecting them has slope less than $m_{ec}$, the y-intercept of $l_N$
867: will be greater than the y-intercept of $l_D$. This means that \textbf{N}
868: will have lower expected cost than \textbf{D}, which is a contradiction.
869: The argument for (b) is analogous (symmetric). \EndProof
870: \end{lemma}
871:
872: \begin{lemma}
873: If the slope of the {\sc rocch} at a point is $m_{ec}$, then the point has
874: minimum expected cost.\\
875: \textbf{Proof:} If this point is the only point where the slope of the {\sc
876: rocch} is $m_{ec}$, then the proof follows directly from Lemma 1 and
877: Lemma 2. If there are multiple such points, then by definition they are
878: connected by an $m_{ec}$-iso-performance line, so they have the same
879: expected cost, and once again the proof follows directly from Lemma 1 and
880: Lemma 2. \EndProof
881: \end{lemma}
882:
883: It is straightforward now to show that the {\sc rocch}-hybrid is robust for the
884: problem of minimizing expected cost.
885:
886: \begin{theorem}
887: The {\sc rocch}-hybrid minimizes expected cost for any cost distribution
888: and any class distribution.\\
889: \textbf{Proof:} Because the {\sc rocch}-hybrid is composed of the
890: classifiers corresponding to the points on the {\sc rocch}, this follows
891: directly from Lemmas 1, 2, and 3. \EndProof
892: \end{theorem}
893:
894: Now we have shown that the {\sc rocch}-hybrid is robust when the goal
895: is to provide the minimum expected-cost classification. This result
896: is important even for accuracy maximization, because the preferred
897: classifier may be different for different target class distributions.
898: This rarely is taken into account in experimental comparisons of
899: classifiers.
900:
901: \begin{corollary}
902: The {\sc rocch}-hybrid minimizes error rate (maximizes accuracy) for any
903: target class distribution.\\
904: \textbf{Proof:} Error rate minimization is cost minimization with uniform
905: error costs. \EndProof
906: \end{corollary}
907:
908: \subsection{Robust classification for other common metrics}
909:
910: Showing that the \rocch-hybrid is robust not only helps us with understanding
911: the \rocch\ method generally, it also shows us how the \rocch-hybrid will pick
912: the best classifier in order to produce the best classifications, which we
913: will return to later. If we ignore the need to specify how to pick the best
914: component classifier, we can show that the \rocch\ applies more generally.
915:
916: \begin{theorem}
917: \label{theorem:general-rocch}
918: For any classifier evaluation metric $f(FP,TP)$, if\\
919: $\Partial{f}{TP}~\ge~0$ and $\Partial{f}{FP} \le 0$ then there exists a
920: point on the \rocch\ with an $f$-value at least
921: as high as that of any known classifier.\\
922: \textbf{Proof:} (by contradiction) Assume that there exists a classifier
923: $\mathcal{C}_o$, not on the \rocch, with an $f$-value higher than that of
924: any point on the \rocch. $\mathcal{C}_o$ is either (i) above or (ii) below
925: the \rocch. In case (i), the \rocch\ is not a convex set enclosing all the
926: points, which is a contradiction. In case (ii), let $\mathcal{C}_o$ be
927: represented in ROC-space by $(FP_o,TP_o)$. Because $\mathcal{C}_o$ is below
928: the \rocch\ there exist points, call one $(FP_p,TP_p)$, on the \rocch\ with
929: $TP_p > TP_o$ and $FP_p < FP_o$. However, by the restriction on the partial
930: derivatives, for any such point $f(FP_p,TP_p) \ge f(FP_o,TP_o)$, which again
931: is a contradiction. \EndProof
932: \end{theorem}
933:
934: There are two complications to the more general use of the \rocch,
935: both of which are illustrated by the decision criterion from our very
936: first example. Recall that the Neyman-Pearson criterion specifies a
937: maximum acceptable $FP$ rate. Standard ROC analysis uses ROC curves
938: to select a single, parameterized classification model; the parameter
939: allows the user to select the ``operating point'' for a
940: decision-making task, usually a threshold on a probabilistic output
941: that will allow for optimal decision making. Under the Neyman-Pearson
942: criterion, selecting the single best model from a set is easy: plot
943: the ROC curves, draw a vertical line at the desired maximum $FP$, and
944: pick the model whose curve has the largest $TP$ at the intersection
945: with this line.
946:
947: \begin{figure}[tb]
948: \begin{center}
949: \epsfig{file=ROC-NP.eps,height=3.1in,width=3in}
950: \caption{The ROC Convex Hull used to select a classifier under the
951: Neyman-Pearson criterion}
952: \label{fig:ROC-NP}
953: \end{center}
954: \end{figure}
955:
956: With the \rocch-hybrid, making the best classifications under
957: the Neyman-Pearson criterion is not so straightforward.
958: For minimizing expected cost it was sufficient for the {\sc rocch}-hybrid to
959: choose a \textit{vertex} from the {\sc rocch} for any $m_{ec}$ value. For
960: problem formulations such as the Neyman-Pearson criterion, the performance
961: statistics at a non-vertex point on the {\sc rocch} may be preferable (see
962: figure~\ref{fig:ROC-NP}). Fortunately, with a slight extension, the {\sc
963: rocch}-hybrid can yield a classifier with these performance statistics.
964:
965: \begin{theorem}
966: \label{theorem:rocch-achieves-any-tradeoff} An {\sc rocch}-hybrid
967: can achieve the $TP$:$FP$ tradeoff represented by any point on the
968: {\sc rocch}, not just the vertices.\\ \textbf{Proof:} (by
969: construction) Extend $\mu(I,x,\mathcal{C})$ to non-vertex points as
970: follows. Pick the point $P$ on the {\sc rocch} with $FP=x$ (there
971: is exactly one). Let $TP_x$ be the $TP$ value of this point. If
972: ($x$, $TP_x$) is an {\sc rocch} vertex, use the corresponding
973: classifier. If it is not a vertex, call the left endpoint of the
974: hull segment on which $P$ lies $C_l$, and the right endpoint $C_r$.
975: Let $d$ be the distance between $C_l$ and $C_r$, and let $p$ be the
976: distance between $C_l$ and $P$. Make classifications as follows.
977: For each input instance flip a weighted coin and choose the answer
978: given by classifier $C_r$ with probability $\frac{p}{d}$ and that
979: given by classifier $C_l$ with probability $1-\frac{p}{d}$. It is
980: straightforward to show that $FP$ and $TP$ for this classifier will
981: be $x$ and $TP_x$. \EndProof
982: \end{theorem}
983:
984: The second complication is that, as illustrated by the Neyman-Pearson
985: criterion, many practical classifier comparison frameworks include
986: \textit{constrained} optimization problems (below we will discuss other
987: frameworks). Arbitrarily constrained optimizations are problematic for the
988: \rocch-hybrid. Given total freedom, it is possible to devise constraints on
989: classifier performance such that, even with the restriction on the partial
990: derivatives, an interior point scores higher than any \textit{acceptable}
991: point on the hull. For example, two linear constraints can enclose a subset
992: of the interior and exclude \textit{the entire} \rocch---there will be no
993: acceptable points on the \rocch. However, many realistic constraints do not
994: thwart the optimality of the \rocch-hybrid.
995:
996: \begin{theorem}
997: \label{theorem:general-rocch-hybrid}
998: For any classifier evaluation metric $f(FP,TP)$, if \\
999: $\Partial{f}{TP}\ge~0$ and $\Partial{f}{FP}\le~0$ and no constraint on
1000: classifier performance eliminates any point on the \rocch\ without also
1001: eliminating all higher-scoring interior points, then the \rocch-hybrid can
1002: perform at least as well as any known classifier.
1003: \\
1004: \textbf{Proof:} Follows directly from Theorem~\ref{theorem:general-rocch}
1005: and Theorem~\ref{theorem:rocch-achieves-any-tradeoff}. \EndProof
1006: \end{theorem}
1007:
1008: Linear constraints on classifiers' $FP:TP$ performance are common
1009: for real-world problems, so the following is
1010: useful.
1011:
1012: \begin{corollary}
1013: \label{corollary:linear-constraints}
1014: For any classifier evaluation metric $f(FP,TP)$, if\\
1015: $\Partial{f}{TP} \ge 0$ and $\Partial{f}{FP} \le 0$
1016: and there is a single constraint on classifier performance
1017: of the form $a \cdot TP + b \cdot FP \le c$, with $a$ and $b$
1018: non-negative,
1019: then
1020: the \rocch-hybrid can perform at least as well as any known
1021: classifier.
1022: \\
1023: \textbf{Proof:}
1024: The single constraint eliminates from contention all points (classifiers)
1025: that do not fall to the left of, or below, a line with non-positive
1026: slope. By the restriction on the partial derivatives, such a constraint
1027: will not eliminate a point on the \rocch\ without also eliminating
1028: all interior points with higher $f$-values.
1029: Thus, the proof follows directly from Theorem~\ref{theorem:general-rocch-hybrid}.
1030: \EndProof
1031: \end{corollary}
1032:
1033: So, finally, we have the following:
1034:
1035: \begin{corollary}
1036: \label{cor:rocch-maximizes-NP}
1037: For the Neyman-Pearson criterion, the {\sc rocch}-hybrid can perform at
1038: least as well as that of any known
1039: classifier.\\
1040: \textbf{Proof:} For the Neyman-Pearson criterion, the evaluation metric is
1041: $f(FP,TP)=TP$, that is, a higher $TP$ is better. The constraint on
1042: classifier performance is $FP \le FP_{max}$. These satisfy the conditions
1043: for Corollary~\ref{corollary:linear-constraints}, and therefore this
1044: corollary follows. \EndProof
1045: \end{corollary}
1046:
1047: All the foregoing effort may seem misplaced for a simple
1048: criterion like Neyman-Pearson. However, there are
1049: many other realistic problem formulations.
1050: For example, consider
1051: the decision-support problem of optimizing \textit{workforce utilization}, in
1052: which a workforce is available that can process a fixed number of cases. Too few
1053: cases will under-utilize the workforce, but too many cases will leave some
1054: cases unattended (expanding the workforce usually is not a short-term
1055: solution). If the workforce can handle $K$ cases, the system should present
1056: the best possible set of $K$ cases. This is similar to the Neyman-Pearson
1057: criterion, but with an absolute cutoff ($K$) instead of a percentage cutoff
1058: ($FP$).
1059:
1060:
1061: \begin{theorem}
1062: \label{the:rocch_best}
1063: For workforce utilization, the {\sc rocch}-hybrid will provide the best set
1064: of $K$ cases, for any choice of $K$.\\
1065: \textbf{Proof:} (by construction) The decision criterion is to maximize $TP$
1066: subject to the constraint:
1067: \[
1068: TP \cdot P + FP \cdot N \le K
1069: \]
1070: The theorem therefore follows from Corollary~\ref{corollary:linear-constraints}. \EndProof
1071: \end{theorem}
1072:
1073: In fact, many screening problems, such as are found in marketing and
1074: information retrieval, use exactly this linear constraint. It follows that
1075: for maximizing lift \cite{BerryLinoff:97}, precision, or recall, subject to
1076: absolute or percentage cutoffs on case presentation, the {\sc rocch}-hybrid
1077: will provide the best set of cases.
1078:
1079: As with minimizing expected cost, imprecision in the environment
1080: forces us to favor a \textit{robust} solution for these other
1081: comparison frameworks. For many real-world problems, the precise
1082: desired cutoff will be unknown or will change (\eg because of
1083: fundamental uncertainty, variability in case difficulty, or competing
1084: responsibilities). What is worse, for a fixed (absolute) cutoff
1085: merely changing the size of the universe of cases (e.g., the size of
1086: a document corpus) may change the preferred classifier, because it
1087: will change the constraint line. The {\sc rocch}-hybrid provides a
1088: robust solution because it gives the optimal subset of cases for any
1089: constraint line. For example, for document retrieval the {\sc
1090: rocch}-hybrid will yield the best $N$ documents for any $N$, for any
1091: prior class distribution (in the target corpus), and for any target
1092: corpus size.
1093:
1094: \subsection{Ranking cases}
1095: \label{sect:ranking-cases}
1096:
1097: An apparent solution to the problem of robust classification is to use a model
1098: that ranks cases, and just work down the ranked list. This approach appears
1099: to sidestep the brittleness demonstrated with binary classifiers, since the
1100: choice of a cutoff point can be deferred to classification time. However,
1101: choosing the best ranking model is still problematic. For most practical
1102: situations, choosing the best ranking model is equivalent to choosing which
1103: classifier is best \emph{for the cutoff that will be used}.
1104:
1105: An example will illustrate this. Consider two ranking functions, $R_a$ and
1106: $R_b$, applied to a class-balanced set of 100 cases. Assume $R_a$ is able to
1107: recognize a common aspect unique to positive cases that occurs in 20\% of the
1108: population, and it ranks these highest. Assume $R_b$ is able to recognize a
1109: common aspect unique to negative cases occurring in 20\% of the population, and it
1110: ranks these lowest. So $R_a$ ranks the highest 20\% correctly and performs
1111: randomly on the remainder, while $R_b$ ranks the lowest 20\% correctly and
1112: performs randomly on the remainder. Which model is better? The answer
1113: depends entirely upon how far down the list the system will go before it
1114: stops; that is, upon what cutoff will be used. If fewer than 50 cases are to
1115: be selected then $R_a$ should be used, whereas $R_b$ is better if more than 50
1116: cases will be selected. Figure~\ref{fig:Ranking-models} shows the ROC curves
1117: corresponding to these two classifiers, and the point corresponding to $N=50$
1118: where the curves cross in ROC space.
1119:
1120: \begin{figure}[tb]
1121: \begin{center}
1122: \epsfig{file=Ranking-models.eps ,height=3in}
1123: \caption{The ROC curves of the two ranking classifiers, $R_a$ and $R_b$,
1124: described in Section~\ref{sect:ranking-cases}.}
1125: \label{fig:Ranking-models}
1126: \end{center}
1127: \end{figure}
1128:
1129: The \rocch\ method can be used to organize such ranking models, as we have
1130: seen. Recall that ROC curves are formed from case rankings by moving the
1131: cutoff from one extreme to the other (Table~\ref{tab:ROC-alg} shows an
1132: algorithm for calculating the ROC curve from such rankings). The {\sc
1133: rocch}-hybrid comprises the ranking models that are best for all possible
1134: conditions.
1135:
1136: \subsection{Whole-curve metrics}
1137:
1138: In situations where either the target cost distribution or class distribution
1139: is \emph{completely} unknown, some researchers advocate choosing the
1140: classifier that maximizes a single-number metric representing the average
1141: performance over the entire curve. A common whole-curve metric is ``AUC'',
1142: the Area Under the (ROC) Curve \cite{Bradley:97}. The AUC is equivalent to
1143: the probability that a randomly chosen positive instance will be rated higher
1144: than a negative instance, and thereby is also estimated by the Wilcoxon test
1145: of ranks \cite{HanleyMcNeil:82}. A criticism of AUC is that for specific
1146: target conditions the classifier with the maximum AUC may be suboptimal
1147: \cite{ProvostFawcettKohavi:98}. Indeed, this criticism may be made of any
1148: single-number metric. Fortunately, not only is the \textsc{rocch}-hybrid
1149: optimal for any specific target conditions, it has the maximum
1150: AUC---There is no classifier with AUC larger than that of the {\sc rocch}-hybrid.
1151:
1152: \subsection{Using the ROCCH-hybrid}
1153:
1154: To use the \textsc{rocch}-hybrid for classification, we need to translate
1155: environmental conditions to $x$ values to plug into $\mu(I,x,\mathcal{C})$.
1156: For minimizing expected cost, Equation~\ref{eq:slope} shows how to translate
1157: conditions to $m_{ec}$. For any $m_{ec}$, by Lemma~3 we want the $FP$ value
1158: of the point where the slope of the {\sc rocch} is $m_{ec}$, which is
1159: straightforward to calculate. For the Neyman-Pearson criterion the conditions
1160: are defined as $FP$ values. For workforce utilization with conditions
1161: corresponding to a cutoff $K$, the $FP$ value is found by intersecting the line
1162: $TP \cdot P + FP \cdot N = K$ with the {\sc rocch}.
1163:
1164: We have argued that target conditions (misclassification costs and
1165: class distribution) are rarely known. It may be confusing that
1166: we now seem to require exact knowledge of these conditions. The
1167: \textsc{rocch}-hybrid gives us two important capabilities. First, the
1168: need for precise knowledge of target conditions is deferred until
1169: run time. Second, in the absence of precise knowledge even at
1170: run time, the system can be optimized easily with minimal feedback.
1171:
1172: By using the \textsc{rocch}-hybrid, information on target conditions is not
1173: needed to train and compare classifiers. This is important because of
1174: imprecision caused by temporal,
1175: geographic, or other differences that may exist between training and use.
1176: For example, building
1177: a system for a real-world problem introduces a non-trivial delay between the
1178: time data are gathered and the time the learned models will be used. The
1179: problem is exacerbated in domains where error costs or class distributions
1180: change over time; even with slow drift, a brittle model may become suboptimal
1181: quickly. In many such scenarios, costs and class distributions can be specified
1182: (or respecified) at run time with reasonable precision by sampling from the
1183: current population, and used to ensure that the {\sc rocch}-hybrid always
1184: performs optimally.
1185:
1186:
1187: In some cases, even at run time these quantities are not known
1188: exactly. A further benefit of the \textsc{rocch}-hybrid is that it
1189: can be tuned easily to yield optimal performance with only minimal
1190: feedback from the environment. Conceptually, the {\sc rocch}-hybrid
1191: has one ``knob'' that varies $x$ in $\mu(I,x,\mathcal{C})$ from one
1192: extreme to the other. For any knob setting, the {\sc rocch}-hybrid
1193: will give the optimal $TP$:$FP$ tradeoff for the target conditions
1194: corresponding to that setting. Turning the knob to the right
1195: increases $TP$; turning the knob to the left decreases $FP$. Because
1196: of the monotonicity of the \textsc{rocch}-hybrid, simple hill-climbing
1197: can guarantee optimal performance. For example, if the system
1198: produces too many false alarms, turn the knob to the left; if the
1199: system is presenting too few cases, turn the knob to the right.
1200:
1201: \subsection{Beating the component classifiers}
1202: \label{sect:beating-the-components}
1203:
1204: Perhaps surprisingly, in many realistic situations an {\sc
1205: rocch}-hybrid system can do \emph{better} than any of its component
1206: classifiers. Consider the Neyman-Pearson decision criterion. The
1207: {\sc rocch} may intersect the $FP$-line \textit{above} the highest
1208: component ROC curve. This occurs when the $FP$-line intersects the
1209: {\sc rocch} between vertices; therefore, there is no component
1210: classifier that actually produces these particular ($FP$,$TP$)
1211: statistics, as in figure~\ref{fig:ROC-NP}. By
1212: Theorem~\ref{theorem:rocch-achieves-any-tradeoff}, the {\sc
1213: rocch}-hybrid can achieve any $TP$ on the hull. Only a small number
1214: of $FP$ values correspond to hull vertices.
1215: The same holds for other common problem formulations, such as workforce
1216: utilization, lift maximization, precision maximization, and recall
1217: maximization.
1218:
1219: \subsection{Time and space efficiency}
1220:
1221: We have argued that the {\sc rocch}-hybrid is robust for a wide variety of
1222: problem formulations. It is also efficient to build, to store, and to update.
1223:
1224: The time efficiency of building the {\sc rocch}-hybrid depends first
1225: on the efficiency of building the component models, which varies
1226: widely by model type. Some models built by machine learning methods
1227: can be built in seconds (once data are available). Hand-built models
1228: can take years to build. However, we presume that this is work that
1229: would be done anyway. The {\sc rocch}-hybrid can be built with
1230: whatever methods are available, be there two or two thousand. As
1231: described below, as new classifiers become available, the {\sc
1232: rocch}-hybrid can be updated incrementally. The time efficiency
1233: depends also on the efficiency of the experimental evaluation of the
1234: classifiers. Once again, we presume that this is work that would be
1235: done anyway. Finally, the time efficiency of the {\sc rocch}-hybrid
1236: depends on the efficiency of building the {\sc rocch}, which can be
1237: done in $O(N \log N)$ time using the QuickHull algorithm
1238: \cite{quickhull:96} where $N$ is the number of classifiers.
1239:
1240: The {\sc rocch} is space efficient, too, because it comprises only
1241: classifiers that might be optimal under some target conditions (which
1242: follows directly from Lemmas 1--3 and Definitions 3 and 4). The
1243: number of classifiers that must be stored can be reduced if bounds can
1244: be placed on the potential target conditions. As described above,
1245: ranges of conditions define segments of the {\sc rocch}. Thus, the
1246: {\sc rocch}-hybrid may need only a subset of $\mathcal{C}$.
1247:
1248: Adding new classifiers to the {\sc rocch}-hybrid also is efficient. Adding a
1249: classifier to the \textsc{rocch} will either (i) extend the hull, adding to
1250: (and possibly subtracting from) the {\sc rocch}-hybrid, or (ii) conclude that
1251: the new classifiers are not superior to the existing classifiers in any
1252: portion of ROC space and can be discarded.
1253:
1254: The run-time (classification) complexity of the {\sc rocch}-hybrid is never
1255: worse than that of the component classifiers. In situations where run-time
1256: complexity is crucial, the {\sc rocch} should be constructed without
1257: prohibitively expensive classification models. It then will find the best
1258: subset of the computationally efficient models.
1259:
1260: \section{Empirical demonstration of need}
1261:
1262: Robust classification is of fundamental interest because it
1263: weakens two very strong assumptions: the
1264: availability of precise knowledge of costs and
1265: of class distributions.
1266: However, might it not be that existing classifiers already are robust?
1267: For example, if a given classifier is optimal under one set of
1268: conditions, might it not be optimal under all?
1269:
1270: It is beyond the scope of this paper to offer an in-depth experimental study
1271: answering this question. However, we can provide solid evidence that the
1272: answer is ``no'' by referring to the results of two prior studies. One is a
1273: comprehensive ROC analysis of medical domains recently conducted by Andrew
1274: Bradley \citeyear{Bradley:97}.\footnote{Bradley's purpose was not to answer
1275: this question; fortunately, his published results do anyway.} The other is a
1276: published ROC analysis of UCI database domains that we undertook last year
1277: with Ron Kohavi \cite{ProvostFawcettKohavi:98}.
1278:
1279: Note that a classifier \textit{dominates} if its ROC curve completely
1280: defines the {\sc rocch} (which means dominating classifiers are robust
1281: and vice versa). Therefore, if there exist more than a trivially few
1282: domains where no single classifier dominates, then techniques like the {\sc
1283: rocch}-hybrid are essential if robust classifiers are desired.
1284:
1285:
1286: \subsection{Bradley's study}
1287:
1288: Bradley studied six
1289: medical data sets, noting that ``unfortunately, we rarely know what the
1290: individual misclassification costs are.'' He plotted the ROC curves of six
1291: classifier learning algorithms (two neural nets, two decision trees and two
1292: statistical techniques).
1293:
1294:
1295: \begin{figure}[tb]
1296: \begin{center}
1297: \epsfig{file=Bradley-HB.eps,height=3in,width=3in}
1298: \caption{Bradley's classifier results for the heart bleeding data.}
1299: \label{fig:Bradley-HB}
1300: \end{center}
1301: \end{figure}
1302:
1303: On \textit{not one} of these data sets was there a dominating
1304: classifier. This means that for each domain, there exist different
1305: sets of conditions for which different classifiers are preferable. In
1306: fact, the running example in the present article is based on the three
1307: best classifiers from Bradley's results on the heart bleeding data;
1308: his results for the full set of six classifiers can be found in
1309: figure~\ref{fig:Bradley-HB}. Classifiers constructed for the
1310: Cleveland heart disease data are shown in
1311: figure~\ref{fig:Bradley-Cleveland}.
1312:
1313: Bradley's results show clearly that for many domains the classifier that
1314: maximizes any single metric---be it accuracy, cost, or the area under the ROC
1315: curve---will be the best for some cost and class distributions and will not be
1316: the best for others. We have shown that the {\sc
1317: rocch}-hybrid will be the best for all.
1318:
1319: \begin{figure}[tb]
1320: \begin{center}
1321: \epsfig{file=Bradley-Cleveland.eps,height=3in,width=3in}
1322: \caption{Bradley's classifier results for the Cleveland heart disease data}
1323: \label{fig:Bradley-Cleveland}
1324: \end{center}
1325: \end{figure}
1326:
1327: \subsection{Our study}
1328: \label{sect:our-study}
1329:
1330: In the study we performed with Ron Kohavi, we chose ten datasets from the UCI
1331: repository, each of which contains at least 250 instances, but for which the
1332: accuracy for decision trees was less than 95\%. For each domain, we induced
1333: classifiers for the minority class (for Road, we chose the class Grass). We
1334: selected several induction algorithms from \mlc\ \cite{mlc-new-intro-j}: a
1335: decision tree learner (MC4), Naive Bayes with discretization (NB), $k$-nearest
1336: neighbor for several $k$ values (IB$k$), and Bagged-MC4
1337: \cite{breiman-bagging}. MC4 is similar to C4.5 \cite{quinlan-c45};
1338: probabilistic predictions are made by using a Laplace correction at the
1339: leaves. NB discretizes the data based on entropy minimization
1340: \cite{dougherty-kohavi-sahami-disc} and then builds the Naive-Bayes model
1341: \cite{domingos-pazzani-simple-bayes}. IB$k$ votes the closest $k$ neighbors;
1342: each neighbor votes with a weight equal to one over its distance from the test
1343: instance.
1344:
1345: Some of the ROC curves are shown in Figure~\ref{fig:UCI-ROCs}. For \emph{only
1346: one} of these ten domains (Vehicle) was there an absolute dominator. In
1347: general, very few of the 100 runs performed (on 10 data sets, using 10
1348: cross-validation folds each) had dominating classifiers. Some cases are very
1349: close, for example Adult and Waveform-21. In other cases a curve that
1350: dominates in one area of ROC space is dominated in another. These results
1351: also support the need for methods like the \rocch -hybrid, which produce
1352: robust classifiers.
1353:
1354: \begin{figure}[tb]
1355: \centerline{%
1356: \begin{tabular}{c@{\hspace{3pc}}c}
1357: \epsfig{file=vehicle.eps,height=2.7in,width=2.7in} &
1358: \epsfig{file=crx.eps, height=2.7in,width=2.7in}\\
1359: a.~~Vehicle &
1360: b.~~CRX \\
1361: \\
1362: \epsfig{file=roadGrass.eps,height=2.7in,width=2.7in} &
1363: \epsfig{file=satimage.eps, height=2.7in,width=2.7in}\\
1364: c.~~RoadGrass &
1365: d.~~Satimage
1366: \end{tabular}
1367: }
1368: \caption{Smoothed ROC curves from UCI database domains}
1369: \label{fig:UCI-ROCs}
1370: \end{figure}
1371:
1372: \begin{table}[tb]
1373: \caption{Locally dominating classifiers for four UCI domains}
1374: \label{tab:convex-hulls}
1375: \normalsize
1376: \begin{tabular*}{3.5in}{lll}
1377: \textbf{Domain} & \textbf{Slope range} & \textbf{Dominator} \\ \hline
1378: Vehicle & [0, $\infty$) & Bagged-MC4\\ \hline
1379: Road (Grass) & [0, 0.38] & NB\\
1380: & [0.38, $\infty$) & Bagged-MC4\\ \hline
1381: CRX & [0, 0.03] & Bagged-MC4\\
1382: & [0.03, 0.06] & NB\\
1383: & [0.06, 2.06] & Bagged-MC4\\
1384: & [2.06, $\infty$) & NB\\ \hline
1385: Satimage & [0, 0.05] & NB \\
1386: & [0.05, 0.22] & Bagged-MC4 \\
1387: & [0.22, 2.60] & IB5 \\
1388: & [2.60, 3.11] & IB3 \\
1389: & [3.11, 7.54] & IB5 \\
1390: & [7.54, 31.14] & IB3 \\
1391: & [31.14, $\infty$) & Bagged-MC4 \\ \hline
1392: \end{tabular*}
1393: \end{table}
1394:
1395: As examples of what expected-cost-minimizing \textsc{rocch}-hybrids would look
1396: like internally, Table~\ref{tab:convex-hulls} shows the component classifiers
1397: that make up the \rocch\ for the four UCI domains of
1398: figure~\ref{fig:UCI-ROCs}. For example, in the Road domain (see
1399: figure~\ref{fig:UCI-ROCs} and Table~\ref{tab:convex-hulls}), Naive Bayes would
1400: be chosen for any target conditions corresponding to a slope less than $0.38$,
1401: and Bagged-MC4 would be chosen for slopes greater than $0.38$. They perform
1402: equally well at $0.38$.
1403:
1404: \section{Limitations and future work}
1405:
1406: There are limitations to the {\sc rocch} method as we have presented it here.
1407: We have defined it here only for two-class problems. Srinivasan
1408: \citeyear{Srinivasan:99} shows that it can be extended to multiple dimensions.
1409: It should be noted that the dimensionality of the ``ROC-hyperspace'' grows
1410: quadratically in the number of classes, so both efficiency and visualization
1411: capability are called into question.
1412:
1413: We have assumed constant error costs for a given \textit{type} of
1414: error, e.g., all false positives cost the same. For some problems,
1415: different errors of the same type have different costs. In many
1416: cases, such a problem can be transformed for evaluation into an
1417: equivalent problem with uniform intra-type error costs by duplicating
1418: instances in proportion to their costs (or by simply modifying the
1419: counting procedure accordingly).
1420:
1421: We also have assumed for this paper that the estimates of the classifiers'
1422: performance statistics ($FP$ and $TP$) are very good. As mentioned above, much
1423: work has addressed the production of good estimates for simple performance
1424: statistics such as error rate. Much less work has addressed the production of
1425: good ROC curve estimates. As with simpler statistics, care should be taken to
1426: avoid over-fitting the training data and to ensure that differences between ROC
1427: curves are meaningful. One solution is to use cross-validation with averaging
1428: of ROC curves \cite{ProvostFawcettKohavi:98}, which is the procedure used to
1429: produce the ROC curves in Section~\ref{sect:our-study}. To our knowledge, the
1430: issue is open of how best to produce confidence bands appropriate to a
1431: particular problem. Those shown in Section~\ref{sect:our-study} are
1432: appropriate for the Neyman-Pearson decision criterion (i.e., they show
1433: confidence on $TP$ for various values of $FP$).
1434:
1435: Also, we have addressed predictive performance and computational
1436: performance. These are not the only concerns in choosing a
1437: classification model. What if comprehensibility is important? The
1438: easy answer is that for any particular setting, the {\sc rocch}-hybrid
1439: is as comprehensible as the underlying model it is using. However,
1440: this answer falls short if the {\sc rocch}-hybrid is interpolating
1441: between two models or if one wants to understand the
1442: ``multiple-model'' system as a whole.
1443:
1444: Although ROC analysis and the ROCCH method were specifically designed for
1445: classification domains, we have extended them to \emph{activity monitoring}
1446: domains \cite{FawcettProvost:99}. Such domains involve monitoring the
1447: behavior of a population of entities for interesting events requiring action.
1448: These problems are substantially different from standard classification because
1449: timeliness of classification is important and dependencies exist among
1450: instances; both factors complicate evaluation.
1451:
1452: This work is fundamentally different from other recent machine
1453: learning work on combining multiple models \cite{AliPazzani:96}. That work
1454: combines models in order to boost performance for a fixed cost and class
1455: distribution. The {\sc rocch}-hybrid combines models for robustness across
1456: different cost and class distributions. In principle, these methods should be
1457: independent---multiple-model classifiers are candidates for extending the {\sc
1458: rocch}. However, it may be that some multiple-model classifiers achieve
1459: increased performance for a specific set of conditions by (in effect)
1460: interpolating along edges of the {\sc rocch}.
1461: Cherikh \cite{Cherikh-thesis} uses
1462: ROC analysis to study decision making where the decisions of
1463: multiple models are present. Unlike our work, the goal is to
1464: find optimal combinations of models for specific conditions.
1465: However, it seems that the two methods may be combined profitably:
1466: well-chosen combinations of models
1467: should extend the ROCCH, yielding a better robust classifier.
1468:
1469: The \rocch\ method also complements research on cost-sensitive learning
1470: \cite{Turney-cost-bib}. Existing cost-sensitive learning methods are brittle
1471: with respect to imprecise cost knowledge. Thus, the \rocch\ is an essential
1472: evaluation tool. Furthermore, cost-sensitive learning may be used to find
1473: better components for the \rocch-hybrid, by searching explicitly for
1474: classifiers that extend the \rocch.
1475:
1476:
1477:
1478:
1479: \section{Conclusion}
1480:
1481: The ROC convex hull method is a robust, efficient solution to the
1482: problem of comparing multiple classifiers in imprecise and changing
1483: environments. It is intuitive, can compare classifiers both in general
1484: and under specific distribution assumptions, and provides crisp
1485: visualizations. It minimizes the management of classifier performance
1486: data, by selecting exactly those classifiers that are potentially
1487: optimal; thus, only these need to be saved in preparation for
1488: changing conditions. Moreover, due to its incremental nature, new
1489: classifiers can be incorporated easily, \eg when trying a new parameter
1490: setting.
1491:
1492: The {\sc rocch}-hybrid performs optimally under any target conditions
1493: for many realistic problem formulations, including the optimization of
1494: metrics such as accuracy, expected cost, lift, precision, recall, and
1495: workforce utilization. It is efficient to build in terms of time and
1496: space, and can be updated incrementally. Furthermore, it can
1497: sometimes classify better than any (other) known model. Therefore, we
1498: conclude that it is an elegant, robust classification system.
1499:
1500: We believe that this work has important implications for both machine learning
1501: applications and machine learning research \cite{ProvostFawcettKohavi:98}. For
1502: applications, it helps free system designers from the need to choose (sometimes
1503: arbitrary) comparison metrics before precise knowledge of key evaluation
1504: parameters is available. Indeed, such knowledge may never be available, yet
1505: robust systems still can be built.
1506:
1507: For machine learning research, it frees researchers from the need to
1508: have precise class and cost distribution information in order to study
1509: important related phenomena. In particular, work on cost-sensitive
1510: learning has been impeded by the difficulty of specifying costs, and
1511: by the tenuous nature of conclusions based on a single cost metric.
1512: Researchers need not be held back by either. Cost-sensitive learning
1513: can be studied generally without specifying costs precisely. The same
1514: goes for research on learning with highly skewed distributions. Which
1515: methods are effective for which levels of distribution skew? The
1516: \rocch\ will provide a detailed answer.
1517:
1518: Recently, Drummond and Holte \cite{drummondholtekdd:00} have
1519: demonstrated an intriguing dual to the \rocch. Their ``cost curves''
1520: represent expected costs explicitly, rather than as slopes of
1521: iso-performance lines, and thereby provide an insightful alternative
1522: perspective for visualization.
1523:
1524: Note: An implementation of the \rocch\ method in Perl is publicly available.
1525: The code and related papers may be found at:
1526: \url{http://www.hpl.hp.com/personal/Tom_Fawcett/ROCCH/}.
1527:
1528: \section{Acknowledgments}
1529:
1530: Much of this work was done while the authors were employed at the Bell
1531: Atlantic Science and Technology Center. We thank the many with whom we have
1532: discussed ROC analysis and classifier comparison, especially Rob Holte, George
1533: John, Ron Kohavi, Ron Rymon, and Peter Turney. We thank Andrew Bradley for
1534: supplying data from his analysis.
1535:
1536: \bibliographystyle{theapa}
1537: \bibliography{final}
1538: \end{document}
1539: