0705.0493/ms.tex
1: 
2: %- File    : ms.tex
3: %- ----------------
4: %- Created : Sun Dec 11 22:32:32 2005
5: %- Authors : SNfactory
6: %- 
7: 
8: %- Start of Preamble.
9: 
10:    \documentclass[12pt,preprint]{aastex}
11: %  \documentclass[manuscript,letterpaper]{aastex}
12: %  \documentclass[preprint2]{aastex}
13: %  \documentclass[preprint2,longabstract]{aastex}
14: 
15: \usepackage{bm}
16: 
17: \begin{document}
18: 
19: \title{%
20:     How to Find More Supernovae with Less Work:
21:     Object Classification Techniques for Difference Imaging
22: }
23: 
24: \author{%
25:    S.~Bailey,\altaffilmark{1,4}
26:    C.~Aragon,\altaffilmark{1}
27:    R.~Romano,\altaffilmark{1,2}
28:    R.~C.~Thomas,\altaffilmark{1}
29:    B.~A.~Weaver\altaffilmark{1,3}
30:    D.~Wong\altaffilmark{1}
31: }
32: 
33: \altaffiltext{1}{Lawrence Berkeley National
34: Laboratory, 1 Cyclotron Road, Berkeley, CA 94720}
35: \altaffiltext{2}{Luis W. Alvarez Fellow, National Energy Research Scientific Computing Center, 1 Cyclotron Road, Berkeley, CA 94720} 
36: \altaffiltext{3}{University of California, Space Sciences Laboratory,
37: Berkeley, CA 94720}
38: \altaffiltext{4}{Corresponding author: sjbailey@lbl.gov}
39: 
40: 
41: \begin{abstract}
42: We present the results of applying new object classification techniques
43: to difference images in the context of the Nearby Supernova Factory
44: supernova search.
45: Most current supernova searches subtract reference images from new images,
46: identify objects in these difference images, and apply simple threshold cuts on
47: parameters such as statistical significance, shape, and motion
48: to reject objects such as cosmic rays, asteroids, and subtraction
49: artifacts.  
50: Although most static objects subtract cleanly, even a very low false
51: positive detection rate can lead to hundreds of
52: non-supernova candidates which
53: must be vetted by human inspection before triggering additional followup.
54: In comparison to simple threshold cuts, more sophisticated methods such as
55: Boosted Decision Trees, Random Forests, and Support Vector Machines
56: provide dramatically better object discrimination.
57: At the Nearby Supernova Factory, we reduced the number of non-supernova
58: candidates by a factor of 10 while increasing our supernova identification
59: efficiency.  Methods such as these will be crucial for maintaining
60: a reasonable false positive rate in the automated transient alert
61: pipelines of upcoming projects such as PanSTARRS and LSST.
62: \end{abstract}
63: 
64: \keywords
65: {
66: methods: data analysis ---
67: methods: statistical ---
68: supernovae: general ---
69: techniques: image processing
70: }
71: 
72: \section{Introduction}
73: 
74: Future large scale survey projects such as
75: PanSTARRS\footnote{http://pan-starrs.ifa.hawaii.edu}
76: and LSST\footnote{http://lsst.org} are expected
77: to generate automated rapid turnaround transient alerts for objects
78: such as supernovae, active galactic nuclei, asteroids, Kuiper belt objects,
79: and variable stars.
80: They will do this by comparing new images to coadded stacks of reference
81: images taken previously.  Repeat observations of the same field will
82: occur over timescales of minutes, hours, days, months, and years.
83: Robust rejection of spurious non-astrophysical objects
84: will be crucial to avoid excessive false positive alerts.
85: 
86: A major difficulty of current optical transient programs is the huge
87: number of false positive objects which are difficult to reject while
88: maintaining high selection
89: efficiency for the real objects of interest.  For example, the 2005 Sloan
90: Digital Sky Survey II (SDSS-II) supernova program \citep{sdss:becker}
91: required objects to be detected within 0.6 arcsec in at least two filters
92: with signal-to-noise greater than 3, yet this generated $\sim$4,000 objects
93: per night which needed to be visually checked by humans for verification.
94: Their 2006 search drastically reduced this scanning load by requiring that
95: all but the brightest objects be identified at the same location
96: on multiple nights before they are passed to a human for verification.
97: Although this reduced their scanning load, this method is not
98: applicable to the real-time transient alert pipelines of PanSTARRS and
99: LSST.  A ``60-second transient alert'' would be meaningless if it really
100: meant ``$N$ days plus 60 seconds after the first positive identification.''
101: Although PanSTARRS and LSST will have multiple exposures of a field
102: in the same night, this is equivalent to the multiple-filter requirement
103: of the SDSS 2005 program which was still swamped by false positives.
104: 
105: This problem of false positives
106: is not unique to nearby transient searches; it arises whenever a large
107: number of objects are imaged, either from a wide-field survey or a
108: deep narrow survey.  The ESSENCE \citep{essence}
109: and SNLS \citep{snls} Canadian pipeline supernova
110: searches both result in 100--200 objects to
111: scan per night of data\footnote{private communication with
112: W.M. Wood-Vasey (ESSENCE) and D. Balam (SNLS)}.
113: Although this is a manageable load for a current experiment,
114: it would not scale to future surveys which will image thousands
115: of square degrees per night.\footnote{For comparison,
116:     the SNLS supernova survey covers $\sim1$ square degree per night,
117:     SDSS covers 150 square degrees per night,
118:     and SNfactory covers 350 to 850 square degrees per night.}
119: If current methods were used, the projects would need to drastically
120: reduce their signal efficiency in order to maintain a manageable
121: false-positive rate.
122: The SNLS French pipeline uses multi-night data and an artificial
123: neural net to select candidates for verification, but as noted above,
124: using multi-night information is not applicable to rapid turnaround
125: transient alert pipelines which intend to produce alerts within
126: a minute of the first positive detection.
127: 
128: False positives arise from a variety of sources including diffraction spikes,
129: saturated stars, optical ghosts, star halos, cosmic rays, satellite trails,
130: CCD amplifier glow,
131: other CCD artifacts, and image processing artifacts.  In principle all of these
132: effects are best identified and either fixed or masked at the image level.
133: In practice there will always be effects which produce spurious detections.
134: This problem is especially bad at the start of a search when covering new
135: areas of sky, before consistent problems can be identified and masked.
136: The goal of a classifier is to identify the real
137: candidates of interest (signal events) while rejecting the spurious objects
138: (background events).
139: 
140: In some cases, real astrophysical variable objects are the background events
141: for other analyses.  For example, asteroids, variable stars,
142: and active galactic nuclei form a background for nearby
143: supernova searches, yet they are the core science for other programs.
144: This paper is written from the context of a nearby supernova search
145: and thus these other
146: astrophysical events are treated as background to reject,
147: but the methods presented here are generally applicable to
148: many object classification problems.
149: 
150: This paper presents the results of applying modern machine learning
151: techniques to the supernova search pipeline of the
152: Nearby Supernova Factory \citep{snfactory}.
153: \S \ref{sec:methods} presents a variety of machine learning techniques.
154: \S \ref{sec:SNfactory} describes the Nearby Supernova Factory search,
155: and \S\S~\ref{sec:training} and \ref{sec:software} present
156: the training data and classification software used.
157: \S \ref{sec:comparison} compares the various methods.
158: We find that methods such as
159: Boosted Trees, Random Forests, and Support Vector Machines perform
160: dramatically better than the threshold cuts which are typically used
161: by supernova search programs.
162: 
163: \section{Classification Methods}
164: \label{sec:methods}
165: 
166: Classification methods identify signal {\it vs.}~background
167: events based upon a set of features (also called variables, attributes,
168: or scores)
169: which describe the events.  For example, objects in photometric images
170: can be described by their magnitude, signal-to-noise, and shape parameters
171: such as width and ellipticity.  These features can be used to distinguish
172: stars from galaxies, cosmic rays, or imaging artifacts.
173: 
174: The optimum separation of two classes of events is application dependent,
175: depending upon the desired tradeoff between purity (the fraction of
176: selected events which are real signal), completeness (the fraction of
177: real signal events which are selected), and the total sample size selected.
178: For example, a measurement
179: which depends upon a statistical fit to both signal and background events
180: might optimize the signal-to-noise ratio $\sim S/\sqrt{S+B}$
181: where $S$ and $B$ are the number
182: of real signal and background events which the classifier selects for the fit.
183: A supernova search algorithm, on the other hand,
184: might maximize the purity with the constraint that the completeness
185: remain above 90\%.
186: 
187: Some methods, such as threshold cuts (\S \ref{sec:cuts}),
188: produce a boolean signal/background
189: decision and the cuts themselves must be adjusted to optimize the separation.
190: Other methods have an automated training procedure and
191: produce a single statistic which rates how signal-like or
192: background-like a new event is.  The user may then cut on that statistic
193: to optimize the desired figure of merit.
194: 
195: Classifier parameters are tuned using a training dataset of known
196: signal and background events to optimize
197: the separation power.  Since the results are influenced by the particular
198: statistical fluctuations of the training dataset, the separation power
199: on the training data itself cannot be used as
200: a fair measure of the power of a classifier.  Instead, a separate validation
201: set is used to assess the performance.  If enough training data is available,
202: one uses one dataset to train a variety of classifiers with
203: different parameters, a second dataset to select which set of parameters
204: produces the best classifier, and a third dataset to validate the
205: final performance.
206: Ideally one trains and validates
207: using real data; in practice simulated data are often used for training
208: and validation before applying the classifier to real data.
209: It is important to note that the quality and power of any classifier
210: will be affected by the accuracy of the training sample.  One must be
211: careful to minimize and measure any biases introduced through a simulated
212: training sample which does not completely reflect real data.
213: 
214: \subsection{Threshold Cuts}
215: \label{sec:cuts}
216: 
217: Automated supernova searches have typically operated by applying
218: simple threshold cuts to the features describing objects.
219: For example, a supernova search might
220: keep objects which have a signal-to-noise ratio $S/N > 5$,
221: astrometric positions that agree to within 1 arcsec on 2 or more images,
222: and a width consistent with stars on the images ({\it e.g.}, within a
223: factor of 2 of the median width of stars).
224: If an object fails any of these cuts, it is rejected.
225: These cuts are easy to understand but do not reflect
226: the subtleties of a multidimensional space.
227: An object which just barely
228: fails one of the cuts is still rejected the same as an object which
229: fails many cuts.  It also does not naturally handle correlations between
230: the variables, {\it e.g.,} between the $S/N$ and the astrometric
231: accuracy.\footnote{In a simple case such as this, one could combine $S/N$ and
232:     astrometric positions to form an uncorrelated variable; accomplishing this
233:     in the general case for a large number of variables is non-trivial.}
234: To use threshold cuts, one must find uncorrelated variables without
235: significant outliers such that every cut maintains a high signal efficiency
236: while rejecting background.
237: 
238: Compared to curved boundary selections, threshold cuts are also
239: an inefficient way to select a subset of a hyperspace as the number
240: of dimensions grows large \citep{koeppen}, even for dimensions as few as 5.
241: {\it e.g.}, for 3 dimensions, the volume ratio of a cube to its embedded
242: sphere is 1.9; for 5 dimensions it is 6.1, and for 10 dimensions
243: it is 401.5.
244: This ratio goes to infinity as the number of dimensions increases.
245: Thus if
246: a set of signal events is distributed as an ellipsoid in some feature
247: space, an ellipsoid shaped selection contains much less volume
248: (and thus likely much less background) than the equivalently dimensioned
249: hypercube.
250: 
251: Although commonly used in supernova searches,
252: threshold cuts are widely recognized
253: as being a non-optimal method for signal/background separation problems.
254: The following sections describe a variety of more powerful techniques for
255: identifying supernovae in difference images.
256: 
257: \subsection{Multi-dimensional Probability Measures}
258: 
259: A more sophisticated approach models the probability distribution
260: function (PDF) for the signal and the background for each of the 
261: features.  The combined probability of all of the feature values
262: for an object is used to make the signal/background decision.  This
263: improves over threshold cuts by eliminating rejections based upon a
264: slightly marginal value of a single feature, but it requires a detailed
265: modeling of the PDF of each feature, including all correlations and
266: outliers in the distributions.  This suffers from
267: the ``curse of dimensionality'' \citep{bellman}: since the volume
268: of a hyperspace grows exponentially with the number of dimensions,
269: the size of a training sample must also grow exponentially to
270: adequately determine the PDFs.
271: 
272: If the signal and background features are Gaussian distributed with
273: only linear correlations, Fisher Discriminant Analysis \citep{fisher}
274: finds the best linear combination of
275: features to maximize the separation of the two classes.
276: Figure \ref{fig:fisher} shows a toy example of
277: data which would be well separated using Fisher
278: Discriminant Analysis.  The two classes of events (blue triangles
279: and red squares) are not well separated by either feature $A$
280: or $B$, but their correlation is such that the combination
281: $A + B$ provides very good separation of the two classes.
282: 
283: More generally, if a set of events $\{{\bf x}\}$ in some feature space
284: have means ${\bm \mu}_{0,1}$ and covariances
285: $\Sigma_{0,1}$ for classes 0 and 1, then a linear combination
286: ${\bf w} \cdot {\bf x}$
287: will have means ${\bf w} \cdot {\bm \mu}_{0,1}$ and covariances
288: ${\bf w}^T \Sigma_{0,1} {\bf w}$, where ${\bf w}$ is a set of
289: coefficients defining a linear combination of the features.
290: The separation of the two classes may be defined as
291: \begin{equation}
292: \Delta = {({\bf w} \cdot {\bm \mu}_0 - {\bf w} \cdot {\bm \mu}_1)^2 \over 
293:     {\bf w}^T \Sigma_0 {\bf w} + {\bf w}^T \Sigma_1 {\bf w} },
294: \end{equation}
295: {\it i.e.}, the separation of the means is measured in units of the
296: variances.
297: Fisher showed that the maximum separation is achieved when
298: \begin{equation}
299: {\bf w} = (\Sigma_0 + \Sigma_1)^{-1} ({\bm \mu}_1 - {\bm \mu}_0)
300: \end{equation}
301: The means and covariances of the signal (1) and background (0) classes
302: may be estimated from a training sample, and thus the calculation of
303: the best linear combination for separating the classes
304: is simply a matrix inversion.  This method
305: breaks down when there are non-linear correlations or when there are
306: significant outliers or otherwise non-Gaussian variances such that a
307: simple mean
308: and covariance is not a good descriptor of the feature distributions
309: for the two classes.  In practice, Fisher Discriminant Analysis is
310: most often used to combine several linearly correlated features
311: into a single feature to reduce the dimensionality of a problem before
312: applying another classification method.
313:     
314: \subsection{Decision Trees}
315: 
316: Decision trees \citep{breiman}
317: separate signal from background events by making a
318: cascading set of event splits as shown in Figure \ref{fig:decisiontree}.
319: This forms a generalization of threshold cuts by
320: selecting many hypercubes in the multi-dimensional feature space
321: rather than a single hypercube of cuts.  The training procedure
322: described below automatically selects the features and cut values
323: to generate a tree with maximal separation of signal and background
324: events.
325: 
326: The training procedure begins with a sample of training events and
327: considers all features and cut values to form
328: two subsets with the best separation of signal and background.
329: The procedure is recursively applied to each of the subsets to form
330: further branches.  The recursion is stopped when some condition is
331: met, {\it e.g.}, the subset is entirely signal or background, or
332: the subset has reached a minimum allowed size
333: (a minimum size requirement prevents
334: overtraining on statistical fluctuations of small samples).
335: The terminal nodes which are not further split are called leaves,
336: and are assigned as either signal or background leaves depending
337: upon the training events which ended up on those leaves.
338: 
339: There are a variety of ways to define the best separation at each
340: split; for this study we used the Gini parameter \citep{gini, breiman},
341: which is widely used and provides robust performance.
342: Define the purity of a sample of training events as
343: \begin{equation}
344: P = {\sum_S w_S \over \sum_S w_S + \sum_B w_B}
345: \end{equation}
346: where the sums are over the signal events $S$ and background events $B$
347: and $w_i$ are a set of event weights.  Typically all of the weights
348: are the same and their absolute normalization is arbitrary.
349: If needed, relative weights may be used to increase
350: the influence of an underrepresented subsample of the training data.
351: The role of weights will be more important in the Boosted Trees
352: method described in \S \ref{sec:boostedtrees}.
353: Note that $P=1$ for a sample of pure signal events, $P=0$ for a sample of pure
354: background events, and $P(1-P)=0$ for a sample which is either purely
355: signal or purely background.
356: 
357: Define
358: \begin{equation}
359: {\rm Gini} = P(1-P) \sum_{i=1}^n w_i 
360: \end{equation}
361: where the sum is over all events in that sample.
362: At each node, the training procedure considers all possible
363: features and cut values to minimize the quantity
364: \begin{equation}
365: {\rm Gini}_{\rm left\ child} + 
366: {\rm Gini}_{\rm right\ child}
367: \end{equation}
368: to find the best separation of events.
369: If this split would not increase
370: the overall quality of the tree, {\it i.e.},
371: \begin{equation}
372: {\rm Gini}_{\rm parent} <
373: {\rm Gini}_{\rm left\ child} + 
374: {\rm Gini}_{\rm right\ child}
375: \end{equation}
376: then the node is left as a leaf node, assigning it as a signal leaf
377: if $P>0.5$ and a background leaf otherwise.  If the split would increase
378: the overall quality of the tree, the events are split into two nodes
379: and the procedure is recursively applied to each of those nodes until
380: the stopping conditions are met ({\it e.g.}, minimum leaf sizes) or no
381: splits can be found which would improve the overall quality of the tree.
382: 
383: Decision trees are a generalization of threshold cuts and thus have more
384: flexibility to optimally select a set of signal events within a feature space.
385: However, single decision trees tend to be unstably dependent upon
386: the details of the
387: training set.  A small change in the training set can produce a
388: considerably different tree and thus a considerably different performance
389: on the validation set.
390: 
391: \subsubsection{Boosted Trees}
392: \label{sec:boostedtrees}
393: 
394: Boosting algorithms improve the performance of a classifier by
395: giving greater weight to events which are hardest to classify.  In the
396: case of decision trees, a tree is trained on a set of data,
397: misclassified events are identified and their weights are increased, and the
398: process is repeated to form new trees.  This iteratively produces
399: a set of increasing quality decision trees.
400: The final classifier uses the weighted ensemble average
401: of all of the trees to make a classification decision.  The boosting
402: provides decision trees with better separation power, and the ensemble
403: average washes out the training instabilities associated with single
404: decision trees.  In applications with $\sim$20 or more input features,
405: Boosted Decision Trees can provide significantly better results than
406: Artificial Neural Networks \citep{miniboone}; see also \S \ref{sec:ann}.
407: 
408: There are a variety of boosting algorithms used to increase the weights
409: of misclassified events
410: \citep{bdt:freund96, bdt:friedman01, bdt:friedman00}.
411: We describe here the commonly used
412: Discrete AdaBoost method \citep{bdt:freund96}.
413: Define the error rate for tree $m$ as
414: \begin{equation}
415: {\rm err}_m = {\sum_{i=1}^{N} w_i I_i \over \sum_{i=1}^N w_i}
416: \end{equation}
417: where $I_i = 0$ if event $i$ is correctly classified and $I_i = 1$
418: if it is incorrectly classified.  Typically the first tree is trained
419: with the same weight for all events.
420: Then adjust each of the event weights using
421: \begin{eqnarray}
422: \alpha_m    & = & \beta \times \ln[ (1-{\rm err}_m) / {\rm err}_m ]     \\
423: w_i & \to & w_i \times e^{\alpha_m I_i}
424: \end{eqnarray}
425: This increases the weights of misclassified events; the weights are
426: increased more when the tree has a low error rate.
427: These new weights are then used to generate a new decision tree.
428: The standard AdaBoost algorithm uses $\beta=1$ but this can be adjusted to
429: vary how quickly the weights are updated with each iteration.
430: 
431: After generating $M$ individual trees with weights $\alpha_m$,
432: the final classifier answer for an event described by
433: a set of features ${\bf x}$ is
434: \begin{equation}
435: T({\bf x}) = \sum_{m=1}^M \alpha_m T_m({\bf x})
436: \end{equation}
437: where $T_m({\bf x})$ is the result for tree $m$: 
438: 0 if ${\bf x}$ lands
439: on a background leaf and +1 for a signal leaf.
440: The absolute normalization of $T({\bf x})$ is arbitrary; we
441: chose to renormalize the $\alpha_m$ weights such that
442: $0 \le T({\bf x}) \le 1$.
443: 
444: 
445: \subsubsection{Random Forests}
446: 
447: Random Forests \citep{rf:breiman}
448: also generate multiple decision trees for a given
449: training set and use a weighted average of the trees as the final
450: decision metric.  When training a tree, at each branch the
451: training cycle only considers a random subset of the possible features
452: available to use.  This has the effect of washing out
453: the typical training instabilities of decision trees and produces
454: a classifier which is fast to train and robust against outliers.
455: 
456: \subsection{Support Vector Machines}
457: 
458: The Support Vector Machine (SVM) algorithm is a classification
459: method that has successfully been
460: applied to many pattern recognition problems and is founded on
461: principles of statistical learning theory~\citep{svm:va98, svm:chen05}.
462: It nonlinearly
463: maps data points from the original input space to a higher-dimensional
464: feature space in which an optimal hyperplane parameterized by a normal
465: vector ${\bf w}$ and offset $b$ is computed such that the separation between
466: events in different classes is maximized. The linear decision boundary
467: is defined as $ f({\bf x}) = {\bf w} \cdot {\bf \phi}({\bf x}) + b$,
468: where ${\bf x}$ is a vector in the feature space which describes objects and
469: ${\bf \phi}$ is a mapping which embeds the problem into a
470: higher-dimensional space in which classes are more easily
471: separable than in the original feature space.
472: 
473: An optimization problem is
474: constructed to find the unknown hyperplane parameters, and
475: the optimal hyperplane normal ${\bf w}$ is found to be entirely
476: determined by the subset of events nearest to the optimal
477: decision boundary (also called support vectors, ${\bf x}_i$) as
478: follows: 
479: ${\bf w} = \sum_i c_i \phi({\bf x}_i)$,
480: where the coefficients $c_i$ are 
481: the Lagrange multipliers used in solving the nonlinear
482: optimization and are a byproduct of the optimization.
483: 
484: %%% The subset of events nearest to the decision boundary
485: %%% (the ``support vectors'' ${\bf x}_i$) determine the normal vector
486: %%% ${\bf w} = \sum_i c_i \phi({\bf x}_i)$.
487: 
488: The hyperplane parameters are solved by maximizing the margin
489: (the distance between the hyperplane and the example events in each class),
490: which is formulated as a nonlinear
491: constrained optimization problem, where the constraints
492: enforce that examples from different classes lie on opposite
493: sides of the hyperplane.
494: The objective function to be minimized is convex, {\it i.e.},
495: it is guaranteed to have a global minimum and no local minima.
496: The linear decision boundary corresponds to a nonlinear
497: (and possibly disjoint)
498: decision boundary in the original feature space. Once the hyperplane is
499: found, a set of features ${\bf x}$ is typically classified into one of
500: the two classes by applying a threshold cut to $f({\bf x})$.
501: 
502: Rather than calculating $\phi({\bf x})$ explicitly while
503: evaluating
504: ${\bf w} \cdot \phi({\bf x}) =
505: \sum_i c_i \phi({\bf x}_i) \cdot \phi({\bf x})$,
506: the actual embedding is
507: achieved through a kernel function defining an inner product in the
508: embedding space,
509: $k({\bf x}_1,{\bf x}_2) = {\bf \phi}({\bf x}_1) \cdot
510: {\bf \phi}({\bf x}_2)$.
511: This ``kernel trick'' makes class prediction easy to implement
512: and fast to compute.  Several common kernel mappings are
513: given in \cite{svm:chen05}.
514: In practice, the kernel function is typically chosen empirically
515: via training and testing, and the simplest function giving the
516: desired performance is used.  The Gaussian kernel used in this analysis
517: \begin{equation}
518: k({\bf x}_1,{\bf x}_2) = \exp( -||{\bf x}_1 - {\bf x}_2||^2 / 2 \sigma^2)
519: \end{equation}
520: is commonly used because it only has one free parameter to be
521: tuned ($\sigma$)
522: and empirically performs as well as, if not better than, more
523: complex kernels which may overfit the data.
524: 
525: %%% The most commonly used kernel, the Gaussian kernel,
526: %%% corresponds to a $\phi({\bf x})$ transformation into an infinite
527: %%% dimensional space.
528: 
529: For this analysis we used a soft-boundary SVM method
530: called $C$-SVM, which handles noisy data
531: with high class overlap by adding a regularization term to the objective
532: function.  This term allows but penalizes training
533: points lying on the wrong side
534: of the decision boundary. The regularization parameter, $C$,
535: controls the trade-off between maximizing the separation
536: and allowing some amount of training error while finding the
537: hyperplane which maximally separates signal from background.
538: 
539: The advantages of SVMs include the existence of a unique solution, the
540: simple geometric interpretation of the margin maximization function, the
541: capacity to compute arbitrary nonlinear decision boundaries while
542: controlling over-fitting with soft margins, the low number of parameters
543: to be tuned (as few as two, depending on the choice of kernel), and the
544: dependence of the solution on only a small number of data points
545: (the support vectors)
546: which define the boundary of the class separation hypersurface.
547: For SVM implementation details, see \cite{svm:va98}.
548: 
549: \subsection{Artificial Neural Networks}
550: \label{sec:ann}
551: 
552: Artificial Neural Networks (ANNs) are a broad category of
553: classification methods
554: originally inspired by the interconnected structure of neurons
555: and synapses in the brain.
556: These methods map a set of input variables to one or more output results
557: via one or more ``hidden layers'' of intermediate nodes.
558: % as shown in Figure \ref{fig:ann}.
559: For a supernova search, the inputs would be the features describing
560: each object and the desired output would be 1(0) for signal~(background).
561: For an overview of these methods, see \cite{ann:bishop}.
562: 
563: %%% A commonly used form of an ANN is a Multilayer Perceptron, whose nodes
564: %%% map their inputs $x_i$ into their outputs $y_j$
565: %%% based upon a set of weights $w_{ij}$ and a step-like function such as
566: %%% $y_j = 1 / (1 + e^{-\sum w_{ij} x_i})$.
567: %%% These intermediate results are then mapped to the final output (optionally
568: %%% via additional hidden layers) using
569: %%% another set of weights $v_j$: $O = \sum v_j y_j$.
570: %%% ANNs are iteratively trained on multiple training
571: %%% sets, updating the weights with each cycle to minimize the error of
572: %%% the output value.
573: %%% The performance is evaluated at each cycle on an independent verification
574: %%% sample and the training is stopped when the performance begins to get
575: %%% worse, indicating that the ANN is becoming overtrained.
576: 
577: ANNs can be powerful classifiers and have been used in many applications,
578: though they are slow to train and require some experimentation to
579: optimize the number of hidden layers and nodes to match a given problem.
580: They also do not scale well with an increasing number of input
581: features, and their results become unstable when there are significant
582: outliers or otherwise irrelevant input data.
583: For these reasons,
584: ANNs were not deemed to be an appropriate classification method for
585: our dataset and this method was not
586: pursued for this study.
587: 
588: \section{Nearby Supernova Factory Search}
589: \label{sec:SNfactory}
590: 
591: The Nearby Supernova Factory \citep{snfactory} search uses
592: data from the Near Earth
593: Asteroid Tracking (NEAT) program\footnote{http://neat.jpl.nasa.gov}
594: and the Palomar QUEST consortium\footnote{http://hepwww.physics.yale.edu/quest/palomar.html}
595: using the 112 CCD QUEST-II camera \citep{questcamera}
596: on the Palomar Oschin 1.2-m telescope.
597: The NEAT observing pattern obtains triplets of 60-second exposures spread
598: over a time period of $\sim$1 hour using a single RG610 filter,
599: which is a long pass filter redward of 610 nm.
600: This allows the search to distinguish between asteroids, whose motion is
601: typically detectable
602: on that timescale, and spatially static objects such as supernovae.
603: The QUEST data are obtained in 4 filters in driftscan mode; our search
604: uses the two filters
605: which cover the best quality CCDs
606: (either Bessel $R$ and $I$ or Gunn $r$ and $i$
607: depending upon camera configuration).
608: The QUEST data cover less area and tend to be
609: cosmetically cleaner than the NEAT data, resulting in fewer spurious
610: detections overall.
611: Since the false positive background
612: events are much worse in the NEAT data, our study of alternative
613: classification methods has focused on the NEAT dataset.
614: 
615: Coadded stacks of images taken from 2000 to 2003 are used as references.
616: The new and reference
617: images are convolved to match their point-spread-functions (PSFs),
618: the fluxes are normalized by
619: matching stars, and the reference is subtracted from the new
620: images.  Objects in the subtraction are identified based upon contiguous
621: pixels with $S/N > 3$ with at least one pixel with $S/N > 5$.
622: Objects are described by features such as position,
623: full-width-half-max (FWHM) in $x$ and $y$,
624: aperture photometry and associated uncertainties in 3 apertures,
625: distance to nearest object
626: in the reference coadd, and measures of the roundness and irregularity
627: of the object contour
628: based upon Fourier descriptors \citep{zahn}.
629: Additional features are formed as combinations of features from
630: the same object observed on multiple images.
631: Combined features include
632: the object motion between two images and the consistency of the statistical
633: significance of the measurements in different images.  The features are
634: used by a classification method (originally threshold cuts, more recently
635: Boosted Decision Trees)
636: to select supernova candidates of interest which are then visually
637: scanned by humans to select the best candidates for spectroscopic
638: confirmation and followup by the SuperNova Integral Field Spectrometer
639: (SNIFS) \citep{snfactory}, on the University of Hawaii 2.2-m telescope on
640: Mauna Kea.
641: 
642: \section{Training Dataset}
643: \label{sec:training}
644: 
645: To generate signal events for training, fake supernovae were
646: introduced into the images by moving real stars of a desired magnitude
647: to locations distributed about known galaxies on the SNfactory search
648: images.  By using real stars from the same images as the galaxies,
649: we realistically model the point-spread-function, noise, and possible
650: image artifacts present in that image.  These images with fake supernovae
651: were processed with the same data analysis pipeline as real images
652: to identify objects and measure their features for classification.
653: 
654: The stars are sampled in a circular region of 20 pixels in diameter.
655: The typical FWHM of stars on these images is about 3 pixels,
656: so this samples the PSF out to $\sim8\sigma$.
657: The average sky level of the image is subtracted
658: from the sampled pixels before they are added in the new location, which
659: implicitely assumes that the sky level is uniform over the image.
660: This is valid in
661: most cases, and it is simple to reject fakes created from cases that
662: violate this assumption.  Typically these fakes will have a FWHM that
663: differs very significantly from the average ($\sim$20 pixels vs. $\sim$3).
664: Most stars are sufficiently isolated that they do not bring along portions of
665: other objects, and we identify and reject cases where this does
666: happen.  The spatial variation of the PSF is minimal compared to the
667: night-to-night variations which much be addressed by the image subtraction
668: pipeline, thus this fake supernova generation procedure does not attempt
669: to correct for the small spatial variations of the PSF across the CCD.
670: %%% Since the chosen star could come from anywhere on the image,
671: %%% the average distance shifted is $\sim$850 pixels with a large variance.
672: %%% For comparison, the image size is $2400 \times 598$ pixels or
673: %%% $34.8' \times 8.7'$.
674: 
675: Background events were randomly selected from 4.1 million
676: other objects identified on the subtractions with fake supernovae.
677: These subtractions covered a month of data taking including bright
678: and dark times and a variety of seeing conditions.
679: Objects within 20 pixels of a fake supernova were excluded to
680: avoid any artifacts which might be introduced through an ill-formed fake.
681: These background events form a randomly selected subset of the genuine
682: backgrounds faced by the supernova search in the real data, and thus
683: represent the real fractions of each type of background event faced.
684: 
685: The signal and background samples were split into training and validation
686: subsets.  Several training sets were formed with 5,000 signal and 5,000
687: background events each.  The final validation was performed using 20,000
688: signal and 200,000 background events.  The training dataset for the
689: Support Vector Machine method was augmented with real supernova
690: discoveries in an attempt to improve its overall performance.
691: The original training set used 19 features; an additional 13 features
692: were then added which improved the performance of the Boosted Trees
693: and Random Forests but decreased the performance of the SVM.
694: The results shown in \S \ref{sec:comparison} are for the best performance
695: achieved for each classifier ({\it i.e.}, using 19 features for SVM
696:         and 32 features for the other methods).
697: 
698: \section{Classification software}
699: \label{sec:software}
700: 
701: For Fisher Discriminant Analysis, Boosted Trees, and Random Forests,
702: we used the open-source C++ software
703: package StatPatternRecognition.\footnote{
704: http://sourceforge.net/projects/statpatrec}
705: Training a set of 200 boosted trees
706: using 10,000 training events with 19
707: features each with a minimum leaf size of 15
708: events\footnote{The selection of the number of trees and minimum leaf
709:     size are described in section \ref{sec:paramchoice}.}
710: takes $\sim$3 minutes (wallclock) on a 2 GHz AMD Opteron CPU.
711: Training with 32 features takes $\sim$45 minutes.
712: Once the trees are trained, it takes approximately 0.6 ms (wallclock)
713: to evaluate the results for an object.
714: Random Forests took less than 2 minutes to train on the dataset
715: with 32 features, using the same parameters as above.  Evaluation
716: of a new event takes approximately 0.2 ms.
717: Fisher Discriminant analysis took 1 second to train on the
718: 32 feature dataset and 0.07 ms to evaluate results for a new object.
719: 
720: For SVM, we used the LIBSVM C++ package.\footnote{
721: http://www.csie.ntu.edu.tw/$\sim$cjlin/libsvm}
722: Training a $C$-SVM
723: using 10,000 training events with 19
724: features each takes from 5 to 15 seconds (wallclock)
725: on a 2 GHz AMD Opteron CPU,
726: depending on the settings of the two parameters used (if the parameters
727: overfit the data, more support vectors are needed so training time
728: increases).  Evaluating the SVM on a new data point takes approximately
729: 0.6 ms (wallclock).
730: 
731: \section{Comparison of Methods}
732: \label{sec:comparison}
733: 
734: In the end, most classification methods produce a single classification
735: statistic with arbitrary normalization which
736: rates how signal-like or how background-like the candidate is.
737: A threshold cut on this statistic can be used to select a subsample of
738: events with desired signal {\it vs.}~background purity.
739: A useful way to visualize the power of a classifier is
740: to plot the fraction of false positives ({\it i.e.}, background
741: events incorrectly classified as signal) {\it vs}.~fraction of true positives
742: ({\it i.e.}, signal events correctly identified)
743: for various selection values on the classification statistic.
744: 
745: Figure \ref{fig:eff} shows the performance of several classification
746: methods applied to the SNfactory dataset.  The red square shows the
747: performance of the original threshold cuts upon which we were working
748: to improve.  The curves show that SVM, Random Forests, and Boosted Trees
749: all performed dramatically better than the threshold cuts
750: across a wide range of signal and background efficiencies.
751: The object features in our data have significant outliers which
752: prevented Fisher Discriminant Analysis
753: from being a useful classification method.
754: 
755: The overall best performance was obtained using Boosted Decision Trees,
756: with Random Forests providing nearly as good performance
757: with faster training and evaluation times.
758: Although SVM performed considerably better than threshold cuts, it
759: was not as successful as Random Forests or Boosted Trees.
760: A possible explanation is that the SVM proved to be more sensitive to
761: signal events that lie close to background events in the feature
762: space,
763: and could not strike a balance between modeling such
764: events {\it vs}.~overfitting to noise.
765: For example, a dim young supernova on a bright galaxy can be very similar
766: to a statistical fluctuation or a modest subtraction error in the images.
767: Robustness against overfitting is a known strength of boosted
768: classifiers \citep{bdt:freund99}, and the issue of overfitting
769: noisy data is an area of active research within the machine
770: learning community.
771: Further details of applying SVM to the
772: SNfactory dataset are described in \cite{svm:romano}.
773: 
774: Boosted Trees, Random Forests, and SVM successfully reduced the
775: false-positive rate for all types of background events in our data.
776: The three most common remaining background types are
777: faint optical ghosts from scattered light,
778: fluctuations in charge trails from bright stars due to CCD charge
779: transfer inefficiency,
780: and leftover dipoles from subtracting astrometrically misaligned objects.
781: The optical ghosts can be genuinely difficult to distinguish from
782: dim supernovae near our detection threshold.
783: The charge trails and dipoles are easy to distinguish by eye
784: but we currently do not have specific features which directly
785: address these two backgrounds, thus all classification methods
786: have difficulty with them, given the input features currently available.
787: These backgrounds are somewhat described by a roundness
788: feature and comparison of the flux in small {\it vs}.~large apertures.
789: Adding features to directly address these
790: backgrounds would improve the power of any classification method.
791: Projects with higher quality CCDs and optical designs to minimize
792: scattered light will naturally have fewer backgrounds as well.
793: 
794: The optimum selection criterion is dependent upon the tradeoff between
795: true positive selection efficiency (horizontal axis) and the false
796: positive selection efficiency (vertical axis).
797: At the SNfactory we seek
798: to maximize our signal efficiency within the realistic constraints of
799: the personnel and telescope time available to vet false positives.  For
800: the Fall 2006 search, we used Boosted Decision Trees and
801: choose a point with 10 times less background
802: than we had previously faced.  This corresponds to an average efficiency
803: of $\sim$78\% for identification of a supernova with a single filter
804: and one night of imaging.
805: 
806: \subsection{Optimizing Boosted Tree Parameters}
807: \label{sec:paramchoice}
808: 
809: Since Boosted Trees provided the best classification performance,
810: we describe here how the performance changed with various input
811: parameters to the Boosted Tree training.
812: The performance of Boosted Trees depends upon the number of trees generated
813: and the amount of branching which is done before finishing each tree.
814: 
815: For controlling the amount of branching per tree, the StatPatternRecognition
816: package has an adjustable limit on minimum number of events per leaf in
817: the final tree.
818: Our training
819: sample contained 5,000 signal and 5,000 background events.
820: We found the best performance with a minimum leaf size of 15 events,
821: which results in a set of boosted trees with 250--300 leaves each.
822: Figure \ref{fig:nperleaf} shows the relative performance of 200 trees with
823: a mininum leaf size of $n=5, 25, 50, 100$ relative to the $n=15$ case.
824: 
825: Figure \ref{fig:ntree} shows the relative performance of
826: $N = 25, 50, 100, 200$ trees in comparison to the $N=400$ case, using
827: a minimum leaf size of 50 events.
828: For the Fall 2006 SNfactory supernova search
829: we choose to use 200 trees with a minimum of 15 events per leaf out of the
830: 10,000 training events.
831: 
832: \section{Combining Methods}
833: 
834: As expected, the various methods provided correlated output;
835: {\it i.e.}, events ranked highly by one classifier tended to be ranked
836: highly by another.  But even through Boosted Trees provided the best
837: classification performance overall, there were good signal events which
838: were found by SVM which were missed by the Boosted Trees
839: (for a given set of thresholds on the SVM and Boosted Tree outputs).
840: We attempted
841: to recover these events by combining the output of these two methods.
842: Several combinations were tried:
843: \begin{itemize}
844: \item Keep events which passed thresholds for either classifier.
845: \item Perform Fisher Discriminant Analysis on the output of the two
846:     classifiers.
847: \item Split the SVM {\it vs.}~Boosted Tree output space into sub-regions
848:     of signal and background.  This is conceptually similar to forming
849:     a decision tree in this 2D space and accounts for non-linear correlations
850:     in the SVM {\it vs.}~Boosted Tree output.
851: \item Use the output of SVM as an additional feature input for 
852:     Boosted Trees.
853: \end{itemize}
854: None of these methods produced results which outperformed the Boosted
855: Trees alone.  Although the combined classifiers could identify signal
856: events which would have been missed by just one classifier, the combination
857: also brought in an increased number of background events such that the
858: overall performance was the same or worse than the Boosted Trees alone.
859: This result is not generally true, however.  In other contexts, multiple
860: classifiers have been successfully combined to produce overall more
861: powerful results \citep{dietterich, kittler}.
862: 
863: \section{Discussion and Conclusion}
864: 
865: This work has shown a variety of object classification methods which provide
866: significantly better performance than is possible with the
867: method of threshold cuts used by most current supernova searches.
868: The implementations studied here used common defaults, such as
869: using the Gini parameter for optimizing Boosted Decision Trees
870: and the Gaussian kernel for SVM.  There are many variations of
871: these methods, some of which might provide further improved
872: performance.  But even these ``out of the box'' implementations
873: with minimal tuning provided much better performance than threshold cuts.
874: 
875: Any classifier will be limited by the quality and power of the input
876: features provided.  In practice,
877: after an initial round of training and validation,
878: one should study the misclassified events and introduce additional
879: features that distinguish these.  Such iterations are helpful
880: regardless of which classification method is used;
881: the main point of this work is to point to
882: new classification methods which will maximize the classification
883: power possible given a set of features.
884: 
885: As with any analysis, there is no substitute for clean data and a
886: well understood detector.  Problems which arise from false-positive
887: detections should first be addressed at the level of the detector
888: and data processing pipeline.  Future projects will hopefully have
889: the resources to address spurious detections at this level to make
890: the process easier for their object classifiers.
891: But even high quality, well
892: understood detectors and advanced image processing pipelines
893: such as SDSS will face signal {\it vs}.~background
894: classification problems, and this is where the methods described
895: in this paper come into play.
896: 
897: In addition to improved background rejection power, 
898: these new methods also have the advantage of generating
899: a single number which ranks the quality of an object rather than
900: a boolean pass/fail decision.  One may then adjust a threshold cut 
901: on that single number to tune the desired tradeoff between purity
902: and completeness.  Future surveys may publish transient alerts
903: using relatively loose quality requirements; subscribers to these
904: alerts can then place their own cuts on this quality rank to adjust
905: the purity, completeness, and input data rate as needed.
906: 
907: Boosted Trees, Random Forests, and Support Vector Machines all provide
908: much better object classification performance than traditional threshold
909: cuts.  When applied to the SNfactory supernova search pipeline, Boosted
910: Trees enabled us to find more supernovae with less work: Our efficiency
911: for finding real supernovae increased while our workload for scanning
912: non-supernova objects dramatically decreased.
913: Methods such as these will
914: be crucial for maintaining reasonable false positive rates at
915: the automated transient alert pipelines
916: of upcoming projects such as PanSTARRS and LSST.
917: 
918: \acknowledgements
919: 
920: We would like to thank
921: G.~Aldering,
922: S.~Bongard,
923: M.J.~Childress,
924: P.~Nugent,
925: and R.~Scalzo for useful conversations and 
926: assistance with scanning our supernova candidates.
927: We also thank the entire Nearby Supernova Factory collaboration for
928: confirmation and followup spectra of our selected candidates and
929: for the use of the search images for this study.
930: The anonymous referee provided many useful comments for which we
931: are grateful.
932: 
933: We are grateful to the technical and scientific staff of the Palomar
934: Oschin telescope, where our supernova search data are obtained.
935: The High Performance Wireless Research and
936: Education Network (HPWREN)\footnote{http://hpwren.ucsd.edu},
937: funded by the National Science Foundation grants 0087344 and 0426879,
938: has provided a consistently reliable
939: network for transferring our large amount of data from Mt. Palomar in
940: a timely manner.
941: 
942: This work was supported in part by the Director,
943: Office of Science, Office of High Energy and Nuclear Physics, of the
944: U.S. Department of Energy under Contract No. DE-FG02-92ER40704,
945: by a grant from the Gordon \& Betty Moore Foundation,
946: by National Science Foundation Grant Number AST-0407297.
947: This research used resources of the
948: National Energy Research Scientific Computing Center, which is supported
949: by the Office of Science of the U.S. Department of Energy under
950: Contract No. DE-AC02-05CH11231.
951: 
952: SB would especially like to thank the organizers and hosts of the
953: Statistical Inference Problems in High Energy Physics and Astronomy Workshop
954: held at the Banff International Research Station (BIRS), which is supported by
955: the U.S. National Science Foundation, the Natural Science and
956: Engineering Research Council of Canada, Alberta Innovation, and
957: Mexico's National Council for Science and Technology (CONACYT).
958: 
959: {\it Facilities:}
960: \facility{PO:1.2m (QUEST-II)}
961: 
962: \begin{thebibliography}{}
963: 
964: \bibitem[SNfactory, Aldering et al.(2002)]{snfactory}
965: Aldering, G. et al. (SNfactory) 2002, 
966: \procspie, 4836, 61
967: 
968: \bibitem[Astier et al.(2006)]{snls}
969: Astier, P. et al. (SNLS) 2006,
970: \aap, 447, 31-48
971: 
972: \bibitem[Baltay et al.(2007)]{questcamera}
973: Baltay, C. et al. 2007,
974: (astro-ph/0702590)
975: 
976: \bibitem[Becker et al.(2006)]{sdss:becker}
977: Becker, A. et al. (SDSS-II Supernova Survey) 2007,
978: %% ``Overview of the SDSS-II Supernova Survey: The First Two Seasons,''
979: \baas~(Seattle, WA)
980: %% presented at the 2007 AAS Meeting, Seattle WA, 7 January 2007
981: 
982: \bibitem[Bellman(1961)]{bellman}
983: Bellman, R.E. 1961, Adaptive Control Processes
984: (Princeton, NJ: Princeton University Press)
985: 
986: \bibitem[Bishop(1996)]{ann:bishop}
987: Bishop C.M. 1996, Neural Networks for Pattern Recognition
988: (Oxford University Press)
989: 
990: \bibitem[Breiman(2001)]{rf:breiman}
991: Breiman, L. 2001,
992: ``Random Forests,''
993: University of California, Berkeley, technical report
994: 
995: \bibitem[Breiman et al.(1984)]{breiman}
996: %%% Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J.,
997: Breiman, L., et al. 1984,
998: Classification and Regression Trees
999: (Belmont, CA: Wadsworth International Group)
1000: 
1001: \bibitem[Chen et al.(2005)]{svm:chen05}
1002: Chen, P.-H., Lin, C.-J., and Scholkopf, B. 2005,
1003: %%% ``A tutorial on $\nu$-support vector machines: Research Articles,''
1004: Applied Stochastic Models in Business and Industry, 21(2), 111-136
1005: 
1006: \bibitem[Dietterich(2002)]{dietterich}
1007: Dietterich, T.G. 2002, Ensemble Learning, in
1008: The Handbook of Brain Theory and Neural Networks, Second Edition
1009: (M.A. Arbib, Ed.) (Cambridge, MA: The MIT Press)
1010: 
1011: \bibitem[Fisher(1936)]{fisher}
1012: Fisher, R.A. 1936,
1013: %%% ``The Use of Multiple Measurements in Taxonomic Problems,''
1014: Annals of Eugenics, 7, 179-188
1015: 
1016: \bibitem[Freund \& Schapire(1996)]{bdt:freund96}
1017: Freund, Y., and Schapire, R.E. 1996, 
1018: %%% ``Experiments with a new boosting algorithm,''
1019: Proc COLT, 209-217 (New York: ACM Press)
1020: 
1021: \bibitem[Freund \& Schapire(1999)]{bdt:freund99}
1022: Freund, Y., and Schapire, R.E. 1999, 
1023: %%% ``A Short Introduction to Boosting,''
1024: J. Japan. Soc. for Artif. Intel. 14(5), 771-780
1025: 
1026: \bibitem[Friedman(2001)]{bdt:friedman01}
1027: Friedman, J. 2001,
1028: %%% ``Greedy function approximation: a gradient boosting machine,''
1029: Annals of Statistics 29(5), 1189-1232
1030: 
1031: \bibitem[Friedman, Hastie, \& Tibshirani(2000)]{bdt:friedman00}
1032: Friedman, J., Hastie, T., Tibshirani, R. 2000,
1033: %%% ``Additive Logistic Regression: a Statistical View of Boosting,''
1034: Annals of Statistics, 28(2), 337-407
1035: 
1036: \bibitem[Gini(1921)]{gini}
1037: Gini, C. 1921,
1038: %%% ``Measurement of Inequality and Incomes,''
1039: The Economic Journal 31, 124-126
1040: 
1041: \bibitem[Kittler et al.(1998)]{kittler}
1042: Kittler, J. et al. 1998,
1043: %%% On Combining Classifiers,
1044: IEEE Trans. Pattern Analysis and Machine Intelligence, 20(3), 226
1045: 
1046: \bibitem[K\"oppen(2000)]{koeppen}
1047: K\"oppen, M. 2000,
1048: %%% ``The Curse of Dimensionality,''
1049: 5th Online World Conference on Soft Computing in Industrial Applications
1050: (WSC5), held on the Internet,
1051: %%% September 4-18, 2000.
1052: https://www.npt.nuwc.navy.mil/Csf/papers/hidim.pdf
1053: 
1054: \bibitem[Miknaitis et al.(2007)]{essence}
1055: Miknaitis et al. (ESSENCE) 2007,
1056: (astro-ph/0701043), \apj, submitted
1057: 
1058: \bibitem[Roe et al.(2005)]{miniboone}
1059: Roe, B.P. et al. 2005,
1060: %%% Boosted Decision Trees as an Alternative to 
1061: %%% Artificial Neural Networks for Particle Identification
1062: Nucl. Instrum. Meth. A543, 277-584
1063: 
1064: \bibitem[Romano et al.(2006)]{svm:romano}
1065: Romano, R., Aragon, C., Ding, C. 2006,
1066: %%% ``Supernova Recognition using Support Vector Machines,''
1067: in Proceedings of the 5th International Conference
1068: of Machine Learning Applications (Orlando, FL: IEEE)
1069: %%% http://icmla.cs.csub.edu/icmla06/
1070: %%% http://vis.lbl.gov/$\sim$romano/pubs/sne-svm-icmla06.pdf
1071: 
1072: \bibitem[Vapnik(1998)]{svm:va98}
1073: Vapnik, V. 1998,
1074: Statistical Learning Theory
1075: (Wiley)
1076: 
1077: \bibitem[Zahn \& Roskies(1972)]{zahn}
1078: Zahn, C.T., and Roskies, R.Z. 1972,
1079: %%% ``Fourier descriptors for plane closed curves,''
1080: IEEE Trans. Computers, 21, 269-281
1081: 
1082: 
1083: \end{thebibliography}
1084: 
1085: \clearpage
1086: 
1087: %- Fisher example
1088: \begin{figure}
1089: \centering
1090: \includegraphics{f1.eps}
1091: \figcaption{
1092: Example data which would be well separated using Fisher
1093: Discriminant Analysis.  The two classes of events (open and filled
1094: circles) are not well separated by either feature $A$
1095: or $B$, but their correlation is such that the combination
1096: $A + B$ provides very good separation of the two classes.
1097: \label{fig:fisher}
1098: }
1099: \end{figure}
1100: 
1101: \clearpage
1102: 
1103: %- Artificial Neural Net example
1104: %%% \begin{figure}
1105: %%% \centering
1106: %%% \includegraphics{f2.eps}
1107: %%% \figcaption{
1108: %%% Example structure of an Artificial Neural Network.  The inputs
1109: %%% $x_1 \ldots x_n$
1110: %%% are mapped to an output $O$ via a hidden layer of nodes
1111: %%% $y_1 \ldots y_m$.  The
1112: %%% output of each node is a function of the inputs $x_i$ and a set of
1113: %%% weights $w_{ij}$.  Similarly, the output $O$ is a function of
1114: %%% $y_j$ and another set of weights.
1115: %%% In general there may be multiple hidden layers.
1116: %%% The weights are tuned through iterative training.
1117: %%% }
1118: %%% \label{fig:ann}
1119: %%% \end{figure}
1120: 
1121: %%% \clearpage
1122: 
1123: %- SVM example
1124: \begin{figure}
1125: \centering
1126: \includegraphics{f2.eps}
1127: \figcaption{
1128: Support Vector Machines map an input space of features into a
1129:     higher dimensional space where the separation of classes becomes
1130:     easier.  The separation boundary in the original space
1131:     may be quite complex, even disjoint.  In the higher dimensional
1132:     space, the separation surface is a hyperplane whose parameters are
1133:     entirely determined by the subset of events (the support vectors)
1134:     nearest to the boundary.
1135: \label{fig:ann}
1136: }
1137: \end{figure}
1138: 
1139: \clearpage
1140: 
1141: %- Decision Tree example
1142: \begin{figure}
1143: \centering
1144: \includegraphics{f3.eps}
1145: \figcaption{
1146: Example decision tree which would treat high signal-to-noise objects
1147: differently than low signal-to-noise objects.  In practice, a real
1148: decision tree has many more branches and the same variable can be
1149: used to branch at many different locations with different cut values.
1150: \label{fig:decisiontree}
1151: }
1152: \end{figure}
1153: 
1154: \clearpage
1155: 
1156: %- Method comparison
1157: \begin{figure}
1158: \centering
1159: \includegraphics{f4.eps}
1160: \figcaption{Comparison of Boosted Trees (cyan solid line),
1161:    Random Forest (blue dashed line), SVM (green dotted line),
1162:    and threshold cuts (red dash-dotted line)
1163:    for false positive identification fraction {\it vs.}~true
1164:    positive identification
1165: fraction.  For the threshold cuts, the signal-to-noise ratio,
1166: motion, and shape cuts were varied to adjust signal and background rates.
1167: The red diamond shows the performance of the threshold cuts used
1168: during the SNfactory Summer 2006 search;
1169: the cyan square shows the performance achieved with Boosted Trees
1170: which were used for the Fall 2006 SNfactory search.
1171: The lower right corner of the plot represents ideal performance.
1172: \label{fig:eff}
1173: }
1174: \end{figure}
1175: 
1176: \clearpage
1177: 
1178: %- Leaf size comparison
1179: \begin{figure}
1180: \centering
1181: \includegraphics{f5.eps}
1182: \figcaption{
1183:     A comparison of the performance of 200 boosted trees with varying
1184:     leaf sizes.  10,000 training events were used; the plot shows the
1185:     comparison of leaves with a minimum of $N=5,25,50,100$ events in comparison
1186:     to the performance of the $N=15$ case.
1187: \label{fig:nperleaf}
1188: }
1189: \end{figure}
1190: 
1191: \clearpage
1192: 
1193: %- N trees comparison
1194: \begin{figure}
1195: \centering
1196: \includegraphics{f6.eps}
1197: \figcaption{
1198:     A comparison of the performance of $N_{\rm tree}=25, 50, 100, 200$
1199:     boosted trees
1200:     with  a minimum of 50 events per leaf (out of 10,000 training events) in
1201:     comparison to the $N_{\rm tree}=400$ case.
1202: \label{fig:ntree}
1203: }
1204: \end{figure}
1205: 
1206: \end{document}
1207: