0705:0705.0493/ms.tex

1:

2: %- File    : ms.tex

3: %- ----------------

4: %- Created : Sun Dec 11 22:32:32 2005

5: %- Authors : SNfactory

6: %-

7:

8: %- Start of Preamble.

9:

10:    \documentclass[12pt,preprint]{aastex}

11: %  \documentclass[manuscript,letterpaper]{aastex}

12: %  \documentclass[preprint2]{aastex}

13: %  \documentclass[preprint2,longabstract]{aastex}

14:

15: \usepackage{bm}

16:

17: \begin{document}

18:

19: \title{%

20:     How to Find More Supernovae with Less Work:

21:     Object Classification Techniques for Difference Imaging

22: }

23:

24: \author{%

25:    S.~Bailey,\altaffilmark{1,4}

26:    C.~Aragon,\altaffilmark{1}

27:    R.~Romano,\altaffilmark{1,2}

28:    R.~C.~Thomas,\altaffilmark{1}

29:    B.~A.~Weaver\altaffilmark{1,3}

30:    D.~Wong\altaffilmark{1}

31: }

32:

33: \altaffiltext{1}{Lawrence Berkeley National

34: Laboratory, 1 Cyclotron Road, Berkeley, CA 94720}

35: \altaffiltext{2}{Luis W. Alvarez Fellow, National Energy Research Scientific Computing Center, 1 Cyclotron Road, Berkeley, CA 94720}

36: \altaffiltext{3}{University of California, Space Sciences Laboratory,

37: Berkeley, CA 94720}

38: \altaffiltext{4}{Corresponding author: sjbailey@lbl.gov}

39:

40:

41: \begin{abstract}

42: We present the results of applying new object classification techniques

43: to difference images in the context of the Nearby Supernova Factory

44: supernova search.

45: Most current supernova searches subtract reference images from new images,

46: identify objects in these difference images, and apply simple threshold cuts on

47: parameters such as statistical significance, shape, and motion

48: to reject objects such as cosmic rays, asteroids, and subtraction

49: artifacts.

50: Although most static objects subtract cleanly, even a very low false

51: positive detection rate can lead to hundreds of

52: non-supernova candidates which

53: must be vetted by human inspection before triggering additional followup.

54: In comparison to simple threshold cuts, more sophisticated methods such as

55: Boosted Decision Trees, Random Forests, and Support Vector Machines

56: provide dramatically better object discrimination.

57: At the Nearby Supernova Factory, we reduced the number of non-supernova

58: candidates by a factor of 10 while increasing our supernova identification

59: efficiency.  Methods such as these will be crucial for maintaining

60: a reasonable false positive rate in the automated transient alert

61: pipelines of upcoming projects such as PanSTARRS and LSST.

62: \end{abstract}

63:

64: \keywords

65: {

66: methods: data analysis ---

67: methods: statistical ---

68: supernovae: general ---

69: techniques: image processing

70: }

71:

72: \section{Introduction}

73:

74: Future large scale survey projects such as

75: PanSTARRS\footnote{http://pan-starrs.ifa.hawaii.edu}

76: and LSST\footnote{http://lsst.org} are expected

77: to generate automated rapid turnaround transient alerts for objects

78: such as supernovae, active galactic nuclei, asteroids, Kuiper belt objects,

79: and variable stars.

80: They will do this by comparing new images to coadded stacks of reference

81: images taken previously.  Repeat observations of the same field will

82: occur over timescales of minutes, hours, days, months, and years.

83: Robust rejection of spurious non-astrophysical objects

84: will be crucial to avoid excessive false positive alerts.

85:

86: A major difficulty of current optical transient programs is the huge

87: number of false positive objects which are difficult to reject while

88: maintaining high selection

89: efficiency for the real objects of interest.  For example, the 2005 Sloan

90: Digital Sky Survey II (SDSS-II) supernova program \citep{sdss:becker}

91: required objects to be detected within 0.6 arcsec in at least two filters

92: with signal-to-noise greater than 3, yet this generated $\sim$4,000 objects

93: per night which needed to be visually checked by humans for verification.

94: Their 2006 search drastically reduced this scanning load by requiring that

95: all but the brightest objects be identified at the same location

96: on multiple nights before they are passed to a human for verification.

97: Although this reduced their scanning load, this method is not

98: applicable to the real-time transient alert pipelines of PanSTARRS and

99: LSST.  A ``60-second transient alert'' would be meaningless if it really

100: meant ``$N$ days plus 60 seconds after the first positive identification.''

101: Although PanSTARRS and LSST will have multiple exposures of a field

102: in the same night, this is equivalent to the multiple-filter requirement

103: of the SDSS 2005 program which was still swamped by false positives.

104:

105: This problem of false positives

106: is not unique to nearby transient searches; it arises whenever a large

107: number of objects are imaged, either from a wide-field survey or a

108: deep narrow survey.  The ESSENCE \citep{essence}

109: and SNLS \citep{snls} Canadian pipeline supernova

110: searches both result in 100--200 objects to

111: scan per night of data\footnote{private communication with

112: W.M. Wood-Vasey (ESSENCE) and D. Balam (SNLS)}.

113: Although this is a manageable load for a current experiment,

114: it would not scale to future surveys which will image thousands

115: of square degrees per night.\footnote{For comparison,

116:     the SNLS supernova survey covers $\sim1$ square degree per night,

117:     SDSS covers 150 square degrees per night,

118:     and SNfactory covers 350 to 850 square degrees per night.}

119: If current methods were used, the projects would need to drastically

120: reduce their signal efficiency in order to maintain a manageable

121: false-positive rate.

122: The SNLS French pipeline uses multi-night data and an artificial

123: neural net to select candidates for verification, but as noted above,

124: using multi-night information is not applicable to rapid turnaround

125: transient alert pipelines which intend to produce alerts within

126: a minute of the first positive detection.

127:

128: False positives arise from a variety of sources including diffraction spikes,

129: saturated stars, optical ghosts, star halos, cosmic rays, satellite trails,

130: CCD amplifier glow,

131: other CCD artifacts, and image processing artifacts.  In principle all of these

132: effects are best identified and either fixed or masked at the image level.

133: In practice there will always be effects which produce spurious detections.

134: This problem is especially bad at the start of a search when covering new

135: areas of sky, before consistent problems can be identified and masked.

136: The goal of a classifier is to identify the real

137: candidates of interest (signal events) while rejecting the spurious objects

138: (background events).

139:

140: In some cases, real astrophysical variable objects are the background events

141: for other analyses.  For example, asteroids, variable stars,

142: and active galactic nuclei form a background for nearby

143: supernova searches, yet they are the core science for other programs.

144: This paper is written from the context of a nearby supernova search

145: and thus these other

146: astrophysical events are treated as background to reject,

147: but the methods presented here are generally applicable to

148: many object classification problems.

149:

150: This paper presents the results of applying modern machine learning

151: techniques to the supernova search pipeline of the

152: Nearby Supernova Factory \citep{snfactory}.

153: \S \ref{sec:methods} presents a variety of machine learning techniques.

154: \S \ref{sec:SNfactory} describes the Nearby Supernova Factory search,

155: and \S\S~\ref{sec:training} and \ref{sec:software} present

156: the training data and classification software used.

157: \S \ref{sec:comparison} compares the various methods.

158: We find that methods such as

159: Boosted Trees, Random Forests, and Support Vector Machines perform

160: dramatically better than the threshold cuts which are typically used

161: by supernova search programs.

162:

163: \section{Classification Methods}

164: \label{sec:methods}

165:

166: Classification methods identify signal {\it vs.}~background

167: events based upon a set of features (also called variables, attributes,

168: or scores)

169: which describe the events.  For example, objects in photometric images

170: can be described by their magnitude, signal-to-noise, and shape parameters

171: such as width and ellipticity.  These features can be used to distinguish

172: stars from galaxies, cosmic rays, or imaging artifacts.

173:

174: The optimum separation of two classes of events is application dependent,

175: depending upon the desired tradeoff between purity (the fraction of

176: selected events which are real signal), completeness (the fraction of

177: real signal events which are selected), and the total sample size selected.

178: For example, a measurement

179: which depends upon a statistical fit to both signal and background events

180: might optimize the signal-to-noise ratio $\sim S/\sqrt{S+B}$

181: where $S$ and $B$ are the number

182: of real signal and background events which the classifier selects for the fit.

183: A supernova search algorithm, on the other hand,

184: might maximize the purity with the constraint that the completeness

185: remain above 90\%.

186:

187: Some methods, such as threshold cuts (\S \ref{sec:cuts}),

188: produce a boolean signal/background

189: decision and the cuts themselves must be adjusted to optimize the separation.

190: Other methods have an automated training procedure and

191: produce a single statistic which rates how signal-like or

192: background-like a new event is.  The user may then cut on that statistic

193: to optimize the desired figure of merit.

194:

195: Classifier parameters are tuned using a training dataset of known

196: signal and background events to optimize

197: the separation power.  Since the results are influenced by the particular

198: statistical fluctuations of the training dataset, the separation power

199: on the training data itself cannot be used as

200: a fair measure of the power of a classifier.  Instead, a separate validation

201: set is used to assess the performance.  If enough training data is available,

202: one uses one dataset to train a variety of classifiers with

203: different parameters, a second dataset to select which set of parameters

204: produces the best classifier, and a third dataset to validate the

205: final performance.

206: Ideally one trains and validates

207: using real data; in practice simulated data are often used for training

208: and validation before applying the classifier to real data.

209: It is important to note that the quality and power of any classifier

210: will be affected by the accuracy of the training sample.  One must be

211: careful to minimize and measure any biases introduced through a simulated

212: training sample which does not completely reflect real data.

213:

214: \subsection{Threshold Cuts}

215: \label{sec:cuts}

216:

217: Automated supernova searches have typically operated by applying

218: simple threshold cuts to the features describing objects.

219: For example, a supernova search might

220: keep objects which have a signal-to-noise ratio $S/N > 5$,

221: astrometric positions that agree to within 1 arcsec on 2 or more images,

222: and a width consistent with stars on the images ({\it e.g.}, within a

223: factor of 2 of the median width of stars).

224: If an object fails any of these cuts, it is rejected.

225: These cuts are easy to understand but do not reflect

226: the subtleties of a multidimensional space.

227: An object which just barely

228: fails one of the cuts is still rejected the same as an object which

229: fails many cuts.  It also does not naturally handle correlations between

230: the variables, {\it e.g.,} between the $S/N$ and the astrometric

231: accuracy.\footnote{In a simple case such as this, one could combine $S/N$ and

232:     astrometric positions to form an uncorrelated variable; accomplishing this

233:     in the general case for a large number of variables is non-trivial.}

234: To use threshold cuts, one must find uncorrelated variables without

235: significant outliers such that every cut maintains a high signal efficiency

236: while rejecting background.

237:

238: Compared to curved boundary selections, threshold cuts are also

239: an inefficient way to select a subset of a hyperspace as the number

240: of dimensions grows large \citep{koeppen}, even for dimensions as few as 5.

241: {\it e.g.}, for 3 dimensions, the volume ratio of a cube to its embedded

242: sphere is 1.9; for 5 dimensions it is 6.1, and for 10 dimensions

243: it is 401.5.

244: This ratio goes to infinity as the number of dimensions increases.

245: Thus if

246: a set of signal events is distributed as an ellipsoid in some feature

247: space, an ellipsoid shaped selection contains much less volume

248: (and thus likely much less background) than the equivalently dimensioned

249: hypercube.

250:

251: Although commonly used in supernova searches,

252: threshold cuts are widely recognized

253: as being a non-optimal method for signal/background separation problems.

254: The following sections describe a variety of more powerful techniques for

255: identifying supernovae in difference images.

256:

257: \subsection{Multi-dimensional Probability Measures}

258:

259: A more sophisticated approach models the probability distribution

260: function (PDF) for the signal and the background for each of the

261: features.  The combined probability of all of the feature values

262: for an object is used to make the signal/background decision.  This

263: improves over threshold cuts by eliminating rejections based upon a

264: slightly marginal value of a single feature, but it requires a detailed

265: modeling of the PDF of each feature, including all correlations and

266: outliers in the distributions.  This suffers from

267: the ``curse of dimensionality'' \citep{bellman}: since the volume

268: of a hyperspace grows exponentially with the number of dimensions,

269: the size of a training sample must also grow exponentially to

270: adequately determine the PDFs.

271:

272: If the signal and background features are Gaussian distributed with

273: only linear correlations, Fisher Discriminant Analysis \citep{fisher}

274: finds the best linear combination of

275: features to maximize the separation of the two classes.

276: Figure \ref{fig:fisher} shows a toy example of

277: data which would be well separated using Fisher

278: Discriminant Analysis.  The two classes of events (blue triangles

279: and red squares) are not well separated by either feature $A$

280: or $B$, but their correlation is such that the combination

281: $A + B$ provides very good separation of the two classes.

282:

283: More generally, if a set of events $\{{\bf x}\}$ in some feature space

284: have means ${\bm \mu}_{0,1}$ and covariances

285: $\Sigma_{0,1}$ for classes 0 and 1, then a linear combination

286: ${\bf w} \cdot {\bf x}$

287: will have means ${\bf w} \cdot {\bm \mu}_{0,1}$ and covariances

288: ${\bf w}^T \Sigma_{0,1} {\bf w}$, where ${\bf w}$ is a set of

289: coefficients defining a linear combination of the features.

290: The separation of the two classes may be defined as

291: \begin{equation}

292: \Delta = {({\bf w} \cdot {\bm \mu}_0 - {\bf w} \cdot {\bm \mu}_1)^2 \over

293:     {\bf w}^T \Sigma_0 {\bf w} + {\bf w}^T \Sigma_1 {\bf w} },

294: \end{equation}

295: {\it i.e.}, the separation of the means is measured in units of the

296: variances.

297: Fisher showed that the maximum separation is achieved when

298: \begin{equation}

299: {\bf w} = (\Sigma_0 + \Sigma_1)^{-1} ({\bm \mu}_1 - {\bm \mu}_0)

300: \end{equation}

301: The means and covariances of the signal (1) and background (0) classes

302: may be estimated from a training sample, and thus the calculation of

303: the best linear combination for separating the classes

304: is simply a matrix inversion.  This method

305: breaks down when there are non-linear correlations or when there are

306: significant outliers or otherwise non-Gaussian variances such that a

307: simple mean

308: and covariance is not a good descriptor of the feature distributions

309: for the two classes.  In practice, Fisher Discriminant Analysis is

310: most often used to combine several linearly correlated features

311: into a single feature to reduce the dimensionality of a problem before

312: applying another classification method.

313:

314: \subsection{Decision Trees}

315:

316: Decision trees \citep{breiman}

317: separate signal from background events by making a

318: cascading set of event splits as shown in Figure \ref{fig:decisiontree}.

319: This forms a generalization of threshold cuts by

320: selecting many hypercubes in the multi-dimensional feature space

321: rather than a single hypercube of cuts.  The training procedure

322: described below automatically selects the features and cut values

323: to generate a tree with maximal separation of signal and background

324: events.

325:

326: The training procedure begins with a sample of training events and

327: considers all features and cut values to form

328: two subsets with the best separation of signal and background.

329: The procedure is recursively applied to each of the subsets to form

330: further branches.  The recursion is stopped when some condition is

331: met, {\it e.g.}, the subset is entirely signal or background, or

332: the subset has reached a minimum allowed size

333: (a minimum size requirement prevents

334: overtraining on statistical fluctuations of small samples).

335: The terminal nodes which are not further split are called leaves,

336: and are assigned as either signal or background leaves depending

337: upon the training events which ended up on those leaves.

338:

339: There are a variety of ways to define the best separation at each

340: split; for this study we used the Gini parameter \citep{gini, breiman},

341: which is widely used and provides robust performance.

342: Define the purity of a sample of training events as

343: \begin{equation}

344: P = {\sum_S w_S \over \sum_S w_S + \sum_B w_B}

345: \end{equation}

346: where the sums are over the signal events $S$ and background events $B$

347: and $w_i$ are a set of event weights.  Typically all of the weights

348: are the same and their absolute normalization is arbitrary.

349: If needed, relative weights may be used to increase

350: the influence of an underrepresented subsample of the training data.

351: The role of weights will be more important in the Boosted Trees

352: method described in \S \ref{sec:boostedtrees}.

353: Note that $P=1$ for a sample of pure signal events, $P=0$ for a sample of pure

354: background events, and $P(1-P)=0$ for a sample which is either purely

355: signal or purely background.

356:

357: Define

358: \begin{equation}

359: {\rm Gini} = P(1-P) \sum_{i=1}^n w_i

360: \end{equation}

361: where the sum is over all events in that sample.

362: At each node, the training procedure considers all possible

363: features and cut values to minimize the quantity

364: \begin{equation}

365: {\rm Gini}_{\rm left\ child} +

366: {\rm Gini}_{\rm right\ child}

367: \end{equation}

368: to find the best separation of events.

369: If this split would not increase

370: the overall quality of the tree, {\it i.e.},

371: \begin{equation}

372: {\rm Gini}_{\rm parent} <

373: {\rm Gini}_{\rm left\ child} +

374: {\rm Gini}_{\rm right\ child}

375: \end{equation}

376: then the node is left as a leaf node, assigning it as a signal leaf

377: if $P>0.5$ and a background leaf otherwise.  If the split would increase

378: the overall quality of the tree, the events are split into two nodes

379: and the procedure is recursively applied to each of those nodes until

380: the stopping conditions are met ({\it e.g.}, minimum leaf sizes) or no

381: splits can be found which would improve the overall quality of the tree.

382:

383: Decision trees are a generalization of threshold cuts and thus have more

384: flexibility to optimally select a set of signal events within a feature space.

385: However, single decision trees tend to be unstably dependent upon

386: the details of the

387: training set.  A small change in the training set can produce a

388: considerably different tree and thus a considerably different performance

389: on the validation set.

390:

391: \subsubsection{Boosted Trees}

392: \label{sec:boostedtrees}

393:

394: Boosting algorithms improve the performance of a classifier by

395: giving greater weight to events which are hardest to classify.  In the

396: case of decision trees, a tree is trained on a set of data,

397: misclassified events are identified and their weights are increased, and the

398: process is repeated to form new trees.  This iteratively produces

399: a set of increasing quality decision trees.

400: The final classifier uses the weighted ensemble average

401: of all of the trees to make a classification decision.  The boosting

402: provides decision trees with better separation power, and the ensemble

403: average washes out the training instabilities associated with single

404: decision trees.  In applications with $\sim$20 or more input features,

405: Boosted Decision Trees can provide significantly better results than

406: Artificial Neural Networks \citep{miniboone}; see also \S \ref{sec:ann}.

407:

408: There are a variety of boosting algorithms used to increase the weights

409: of misclassified events

410: \citep{bdt:freund96, bdt:friedman01, bdt:friedman00}.

411: We describe here the commonly used

412: Discrete AdaBoost method \citep{bdt:freund96}.

413: Define the error rate for tree $m$ as

414: \begin{equation}

415: {\rm err}_m = {\sum_{i=1}^{N} w_i I_i \over \sum_{i=1}^N w_i}

416: \end{equation}

417: where $I_i = 0$ if event $i$ is correctly classified and $I_i = 1$

418: if it is incorrectly classified.  Typically the first tree is trained

419: with the same weight for all events.

420: Then adjust each of the event weights using

421: \begin{eqnarray}

422: \alpha_m    & = & \beta \times \ln[ (1-{\rm err}_m) / {\rm err}_m ]     \\

423: w_i & \to & w_i \times e^{\alpha_m I_i}

424: \end{eqnarray}

425: This increases the weights of misclassified events; the weights are

426: increased more when the tree has a low error rate.

427: These new weights are then used to generate a new decision tree.

428: The standard AdaBoost algorithm uses $\beta=1$ but this can be adjusted to

429: vary how quickly the weights are updated with each iteration.

430:

431: After generating $M$ individual trees with weights $\alpha_m$,

432: the final classifier answer for an event described by

433: a set of features ${\bf x}$ is

434: \begin{equation}

435: T({\bf x}) = \sum_{m=1}^M \alpha_m T_m({\bf x})

436: \end{equation}

437: where $T_m({\bf x})$ is the result for tree $m$:

438: 0 if ${\bf x}$ lands

439: on a background leaf and +1 for a signal leaf.

440: The absolute normalization of $T({\bf x})$ is arbitrary; we

441: chose to renormalize the $\alpha_m$ weights such that

442: $0 \le T({\bf x}) \le 1$.

443:

444:

445: \subsubsection{Random Forests}

446:

447: Random Forests \citep{rf:breiman}

448: also generate multiple decision trees for a given

449: training set and use a weighted average of the trees as the final

450: decision metric.  When training a tree, at each branch the

451: training cycle only considers a random subset of the possible features

452: available to use.  This has the effect of washing out

453: the typical training instabilities of decision trees and produces

454: a classifier which is fast to train and robust against outliers.

455:

456: \subsection{Support Vector Machines}

457:

458: The Support Vector Machine (SVM) algorithm is a classification

459: method that has successfully been

460: applied to many pattern recognition problems and is founded on

461: principles of statistical learning theory~\citep{svm:va98, svm:chen05}.

462: It nonlinearly

463: maps data points from the original input space to a higher-dimensional

464: feature space in which an optimal hyperplane parameterized by a normal

465: vector ${\bf w}$ and offset $b$ is computed such that the separation between

466: events in different classes is maximized. The linear decision boundary

467: is defined as $ f({\bf x}) = {\bf w} \cdot {\bf \phi}({\bf x}) + b$,

468: where ${\bf x}$ is a vector in the feature space which describes objects and

469: ${\bf \phi}$ is a mapping which embeds the problem into a

470: higher-dimensional space in which classes are more easily

471: separable than in the original feature space.

472:

473: An optimization problem is

474: constructed to find the unknown hyperplane parameters, and

475: the optimal hyperplane normal ${\bf w}$ is found to be entirely

476: determined by the subset of events nearest to the optimal

477: decision boundary (also called support vectors, ${\bf x}_i$) as

478: follows:

479: ${\bf w} = \sum_i c_i \phi({\bf x}_i)$,

480: where the coefficients $c_i$ are

481: the Lagrange multipliers used in solving the nonlinear

482: optimization and are a byproduct of the optimization.

483:

484: %%% The subset of events nearest to the decision boundary

485: %%% (the ``support vectors'' ${\bf x}_i$) determine the normal vector

486: %%% ${\bf w} = \sum_i c_i \phi({\bf x}_i)$.

487:

488: The hyperplane parameters are solved by maximizing the margin

489: (the distance between the hyperplane and the example events in each class),

490: which is formulated as a nonlinear

491: constrained optimization problem, where the constraints

492: enforce that examples from different classes lie on opposite

493: sides of the hyperplane.

494: The objective function to be minimized is convex, {\it i.e.},

495: it is guaranteed to have a global minimum and no local minima.

496: The linear decision boundary corresponds to a nonlinear

497: (and possibly disjoint)

498: decision boundary in the original feature space. Once the hyperplane is

499: found, a set of features ${\bf x}$ is typically classified into one of

500: the two classes by applying a threshold cut to $f({\bf x})$.

501:

502: Rather than calculating $\phi({\bf x})$ explicitly while

503: evaluating

504: ${\bf w} \cdot \phi({\bf x}) =

505: \sum_i c_i \phi({\bf x}_i) \cdot \phi({\bf x})$,

506: the actual embedding is

507: achieved through a kernel function defining an inner product in the

508: embedding space,

509: $k({\bf x}_1,{\bf x}_2) = {\bf \phi}({\bf x}_1) \cdot

510: {\bf \phi}({\bf x}_2)$.

511: This ``kernel trick'' makes class prediction easy to implement

512: and fast to compute.  Several common kernel mappings are

513: given in \cite{svm:chen05}.

514: In practice, the kernel function is typically chosen empirically

515: via training and testing, and the simplest function giving the

516: desired performance is used.  The Gaussian kernel used in this analysis

517: \begin{equation}

518: k({\bf x}_1,{\bf x}_2) = \exp( -||{\bf x}_1 - {\bf x}_2||^2 / 2 \sigma^2)

519: \end{equation}

520: is commonly used because it only has one free parameter to be

521: tuned ($\sigma$)

522: and empirically performs as well as, if not better than, more

523: complex kernels which may overfit the data.

524:

525: %%% The most commonly used kernel, the Gaussian kernel,

526: %%% corresponds to a $\phi({\bf x})$ transformation into an infinite

527: %%% dimensional space.

528:

529: For this analysis we used a soft-boundary SVM method

530: called $C$-SVM, which handles noisy data

531: with high class overlap by adding a regularization term to the objective

532: function.  This term allows but penalizes training

533: points lying on the wrong side

534: of the decision boundary. The regularization parameter, $C$,

535: controls the trade-off between maximizing the separation

536: and allowing some amount of training error while finding the

537: hyperplane which maximally separates signal from background.

538:

539: The advantages of SVMs include the existence of a unique solution, the

540: simple geometric interpretation of the margin maximization function, the

541: capacity to compute arbitrary nonlinear decision boundaries while

542: controlling over-fitting with soft margins, the low number of parameters

543: to be tuned (as few as two, depending on the choice of kernel), and the

544: dependence of the solution on only a small number of data points

545: (the support vectors)

546: which define the boundary of the class separation hypersurface.

547: For SVM implementation details, see \cite{svm:va98}.

548:

549: \subsection{Artificial Neural Networks}

550: \label{sec:ann}

551:

552: Artificial Neural Networks (ANNs) are a broad category of

553: classification methods

554: originally inspired by the interconnected structure of neurons

555: and synapses in the brain.

556: These methods map a set of input variables to one or more output results

557: via one or more ``hidden layers'' of intermediate nodes.

558: % as shown in Figure \ref{fig:ann}.

559: For a supernova search, the inputs would be the features describing

560: each object and the desired output would be 1(0) for signal~(background).

561: For an overview of these methods, see \cite{ann:bishop}.

562:

563: %%% A commonly used form of an ANN is a Multilayer Perceptron, whose nodes

564: %%% map their inputs $x_i$ into their outputs $y_j$

565: %%% based upon a set of weights $w_{ij}$ and a step-like function such as

566: %%% $y_j = 1 / (1 + e^{-\sum w_{ij} x_i})$.

567: %%% These intermediate results are then mapped to the final output (optionally

568: %%% via additional hidden layers) using

569: %%% another set of weights $v_j$: $O = \sum v_j y_j$.

570: %%% ANNs are iteratively trained on multiple training

571: %%% sets, updating the weights with each cycle to minimize the error of

572: %%% the output value.

573: %%% The performance is evaluated at each cycle on an independent verification

574: %%% sample and the training is stopped when the performance begins to get

575: %%% worse, indicating that the ANN is becoming overtrained.

576:

577: ANNs can be powerful classifiers and have been used in many applications,

578: though they are slow to train and require some experimentation to

579: optimize the number of hidden layers and nodes to match a given problem.

580: They also do not scale well with an increasing number of input

581: features, and their results become unstable when there are significant

582: outliers or otherwise irrelevant input data.

583: For these reasons,

584: ANNs were not deemed to be an appropriate classification method for

585: our dataset and this method was not

586: pursued for this study.

587:

588: \section{Nearby Supernova Factory Search}

589: \label{sec:SNfactory}

590:

591: The Nearby Supernova Factory \citep{snfactory} search uses

592: data from the Near Earth

593: Asteroid Tracking (NEAT) program\footnote{http://neat.jpl.nasa.gov}

594: and the Palomar QUEST consortium\footnote{http://hepwww.physics.yale.edu/quest/palomar.html}

595: using the 112 CCD QUEST-II camera \citep{questcamera}

596: on the Palomar Oschin 1.2-m telescope.

597: The NEAT observing pattern obtains triplets of 60-second exposures spread

598: over a time period of $\sim$1 hour using a single RG610 filter,

599: which is a long pass filter redward of 610 nm.

600: This allows the search to distinguish between asteroids, whose motion is

601: typically detectable

602: on that timescale, and spatially static objects such as supernovae.

603: The QUEST data are obtained in 4 filters in driftscan mode; our search

604: uses the two filters

605: which cover the best quality CCDs

606: (either Bessel $R$ and $I$ or Gunn $r$ and $i$

607: depending upon camera configuration).

608: The QUEST data cover less area and tend to be

609: cosmetically cleaner than the NEAT data, resulting in fewer spurious

610: detections overall.

611: Since the false positive background

612: events are much worse in the NEAT data, our study of alternative

613: classification methods has focused on the NEAT dataset.

614:

615: Coadded stacks of images taken from 2000 to 2003 are used as references.

616: The new and reference

617: images are convolved to match their point-spread-functions (PSFs),

618: the fluxes are normalized by

619: matching stars, and the reference is subtracted from the new

620: images.  Objects in the subtraction are identified based upon contiguous

621: pixels with $S/N > 3$ with at least one pixel with $S/N > 5$.

622: Objects are described by features such as position,

623: full-width-half-max (FWHM) in $x$ and $y$,

624: aperture photometry and associated uncertainties in 3 apertures,

625: distance to nearest object

626: in the reference coadd, and measures of the roundness and irregularity

627: of the object contour

628: based upon Fourier descriptors \citep{zahn}.

629: Additional features are formed as combinations of features from

630: the same object observed on multiple images.

631: Combined features include

632: the object motion between two images and the consistency of the statistical

633: significance of the measurements in different images.  The features are

634: used by a classification method (originally threshold cuts, more recently

635: Boosted Decision Trees)

636: to select supernova candidates of interest which are then visually

637: scanned by humans to select the best candidates for spectroscopic

638: confirmation and followup by the SuperNova Integral Field Spectrometer

639: (SNIFS) \citep{snfactory}, on the University of Hawaii 2.2-m telescope on

640: Mauna Kea.

641:

642: \section{Training Dataset}

643: \label{sec:training}

644:

645: To generate signal events for training, fake supernovae were

646: introduced into the images by moving real stars of a desired magnitude

647: to locations distributed about known galaxies on the SNfactory search

648: images.  By using real stars from the same images as the galaxies,

649: we realistically model the point-spread-function, noise, and possible

650: image artifacts present in that image.  These images with fake supernovae

651: were processed with the same data analysis pipeline as real images

652: to identify objects and measure their features for classification.

653:

654: The stars are sampled in a circular region of 20 pixels in diameter.

655: The typical FWHM of stars on these images is about 3 pixels,

656: so this samples the PSF out to $\sim8\sigma$.

657: The average sky level of the image is subtracted

658: from the sampled pixels before they are added in the new location, which

659: implicitely assumes that the sky level is uniform over the image.

660: This is valid in

661: most cases, and it is simple to reject fakes created from cases that

662: violate this assumption.  Typically these fakes will have a FWHM that

663: differs very significantly from the average ($\sim$20 pixels vs. $\sim$3).

664: Most stars are sufficiently isolated that they do not bring along portions of

665: other objects, and we identify and reject cases where this does

666: happen.  The spatial variation of the PSF is minimal compared to the

667: night-to-night variations which much be addressed by the image subtraction

668: pipeline, thus this fake supernova generation procedure does not attempt

669: to correct for the small spatial variations of the PSF across the CCD.

670: %%% Since the chosen star could come from anywhere on the image,

671: %%% the average distance shifted is $\sim$850 pixels with a large variance.

672: %%% For comparison, the image size is $2400 \times 598$ pixels or

673: %%% $34.8' \times 8.7'$.

674:

675: Background events were randomly selected from 4.1 million

676: other objects identified on the subtractions with fake supernovae.

677: These subtractions covered a month of data taking including bright

678: and dark times and a variety of seeing conditions.

679: Objects within 20 pixels of a fake supernova were excluded to

680: avoid any artifacts which might be introduced through an ill-formed fake.

681: These background events form a randomly selected subset of the genuine

682: backgrounds faced by the supernova search in the real data, and thus

683: represent the real fractions of each type of background event faced.

684:

685: The signal and background samples were split into training and validation

686: subsets.  Several training sets were formed with 5,000 signal and 5,000

687: background events each.  The final validation was performed using 20,000

688: signal and 200,000 background events.  The training dataset for the

689: Support Vector Machine method was augmented with real supernova

690: discoveries in an attempt to improve its overall performance.

691: The original training set used 19 features; an additional 13 features

692: were then added which improved the performance of the Boosted Trees

693: and Random Forests but decreased the performance of the SVM.

694: The results shown in \S \ref{sec:comparison} are for the best performance

695: achieved for each classifier ({\it i.e.}, using 19 features for SVM

696:         and 32 features for the other methods).

697:

698: \section{Classification software}

699: \label{sec:software}

700:

701: For Fisher Discriminant Analysis, Boosted Trees, and Random Forests,

702: we used the open-source C++ software

703: package StatPatternRecognition.\footnote{

704: http://sourceforge.net/projects/statpatrec}

705: Training a set of 200 boosted trees

706: using 10,000 training events with 19

707: features each with a minimum leaf size of 15

708: events\footnote{The selection of the number of trees and minimum leaf

709:     size are described in section \ref{sec:paramchoice}.}

710: takes $\sim$3 minutes (wallclock) on a 2 GHz AMD Opteron CPU.

711: Training with 32 features takes $\sim$45 minutes.

712: Once the trees are trained, it takes approximately 0.6 ms (wallclock)

713: to evaluate the results for an object.

714: Random Forests took less than 2 minutes to train on the dataset

715: with 32 features, using the same parameters as above.  Evaluation

716: of a new event takes approximately 0.2 ms.

717: Fisher Discriminant analysis took 1 second to train on the

718: 32 feature dataset and 0.07 ms to evaluate results for a new object.

719:

720: For SVM, we used the LIBSVM C++ package.\footnote{

721: http://www.csie.ntu.edu.tw/$\sim$cjlin/libsvm}

722: Training a $C$-SVM

723: using 10,000 training events with 19

724: features each takes from 5 to 15 seconds (wallclock)

725: on a 2 GHz AMD Opteron CPU,

726: depending on the settings of the two parameters used (if the parameters

727: overfit the data, more support vectors are needed so training time

728: increases).  Evaluating the SVM on a new data point takes approximately

729: 0.6 ms (wallclock).

730:

731: \section{Comparison of Methods}

732: \label{sec:comparison}

733:

734: In the end, most classification methods produce a single classification

735: statistic with arbitrary normalization which

736: rates how signal-like or how background-like the candidate is.

737: A threshold cut on this statistic can be used to select a subsample of

738: events with desired signal {\it vs.}~background purity.

739: A useful way to visualize the power of a classifier is

740: to plot the fraction of false positives ({\it i.e.}, background

741: events incorrectly classified as signal) {\it vs}.~fraction of true positives

742: ({\it i.e.}, signal events correctly identified)

743: for various selection values on the classification statistic.

744:

745: Figure \ref{fig:eff} shows the performance of several classification

746: methods applied to the SNfactory dataset.  The red square shows the

747: performance of the original threshold cuts upon which we were working

748: to improve.  The curves show that SVM, Random Forests, and Boosted Trees

749: all performed dramatically better than the threshold cuts

750: across a wide range of signal and background efficiencies.

751: The object features in our data have significant outliers which

752: prevented Fisher Discriminant Analysis

753: from being a useful classification method.

754:

755: The overall best performance was obtained using Boosted Decision Trees,

756: with Random Forests providing nearly as good performance

757: with faster training and evaluation times.

758: Although SVM performed considerably better than threshold cuts, it

759: was not as successful as Random Forests or Boosted Trees.

760: A possible explanation is that the SVM proved to be more sensitive to

761: signal events that lie close to background events in the feature

762: space,

763: and could not strike a balance between modeling such

764: events {\it vs}.~overfitting to noise.

765: For example, a dim young supernova on a bright galaxy can be very similar

766: to a statistical fluctuation or a modest subtraction error in the images.

767: Robustness against overfitting is a known strength of boosted

768: classifiers \citep{bdt:freund99}, and the issue of overfitting

769: noisy data is an area of active research within the machine

770: learning community.

771: Further details of applying SVM to the

772: SNfactory dataset are described in \cite{svm:romano}.

773:

774: Boosted Trees, Random Forests, and SVM successfully reduced the

775: false-positive rate for all types of background events in our data.

776: The three most common remaining background types are

777: faint optical ghosts from scattered light,

778: fluctuations in charge trails from bright stars due to CCD charge

779: transfer inefficiency,

780: and leftover dipoles from subtracting astrometrically misaligned objects.

781: The optical ghosts can be genuinely difficult to distinguish from

782: dim supernovae near our detection threshold.

783: The charge trails and dipoles are easy to distinguish by eye

784: but we currently do not have specific features which directly

785: address these two backgrounds, thus all classification methods

786: have difficulty with them, given the input features currently available.

787: These backgrounds are somewhat described by a roundness

788: feature and comparison of the flux in small {\it vs}.~large apertures.

789: Adding features to directly address these

790: backgrounds would improve the power of any classification method.

791: Projects with higher quality CCDs and optical designs to minimize

792: scattered light will naturally have fewer backgrounds as well.

793:

794: The optimum selection criterion is dependent upon the tradeoff between

795: true positive selection efficiency (horizontal axis) and the false

796: positive selection efficiency (vertical axis).

797: At the SNfactory we seek

798: to maximize our signal efficiency within the realistic constraints of

799: the personnel and telescope time available to vet false positives.  For

800: the Fall 2006 search, we used Boosted Decision Trees and

801: choose a point with 10 times less background

802: than we had previously faced.  This corresponds to an average efficiency

803: of $\sim$78\% for identification of a supernova with a single filter

804: and one night of imaging.

805:

806: \subsection{Optimizing Boosted Tree Parameters}

807: \label{sec:paramchoice}

808:

809: Since Boosted Trees provided the best classification performance,

810: we describe here how the performance changed with various input

811: parameters to the Boosted Tree training.

812: The performance of Boosted Trees depends upon the number of trees generated

813: and the amount of branching which is done before finishing each tree.

814:

815: For controlling the amount of branching per tree, the StatPatternRecognition

816: package has an adjustable limit on minimum number of events per leaf in

817: the final tree.

818: Our training

819: sample contained 5,000 signal and 5,000 background events.

820: We found the best performance with a minimum leaf size of 15 events,

821: which results in a set of boosted trees with 250--300 leaves each.

822: Figure \ref{fig:nperleaf} shows the relative performance of 200 trees with

823: a mininum leaf size of $n=5, 25, 50, 100$ relative to the $n=15$ case.

824:

825: Figure \ref{fig:ntree} shows the relative performance of

826: $N = 25, 50, 100, 200$ trees in comparison to the $N=400$ case, using

827: a minimum leaf size of 50 events.

828: For the Fall 2006 SNfactory supernova search

829: we choose to use 200 trees with a minimum of 15 events per leaf out of the

830: 10,000 training events.

831:

832: \section{Combining Methods}

833:

834: As expected, the various methods provided correlated output;

835: {\it i.e.}, events ranked highly by one classifier tended to be ranked

836: highly by another.  But even through Boosted Trees provided the best

837: classification performance overall, there were good signal events which

838: were found by SVM which were missed by the Boosted Trees

839: (for a given set of thresholds on the SVM and Boosted Tree outputs).

840: We attempted

841: to recover these events by combining the output of these two methods.

842: Several combinations were tried:

843: \begin{itemize}

844: \item Keep events which passed thresholds for either classifier.

845: \item Perform Fisher Discriminant Analysis on the output of the two

846:     classifiers.

847: \item Split the SVM {\it vs.}~Boosted Tree output space into sub-regions

848:     of signal and background.  This is conceptually similar to forming

849:     a decision tree in this 2D space and accounts for non-linear correlations

850:     in the SVM {\it vs.}~Boosted Tree output.

851: \item Use the output of SVM as an additional feature input for

852:     Boosted Trees.

853: \end{itemize}

854: None of these methods produced results which outperformed the Boosted

855: Trees alone.  Although the combined classifiers could identify signal

856: events which would have been missed by just one classifier, the combination

857: also brought in an increased number of background events such that the

858: overall performance was the same or worse than the Boosted Trees alone.

859: This result is not generally true, however.  In other contexts, multiple

860: classifiers have been successfully combined to produce overall more

861: powerful results \citep{dietterich, kittler}.

862:

863: \section{Discussion and Conclusion}

864:

865: This work has shown a variety of object classification methods which provide

866: significantly better performance than is possible with the

867: method of threshold cuts used by most current supernova searches.

868: The implementations studied here used common defaults, such as

869: using the Gini parameter for optimizing Boosted Decision Trees

870: and the Gaussian kernel for SVM.  There are many variations of

871: these methods, some of which might provide further improved

872: performance.  But even these ``out of the box'' implementations

873: with minimal tuning provided much better performance than threshold cuts.

874:

875: Any classifier will be limited by the quality and power of the input

876: features provided.  In practice,

877: after an initial round of training and validation,

878: one should study the misclassified events and introduce additional

879: features that distinguish these.  Such iterations are helpful

880: regardless of which classification method is used;

881: the main point of this work is to point to

882: new classification methods which will maximize the classification

883: power possible given a set of features.

884:

885: As with any analysis, there is no substitute for clean data and a

886: well understood detector.  Problems which arise from false-positive

887: detections should first be addressed at the level of the detector

888: and data processing pipeline.  Future projects will hopefully have

889: the resources to address spurious detections at this level to make

890: the process easier for their object classifiers.

891: But even high quality, well

892: understood detectors and advanced image processing pipelines

893: such as SDSS will face signal {\it vs}.~background

894: classification problems, and this is where the methods described

895: in this paper come into play.

896:

897: In addition to improved background rejection power,

898: these new methods also have the advantage of generating

899: a single number which ranks the quality of an object rather than

900: a boolean pass/fail decision.  One may then adjust a threshold cut

901: on that single number to tune the desired tradeoff between purity

902: and completeness.  Future surveys may publish transient alerts

903: using relatively loose quality requirements; subscribers to these

904: alerts can then place their own cuts on this quality rank to adjust

905: the purity, completeness, and input data rate as needed.

906:

907: Boosted Trees, Random Forests, and Support Vector Machines all provide

908: much better object classification performance than traditional threshold

909: cuts.  When applied to the SNfactory supernova search pipeline, Boosted

910: Trees enabled us to find more supernovae with less work: Our efficiency

911: for finding real supernovae increased while our workload for scanning

912: non-supernova objects dramatically decreased.

913: Methods such as these will

914: be crucial for maintaining reasonable false positive rates at

915: the automated transient alert pipelines

916: of upcoming projects such as PanSTARRS and LSST.

917:

918: \acknowledgements

919:

920: We would like to thank

921: G.~Aldering,

922: S.~Bongard,

923: M.J.~Childress,

924: P.~Nugent,

925: and R.~Scalzo for useful conversations and

926: assistance with scanning our supernova candidates.

927: We also thank the entire Nearby Supernova Factory collaboration for

928: confirmation and followup spectra of our selected candidates and

929: for the use of the search images for this study.

930: The anonymous referee provided many useful comments for which we

931: are grateful.

932:

933: We are grateful to the technical and scientific staff of the Palomar

934: Oschin telescope, where our supernova search data are obtained.

935: The High Performance Wireless Research and

936: Education Network (HPWREN)\footnote{http://hpwren.ucsd.edu},

937: funded by the National Science Foundation grants 0087344 and 0426879,

938: has provided a consistently reliable

939: network for transferring our large amount of data from Mt. Palomar in

940: a timely manner.

941:

942: This work was supported in part by the Director,

943: Office of Science, Office of High Energy and Nuclear Physics, of the

944: U.S. Department of Energy under Contract No. DE-FG02-92ER40704,

945: by a grant from the Gordon \& Betty Moore Foundation,

946: by National Science Foundation Grant Number AST-0407297.

947: This research used resources of the

948: National Energy Research Scientific Computing Center, which is supported

949: by the Office of Science of the U.S. Department of Energy under

950: Contract No. DE-AC02-05CH11231.

951:

952: SB would especially like to thank the organizers and hosts of the

953: Statistical Inference Problems in High Energy Physics and Astronomy Workshop

954: held at the Banff International Research Station (BIRS), which is supported by

955: the U.S. National Science Foundation, the Natural Science and

956: Engineering Research Council of Canada, Alberta Innovation, and

957: Mexico's National Council for Science and Technology (CONACYT).

958:

959: {\it Facilities:}

960: \facility{PO:1.2m (QUEST-II)}

961:

962: \begin{thebibliography}{}

963:

964: \bibitem[SNfactory, Aldering et al.(2002)]{snfactory}

965: Aldering, G. et al. (SNfactory) 2002,

966: \procspie, 4836, 61

967:

968: \bibitem[Astier et al.(2006)]{snls}

969: Astier, P. et al. (SNLS) 2006,

970: \aap, 447, 31-48

971:

972: \bibitem[Baltay et al.(2007)]{questcamera}

973: Baltay, C. et al. 2007,

974: (astro-ph/0702590)

975:

976: \bibitem[Becker et al.(2006)]{sdss:becker}

977: Becker, A. et al. (SDSS-II Supernova Survey) 2007,

978: %% ``Overview of the SDSS-II Supernova Survey: The First Two Seasons,''

979: \baas~(Seattle, WA)

980: %% presented at the 2007 AAS Meeting, Seattle WA, 7 January 2007

981:

982: \bibitem[Bellman(1961)]{bellman}

983: Bellman, R.E. 1961, Adaptive Control Processes

984: (Princeton, NJ: Princeton University Press)

985:

986: \bibitem[Bishop(1996)]{ann:bishop}

987: Bishop C.M. 1996, Neural Networks for Pattern Recognition

988: (Oxford University Press)

989:

990: \bibitem[Breiman(2001)]{rf:breiman}

991: Breiman, L. 2001,

992: ``Random Forests,''

993: University of California, Berkeley, technical report

994:

995: \bibitem[Breiman et al.(1984)]{breiman}

996: %%% Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J.,

997: Breiman, L., et al. 1984,

998: Classification and Regression Trees

999: (Belmont, CA: Wadsworth International Group)

1000:

1001: \bibitem[Chen et al.(2005)]{svm:chen05}

1002: Chen, P.-H., Lin, C.-J., and Scholkopf, B. 2005,

1003: %%% ``A tutorial on $\nu$-support vector machines: Research Articles,''

1004: Applied Stochastic Models in Business and Industry, 21(2), 111-136

1005:

1006: \bibitem[Dietterich(2002)]{dietterich}

1007: Dietterich, T.G. 2002, Ensemble Learning, in

1008: The Handbook of Brain Theory and Neural Networks, Second Edition

1009: (M.A. Arbib, Ed.) (Cambridge, MA: The MIT Press)

1010:

1011: \bibitem[Fisher(1936)]{fisher}

1012: Fisher, R.A. 1936,

1013: %%% ``The Use of Multiple Measurements in Taxonomic Problems,''

1014: Annals of Eugenics, 7, 179-188

1015:

1016: \bibitem[Freund \& Schapire(1996)]{bdt:freund96}

1017: Freund, Y., and Schapire, R.E. 1996,

1018: %%% ``Experiments with a new boosting algorithm,''

1019: Proc COLT, 209-217 (New York: ACM Press)

1020:

1021: \bibitem[Freund \& Schapire(1999)]{bdt:freund99}

1022: Freund, Y., and Schapire, R.E. 1999,

1023: %%% ``A Short Introduction to Boosting,''

1024: J. Japan. Soc. for Artif. Intel. 14(5), 771-780

1025:

1026: \bibitem[Friedman(2001)]{bdt:friedman01}

1027: Friedman, J. 2001,

1028: %%% ``Greedy function approximation: a gradient boosting machine,''

1029: Annals of Statistics 29(5), 1189-1232

1030:

1031: \bibitem[Friedman, Hastie, \& Tibshirani(2000)]{bdt:friedman00}

1032: Friedman, J., Hastie, T., Tibshirani, R. 2000,

1033: %%% ``Additive Logistic Regression: a Statistical View of Boosting,''

1034: Annals of Statistics, 28(2), 337-407

1035:

1036: \bibitem[Gini(1921)]{gini}

1037: Gini, C. 1921,

1038: %%% ``Measurement of Inequality and Incomes,''

1039: The Economic Journal 31, 124-126

1040:

1041: \bibitem[Kittler et al.(1998)]{kittler}

1042: Kittler, J. et al. 1998,

1043: %%% On Combining Classifiers,

1044: IEEE Trans. Pattern Analysis and Machine Intelligence, 20(3), 226

1045:

1046: \bibitem[K\"oppen(2000)]{koeppen}

1047: K\"oppen, M. 2000,

1048: %%% ``The Curse of Dimensionality,''

1049: 5th Online World Conference on Soft Computing in Industrial Applications

1050: (WSC5), held on the Internet,

1051: %%% September 4-18, 2000.

1052: https://www.npt.nuwc.navy.mil/Csf/papers/hidim.pdf

1053:

1054: \bibitem[Miknaitis et al.(2007)]{essence}

1055: Miknaitis et al. (ESSENCE) 2007,

1056: (astro-ph/0701043), \apj, submitted

1057:

1058: \bibitem[Roe et al.(2005)]{miniboone}

1059: Roe, B.P. et al. 2005,

1060: %%% Boosted Decision Trees as an Alternative to

1061: %%% Artificial Neural Networks for Particle Identification

1062: Nucl. Instrum. Meth. A543, 277-584

1063:

1064: \bibitem[Romano et al.(2006)]{svm:romano}

1065: Romano, R., Aragon, C., Ding, C. 2006,

1066: %%% ``Supernova Recognition using Support Vector Machines,''

1067: in Proceedings of the 5th International Conference

1068: of Machine Learning Applications (Orlando, FL: IEEE)

1069: %%% http://icmla.cs.csub.edu/icmla06/

1070: %%% http://vis.lbl.gov/$\sim$romano/pubs/sne-svm-icmla06.pdf

1071:

1072: \bibitem[Vapnik(1998)]{svm:va98}

1073: Vapnik, V. 1998,

1074: Statistical Learning Theory

1075: (Wiley)

1076:

1077: \bibitem[Zahn \& Roskies(1972)]{zahn}

1078: Zahn, C.T., and Roskies, R.Z. 1972,

1079: %%% ``Fourier descriptors for plane closed curves,''

1080: IEEE Trans. Computers, 21, 269-281

1081:

1082:

1083: \end{thebibliography}

1084:

1085: \clearpage

1086:

1087: %- Fisher example

1088: \begin{figure}

1089: \centering

1090: \includegraphics{f1.eps}

1091: \figcaption{

1092: Example data which would be well separated using Fisher

1093: Discriminant Analysis.  The two classes of events (open and filled

1094: circles) are not well separated by either feature $A$

1095: or $B$, but their correlation is such that the combination

1096: $A + B$ provides very good separation of the two classes.

1097: \label{fig:fisher}

1098: }

1099: \end{figure}

1100:

1101: \clearpage

1102:

1103: %- Artificial Neural Net example

1104: %%% \begin{figure}

1105: %%% \centering

1106: %%% \includegraphics{f2.eps}

1107: %%% \figcaption{

1108: %%% Example structure of an Artificial Neural Network.  The inputs

1109: %%% $x_1 \ldots x_n$

1110: %%% are mapped to an output $O$ via a hidden layer of nodes

1111: %%% $y_1 \ldots y_m$.  The

1112: %%% output of each node is a function of the inputs $x_i$ and a set of

1113: %%% weights $w_{ij}$.  Similarly, the output $O$ is a function of

1114: %%% $y_j$ and another set of weights.

1115: %%% In general there may be multiple hidden layers.

1116: %%% The weights are tuned through iterative training.

1117: %%% }

1118: %%% \label{fig:ann}

1119: %%% \end{figure}

1120:

1121: %%% \clearpage

1122:

1123: %- SVM example

1124: \begin{figure}

1125: \centering

1126: \includegraphics{f2.eps}

1127: \figcaption{

1128: Support Vector Machines map an input space of features into a

1129:     higher dimensional space where the separation of classes becomes

1130:     easier.  The separation boundary in the original space

1131:     may be quite complex, even disjoint.  In the higher dimensional

1132:     space, the separation surface is a hyperplane whose parameters are

1133:     entirely determined by the subset of events (the support vectors)

1134:     nearest to the boundary.

1135: \label{fig:ann}

1136: }

1137: \end{figure}

1138:

1139: \clearpage

1140:

1141: %- Decision Tree example

1142: \begin{figure}

1143: \centering

1144: \includegraphics{f3.eps}

1145: \figcaption{

1146: Example decision tree which would treat high signal-to-noise objects

1147: differently than low signal-to-noise objects.  In practice, a real

1148: decision tree has many more branches and the same variable can be

1149: used to branch at many different locations with different cut values.

1150: \label{fig:decisiontree}

1151: }

1152: \end{figure}

1153:

1154: \clearpage

1155:

1156: %- Method comparison

1157: \begin{figure}

1158: \centering

1159: \includegraphics{f4.eps}

1160: \figcaption{Comparison of Boosted Trees (cyan solid line),

1161:    Random Forest (blue dashed line), SVM (green dotted line),

1162:    and threshold cuts (red dash-dotted line)

1163:    for false positive identification fraction {\it vs.}~true

1164:    positive identification

1165: fraction.  For the threshold cuts, the signal-to-noise ratio,

1166: motion, and shape cuts were varied to adjust signal and background rates.

1167: The red diamond shows the performance of the threshold cuts used

1168: during the SNfactory Summer 2006 search;

1169: the cyan square shows the performance achieved with Boosted Trees

1170: which were used for the Fall 2006 SNfactory search.

1171: The lower right corner of the plot represents ideal performance.

1172: \label{fig:eff}

1173: }

1174: \end{figure}

1175:

1176: \clearpage

1177:

1178: %- Leaf size comparison

1179: \begin{figure}

1180: \centering

1181: \includegraphics{f5.eps}

1182: \figcaption{

1183:     A comparison of the performance of 200 boosted trees with varying

1184:     leaf sizes.  10,000 training events were used; the plot shows the

1185:     comparison of leaves with a minimum of $N=5,25,50,100$ events in comparison

1186:     to the performance of the $N=15$ case.

1187: \label{fig:nperleaf}

1188: }

1189: \end{figure}

1190:

1191: \clearpage

1192:

1193: %- N trees comparison

1194: \begin{figure}

1195: \centering

1196: \includegraphics{f6.eps}

1197: \figcaption{

1198:     A comparison of the performance of $N_{\rm tree}=25, 50, 100, 200$

1199:     boosted trees

1200:     with  a minimum of 50 events per leaf (out of 10,000 training events) in

1201:     comparison to the $N_{\rm tree}=400$ case.

1202: \label{fig:ntree}

1203: }

1204: \end{figure}

1205:

1206: \end{document}

1207: