0204:cs0204047/paper.tex

1: \documentclass[11pt]{article}

2:

3: %\usepackage{ijcai01}

4: %\usepackage{fullpage,palatino}

5: \usepackage{fullpage}

6: \usepackage{amsfonts}

7:

8: \usepackage{alltt}

9: \setlength{\oddsidemargin}{-0.25in}

10: \setlength{\evensidemargin}{-0.25in}

11: \setlength{\topmargin}{0.5in}

12: \setlength{\headheight}{0pt}

13: \def\R{\mathbb{R}}

14: \setlength{\headsep}{0pt}

15: \setlength{\footskip}{30pt}

16: \setlength{\textheight}{8.75in}

17: \setlength{\textwidth}{7in}

18: \setlength{\marginparwidth}{0in}

19: \setlength{\marginparsep}{0in}

20: \newsavebox{\savepar}

21: \newenvironment{boxit}{\begin{lrbox}{\savepar}

22:  \begin{minipage}[b]{5.1in}}

23: {\end{minipage}\end{lrbox}\fbox{\usebox{\savepar}}}

24: \newenvironment{descit}[1]{\begin{quote} \textit{#1}}{\end{quote}}

25: \newenvironment{closeitemize}{\begin{list}{-}{\topsep=0in\itemsep=0in\parsep=0in}}{\end{list}}

26:

27: \input{psfig-dvips}

28:

29: \newif\ifpdf

30: \ifx\pdfoutput\undefined

31:   \pdffalse

32: \else

33:   \pdfoutput=1

34:   \pdftrue

35: \fi

36:

37: \ifpdf

38:   \usepackage[pdftex]{graphicx}

39:   \usepackage[pdftex]{color}

40:   \DeclareGraphicsExtensions{.pdf,.png,.jpg}

41: \else

42:   \usepackage[dvips]{graphicx}

43:   \usepackage[dvips]{color}

44:   \DeclareGraphicsExtensions{.eps,.epsi,.ps}

45: \fi

46:

47: \usepackage{times}

48: %\usepackage{fancyheadings}

49:

50: %\pagestyle{plain}

51: %\thispagestyle{empty}

52: %\pagestyle{empty}

53:

54: \def\midv{\mathop{\,|\,}}

55: \newtheorem{defn}{Definition}

56: \long\def\cbk#1{{\color{red}[CBK: #1]}}

57: \newlength\colwidth \setlength\colwidth{3.25in}

58:

59: \title{Sampling Strategies for Mining in Data-Scarce Domains}

60:

61: \author{Naren Ramakrishnan \\

62: Department of Computer Science\\

63: Virginia Tech, VA 24061\\

64: Tel: (540) 231-8451\\

65: Email: naren@cs.vt.edu

66: \and

67: Chris Bailey-Kellogg\\

68: Department of Computer Sciences\\

69: Purdue University, IN 47907\\

70: Tel: (765) 494-9025\\

71: Email: cbk@cs.purdue.edu}

72:

73: \date{}

74: \begin{document}

75:

76: \maketitle

77: \begin{abstract}

78: \noindent

79: Data mining has traditionally focused on the task of drawing

80: inferences from large datasets.  However, many scientific and

81: engineering domains, such as fluid dynamics and aircraft design, are

82: characterized by {\em scarce} data, due to the expense and

83: complexity of associated experiments and simulations.  In such

84: data-scarce domains, it is advantageous to focus the data collection

85: effort on only those regions deemed most important to support a

86: particular data mining objective.  This paper describes a mechanism

87: that interleaves bottom-up data mining, to uncover multi-level

88: structures in spatial data, with top-down sampling, to clarify

89: difficult decisions in the mining process.  The mechanism exploits

90: relevant physical properties, such as continuity, correspondence, and

91: locality, in a unified framework. This leads to effective mining and

92: sampling decisions that are explainable in terms of domain knowledge

93: and data characteristics.  This approach is demonstrated in two

94: diverse applications --- mining pockets in spatial data, and

95: qualitative determination of Jordan forms of matrices.

96: \end{abstract}

97:

98: %\thispagestyle{empty}

99: %\vspace{-0.2in}

100: \section{Introduction}

101: %\vspace{-0.1in}

102: A number of important scientific and engineering applications, such as

103: fluid dynamics simulation and aircraft design, require analysis of

104: spatially-distributed data from expensive experiments

105: and/or complex simulations demanding days, weeks, or even years on

106: petaflops-class computing systems.  For example,

107: consider the conceptual design of a high-speed civil transport (HSCT),

108: which involves the disciplines of aerodynamics, structures,

109: controls (mission-related), and propulsion. 80\% of the

110: aircraft lifecycle cost is determined at this stage.

111: Fig.~\ref{fig:aircraft} shows a cross-section of the design space for

112: such a problem involving 29 design

113: variables with 68 constraints~\cite{vizcraft}.

114: Frequently, the engineer will change some aspect of

115: a nominal design point, and run a simulation to see how the change

116: affects the objective function and various constraints dealing with

117: aircraft geometry and performance/aerodynamics. Or the design

118: process is made configurable, so the engineer can concentrate

119: on accurately modeling some aspect (e.g., the interaction between the

120: wing root and the fuselage) while replacing the remainder of the design

121: with fixed boundary conditions surrounding the focal area. Both these

122: approaches are inadequate for exploring such large high-dimensional design spaces,

123: even at low fidelity.  Ideally, the design engineer would like a

124: high-level mining system to identify the {\it pockets} that contain

125: good designs and which merit further consideration; traditional tools

126: from optimization and approximation theory can then be applied to

127: fine-tune such preliminary analyses.

128: %Fig.~\ref{fig:aircraft} depicts one such

129: %pocket containing two optimal configurations of aircraft designs.

130:

131: Three important characteristics distinguish such applications.

132: First, they are characterized not by an abundance of data, but

133: rather by a scarcity of data (owing to the cost and time involved in

134: conducting simulations). Second, the

135: computational scientist has complete control over the data acquisition

136: process (e.g.\ regions of the design space where data can be

137: collected), especially via computer simulations.

138: And finally, there exists significant domain knowledge

139: in the form of physical properties such as continuity, correspondence, and

140: locality. It is natural therefore to use such information to focus data

141: collection for data mining. In this paper, we are interested in

142: the question: `Given a simulation code, knowledge of physical properties, and a data

143: mining goal, at what points should data be collected?'

144:

145: By suitably formulating an objective function and constraints around this question, we can

146: pose it as a problem of minimizing the number of samples needed for data mining.

147: %If the cost of data collection is non-uniform across the design space, then the expense

148: %of data samples can also be included in the formulation. This would mean trading off

149: %the cost of collecting new data with the expected improvement in performance w.r.t.

150: %a data mining objective.

151: Such a combination of \{data-scarcity + control over data collection +

152: need to exploit domain knowledge\} characterizes many important

153: computational science applications.

154: Data mining is now recognized as a key solution approach

155: for such applications, supporting analysis, visualization,

156: and design tasks~\cite{naren-ayg-advances}. It serves a primary

157: role in many domains (e.g., microarray bioinformatics) and a complementary role in

158: others, by augmenting traditional techniques from numerical analysis,

159: statistics, and machine learning.

160:

161: \begin{figure}

162: \begin{center}

163: \includegraphics[width=4.5in]{slice}

164: \end{center}

165: \vspace*{-\baselineskip}

166: \caption{A pocket in an aircraft design space viewed as a slice

167: through three design points~\cite{vizcraft} (courtesy Layne T. Watson).}

168: \label{fig:aircraft}

169: \end{figure}

170:

171: The goal of this paper is to describe focused sampling strategies

172: for mining scientific data. Our approach is based on the spatial aggregation

173: language (SAL)~\cite{bailey-kellogg96}, which supports construction

174: of data interpretation and control design applications for

175: spatially-distributed physical systems. Used as a basis for describing data mining

176: algorithms, SAL programs also help exploit knowledge of

177: physical properties such as continuity and locality in data fields.

178: %, based on specified

179: %metrics, adjacency relations, and equivalence predicates.

180: They work in a bottom-up manner to uncover regions of uniformity in

181: spatially distributed data. In conjunction with this process, we introduce

182: a top-down sampling strategy that focuses data collection in only those

183: regions that are deemed most important to support a data mining

184: objective. Together, they help define a methodology for mining in data-scarce

185: domains. We describe

186: this methodology at a high-level and devote the major part of the paper to

187: two applications that employ it.

188:

189: \section{A Methodology for Mining in Data-Scarce Domains}

190: It is possible to study the problem of sampling for targeted data mining activities, such

191: as clustering, finding association rules, and decision tree construction~\cite{ganti-ieee}. This is

192: the approach taken by work such as~\cite{mannila}. In this paper, however, we are interested in a

193: general framework or language to express data mining operations on datasets and

194: which can be used to study the design of data collection and sampling strategies. The spatial

195: aggregation language (SAL)~\cite{bailey-kellogg96,yip96a} is such a framework.

196: %Using SAL, we can implement a variety

197: %of data mining algorithms as repeated aggregation operations on data fields.

198: %In addition, SAL allows the exploitation of physical properties that hold in the data.

199: %As mentioned earlier, such physical properties

200: %allow us to be more intelligent about deciding where to collect data next.

201:

202: \subsection{SAL: The Spatial Aggregation Language}

203: %\vspace{-0.1in}

204: As a data mining framework, SAL

205: is based on successive manipulations of data fields by a uniform vocabulary of

206: aggregation, classification, and abstraction operators. Programming in SAL follows

207: a philosophy of building a multi-layer hierarchy of aggregations of data. These

208: increasingly abstract descriptions of data are built using explicit representations

209: of physical knowledge, expressed as metrics, adjacency relations, and

210: equivalence predicates. This allows a SAL program to uncover and exploit structures in

211: physical data.

212:

213: SAL programs employ what has been called an {\em imagistic reasoning} style~\cite{yip95b}.

214: They employ vision-like routines to manipulate multi-layer geometric and

215: topological structures in spatially distributed data.  SAL adopts a

216: {\em field ontology}, in which the input is a {\em field} mapping from

217: one continuum to another (e.g.\ 2-D temperature field: $\R^2

218: \rightarrow \R^1$; 3-D fluid flow field: $\R^3 \rightarrow \R^3$).

219: Multi-layer structures arise from continuities in fields at multiple

220: scales.  Due to continuity, fields exhibit regions of uniformity, and

221: these regions of uniformity can be abstracted as higher-level

222: structures which in turn exhibit their own continuities.

223: Task-specific domain knowledge specifies how to uncover such regions

224: of uniformity, defining metrics for closeness of both field objects

225: and their features.  For example, isothermal contours are connected

226: curves of nearby points with equal (or similar enough) temperature.

227:

228: The identification of structures in a field is a form of data

229: reduction: a relatively information-rich field representation is

230: abstracted into a more concise structural representation (e.g.\

231: pressure data points into isobar curves or pressure cells; isobar

232: curve segments into troughs).  Navigating the mapping from field to

233: abstract description through multiple layers rather than in one giant

234: step allows the construction of more modular programs with more

235: manageable pieces that can use similar processing techniques at

236: different levels of abstraction.  The multi-level mapping also allows

237: higher-level layers to use global properties of lower-level objects as

238: local properties of the higher-level objects.  For example, the

239: average temperature in a region is a global property when considered

240: with respect to the temperature data points, but a local property when

241: considered with respect to a more abstract region description.  As

242: this paper demonstrates, analysis of higher-level structures in such a

243: hierarchy can guide interpretation of lower-level data.

244:

245:

246: \begin{figure}

247: \begin{center}

248: \includegraphics[width=3.5in]{SA}

249: \end{center}

250: %\vspace*{-\baselineskip}

251: \caption{SAL multi-layer spatial aggregates, uncovered by a uniform

252: vocabulary of operators utilizing domain knowledge. A variety of scientific data mining

253: tasks, such as vector field bundling, contour aggregation, correspondence abstraction, clustering,

254: and

255: uncovering regions of uniformity can be expressed as multi-level computations with SAL

256: aggregates.}

257: \label{fig:sa}

258: \end{figure}

259: \begin{figure}

260: \begin{center}

261: \begin{tabular}{cccc}

262: \includegraphics[width=1.5in]{vect1.eps} &

263: \includegraphics[width=1.5in]{vect2.eps} &

264: \includegraphics[width=1.5in]{vect3.eps} &

265: \includegraphics[width=1.5in]{vect4.eps} \\

266: (a) & (b) & (c) & (d) \\

267: \includegraphics[width=1.5in]{vect5.eps} &

268: \includegraphics[width=1.5in]{vect6.eps} &

269: \includegraphics[width=1.5in]{vect7.eps} &

270: \includegraphics[width=1.5in]{vect8.eps} \\

271: (e) & (f) & (g) & (h) \\

272: \end{tabular}

273: \end{center}

274: \caption{Example steps in SAL implementation of vector field

275: analysis application.  (a) Input vector field. (b) 8-adjacency

276: neighborhood graph.  (c) Forward neighbors. (d) Best forward

277: neighbors. (e) Ngraph transposed from best forward neighbors. (f) Best

278: backward neighbors. (g) Resulting adjacencies redescribed as

279: curves. (h) Higher-level aggregation and classification of curves

280: whose flows converge.}

281: \label{fig:vect}

282: \end{figure}

283:

284: \begin{figure}

285: %\framebox{

286: \begin{tabular}{|lc|} \hline

287:  & \\

288: \begin{minipage}{\textwidth}

289: \small

290: \begin{alltt}

291: // (a) Read vector field.

292: vect_field = read_point_point_field(\emph{infile});

293: points = domain_space(vect_field);

294:

295: // (b) Aggregate with 8-adjacency (i.e. within 1.5 units).

296: point_ngraph = aggregate(points, make_ngraph_near(1.5));

297:

298: // (c) Compare vector directions with node-neighbor direction.

299: angle = function (p1, p2) \{

300:   dot(normalize(mean(feature(vect_field, p1), feature(vect_field, p2))),

301:       normalize(subtract(p2, p1)))

302: \}

303: forward_ngraph = filter_ngraph(adj in point_ngraph, \{

304:  angle(from(adj), to(adj)) > \emph{angle\_similarity}

305: \})

306: // (d) Find best forward neighbor, comparing vector direction

307: // with ngraph edge direction and penalizing for distance.

308: forward_metric = function (adj) \{

309:   angle(from(adj), to(adj)) - \emph{distance\_penalty} * distance(from(adj),to(adj))

310: \}

311: best_forward_ngraph = best_neighbors_ngraph(forward_ngraph, forward_metric);

312:

313: // (e) Find backward neighbors by transposing best forward neighbors.

314: backward_ngraph = transpose_ngraph(best_forward_ngraph);

315:

316: // (f) At junctions, keep best backward neighbor using metric

317: // similar to that for best forward neighbors.

318: backward_metric = function (adj) \{

319:   angle(to(adj), from(adj)) - \emph{distance\_penalty}*distance(from(adj),to(adj))

320: \}

321: best_backward_ngraph = best_neighbors_ngraph(backward_ngraph, backward_metric);

322:

323: // (g) Move to a higher abstraction level by forming equivalence classes

324: // from remaining groups and redescribing them as curves.

325: final_ngraph = symmetric_ngraph(best_backward_ngraph, extend=true);

326: point_classes = classify(points, make_classifier_transitive(final_ngraph));

327:

328: points_to_curves = redescribe(classes(point_classes),

329:    make_redescribe_op_path_nline(final_ngraph));

330: trajs = high_level_objects(points_to_curves);

331: \end{alltt}

332: \end{minipage}

333:  & \\

334:  & \\

335: \hline

336: \end{tabular}

337: %}

338: \caption{SAL data mining program for the vector field analysis application of Fig.~\ref{fig:vect}.}

339: \label{samplecode}

340: \end{figure}

341:

342: SAL supports structure discovery through a small set of generic

343: operators, parameterized with domain-specific knowledge, on uniform

344: data types.  These operators and data types mediate increasingly

345: abstract descriptions of the input data (see Fig.~\ref{fig:sa}) to

346: form higher-level abstractions and mine patterns.  The {\em

347: primitives} in SAL are contiguous regions of space called {\em spatial

348: objects}; the {\em compounds} are (possibly structured) collections of

349: spatial objects; the {\em abstraction mechanisms} connect collections

350: at one level of abstraction with single objects at a higher level.

351:

352: SAL is currently available as a C++ library\footnote{The SAL implementation can be

353: downloaded from http://www.cis.ohio-state.edu/insight/sal-code.html.} providing access to a

354: large set of data type implementations and operations.  In addition,

355: an interpreted, interaction environment layered over the library

356: supports rapid prototyping of data mining applications.  It allows

357: users to inspect data and structures, test the effects of different

358: predicates, and graphically interact with representations of the

359: structures.

360: %SAL applications ranging from weather data analysis to

361: %diffusion-reaction system analysis to dynamical systems analysis to

362: %mechanical mechanism analysis all use the same set of generic

363: %operators parameterized by different domain knowledge.

364:

365: To illustrate SAL programming style, consider the task of bundling

366: vectors in a given vector field (e.g.\ wind velocity or temperature gradient)

367: into a set of streamlines (paths through the field following the

368: vector directions). This process can be depicted as shown in

369: Fig.~\ref{fig:vect} and the corresponding SAL data mining program is shown

370: in Fig.~\ref{samplecode}.

371: The steps

372: in this program are as follows:

373: (a) Establish a {\em field} mapping points (locations) to points

374: (vector directions, assumed here to be normalized).  (b) Localize

375: computation with a {\em neighborhood graph}, so that only spatially

376: proximate points are compared.

377: (c)--(f) Use a series of local computations on this representation to

378: find {\em equivalence classes} of neighboring vectors with respect to

379: vector direction (systematically eliminate all edges but those whose

380: directions best match the vector direction at both endpoints).

381: (g) {\em Redescribe} equivalence classes of vectors into more abstract

382: streamline curves.  (h) Aggregate and classify these curves into

383: groups with similar flow behavior, {\em using the exact same operators

384: but with different metrics} (code not shown).  As this example

385: illustrates, SAL provides a vocabulary for expressing the knowledge

386: required (e.g., distance metrics and similarity metrics)

387: for uncovering multi-level structures in spatial datasets.  It has been

388: applied to applications ranging from decentralized control

389: design~\cite{bailey-kellogg01}

390: %to weather data analysis~\cite{huang99}

391: to analysis of diffusion-reaction morphogenesis~\cite{ordonez00}.

392:

393: \subsection{Data Collection and Sampling}

394: %\vspace{-0.1in}

395: The above example illustrated the use of SAL in a data-rich domain. The exploitation

396: of physical properties is a central tenet of SAL since it drives the computation of

397: multi-level spatial aggregates. Many important physical properties can be expressed as

398: SAL computations by suitably defining adjacency relations and aggregation metrics.

399: To extend the use of SAL to data-scarce settings, we

400: present the sampling methodology outlined in Fig.~\ref{sampling-meth}.

401:

402: Once again, it is easy to understand the methodology in the context of the vector-field bundling

403: application (Fig.~\ref{fig:vect}). Assume that we apply the SAL data mining program of Fig.~\ref{samplecode}

404: with a small dataset and have navigated upto the highest level of the hierarchy (streamlines bundled with

405: convergent flows).

406: The SAL program computes different streamline aggregations from a neighborhood graph and chooses

407: one based on how well its curvature matches the direction of the vectors it aggregates. If data

408: is scarce, it is likely that some of these classification decisions will be {\it ambiguous}, i.e.,

409: there may exist multiple streamline aggregations. {\bf In such a case, we would like to choose a new data sample

410: that reduces the ambiguity and clarifies what the correct classification should be.}

411:

412: This is the essence of our sampling methodology: using SAL aggregates, we identify an information-theoretic measure

413: (here, ambiguity) that can be used to drive stages of future data collection. For instance, the

414: ambiguous streamline classifications can be summarized as a 2D ambiguity distribution that has a spike

415: for every location where an ambiguity was detected.

416: Reduction of ambiguity can be posed as the problem of minimization of (or maximization, as the case may be)

417: a functional involving the (computed) ambiguity. The functional could be the entropy in the underlying

418: data field, as revealed by the ambiguity distribution.

419: Such a minimization will lead us to selecting a data point(s) that clarifies the distribution of

420: streamlines, and hence makes more effective use of data for data mining purposes.  The net effect of this methodology is

421: that we are able to capture the desirability of a particular design (data layout) in terms

422: of computations involving SAL aggregates. Thus, sampling is conducted for the express purpose of improving the

423: quality and efficacy of data mining. The dataset is updated with the newly collected value and the process is repeated

424: till a desired stopping criteria is met. For instance, we could terminate if the

425: functional is within accepted bounds, or

426: when there is no improvement in confidence of data mining results between successive rounds of data collection.

427: In our case, when there is no further ambiguity.

428:

429: This idea of sampling to satisfy particular design criteria has been studied in various

430: contexts, especially spatial statistics~\cite{Easterling, journel, dace}.

431: Many of these approaches (including ours) rely on

432: capturing properties of a desirable design in terms of a novel

433: objective function.  The distinguishing feature of our work is that it

434: uses {\em spatial} information gleaned from a higher level of

435: abstraction to focus data collection at the field/simulation code

436: layer.

437: % While flavors of the {\it consistent labeling} problem in

438: %mobile vision have this feature, they are more attuned to transferring

439: %information across two {\it successive} abstraction levels.

440: The applications presented here are also novel in that they span and connect

441: arbitrary levels of abstraction, thus suggesting new ways to integrate

442: qualitative and quantitative simulation~\cite{berleant-kuipers}.

443:

444: %The local nature of SAL computations means that interpolatory

445: %models (e.g., kriging) are particularly appropriate as surrogates since they give exact responses at the known

446: %data locations, and estimate values at other locations by minimizing a suitable

447: %error criterion (e.g., MSE). Global, least-squares techniques are not

448: %applicable because measurements at all locations are equally considered to uncover

449: %trends and patterns in a particular region.

450: %

451: \begin{figure}

452: \begin{center}

453: \includegraphics[height=3in]{methodology}

454: \end{center}

455: %\vspace*{-\baselineskip}

456: \caption{The sampling methodology for SAL mining in data-scarce domains.}

457: \label{sampling-meth}

458: \end{figure}

459:

460: We present concrete realizations of the above methodology in the next section.

461: %As an example,

462: %Fig.~\ref{sampling-meth2} shows how the various stages in Fig.~\ref{sampling-meth} can be instantiated to help mine

463: %streamlines in vector fields.

464: But before we proceed, it is

465: pertinent to note an optional step in our methodology. The newly collected data value can be used to improve a {\it

466: surrogate} model which then generates a dense data field for mining.

467: A surrogate function is something that is used in lieu of the real

468: data source, so as to generate sufficient data for mining purposes. This is often

469: more advantageous than working directly with sparse data. Surrogate models are widely used in

470: engineering design, optimization, and in response surface approximations~\cite{ltw,response-book}.

471:

472: %The process can be terminated when the functional is within acceptable bounds or if we

473: %have exceeded a cost assumed for data collection.

474: %The instantiation of the methodology thus depends on $\mathcal{M}$, the

475: %functional involving $\mathcal{M}$, the way to

476: %optimize the functional, and the stopping criteria. The `right' choice

477: %is heavily dependent on the particular application and, if done wisely, can

478: %lead to substantial improvements in efficiency of data mining.

479: %This is best illustrated by examples, which we proceed to do with three applications.

480: %

481: %The common theme in all of these applications is that the formulation of $\mathcal{M}$

482: %reflects our prior knowledge of physical properties. Since computations in SAL

483: %are done in a bottom-up manner, difficulties encountered in decision-making (e.g.,

484: %ambiguity in classifying a streamline) are captured in $\mathcal{M}$ and used to

485: %drive top-down selection of sample locations. Table~\ref{compare3} compares the organization

486: %of the data mining methodology for each

487: %of the three applications. As we will show, this approach leads

488: %to highly effective sampling decisions that are also explainable in terms of

489: %problem structures and domain knowledge.

490:

491: Together, SAL and our focused sampling methodology address the main issues raised in

492: the beginning of the paper: SAL's uniform use of fields and abstraction operators allows

493: us to exploit prior knowledge in a bottom-up manner. Discrepancies as suggested by our

494: knowledge of physical properties (e.g., ambiguities) are used in a top-down manner by

495: the sampling methodology. Continuing these two stages alternatively leads to a closed-loop

496: data mining solution for data-scarce domains.

497:

498: \section{Example Applications}

499: %\vspace{-0.1in}

500: \subsection{Mining Pockets in Spatial Data}

501: %\vspace{-0.1in}

502: Our first application is motivated by the aircraft design problem and is meant

503: to illustrate the basic idea of our methodology. Here, we are given a spatial vector field

504: and we wish to identify {\it pockets} underlying the gradient. In a weather map, this might

505: mean identifying pressure troughs, for instance. The question is: `where should data be

506: collected so that we are able to mine the pockets with high confidence?' We begin by presenting

507: a mathematical function that gives rise to pockets in spatial fields. This function will

508: be used to validate and test our data mining and sampling methodology.

509:

510: \begin{figure}

511: \begin{center}

512: \begin{tabular}{cc}

513: \includegraphics[width=2.5in]{pocket}

514: \end{tabular}

515: \end{center}

516: \caption{A 2D pocket function.}

517: \label{fig:pocket-diag}

518: \end{figure}

519:

520: \subsubsection*{de Boor's function}

521: Carl de Boor invented a pocket function that exploits containment properties of the

522: $n$-sphere of radius 1 centered at the origin ($\Sigma {x_i}^2 \leq 1$) with respect to the

523: $n$-dimensional hypercube defined by $x_i \in [-1, 1], i=1\cdots n$. Even though the

524: sphere is embedded inside the cube, notice that the ratio of the volume of the cube ($2^n$) to that of the sphere

525: ($\pi^{n/2} / (n/2)!$) grows unboundedly with $n$. This means that the

526: volume of a high-dimensional cube is concentrated in its corners (a

527: counterintuitive notion at first). de Boor exploited this

528: property to design a difficult-to-optimize function which assumes a

529: {\it pocket} in each corner of the cube (Fig.~\ref{fig:pocket-diag}), that

530: is just outside the sphere.  Formally, it can be

531: defined as:

532: \begin{eqnarray}

533: \alpha({\mathbf X}) & = & cos \left( \sum_{i=1}^n 2^i \left( 1 + {x_i \over{\mid x_i \mid}}\right) \right) - 2 \\

534: \delta({\mathbf X})     & = & \| {\mathbf{X}} - 0.5 {\mathbf{I}}\| \\

535: p({\mathbf X}) & = & \alpha({\mathbf X}) ( 1 - \delta^2({\mathbf X})

536: (3 - 2\delta({\mathbf X}))) + 1

537: \end{eqnarray}

538: where ${\mathbf X}$ is the n-dimensional point $(x_1,x_2,\cdots,x_n)$

539: at which the pocket function $p$ is evaluated, ${\mathbf I}$ is the

540: identity n-vector, and $\|\cdot\|$ is the $L_2$ norm.

541:

542: It is easily seen that $p$ has $2^n$ pockets (local minima); if $n$ is large

543: (say, 30, which means it will take more than half a million points to

544: just represent the corners of the $n$-cube!), naive global

545: optimization algorithms will require an unreasonable number of

546: function evaluations to find the pockets. Our goal for data mining here is to obtain a

547: qualitative indication of the existence, number, and locations of pockets, using

548: low-fidelity models and/or as few data points as possible.  The results can then be

549: used to seed higher-fidelity calculations.  This is also fundamentally

550: different from DACE~\cite{dace}, polynomial response surface

551: approximations~\cite{ltw}, and other approaches in geo-statistics

552: where the goal is accuracy of functional prediction at untested data

553: points.  Here, accuracy of estimation is traded for the ability to

554: mine pockets.

555:

556: \subsubsection*{Surrogate Function}

557: In this study, we use the SAL vector-field bundling code presented earlier along with

558: a surrogate model as the basis for generating a dense field

559: of data. Surrogate theory is an established area in engineering optimization and

560: there are several ways in which we can build a surrogate.

561: However, the local nature of SAL computations means that we can be selective about

562: our choice of surrogate representation.

563: For example, global, least-squares type

564: approximations are inappropriate since measurements at all locations

565: are equally considered to uncover trends and patterns in a particular

566: region.  We advocate the use of kriging-type

567: interpolators~\cite{dace}, which are local modeling methods with roots

568: in Bayesian statistics.  Kriging can handle situations with multiple

569: local extrema (for example, in weather data, remote sensing data,

570: etc.) and can easily exploit anisotropies and trends. Given $k$

571: observations, the interpolated model gives exact responses at these

572: $k$ sites and estimates values at other sites by minimizing the mean

573: squared error (MSE), assuming a random data process with zero mean and

574: a known covariance function.

575:

576: Formally (for two dimensions), the true function $p$ is assumed to be

577: the realization of a random process such as:

578: \begin{equation}

579: p(x,y) = \beta + Z(x,y)

580: \end{equation}

581: where $\beta$ is typically a uniform random variate, estimated based

582: on the known $k$ values of $p$, and $Z$ is a correlation function.

583: Kriging then estimates a model $p'$ of the same form, based on the

584: $k$ observations:

585: \begin{equation}

586: p'(x_i,y_i) = E(p(x_i,y_i) \midv p(x_1,y_1), \cdots, p(x_k,y_k))

587: \end{equation}

588: and minimizing mean squared error between $p'$ and $p$:

589: \begin{equation}\label{eq:MSE}

590: MSE = E(p'(x,y) - p(x,y))^2

591: \end{equation}

592: A typical choice for $Z$ in $p'$ is $\sigma^2 R$, where scalar

593: $\sigma^2$ is the {\it estimated} variance, and correlation matrix $R$

594: encodes domain-specific constraints and reflects the current fidelity

595: of data.  We use an exponential function for entries in $R$, with two

596: parameters $C_1$ and $C_2$:

597: \begin{equation}\label{eq:R}

598: R_{ij} = e^{-C_1|x_i-x_j|^2 - C_2|y_i-y_j|^2}

599: \end{equation}

600: Intuitively, values at closer points should be more highly correlated.

601:

602: The estimator minimizing mean squared error is then obtained by

603: multi-dimensional optimization (the derivation from Eqs.~\ref{eq:MSE}

604: and~\ref{eq:R} is beyond the scope of this paper):

605: \begin{equation}\label{eq:optim1}

606: \max_C {\frac{-k}{2}}(\ln\sigma^2 + \ln |R|)

607: \end{equation}

608: This expression satisfies the conditions that there is no error

609: between the model and the true values at the chosen $k$ sites, and

610: that all variability in the model arises from the design of $Z$.  The

611: multi-dimensional optimization is often performed by gradient descent

612: or pattern search methods.  More details are available in~\cite{dace},

613: which demonstrates this methodology in the context of the design and

614: analysis of computer experiments.

615:

616: \subsubsection*{Data Mining and Sampling Methodology}

617: The bottom-up computation of SAL aggregates from the surrogate model's outputs

618: will possibly lead to some ambiguous streamline classifications, as discussed earlier.

619: Ambiguity can reflect the desirability of acquiring data at or near a

620: specified point, to clarify the correct classification and to serve as

621: a mathematical criterion of information content.

622: There are several ways in which we can use information about ambiguity to drive

623: data collection. In this study, we express the ambiguities as a distribution describing

624: the number of possible good neighbors (for a streamline).

625: This {\it ambiguity distribution} provides a novel mechanism to include

626: qualitative information --- streamlines that agree will generally

627: contribute less to data mining, for information purposes. The information-theoretic measure

628: $M$ (ref. Fig.~\ref{sampling-meth}) was thus defined to be the ambiguity distribution $\wp$.

629:

630: The functional was defined as the posterior entropy $E(-\log d)$, where $d$ is the conditional

631: density of $\wp$ over the design space {\it not covered}

632: by the current data values. By a reduction argument, minimizing this posterior entropy can be

633: shown to be maximizing the prior entropy over the {\it unsampled} design space~\cite{dace}.

634: In turn, this means that the amount of information obtained from an experiment (additional data

635: collection) is maximized. In addition, we also incorporated $\wp$ as an indicator covariance term in

636: our surrogate model (this is a conventional method

637: for including qualitative information in an interpolatory model~\cite{journel}).

638:

639: \subsubsection*{Experimental Results}

640: The initial experimental configuration used a face-centered design ($4$ points in the 2D case).  A

641: surrogate model by kriging interpolation then generated data on a $41^n$-point grid.

642: de Boor's function was used as the source for data values; we also employed pseudorandom perturbations

643: of it that shift the pockets from the corners in a somewhat unpredictable

644: way (see~\cite{ambig} for details). In total, we experimented with 100 perturbed

645: variations (each) of the 2D and 3D pocket functions. For each of these cases, data collection was organized

646: in rounds of one extra sample each (that minimizes the above functional). The number of samples needed

647: to mine all the pockets by SAL was recorded. We also compared our results with those obtained

648: from a pure DACE/kriging approach (i.e., where sampling was directed at improving accuracy of function estimation). In other words, we used the DACE methodology to suggest

649: new locations for data collection and determined how these choices fared with respect

650: to mining the pockets.

651:

652: \begin{figure}

653: \begin{center}

654: \includegraphics[width=3in]{pocket-results}

655: \end{center}

656: %\vspace*{-\baselineskip}

657: \caption{Pocket-finding results (2D) show that focused sampling using a measure

658: of ambiguity always requires fewer total samples (7-15) than conventional kriging (17-23).}

659: \label{fig:pocket-bar}

660: \end{figure}

661:

662: \begin{figure}

663: \begin{center}

664: \begin{tabular}{cc}

665: \includegraphics[width=3in]{design} &

666: \includegraphics[width=2.5in]{results2sal}

667: \end{tabular}

668: \end{center}

669: \caption{Mining pockets in 2D from only 7 sample points.

670: (left)

671: The chosen sample locations: 4 initial

672: face-centered samples (marked as blue circles) plus 3

673: samples selected by our methodology (marked as red diamonds). Note that no additional sample is required in

674: the lower-left quadrant.  (right)

675: SAL structures in surrogate

676: model data, confirming the existence of four pockets.}

677: \label{fig:pocket}

678: \end{figure}

679:

680: Fig.~\ref{fig:pocket-bar} shows the distributions of total number of data samples

681: required to mine the four pockets for the 2D case. We were thus able to mine the 2D pockets

682: using 3 to 11 additional samples, whereas the conventional kriging approach required

683: 13 to 19 additional samples. The results were were more striking in the 3D case:

684: at most 42 additional samples for focused sampling and upto 151 points for conventional

685: kriging. This shows that our focused sampling methodology performs 40-75\% better

686: than sampling by conventional kriging.

687:

688: Fig.~\ref{fig:pocket} (left)

689: describes a 2D design involving only $7$ total data points that is able to mine the four pockets.

690: Counterintuitively, no additional sample is required in the lower left quadrant! While this

691: will lead to a highly sub-optimal design (from the traditional viewpoint

692: of minimizing variance in predicted values), it is nevertheless an appropriate design

693: for data mining purposes. In particular, this means that neighborhood

694: calculations involving the other three quadrants are enough to uncover

695: the pocket in the fourth quadrant.  Since the kriging interpolator

696: uses local modeling and since pockets in 2D effectively occupy the

697: quadrants, obtaining measurements at ambiguous locations serves to

698: capture the relatively narrow regime of each dip, which in turn helps

699: to distinguish the pocket in the neighboring quadrant. This effect is hard

700: to achieve without exploiting knowledge of physical properties, in this case,

701: locality of the dips.

702:

703: \subsection{Qualitative Jordan Form Determination}

704: %\vspace{-0.1in}

705: In our second  application, we use our methodology to identify the most probable

706: Jordan form of a given matrix. This is a good application for data mining

707: since the direct computation of the Jordan form leads to a numerically

708: unstable algorithm.

709:

710: \subsubsection*{Jordan forms}

711: A matrix $\mathcal{A}$ (real or complex) that has $r$ independent eigenvectors has a Jordan form that

712: consists of $r$ {\it blocks}. Each of these blocks is an upper triangular

713: matrix that is associated with one of the eigenvectors of

714: $\mathcal{A}$ and whose size describes the multiplicity of the

715: corresponding eigenvalue. For the given matrix $\mathcal{A}$,

716: the diagonalization thus posits a nonsingular matrix $\mathcal{B}$ such that:

717: \begin{equation}

718: \mathcal{B}^{-1} \mathcal{A} \mathcal{B} = \left[ \begin{array}{cccc}

719:             {\mathcal{J}}_1 & & & \\

720:             & {\mathcal{J}}_2 & & \\

721:             & & \cdot & \\

722:             & & & {\mathcal{J}}_r\\

723:             \end{array}

724: \right]

725: \end{equation}

726: where

727: \begin{equation}

728: {\mathcal{J}}_i = \left[ \begin{array}{cccc}

729:                           \lambda_i & 1     &   & \\

730:                                     & \cdot & 1 & \\

731:                                     &       & \cdot  & 1\\

732:                                     &       &   & \lambda_i\\

733: \end{array}

734: \right]

735: \end{equation}

736: and $\lambda_i$ is the eigenvalue revealed by the $i$th Jordan block ($\mathcal{J}_i$).

737: The Jordan form is most easily explained by looking at how eigenvectors are

738: distributed for a given eigenvalue. Consider, for example, the matrix

739: $$ \left[ \begin{array}{crr}

740: 1 & 1 & -1 \\

741: 0 & 0 & 2 \\

742: 0 & -1 & 3 \\

743: \end{array}

744: \right]$$

745: that has eigenvalues at 1, 1, and 2. This matrix has only two

746: eigenvectors, as revealed by the two-block structure of its Jordan form:

747: $$\left[ \begin{array}{cr|r}

748: 1 & 1 & 0 \\

749: 0 & 1 & 0 \\ \hline

750: 0 & 0 & 2 \\

751: \end{array}

752: \right]$$

753: The Jordan form is unique modulo shufflings of the blocks and, in this case,

754: shows that there is one eigenvalue ($1$) of multiplicity $2$ and one eigenvalue

755: ($2$) of multiplicty $1$. We say that the matrix has the

756: Jordan structure given by

757: $(1)^2 (2)^1$. In contrast, the matrix

758: $$ \left[ \begin{array}{crr}

759: 1 & 0 & 0 \\

760: 0 & 2 & 0 \\

761: 0 & 0 & 1 \\

762: \end{array}

763: \right]$$

764: has the same eigenvalues but a three-block Jordan structure:

765: $$\left[ \begin{array}{c|r|r}

766: 1 & 0 & 0 \\ \hline

767: 0 & 1 & 0 \\ \hline

768: 0 & 0 & 2 \\

769: \end{array}

770: \right]$$

771: This is because there are three independent eigenvectors (the unit vectors,

772: actually). The diagonalizing matrix is thus the identity matrix and the

773: Jordan form has three permutations. The Jordan structure is therefore

774: given by $(1)^1

775: (1)^1 (2)^1$. These two examples show that

776: a given eigenvalue's multiplicity could be distributed across one, many, or

777: all Jordan blocks. Correlating the eigenvalue with the block structure is

778: an important problem in numerical analysis.

779:

780: The typical approach to computing the Jordan form is to `follow the staircase'

781: pattern of the structure and perform rank determinations in conjunction

782: with ascertaining the eigenvalues. One of the more serious

783: caveats with such an approach involves mistaking an eigenvalue of multiplicity

784: $> 1$ for multiple eigenvalues~\cite{staircase}.

785: In the first example matrix

786: above, this might lead to inferring that the Jordan form has three

787: blocks.

788: The extra care needed to safeguard staircase algorithms usually

789: involves more complexity than the original computation to be performed!

790: The ill-conditioned nature of this computation has thus

791: traditionally prompted numerical analysts to favor other, more stable,

792: decompositions.

793:

794: \begin{figure}

795: \begin{center}

796: \begin{tabular}{cc}

797: \includegraphics[width=0.25\textwidth]{example-jordan1} &

798: \includegraphics[width=0.25\textwidth]{example-jordan2} \\

799: \end{tabular}

800: \end{center}

801: \caption{Superimposed spectra for assessing the Jordan form

802: of the Brunet matrix. Two Jordan blocks of multiplicity 3 are

803: observed for eigenvalue 7, at different (left, right) perturbation levels.}

804: \label{fig:jordan}

805: \end{figure}

806:

807: \subsubsection*{Qualitative assessment of Jordan forms}

808: A recent development has been the acceptance of a qualitative approach

809: to Jordan structure determination, proposed by Chaitin-Chatelin and

810: Frayss\'{e}~\cite{precise}. This approach does not employ the staircase

811: idea and, instead, exploits a semantics of eigenvalue perturbations to

812: infer multiplicity. This leads to a geometrically intuitive algorithm that

813: can be implemented using SAL.

814:

815: Consider a matrix that has eigenvalues $\lambda_1, \lambda_2,

816: \cdots, \lambda_n$ with multiplicities $\rho_1, \rho_2, \cdots, \rho_n$

817: (resp). Any attempt at finding the eigenvalues (e.g., determining

818: the roots of the characteristic polynomial) is intrinsically subject

819: to the numerical analysis dogma: the problem being solved will

820: actually be a {\it perturbed} version of the original problem. This allows

821: the expression of the {\it computed} eigenvalues in terms of perturbations

822: on the actual eigenvalues. It can be easily seen that the computed

823: eigenvalue corresponding to any $\lambda_k$ will be distributed on

824: the complex plane as:

825: $$\lambda_k + |\Delta|^{1\over{\rho_k}} e^{{i\phi}\over{\rho_k}}$$

826: where the phase $\phi$ of the perturbation $\Delta$ ranges over \{$2\pi, 4\pi,

827: \ldots, 2\rho_k \pi$\} if $\Delta$ is positive and

828: over \{$3\pi, 5\pi, \ldots, 2(\rho_k+1) \pi$\} if $\Delta$ is negative. The

829: insight

830: in~\cite{precise} is to {\it superimpose} numerous such perturbed

831: calculations graphically so that the aggregate picture reveals the $\rho_k$ of

832: the eigenvalue $\lambda_k$. Notice that the phase variations

833: imply that the computed eigenvalues will be lying on the

834: vertices of a regular polygon centered on the {\it actual} eigenvalue

835: and where the number of sides is {\it two times} the multiplicity of the

836: considered eigenvalue (this takes into account both positive and

837: negative $\Delta$). Since the diameter of the polygon is influenced

838: by $\Delta$, iterating this process over many $\Delta$ will lead to a

839: `sticks' depiction of the Jordan form.

840:

841: To illustrate, we choose a matrix whose computations will

842: be more prone to finite precision errors. Perturbations on

843: the 8-by-8 Brunet matrix~\cite{precise} with Jordan structure

844: $(-1)^1 (-2)^1 (7)^3 (7)^3$ induce the superimposed structures shown in

845: Fig.~\ref{fig:jordan}. The left part of Fig.~\ref{fig:jordan} depicts

846: normwise relative perturbations in the scale

847: of $[2^{-50},2^{-40}]$. The six sticks around the eigenvalue at 7

848: clearly reveal that its Jordan block is of size 3. The

849: other Jordan block, also centered at 7, is revealed if we conduct

850: our exploration at a finer perturbation level. Fig.~\ref{fig:jordan}

851: reveals the second Jordan block using perturbations in the

852: range $[2^{-53},2^{-50}]$. The noise in both pictures is a consequence

853: of (i) having two Jordan blocks with the same size, and (ii)

854: a `ring' phenomenon studied in~\cite{edelman-ma}; we do

855: not attempt to capture these effects in this paper.

856:

857: \begin{figure}[t]

858: \begin{center}

859: \includegraphics[width=2in]{star-trig2a}\hspace*{0.25in}\includegraphics[width=2in]{star-trig4a} \\

860: \vspace*{0.1in}

861: \includegraphics[width=2in]{star-corr2a}\hspace*{0.25in}\includegraphics[width=2in]{star-corr4a} \\

862: \end{center}

863: \vspace*{-\baselineskip}

864: \caption{Mining Jordan forms from (left) a small sample set, and (right)

865: large sample set. (top) Approximately congruent triangles. (bottom)

866: Evaluation

867: of correspondence of rotated triangles in terms of match

868: %

869: %, for (top)

870: %small sample set; (middle) larger sample set; (bottom) larger sample

871: %set but lower-scoring model.  (left) Approximately-congruent

872: %triangles.  (right) Evaluation of correspondence in terms of match

873: between original (red dots) and rotated (green circles) samples.}

874: \label{fig:star-demo}

875: %\vspace*{-\baselineskip}

876: \end{figure}

877:

878: \subsubsection*{Data Mining and Sampling Methodology}

879: For this study, we collect data by random normwise perturbations

880: in a given region and a SAL program determines

881: multiplicity by detecting symmetry correspondence in the samples.  The first

882: aggregation level collects the samples for a given perturbation into

883: triangles.  The second aggregation level finds congruent triangles via

884: geometric hashing~\cite{hash}, and uses congruence to establish

885: an analogy relation among triangle vertices.  This relation is then abstracted

886: into a rotation about a point (the eigenvalue), and evaluated for whether

887: each point rotates onto another and whether matches define regular

888: polygons. A third level then compares rotations across

889: different perturbations, re-visiting perturbations or choosing new

890: perturbations in order to disambiguate (see Fig.~\ref{fig:star-demo}).

891: The end result of this

892: analysis is a confidence measure on models of possible Jordan forms.

893: Each model is defined by its estimate of $\lambda$ and $\rho$ (notice that

894: we are working only within one region at a time). The measure

895: $M$ was defined to be the joint probability distribution over the space of $\lambda$ and $\rho$.

896:

897: \subsubsection*{Experimental Results}

898: Since our Jordan form computation treats multiple perturbations (irresp. of level)

899: as {\it independent} estimates of eigenstructure, the idea of sampling

900: here is not `where to collect,' but `how much to collect.' The goal of

901: data mining is hence to improve our confidence in model evaluation.

902: We organized data collection into rounds of 6-8 samples each,

903: varied a tolerance parameter for triangle

904: congruence from 0.1 to 0.5 (effectively increasing the number of

905: models posited), and determined the number of rounds needed to

906: determine the Jordan form. As test cases, we used the set of matrices

907: studied in~\cite{precise}.

908: On average, our focused sampling approach required 1 round of data collection

909: at a tolerance of 0.1 and up to 2.7 rounds at 0.5.  Even with a large number

910: of models posited, additional data quickly weeded out bad models.

911: Fig.~\ref{fig:star-demo} demonstrates this mechanism on the Brunet matrix

912: discussed above for two sets of sample points.

913: To the best of our knowledge, this is the only known

914: known focused sampling methodology for this domain; we hence are unable to

915: present any comparisons. However, it is clear that by harnessing domain knowledge

916: about correspondences, we have arrived at an intelligent sampling methodology that

917: resembles what a human would obtain by visual inspection.

918:

919: \section{Discussion}

920: The presented methodology for mining in data-scarce domains has several

921: intrinsic benefits. First, it is based on a uniform vocabulary of operators that

922: can be exploited for a rich diversity of applications. Second, it

923: demonstrates a novel factorization to the problem of mining when data is scarce,

924: namely, formulating an experiment design methodology to clarify, disambiguate,

925: and improve confidences in higher-level aggregates of data.

926: This allows us to bridge qualitative and quantitative

927: information in a unified framework. SAL programs thus uncover bottom-up structures in data

928: systematically and use difficulties encountered in this process (ambiguities,

929: lack of correspondences) to guide top-down selection of additional

930: data samples. By using knowledge of physical properties explicitly, our

931: approach can provide more holistic and explainable results than off-the-shelf data

932: mining algorithms. Third, our methodology can co-exist with

933: more traditional approaches to problem solving (numerical analysis, optimization)

934: and is not meant to be a replacement or a contrasting approach. This is amply

935: demonstrated in each of the two applications above, where connections with various

936: traditional methodologies have been carefully established.

937:

938: The methodology makes several intrinsic assumptions which we only briefly mention

939: here. All of our applications have been such that the cause, formation, and

940: effect of the relevant physical properties are well understood. This is precisely what

941: allows us to act decisively based on higher-level information from SAL aggregates,

942: through the measure $M$. It also assumes that the problems that will be

943: encountered by the mining algorithm are the same as the problems for which

944: it was designed. This is an inheritance from Bayesian inductive inference and

945: leads to fundamental limitations on what can be done in such a setting. For instance, if

946: new data does not help clarify an ambiguity,

947: does the fault lie with the model (SAL higher-level aggregate) or with the

948: data? We can summarize this problem by saying that the approach requires strong {\it a priori} information about

949: what is possible and what is not.

950:

951: Nevertheless, by advocating targeted use of domain specific knowledge and

952: aiding qualitative model selection, our methodology is more efficient at determining

953: high level models from empirical data. Together, SAL and our information-theoretic measure $M$

954: encapsulate knowledge about physical properties and this is what makes our

955: methodology a viable one for data mining purposes.

956: In future we aim to characterize more formally

957: the particular forms of domain knowledge that help overcome sparsity and

958: noise in scientific datasets.

959:

960: It should be mentioned that while the two studies formulate their sampling objectives differently, they

961: are naturally supported by the SAL framework:

962: \begin{itemize}

963: \item (pockets) Where should I collect data in order to mine the pockets with high

964: confidence?

965: \item (Jordan forms) How much data should I collect in order to determine the right

966: Jordan form with high confidence?

967: \end{itemize}

968: One could imagine extending our framework to also take into account the expense

969: of data samples. If the cost of data collection is non-uniform across the domain, then

970: including this in the design of our functional will allow us to tradeoff the cost

971: of gathering information with the expected improvement

972: in problem solving performance. This area of data mining is referred to as {\it active

973: learning.}

974:

975: Data mining can sometimes be a controversial term in a discipline that is used to

976: mathematical rigor; this is because it often used synonymously with `lack of a hypothesis

977: or theory.' We hope to have convinced the reader that this need not be the case and

978: that data mining can indeed be sensitive to knowledge about the domain, especially

979: physical properties of the kind we have harnessed here. As data mining

980: applications become more prevalent in science, the need to incorporate {\it a priori}

981: domain knowledge will only become more important.

982:

983: \begin{thebibliography}{10}

984:

985: \bibitem{ambig}

986: C.~Bailey-Kellogg and N.~Ramakrishnan.

987: \newblock {Ambiguity-Directed Sampling for Qualitative Analysis of Sparse Data

988:   from Spatially Distributed Physical Systems}.

989: \newblock In {\em Proceedings of the Seventeenth International Joint Conference

990:   on Artificial Intelligence (IJCAI-01)}, pages 43--50, 2001.

991:

992: \bibitem{bailey-kellogg01}

993: C.~Bailey-Kellogg and F.~Zhao.

994: \newblock {Influence-Based Model Decomposition for Reasoning about Spatially

995:   Distributed Physical Systems}.

996: \newblock {\em Artificial Intelligence}, Vol. 130(2):pages 125--166, 2001.

997:

998: \bibitem{bailey-kellogg96}

999: C.~Bailey-Kellogg, F.~Zhao, and K.~Yip.

1000: \newblock {Spatial Aggregation: Language and Applications}.

1001: \newblock In {\em Proceedings of the Thirteenth National Conference on

1002:   Artificial Intelligence (AAAI'96)}, pages 517--522, 1996.

1003:

1004: \bibitem{berleant-kuipers}

1005: D.~Berleant and B.~Kuipers.

1006: \newblock {Qualitative and Quantitative Simulation: Bridging the Gap}.

1007: \newblock {\em Artificial Intelligence}, Vol. 95(2):pages 215--255, 1998.

1008:

1009: \bibitem{precise}

1010: F.~Chaitin-Chatelin and V.~Frayss\'{e}.

1011: \newblock {\em Lectures on Finite Precision Computations}.

1012: \newblock SIAM Monographs, 1996.

1013:

1014: \bibitem{Easterling}

1015: R.G. Easterling.

1016: \newblock {Comment on `Design and Analysis of Computer Experiments'}.

1017: \newblock {\em {Statistical Science}}, Vol. 4(4):pages 425--427, {1989}.

1018:

1019: \bibitem{edelman-ma}

1020: A.~Edelman and Y.~Ma.

1021: \newblock {Non-Generic Eigenvalue Perturbations of {Jordan} Blocks}.

1022: \newblock {\em Linear Algebra \& Applications}, Vol. 273(1-3):pages 45--63,

1023:   1998.

1024:

1025: \bibitem{staircase}

1026: A.~Edelman and Y.~Ma.

1027: \newblock {Staircase Failures Explained by Orthogonal Versal Forms}.

1028: \newblock {\em SIAM Journal on Matrix Analysis and Applications}, Vol.

1029:   21(3):pages 1004--1025, 2000.

1030:

1031: \bibitem{ganti-ieee}

1032: V.~Ganti, J.~Gehrke, and R.~Ramakrishnan.

1033: \newblock {Mining Very Large Databases}.

1034: \newblock {\em IEEE Computer}, Vol. 32(8):pages 38--45, August 1999.

1035:

1036: \bibitem{vizcraft}

1037: A.~Goel, C.A. Baker, C.A. Shaffer, B.~Grossman, W.H. Mason, L.T. Watson, and

1038:   R.T. Haftka.

1039: \newblock {VizCraft: A Problem-Solving Environment for Aircraft Configuration

1040:   Design}.

1041: \newblock {\em IEEE/AIP Computing in Science and Engineering}, Vol. 3(1):pages

1042:   56--66, 2001.

1043:

1044: \bibitem{journel}

1045: A.~Journel.

1046: \newblock {Constrainted Interpolation and Qualitative Information - The Soft

1047:   Kriging Approach}.

1048: \newblock {\em {Mathematical Geology}}, Vol. 18(2):pages 269--286, November

1049:   {1986}.

1050:

1051: \bibitem{mannila}

1052: J.~Kivinen and H.~Mannila.

1053: \newblock {The Use of Sampling in Knowledge Discovery}.

1054: \newblock In {\em Proceedings of the Thirteenth ACM Symposium on Principles of

1055:   Database Systems}, pages 77--85, 1994.

1056:

1057: \bibitem{ltw}

1058: D.L. Knill, A.A. Giunta, C.A. Baker, B.~Grossman, W.H. Mason, R.T. Haftka, and

1059:   L.T. Watson.

1060: \newblock {Response Surface Models Combining Linear and Euler Aerodynamics for

1061:   Supersonic Transport Design}.

1062: \newblock {\em Journal of Aircraft}, 36(1):pages 75--86, 1999.

1063:

1064: \bibitem{hash}

1065: Y.~Lamdan and H.~Wolfson.

1066: \newblock {Geometric Hashing: A General and Efficient Model-Based Recognition

1067:   Scheme}.

1068: \newblock In {\em Proceedings of the Second International Conference on

1069:   Computer Vision (ICCV)}, pages 238--249, 1988.

1070:

1071: \bibitem{response-book}

1072: R.H. Myers and D.C. Montgomery.

1073: \newblock {\em Response Surface Methodology: Process and Product Optimization

1074:   using Designed Experiments}.

1075: \newblock Wiley, Jan 2002.

1076:

1077: \bibitem{ordonez00}

1078: I.~{Ord\'{o}\~{n}ez} and F.~Zhao.

1079: \newblock {{STA}: Spatio-Temporal Aggregation with Applications to Analysis of

1080:   Diffusion-Reaction Phenomena}.

1081: \newblock In {\em Proceedings of the Seventeenth National Conference on

1082:   Artificial Intelligence (AAAI'00)}, pages 517--523, 2000.

1083:

1084: \bibitem{naren-ayg-advances}

1085: N.~Ramakrishnan and A.Y. Grama.

1086: \newblock {Mining Scientific Data}.

1087: \newblock {\em Advances in Computers}, Vol. 55:pages 119--169, Sep 2001.

1088:

1089: \bibitem{dace}

1090: J.~Sacks, W.J. Welch, T.J. Mitchell, and H.P. Wynn.

1091: \newblock {Design and Analysis of Computer Experiments}.

1092: \newblock {\em {Statistical Science}}, Vol. 4(4):pages 409--435, {1989}.

1093:

1094: \bibitem{yip96a}

1095: K.M. Yip and F.~Zhao.

1096: \newblock {Spatial Aggregation: Theory and Applications}.

1097: \newblock {\em Journal of Artificial Intelligence Research}, Vol. 5:pages

1098:   1--26, 1996.

1099:

1100: \bibitem{yip95b}

1101: K.M. Yip, F.~Zhao, and E.~Sacks.

1102: \newblock {Imagistic Reasoning}.

1103: \newblock {\em ACM Computing Surveys}, Vol. 27(3):pages 363--365, 1995.

1104:

1105: \end{thebibliography}

1106: \end{document}

1107:

1108: