1: \documentclass[11pt]{article}
2:
3: %\usepackage{ijcai01}
4: %\usepackage{fullpage,palatino}
5: \usepackage{fullpage}
6: \usepackage{amsfonts}
7:
8: \usepackage{alltt}
9: \setlength{\oddsidemargin}{-0.25in}
10: \setlength{\evensidemargin}{-0.25in}
11: \setlength{\topmargin}{0.5in}
12: \setlength{\headheight}{0pt}
13: \def\R{\mathbb{R}}
14: \setlength{\headsep}{0pt}
15: \setlength{\footskip}{30pt}
16: \setlength{\textheight}{8.75in}
17: \setlength{\textwidth}{7in}
18: \setlength{\marginparwidth}{0in}
19: \setlength{\marginparsep}{0in}
20: \newsavebox{\savepar}
21: \newenvironment{boxit}{\begin{lrbox}{\savepar}
22: \begin{minipage}[b]{5.1in}}
23: {\end{minipage}\end{lrbox}\fbox{\usebox{\savepar}}}
24: \newenvironment{descit}[1]{\begin{quote} \textit{#1}}{\end{quote}}
25: \newenvironment{closeitemize}{\begin{list}{-}{\topsep=0in\itemsep=0in\parsep=0in}}{\end{list}}
26:
27: \input{psfig-dvips}
28:
29: \newif\ifpdf
30: \ifx\pdfoutput\undefined
31: \pdffalse
32: \else
33: \pdfoutput=1
34: \pdftrue
35: \fi
36:
37: \ifpdf
38: \usepackage[pdftex]{graphicx}
39: \usepackage[pdftex]{color}
40: \DeclareGraphicsExtensions{.pdf,.png,.jpg}
41: \else
42: \usepackage[dvips]{graphicx}
43: \usepackage[dvips]{color}
44: \DeclareGraphicsExtensions{.eps,.epsi,.ps}
45: \fi
46:
47: \usepackage{times}
48: %\usepackage{fancyheadings}
49:
50: %\pagestyle{plain}
51: %\thispagestyle{empty}
52: %\pagestyle{empty}
53:
54: \def\midv{\mathop{\,|\,}}
55: \newtheorem{defn}{Definition}
56: \long\def\cbk#1{{\color{red}[CBK: #1]}}
57: \newlength\colwidth \setlength\colwidth{3.25in}
58:
59: \title{Sampling Strategies for Mining in Data-Scarce Domains}
60:
61: \author{Naren Ramakrishnan \\
62: Department of Computer Science\\
63: Virginia Tech, VA 24061\\
64: Tel: (540) 231-8451\\
65: Email: naren@cs.vt.edu
66: \and
67: Chris Bailey-Kellogg\\
68: Department of Computer Sciences\\
69: Purdue University, IN 47907\\
70: Tel: (765) 494-9025\\
71: Email: cbk@cs.purdue.edu}
72:
73: \date{}
74: \begin{document}
75:
76: \maketitle
77: \begin{abstract}
78: \noindent
79: Data mining has traditionally focused on the task of drawing
80: inferences from large datasets. However, many scientific and
81: engineering domains, such as fluid dynamics and aircraft design, are
82: characterized by {\em scarce} data, due to the expense and
83: complexity of associated experiments and simulations. In such
84: data-scarce domains, it is advantageous to focus the data collection
85: effort on only those regions deemed most important to support a
86: particular data mining objective. This paper describes a mechanism
87: that interleaves bottom-up data mining, to uncover multi-level
88: structures in spatial data, with top-down sampling, to clarify
89: difficult decisions in the mining process. The mechanism exploits
90: relevant physical properties, such as continuity, correspondence, and
91: locality, in a unified framework. This leads to effective mining and
92: sampling decisions that are explainable in terms of domain knowledge
93: and data characteristics. This approach is demonstrated in two
94: diverse applications --- mining pockets in spatial data, and
95: qualitative determination of Jordan forms of matrices.
96: \end{abstract}
97:
98: %\thispagestyle{empty}
99: %\vspace{-0.2in}
100: \section{Introduction}
101: %\vspace{-0.1in}
102: A number of important scientific and engineering applications, such as
103: fluid dynamics simulation and aircraft design, require analysis of
104: spatially-distributed data from expensive experiments
105: and/or complex simulations demanding days, weeks, or even years on
106: petaflops-class computing systems. For example,
107: consider the conceptual design of a high-speed civil transport (HSCT),
108: which involves the disciplines of aerodynamics, structures,
109: controls (mission-related), and propulsion. 80\% of the
110: aircraft lifecycle cost is determined at this stage.
111: Fig.~\ref{fig:aircraft} shows a cross-section of the design space for
112: such a problem involving 29 design
113: variables with 68 constraints~\cite{vizcraft}.
114: Frequently, the engineer will change some aspect of
115: a nominal design point, and run a simulation to see how the change
116: affects the objective function and various constraints dealing with
117: aircraft geometry and performance/aerodynamics. Or the design
118: process is made configurable, so the engineer can concentrate
119: on accurately modeling some aspect (e.g., the interaction between the
120: wing root and the fuselage) while replacing the remainder of the design
121: with fixed boundary conditions surrounding the focal area. Both these
122: approaches are inadequate for exploring such large high-dimensional design spaces,
123: even at low fidelity. Ideally, the design engineer would like a
124: high-level mining system to identify the {\it pockets} that contain
125: good designs and which merit further consideration; traditional tools
126: from optimization and approximation theory can then be applied to
127: fine-tune such preliminary analyses.
128: %Fig.~\ref{fig:aircraft} depicts one such
129: %pocket containing two optimal configurations of aircraft designs.
130:
131: Three important characteristics distinguish such applications.
132: First, they are characterized not by an abundance of data, but
133: rather by a scarcity of data (owing to the cost and time involved in
134: conducting simulations). Second, the
135: computational scientist has complete control over the data acquisition
136: process (e.g.\ regions of the design space where data can be
137: collected), especially via computer simulations.
138: And finally, there exists significant domain knowledge
139: in the form of physical properties such as continuity, correspondence, and
140: locality. It is natural therefore to use such information to focus data
141: collection for data mining. In this paper, we are interested in
142: the question: `Given a simulation code, knowledge of physical properties, and a data
143: mining goal, at what points should data be collected?'
144:
145: By suitably formulating an objective function and constraints around this question, we can
146: pose it as a problem of minimizing the number of samples needed for data mining.
147: %If the cost of data collection is non-uniform across the design space, then the expense
148: %of data samples can also be included in the formulation. This would mean trading off
149: %the cost of collecting new data with the expected improvement in performance w.r.t.
150: %a data mining objective.
151: Such a combination of \{data-scarcity + control over data collection +
152: need to exploit domain knowledge\} characterizes many important
153: computational science applications.
154: Data mining is now recognized as a key solution approach
155: for such applications, supporting analysis, visualization,
156: and design tasks~\cite{naren-ayg-advances}. It serves a primary
157: role in many domains (e.g., microarray bioinformatics) and a complementary role in
158: others, by augmenting traditional techniques from numerical analysis,
159: statistics, and machine learning.
160:
161: \begin{figure}
162: \begin{center}
163: \includegraphics[width=4.5in]{slice}
164: \end{center}
165: \vspace*{-\baselineskip}
166: \caption{A pocket in an aircraft design space viewed as a slice
167: through three design points~\cite{vizcraft} (courtesy Layne T. Watson).}
168: \label{fig:aircraft}
169: \end{figure}
170:
171: The goal of this paper is to describe focused sampling strategies
172: for mining scientific data. Our approach is based on the spatial aggregation
173: language (SAL)~\cite{bailey-kellogg96}, which supports construction
174: of data interpretation and control design applications for
175: spatially-distributed physical systems. Used as a basis for describing data mining
176: algorithms, SAL programs also help exploit knowledge of
177: physical properties such as continuity and locality in data fields.
178: %, based on specified
179: %metrics, adjacency relations, and equivalence predicates.
180: They work in a bottom-up manner to uncover regions of uniformity in
181: spatially distributed data. In conjunction with this process, we introduce
182: a top-down sampling strategy that focuses data collection in only those
183: regions that are deemed most important to support a data mining
184: objective. Together, they help define a methodology for mining in data-scarce
185: domains. We describe
186: this methodology at a high-level and devote the major part of the paper to
187: two applications that employ it.
188:
189: \section{A Methodology for Mining in Data-Scarce Domains}
190: It is possible to study the problem of sampling for targeted data mining activities, such
191: as clustering, finding association rules, and decision tree construction~\cite{ganti-ieee}. This is
192: the approach taken by work such as~\cite{mannila}. In this paper, however, we are interested in a
193: general framework or language to express data mining operations on datasets and
194: which can be used to study the design of data collection and sampling strategies. The spatial
195: aggregation language (SAL)~\cite{bailey-kellogg96,yip96a} is such a framework.
196: %Using SAL, we can implement a variety
197: %of data mining algorithms as repeated aggregation operations on data fields.
198: %In addition, SAL allows the exploitation of physical properties that hold in the data.
199: %As mentioned earlier, such physical properties
200: %allow us to be more intelligent about deciding where to collect data next.
201:
202: \subsection{SAL: The Spatial Aggregation Language}
203: %\vspace{-0.1in}
204: As a data mining framework, SAL
205: is based on successive manipulations of data fields by a uniform vocabulary of
206: aggregation, classification, and abstraction operators. Programming in SAL follows
207: a philosophy of building a multi-layer hierarchy of aggregations of data. These
208: increasingly abstract descriptions of data are built using explicit representations
209: of physical knowledge, expressed as metrics, adjacency relations, and
210: equivalence predicates. This allows a SAL program to uncover and exploit structures in
211: physical data.
212:
213: SAL programs employ what has been called an {\em imagistic reasoning} style~\cite{yip95b}.
214: They employ vision-like routines to manipulate multi-layer geometric and
215: topological structures in spatially distributed data. SAL adopts a
216: {\em field ontology}, in which the input is a {\em field} mapping from
217: one continuum to another (e.g.\ 2-D temperature field: $\R^2
218: \rightarrow \R^1$; 3-D fluid flow field: $\R^3 \rightarrow \R^3$).
219: Multi-layer structures arise from continuities in fields at multiple
220: scales. Due to continuity, fields exhibit regions of uniformity, and
221: these regions of uniformity can be abstracted as higher-level
222: structures which in turn exhibit their own continuities.
223: Task-specific domain knowledge specifies how to uncover such regions
224: of uniformity, defining metrics for closeness of both field objects
225: and their features. For example, isothermal contours are connected
226: curves of nearby points with equal (or similar enough) temperature.
227:
228: The identification of structures in a field is a form of data
229: reduction: a relatively information-rich field representation is
230: abstracted into a more concise structural representation (e.g.\
231: pressure data points into isobar curves or pressure cells; isobar
232: curve segments into troughs). Navigating the mapping from field to
233: abstract description through multiple layers rather than in one giant
234: step allows the construction of more modular programs with more
235: manageable pieces that can use similar processing techniques at
236: different levels of abstraction. The multi-level mapping also allows
237: higher-level layers to use global properties of lower-level objects as
238: local properties of the higher-level objects. For example, the
239: average temperature in a region is a global property when considered
240: with respect to the temperature data points, but a local property when
241: considered with respect to a more abstract region description. As
242: this paper demonstrates, analysis of higher-level structures in such a
243: hierarchy can guide interpretation of lower-level data.
244:
245:
246: \begin{figure}
247: \begin{center}
248: \includegraphics[width=3.5in]{SA}
249: \end{center}
250: %\vspace*{-\baselineskip}
251: \caption{SAL multi-layer spatial aggregates, uncovered by a uniform
252: vocabulary of operators utilizing domain knowledge. A variety of scientific data mining
253: tasks, such as vector field bundling, contour aggregation, correspondence abstraction, clustering,
254: and
255: uncovering regions of uniformity can be expressed as multi-level computations with SAL
256: aggregates.}
257: \label{fig:sa}
258: \end{figure}
259: \begin{figure}
260: \begin{center}
261: \begin{tabular}{cccc}
262: \includegraphics[width=1.5in]{vect1.eps} &
263: \includegraphics[width=1.5in]{vect2.eps} &
264: \includegraphics[width=1.5in]{vect3.eps} &
265: \includegraphics[width=1.5in]{vect4.eps} \\
266: (a) & (b) & (c) & (d) \\
267: \includegraphics[width=1.5in]{vect5.eps} &
268: \includegraphics[width=1.5in]{vect6.eps} &
269: \includegraphics[width=1.5in]{vect7.eps} &
270: \includegraphics[width=1.5in]{vect8.eps} \\
271: (e) & (f) & (g) & (h) \\
272: \end{tabular}
273: \end{center}
274: \caption{Example steps in SAL implementation of vector field
275: analysis application. (a) Input vector field. (b) 8-adjacency
276: neighborhood graph. (c) Forward neighbors. (d) Best forward
277: neighbors. (e) Ngraph transposed from best forward neighbors. (f) Best
278: backward neighbors. (g) Resulting adjacencies redescribed as
279: curves. (h) Higher-level aggregation and classification of curves
280: whose flows converge.}
281: \label{fig:vect}
282: \end{figure}
283:
284: \begin{figure}
285: %\framebox{
286: \begin{tabular}{|lc|} \hline
287: & \\
288: \begin{minipage}{\textwidth}
289: \small
290: \begin{alltt}
291: // (a) Read vector field.
292: vect_field = read_point_point_field(\emph{infile});
293: points = domain_space(vect_field);
294:
295: // (b) Aggregate with 8-adjacency (i.e. within 1.5 units).
296: point_ngraph = aggregate(points, make_ngraph_near(1.5));
297:
298: // (c) Compare vector directions with node-neighbor direction.
299: angle = function (p1, p2) \{
300: dot(normalize(mean(feature(vect_field, p1), feature(vect_field, p2))),
301: normalize(subtract(p2, p1)))
302: \}
303: forward_ngraph = filter_ngraph(adj in point_ngraph, \{
304: angle(from(adj), to(adj)) > \emph{angle\_similarity}
305: \})
306: // (d) Find best forward neighbor, comparing vector direction
307: // with ngraph edge direction and penalizing for distance.
308: forward_metric = function (adj) \{
309: angle(from(adj), to(adj)) - \emph{distance\_penalty} * distance(from(adj),to(adj))
310: \}
311: best_forward_ngraph = best_neighbors_ngraph(forward_ngraph, forward_metric);
312:
313: // (e) Find backward neighbors by transposing best forward neighbors.
314: backward_ngraph = transpose_ngraph(best_forward_ngraph);
315:
316: // (f) At junctions, keep best backward neighbor using metric
317: // similar to that for best forward neighbors.
318: backward_metric = function (adj) \{
319: angle(to(adj), from(adj)) - \emph{distance\_penalty}*distance(from(adj),to(adj))
320: \}
321: best_backward_ngraph = best_neighbors_ngraph(backward_ngraph, backward_metric);
322:
323: // (g) Move to a higher abstraction level by forming equivalence classes
324: // from remaining groups and redescribing them as curves.
325: final_ngraph = symmetric_ngraph(best_backward_ngraph, extend=true);
326: point_classes = classify(points, make_classifier_transitive(final_ngraph));
327:
328: points_to_curves = redescribe(classes(point_classes),
329: make_redescribe_op_path_nline(final_ngraph));
330: trajs = high_level_objects(points_to_curves);
331: \end{alltt}
332: \end{minipage}
333: & \\
334: & \\
335: \hline
336: \end{tabular}
337: %}
338: \caption{SAL data mining program for the vector field analysis application of Fig.~\ref{fig:vect}.}
339: \label{samplecode}
340: \end{figure}
341:
342: SAL supports structure discovery through a small set of generic
343: operators, parameterized with domain-specific knowledge, on uniform
344: data types. These operators and data types mediate increasingly
345: abstract descriptions of the input data (see Fig.~\ref{fig:sa}) to
346: form higher-level abstractions and mine patterns. The {\em
347: primitives} in SAL are contiguous regions of space called {\em spatial
348: objects}; the {\em compounds} are (possibly structured) collections of
349: spatial objects; the {\em abstraction mechanisms} connect collections
350: at one level of abstraction with single objects at a higher level.
351:
352: SAL is currently available as a C++ library\footnote{The SAL implementation can be
353: downloaded from http://www.cis.ohio-state.edu/insight/sal-code.html.} providing access to a
354: large set of data type implementations and operations. In addition,
355: an interpreted, interaction environment layered over the library
356: supports rapid prototyping of data mining applications. It allows
357: users to inspect data and structures, test the effects of different
358: predicates, and graphically interact with representations of the
359: structures.
360: %SAL applications ranging from weather data analysis to
361: %diffusion-reaction system analysis to dynamical systems analysis to
362: %mechanical mechanism analysis all use the same set of generic
363: %operators parameterized by different domain knowledge.
364:
365: To illustrate SAL programming style, consider the task of bundling
366: vectors in a given vector field (e.g.\ wind velocity or temperature gradient)
367: into a set of streamlines (paths through the field following the
368: vector directions). This process can be depicted as shown in
369: Fig.~\ref{fig:vect} and the corresponding SAL data mining program is shown
370: in Fig.~\ref{samplecode}.
371: The steps
372: in this program are as follows:
373: (a) Establish a {\em field} mapping points (locations) to points
374: (vector directions, assumed here to be normalized). (b) Localize
375: computation with a {\em neighborhood graph}, so that only spatially
376: proximate points are compared.
377: (c)--(f) Use a series of local computations on this representation to
378: find {\em equivalence classes} of neighboring vectors with respect to
379: vector direction (systematically eliminate all edges but those whose
380: directions best match the vector direction at both endpoints).
381: (g) {\em Redescribe} equivalence classes of vectors into more abstract
382: streamline curves. (h) Aggregate and classify these curves into
383: groups with similar flow behavior, {\em using the exact same operators
384: but with different metrics} (code not shown). As this example
385: illustrates, SAL provides a vocabulary for expressing the knowledge
386: required (e.g., distance metrics and similarity metrics)
387: for uncovering multi-level structures in spatial datasets. It has been
388: applied to applications ranging from decentralized control
389: design~\cite{bailey-kellogg01}
390: %to weather data analysis~\cite{huang99}
391: to analysis of diffusion-reaction morphogenesis~\cite{ordonez00}.
392:
393: \subsection{Data Collection and Sampling}
394: %\vspace{-0.1in}
395: The above example illustrated the use of SAL in a data-rich domain. The exploitation
396: of physical properties is a central tenet of SAL since it drives the computation of
397: multi-level spatial aggregates. Many important physical properties can be expressed as
398: SAL computations by suitably defining adjacency relations and aggregation metrics.
399: To extend the use of SAL to data-scarce settings, we
400: present the sampling methodology outlined in Fig.~\ref{sampling-meth}.
401:
402: Once again, it is easy to understand the methodology in the context of the vector-field bundling
403: application (Fig.~\ref{fig:vect}). Assume that we apply the SAL data mining program of Fig.~\ref{samplecode}
404: with a small dataset and have navigated upto the highest level of the hierarchy (streamlines bundled with
405: convergent flows).
406: The SAL program computes different streamline aggregations from a neighborhood graph and chooses
407: one based on how well its curvature matches the direction of the vectors it aggregates. If data
408: is scarce, it is likely that some of these classification decisions will be {\it ambiguous}, i.e.,
409: there may exist multiple streamline aggregations. {\bf In such a case, we would like to choose a new data sample
410: that reduces the ambiguity and clarifies what the correct classification should be.}
411:
412: This is the essence of our sampling methodology: using SAL aggregates, we identify an information-theoretic measure
413: (here, ambiguity) that can be used to drive stages of future data collection. For instance, the
414: ambiguous streamline classifications can be summarized as a 2D ambiguity distribution that has a spike
415: for every location where an ambiguity was detected.
416: Reduction of ambiguity can be posed as the problem of minimization of (or maximization, as the case may be)
417: a functional involving the (computed) ambiguity. The functional could be the entropy in the underlying
418: data field, as revealed by the ambiguity distribution.
419: Such a minimization will lead us to selecting a data point(s) that clarifies the distribution of
420: streamlines, and hence makes more effective use of data for data mining purposes. The net effect of this methodology is
421: that we are able to capture the desirability of a particular design (data layout) in terms
422: of computations involving SAL aggregates. Thus, sampling is conducted for the express purpose of improving the
423: quality and efficacy of data mining. The dataset is updated with the newly collected value and the process is repeated
424: till a desired stopping criteria is met. For instance, we could terminate if the
425: functional is within accepted bounds, or
426: when there is no improvement in confidence of data mining results between successive rounds of data collection.
427: In our case, when there is no further ambiguity.
428:
429: This idea of sampling to satisfy particular design criteria has been studied in various
430: contexts, especially spatial statistics~\cite{Easterling, journel, dace}.
431: Many of these approaches (including ours) rely on
432: capturing properties of a desirable design in terms of a novel
433: objective function. The distinguishing feature of our work is that it
434: uses {\em spatial} information gleaned from a higher level of
435: abstraction to focus data collection at the field/simulation code
436: layer.
437: % While flavors of the {\it consistent labeling} problem in
438: %mobile vision have this feature, they are more attuned to transferring
439: %information across two {\it successive} abstraction levels.
440: The applications presented here are also novel in that they span and connect
441: arbitrary levels of abstraction, thus suggesting new ways to integrate
442: qualitative and quantitative simulation~\cite{berleant-kuipers}.
443:
444: %The local nature of SAL computations means that interpolatory
445: %models (e.g., kriging) are particularly appropriate as surrogates since they give exact responses at the known
446: %data locations, and estimate values at other locations by minimizing a suitable
447: %error criterion (e.g., MSE). Global, least-squares techniques are not
448: %applicable because measurements at all locations are equally considered to uncover
449: %trends and patterns in a particular region.
450: %
451: \begin{figure}
452: \begin{center}
453: \includegraphics[height=3in]{methodology}
454: \end{center}
455: %\vspace*{-\baselineskip}
456: \caption{The sampling methodology for SAL mining in data-scarce domains.}
457: \label{sampling-meth}
458: \end{figure}
459:
460: We present concrete realizations of the above methodology in the next section.
461: %As an example,
462: %Fig.~\ref{sampling-meth2} shows how the various stages in Fig.~\ref{sampling-meth} can be instantiated to help mine
463: %streamlines in vector fields.
464: But before we proceed, it is
465: pertinent to note an optional step in our methodology. The newly collected data value can be used to improve a {\it
466: surrogate} model which then generates a dense data field for mining.
467: A surrogate function is something that is used in lieu of the real
468: data source, so as to generate sufficient data for mining purposes. This is often
469: more advantageous than working directly with sparse data. Surrogate models are widely used in
470: engineering design, optimization, and in response surface approximations~\cite{ltw,response-book}.
471:
472: %The process can be terminated when the functional is within acceptable bounds or if we
473: %have exceeded a cost assumed for data collection.
474: %The instantiation of the methodology thus depends on $\mathcal{M}$, the
475: %functional involving $\mathcal{M}$, the way to
476: %optimize the functional, and the stopping criteria. The `right' choice
477: %is heavily dependent on the particular application and, if done wisely, can
478: %lead to substantial improvements in efficiency of data mining.
479: %This is best illustrated by examples, which we proceed to do with three applications.
480: %
481: %The common theme in all of these applications is that the formulation of $\mathcal{M}$
482: %reflects our prior knowledge of physical properties. Since computations in SAL
483: %are done in a bottom-up manner, difficulties encountered in decision-making (e.g.,
484: %ambiguity in classifying a streamline) are captured in $\mathcal{M}$ and used to
485: %drive top-down selection of sample locations. Table~\ref{compare3} compares the organization
486: %of the data mining methodology for each
487: %of the three applications. As we will show, this approach leads
488: %to highly effective sampling decisions that are also explainable in terms of
489: %problem structures and domain knowledge.
490:
491: Together, SAL and our focused sampling methodology address the main issues raised in
492: the beginning of the paper: SAL's uniform use of fields and abstraction operators allows
493: us to exploit prior knowledge in a bottom-up manner. Discrepancies as suggested by our
494: knowledge of physical properties (e.g., ambiguities) are used in a top-down manner by
495: the sampling methodology. Continuing these two stages alternatively leads to a closed-loop
496: data mining solution for data-scarce domains.
497:
498: \section{Example Applications}
499: %\vspace{-0.1in}
500: \subsection{Mining Pockets in Spatial Data}
501: %\vspace{-0.1in}
502: Our first application is motivated by the aircraft design problem and is meant
503: to illustrate the basic idea of our methodology. Here, we are given a spatial vector field
504: and we wish to identify {\it pockets} underlying the gradient. In a weather map, this might
505: mean identifying pressure troughs, for instance. The question is: `where should data be
506: collected so that we are able to mine the pockets with high confidence?' We begin by presenting
507: a mathematical function that gives rise to pockets in spatial fields. This function will
508: be used to validate and test our data mining and sampling methodology.
509:
510: \begin{figure}
511: \begin{center}
512: \begin{tabular}{cc}
513: \includegraphics[width=2.5in]{pocket}
514: \end{tabular}
515: \end{center}
516: \caption{A 2D pocket function.}
517: \label{fig:pocket-diag}
518: \end{figure}
519:
520: \subsubsection*{de Boor's function}
521: Carl de Boor invented a pocket function that exploits containment properties of the
522: $n$-sphere of radius 1 centered at the origin ($\Sigma {x_i}^2 \leq 1$) with respect to the
523: $n$-dimensional hypercube defined by $x_i \in [-1, 1], i=1\cdots n$. Even though the
524: sphere is embedded inside the cube, notice that the ratio of the volume of the cube ($2^n$) to that of the sphere
525: ($\pi^{n/2} / (n/2)!$) grows unboundedly with $n$. This means that the
526: volume of a high-dimensional cube is concentrated in its corners (a
527: counterintuitive notion at first). de Boor exploited this
528: property to design a difficult-to-optimize function which assumes a
529: {\it pocket} in each corner of the cube (Fig.~\ref{fig:pocket-diag}), that
530: is just outside the sphere. Formally, it can be
531: defined as:
532: \begin{eqnarray}
533: \alpha({\mathbf X}) & = & cos \left( \sum_{i=1}^n 2^i \left( 1 + {x_i \over{\mid x_i \mid}}\right) \right) - 2 \\
534: \delta({\mathbf X}) & = & \| {\mathbf{X}} - 0.5 {\mathbf{I}}\| \\
535: p({\mathbf X}) & = & \alpha({\mathbf X}) ( 1 - \delta^2({\mathbf X})
536: (3 - 2\delta({\mathbf X}))) + 1
537: \end{eqnarray}
538: where ${\mathbf X}$ is the n-dimensional point $(x_1,x_2,\cdots,x_n)$
539: at which the pocket function $p$ is evaluated, ${\mathbf I}$ is the
540: identity n-vector, and $\|\cdot\|$ is the $L_2$ norm.
541:
542: It is easily seen that $p$ has $2^n$ pockets (local minima); if $n$ is large
543: (say, 30, which means it will take more than half a million points to
544: just represent the corners of the $n$-cube!), naive global
545: optimization algorithms will require an unreasonable number of
546: function evaluations to find the pockets. Our goal for data mining here is to obtain a
547: qualitative indication of the existence, number, and locations of pockets, using
548: low-fidelity models and/or as few data points as possible. The results can then be
549: used to seed higher-fidelity calculations. This is also fundamentally
550: different from DACE~\cite{dace}, polynomial response surface
551: approximations~\cite{ltw}, and other approaches in geo-statistics
552: where the goal is accuracy of functional prediction at untested data
553: points. Here, accuracy of estimation is traded for the ability to
554: mine pockets.
555:
556: \subsubsection*{Surrogate Function}
557: In this study, we use the SAL vector-field bundling code presented earlier along with
558: a surrogate model as the basis for generating a dense field
559: of data. Surrogate theory is an established area in engineering optimization and
560: there are several ways in which we can build a surrogate.
561: However, the local nature of SAL computations means that we can be selective about
562: our choice of surrogate representation.
563: For example, global, least-squares type
564: approximations are inappropriate since measurements at all locations
565: are equally considered to uncover trends and patterns in a particular
566: region. We advocate the use of kriging-type
567: interpolators~\cite{dace}, which are local modeling methods with roots
568: in Bayesian statistics. Kriging can handle situations with multiple
569: local extrema (for example, in weather data, remote sensing data,
570: etc.) and can easily exploit anisotropies and trends. Given $k$
571: observations, the interpolated model gives exact responses at these
572: $k$ sites and estimates values at other sites by minimizing the mean
573: squared error (MSE), assuming a random data process with zero mean and
574: a known covariance function.
575:
576: Formally (for two dimensions), the true function $p$ is assumed to be
577: the realization of a random process such as:
578: \begin{equation}
579: p(x,y) = \beta + Z(x,y)
580: \end{equation}
581: where $\beta$ is typically a uniform random variate, estimated based
582: on the known $k$ values of $p$, and $Z$ is a correlation function.
583: Kriging then estimates a model $p'$ of the same form, based on the
584: $k$ observations:
585: \begin{equation}
586: p'(x_i,y_i) = E(p(x_i,y_i) \midv p(x_1,y_1), \cdots, p(x_k,y_k))
587: \end{equation}
588: and minimizing mean squared error between $p'$ and $p$:
589: \begin{equation}\label{eq:MSE}
590: MSE = E(p'(x,y) - p(x,y))^2
591: \end{equation}
592: A typical choice for $Z$ in $p'$ is $\sigma^2 R$, where scalar
593: $\sigma^2$ is the {\it estimated} variance, and correlation matrix $R$
594: encodes domain-specific constraints and reflects the current fidelity
595: of data. We use an exponential function for entries in $R$, with two
596: parameters $C_1$ and $C_2$:
597: \begin{equation}\label{eq:R}
598: R_{ij} = e^{-C_1|x_i-x_j|^2 - C_2|y_i-y_j|^2}
599: \end{equation}
600: Intuitively, values at closer points should be more highly correlated.
601:
602: The estimator minimizing mean squared error is then obtained by
603: multi-dimensional optimization (the derivation from Eqs.~\ref{eq:MSE}
604: and~\ref{eq:R} is beyond the scope of this paper):
605: \begin{equation}\label{eq:optim1}
606: \max_C {\frac{-k}{2}}(\ln\sigma^2 + \ln |R|)
607: \end{equation}
608: This expression satisfies the conditions that there is no error
609: between the model and the true values at the chosen $k$ sites, and
610: that all variability in the model arises from the design of $Z$. The
611: multi-dimensional optimization is often performed by gradient descent
612: or pattern search methods. More details are available in~\cite{dace},
613: which demonstrates this methodology in the context of the design and
614: analysis of computer experiments.
615:
616: \subsubsection*{Data Mining and Sampling Methodology}
617: The bottom-up computation of SAL aggregates from the surrogate model's outputs
618: will possibly lead to some ambiguous streamline classifications, as discussed earlier.
619: Ambiguity can reflect the desirability of acquiring data at or near a
620: specified point, to clarify the correct classification and to serve as
621: a mathematical criterion of information content.
622: There are several ways in which we can use information about ambiguity to drive
623: data collection. In this study, we express the ambiguities as a distribution describing
624: the number of possible good neighbors (for a streamline).
625: This {\it ambiguity distribution} provides a novel mechanism to include
626: qualitative information --- streamlines that agree will generally
627: contribute less to data mining, for information purposes. The information-theoretic measure
628: $M$ (ref. Fig.~\ref{sampling-meth}) was thus defined to be the ambiguity distribution $\wp$.
629:
630: The functional was defined as the posterior entropy $E(-\log d)$, where $d$ is the conditional
631: density of $\wp$ over the design space {\it not covered}
632: by the current data values. By a reduction argument, minimizing this posterior entropy can be
633: shown to be maximizing the prior entropy over the {\it unsampled} design space~\cite{dace}.
634: In turn, this means that the amount of information obtained from an experiment (additional data
635: collection) is maximized. In addition, we also incorporated $\wp$ as an indicator covariance term in
636: our surrogate model (this is a conventional method
637: for including qualitative information in an interpolatory model~\cite{journel}).
638:
639: \subsubsection*{Experimental Results}
640: The initial experimental configuration used a face-centered design ($4$ points in the 2D case). A
641: surrogate model by kriging interpolation then generated data on a $41^n$-point grid.
642: de Boor's function was used as the source for data values; we also employed pseudorandom perturbations
643: of it that shift the pockets from the corners in a somewhat unpredictable
644: way (see~\cite{ambig} for details). In total, we experimented with 100 perturbed
645: variations (each) of the 2D and 3D pocket functions. For each of these cases, data collection was organized
646: in rounds of one extra sample each (that minimizes the above functional). The number of samples needed
647: to mine all the pockets by SAL was recorded. We also compared our results with those obtained
648: from a pure DACE/kriging approach (i.e., where sampling was directed at improving accuracy of function estimation). In other words, we used the DACE methodology to suggest
649: new locations for data collection and determined how these choices fared with respect
650: to mining the pockets.
651:
652: \begin{figure}
653: \begin{center}
654: \includegraphics[width=3in]{pocket-results}
655: \end{center}
656: %\vspace*{-\baselineskip}
657: \caption{Pocket-finding results (2D) show that focused sampling using a measure
658: of ambiguity always requires fewer total samples (7-15) than conventional kriging (17-23).}
659: \label{fig:pocket-bar}
660: \end{figure}
661:
662: \begin{figure}
663: \begin{center}
664: \begin{tabular}{cc}
665: \includegraphics[width=3in]{design} &
666: \includegraphics[width=2.5in]{results2sal}
667: \end{tabular}
668: \end{center}
669: \caption{Mining pockets in 2D from only 7 sample points.
670: (left)
671: The chosen sample locations: 4 initial
672: face-centered samples (marked as blue circles) plus 3
673: samples selected by our methodology (marked as red diamonds). Note that no additional sample is required in
674: the lower-left quadrant. (right)
675: SAL structures in surrogate
676: model data, confirming the existence of four pockets.}
677: \label{fig:pocket}
678: \end{figure}
679:
680: Fig.~\ref{fig:pocket-bar} shows the distributions of total number of data samples
681: required to mine the four pockets for the 2D case. We were thus able to mine the 2D pockets
682: using 3 to 11 additional samples, whereas the conventional kriging approach required
683: 13 to 19 additional samples. The results were were more striking in the 3D case:
684: at most 42 additional samples for focused sampling and upto 151 points for conventional
685: kriging. This shows that our focused sampling methodology performs 40-75\% better
686: than sampling by conventional kriging.
687:
688: Fig.~\ref{fig:pocket} (left)
689: describes a 2D design involving only $7$ total data points that is able to mine the four pockets.
690: Counterintuitively, no additional sample is required in the lower left quadrant! While this
691: will lead to a highly sub-optimal design (from the traditional viewpoint
692: of minimizing variance in predicted values), it is nevertheless an appropriate design
693: for data mining purposes. In particular, this means that neighborhood
694: calculations involving the other three quadrants are enough to uncover
695: the pocket in the fourth quadrant. Since the kriging interpolator
696: uses local modeling and since pockets in 2D effectively occupy the
697: quadrants, obtaining measurements at ambiguous locations serves to
698: capture the relatively narrow regime of each dip, which in turn helps
699: to distinguish the pocket in the neighboring quadrant. This effect is hard
700: to achieve without exploiting knowledge of physical properties, in this case,
701: locality of the dips.
702:
703: \subsection{Qualitative Jordan Form Determination}
704: %\vspace{-0.1in}
705: In our second application, we use our methodology to identify the most probable
706: Jordan form of a given matrix. This is a good application for data mining
707: since the direct computation of the Jordan form leads to a numerically
708: unstable algorithm.
709:
710: \subsubsection*{Jordan forms}
711: A matrix $\mathcal{A}$ (real or complex) that has $r$ independent eigenvectors has a Jordan form that
712: consists of $r$ {\it blocks}. Each of these blocks is an upper triangular
713: matrix that is associated with one of the eigenvectors of
714: $\mathcal{A}$ and whose size describes the multiplicity of the
715: corresponding eigenvalue. For the given matrix $\mathcal{A}$,
716: the diagonalization thus posits a nonsingular matrix $\mathcal{B}$ such that:
717: \begin{equation}
718: \mathcal{B}^{-1} \mathcal{A} \mathcal{B} = \left[ \begin{array}{cccc}
719: {\mathcal{J}}_1 & & & \\
720: & {\mathcal{J}}_2 & & \\
721: & & \cdot & \\
722: & & & {\mathcal{J}}_r\\
723: \end{array}
724: \right]
725: \end{equation}
726: where
727: \begin{equation}
728: {\mathcal{J}}_i = \left[ \begin{array}{cccc}
729: \lambda_i & 1 & & \\
730: & \cdot & 1 & \\
731: & & \cdot & 1\\
732: & & & \lambda_i\\
733: \end{array}
734: \right]
735: \end{equation}
736: and $\lambda_i$ is the eigenvalue revealed by the $i$th Jordan block ($\mathcal{J}_i$).
737: The Jordan form is most easily explained by looking at how eigenvectors are
738: distributed for a given eigenvalue. Consider, for example, the matrix
739: $$ \left[ \begin{array}{crr}
740: 1 & 1 & -1 \\
741: 0 & 0 & 2 \\
742: 0 & -1 & 3 \\
743: \end{array}
744: \right]$$
745: that has eigenvalues at 1, 1, and 2. This matrix has only two
746: eigenvectors, as revealed by the two-block structure of its Jordan form:
747: $$\left[ \begin{array}{cr|r}
748: 1 & 1 & 0 \\
749: 0 & 1 & 0 \\ \hline
750: 0 & 0 & 2 \\
751: \end{array}
752: \right]$$
753: The Jordan form is unique modulo shufflings of the blocks and, in this case,
754: shows that there is one eigenvalue ($1$) of multiplicity $2$ and one eigenvalue
755: ($2$) of multiplicty $1$. We say that the matrix has the
756: Jordan structure given by
757: $(1)^2 (2)^1$. In contrast, the matrix
758: $$ \left[ \begin{array}{crr}
759: 1 & 0 & 0 \\
760: 0 & 2 & 0 \\
761: 0 & 0 & 1 \\
762: \end{array}
763: \right]$$
764: has the same eigenvalues but a three-block Jordan structure:
765: $$\left[ \begin{array}{c|r|r}
766: 1 & 0 & 0 \\ \hline
767: 0 & 1 & 0 \\ \hline
768: 0 & 0 & 2 \\
769: \end{array}
770: \right]$$
771: This is because there are three independent eigenvectors (the unit vectors,
772: actually). The diagonalizing matrix is thus the identity matrix and the
773: Jordan form has three permutations. The Jordan structure is therefore
774: given by $(1)^1
775: (1)^1 (2)^1$. These two examples show that
776: a given eigenvalue's multiplicity could be distributed across one, many, or
777: all Jordan blocks. Correlating the eigenvalue with the block structure is
778: an important problem in numerical analysis.
779:
780: The typical approach to computing the Jordan form is to `follow the staircase'
781: pattern of the structure and perform rank determinations in conjunction
782: with ascertaining the eigenvalues. One of the more serious
783: caveats with such an approach involves mistaking an eigenvalue of multiplicity
784: $> 1$ for multiple eigenvalues~\cite{staircase}.
785: In the first example matrix
786: above, this might lead to inferring that the Jordan form has three
787: blocks.
788: The extra care needed to safeguard staircase algorithms usually
789: involves more complexity than the original computation to be performed!
790: The ill-conditioned nature of this computation has thus
791: traditionally prompted numerical analysts to favor other, more stable,
792: decompositions.
793:
794: \begin{figure}
795: \begin{center}
796: \begin{tabular}{cc}
797: \includegraphics[width=0.25\textwidth]{example-jordan1} &
798: \includegraphics[width=0.25\textwidth]{example-jordan2} \\
799: \end{tabular}
800: \end{center}
801: \caption{Superimposed spectra for assessing the Jordan form
802: of the Brunet matrix. Two Jordan blocks of multiplicity 3 are
803: observed for eigenvalue 7, at different (left, right) perturbation levels.}
804: \label{fig:jordan}
805: \end{figure}
806:
807: \subsubsection*{Qualitative assessment of Jordan forms}
808: A recent development has been the acceptance of a qualitative approach
809: to Jordan structure determination, proposed by Chaitin-Chatelin and
810: Frayss\'{e}~\cite{precise}. This approach does not employ the staircase
811: idea and, instead, exploits a semantics of eigenvalue perturbations to
812: infer multiplicity. This leads to a geometrically intuitive algorithm that
813: can be implemented using SAL.
814:
815: Consider a matrix that has eigenvalues $\lambda_1, \lambda_2,
816: \cdots, \lambda_n$ with multiplicities $\rho_1, \rho_2, \cdots, \rho_n$
817: (resp). Any attempt at finding the eigenvalues (e.g., determining
818: the roots of the characteristic polynomial) is intrinsically subject
819: to the numerical analysis dogma: the problem being solved will
820: actually be a {\it perturbed} version of the original problem. This allows
821: the expression of the {\it computed} eigenvalues in terms of perturbations
822: on the actual eigenvalues. It can be easily seen that the computed
823: eigenvalue corresponding to any $\lambda_k$ will be distributed on
824: the complex plane as:
825: $$\lambda_k + |\Delta|^{1\over{\rho_k}} e^{{i\phi}\over{\rho_k}}$$
826: where the phase $\phi$ of the perturbation $\Delta$ ranges over \{$2\pi, 4\pi,
827: \ldots, 2\rho_k \pi$\} if $\Delta$ is positive and
828: over \{$3\pi, 5\pi, \ldots, 2(\rho_k+1) \pi$\} if $\Delta$ is negative. The
829: insight
830: in~\cite{precise} is to {\it superimpose} numerous such perturbed
831: calculations graphically so that the aggregate picture reveals the $\rho_k$ of
832: the eigenvalue $\lambda_k$. Notice that the phase variations
833: imply that the computed eigenvalues will be lying on the
834: vertices of a regular polygon centered on the {\it actual} eigenvalue
835: and where the number of sides is {\it two times} the multiplicity of the
836: considered eigenvalue (this takes into account both positive and
837: negative $\Delta$). Since the diameter of the polygon is influenced
838: by $\Delta$, iterating this process over many $\Delta$ will lead to a
839: `sticks' depiction of the Jordan form.
840:
841: To illustrate, we choose a matrix whose computations will
842: be more prone to finite precision errors. Perturbations on
843: the 8-by-8 Brunet matrix~\cite{precise} with Jordan structure
844: $(-1)^1 (-2)^1 (7)^3 (7)^3$ induce the superimposed structures shown in
845: Fig.~\ref{fig:jordan}. The left part of Fig.~\ref{fig:jordan} depicts
846: normwise relative perturbations in the scale
847: of $[2^{-50},2^{-40}]$. The six sticks around the eigenvalue at 7
848: clearly reveal that its Jordan block is of size 3. The
849: other Jordan block, also centered at 7, is revealed if we conduct
850: our exploration at a finer perturbation level. Fig.~\ref{fig:jordan}
851: reveals the second Jordan block using perturbations in the
852: range $[2^{-53},2^{-50}]$. The noise in both pictures is a consequence
853: of (i) having two Jordan blocks with the same size, and (ii)
854: a `ring' phenomenon studied in~\cite{edelman-ma}; we do
855: not attempt to capture these effects in this paper.
856:
857: \begin{figure}[t]
858: \begin{center}
859: \includegraphics[width=2in]{star-trig2a}\hspace*{0.25in}\includegraphics[width=2in]{star-trig4a} \\
860: \vspace*{0.1in}
861: \includegraphics[width=2in]{star-corr2a}\hspace*{0.25in}\includegraphics[width=2in]{star-corr4a} \\
862: \end{center}
863: \vspace*{-\baselineskip}
864: \caption{Mining Jordan forms from (left) a small sample set, and (right)
865: large sample set. (top) Approximately congruent triangles. (bottom)
866: Evaluation
867: of correspondence of rotated triangles in terms of match
868: %
869: %, for (top)
870: %small sample set; (middle) larger sample set; (bottom) larger sample
871: %set but lower-scoring model. (left) Approximately-congruent
872: %triangles. (right) Evaluation of correspondence in terms of match
873: between original (red dots) and rotated (green circles) samples.}
874: \label{fig:star-demo}
875: %\vspace*{-\baselineskip}
876: \end{figure}
877:
878: \subsubsection*{Data Mining and Sampling Methodology}
879: For this study, we collect data by random normwise perturbations
880: in a given region and a SAL program determines
881: multiplicity by detecting symmetry correspondence in the samples. The first
882: aggregation level collects the samples for a given perturbation into
883: triangles. The second aggregation level finds congruent triangles via
884: geometric hashing~\cite{hash}, and uses congruence to establish
885: an analogy relation among triangle vertices. This relation is then abstracted
886: into a rotation about a point (the eigenvalue), and evaluated for whether
887: each point rotates onto another and whether matches define regular
888: polygons. A third level then compares rotations across
889: different perturbations, re-visiting perturbations or choosing new
890: perturbations in order to disambiguate (see Fig.~\ref{fig:star-demo}).
891: The end result of this
892: analysis is a confidence measure on models of possible Jordan forms.
893: Each model is defined by its estimate of $\lambda$ and $\rho$ (notice that
894: we are working only within one region at a time). The measure
895: $M$ was defined to be the joint probability distribution over the space of $\lambda$ and $\rho$.
896:
897: \subsubsection*{Experimental Results}
898: Since our Jordan form computation treats multiple perturbations (irresp. of level)
899: as {\it independent} estimates of eigenstructure, the idea of sampling
900: here is not `where to collect,' but `how much to collect.' The goal of
901: data mining is hence to improve our confidence in model evaluation.
902: We organized data collection into rounds of 6-8 samples each,
903: varied a tolerance parameter for triangle
904: congruence from 0.1 to 0.5 (effectively increasing the number of
905: models posited), and determined the number of rounds needed to
906: determine the Jordan form. As test cases, we used the set of matrices
907: studied in~\cite{precise}.
908: On average, our focused sampling approach required 1 round of data collection
909: at a tolerance of 0.1 and up to 2.7 rounds at 0.5. Even with a large number
910: of models posited, additional data quickly weeded out bad models.
911: Fig.~\ref{fig:star-demo} demonstrates this mechanism on the Brunet matrix
912: discussed above for two sets of sample points.
913: To the best of our knowledge, this is the only known
914: known focused sampling methodology for this domain; we hence are unable to
915: present any comparisons. However, it is clear that by harnessing domain knowledge
916: about correspondences, we have arrived at an intelligent sampling methodology that
917: resembles what a human would obtain by visual inspection.
918:
919: \section{Discussion}
920: The presented methodology for mining in data-scarce domains has several
921: intrinsic benefits. First, it is based on a uniform vocabulary of operators that
922: can be exploited for a rich diversity of applications. Second, it
923: demonstrates a novel factorization to the problem of mining when data is scarce,
924: namely, formulating an experiment design methodology to clarify, disambiguate,
925: and improve confidences in higher-level aggregates of data.
926: This allows us to bridge qualitative and quantitative
927: information in a unified framework. SAL programs thus uncover bottom-up structures in data
928: systematically and use difficulties encountered in this process (ambiguities,
929: lack of correspondences) to guide top-down selection of additional
930: data samples. By using knowledge of physical properties explicitly, our
931: approach can provide more holistic and explainable results than off-the-shelf data
932: mining algorithms. Third, our methodology can co-exist with
933: more traditional approaches to problem solving (numerical analysis, optimization)
934: and is not meant to be a replacement or a contrasting approach. This is amply
935: demonstrated in each of the two applications above, where connections with various
936: traditional methodologies have been carefully established.
937:
938: The methodology makes several intrinsic assumptions which we only briefly mention
939: here. All of our applications have been such that the cause, formation, and
940: effect of the relevant physical properties are well understood. This is precisely what
941: allows us to act decisively based on higher-level information from SAL aggregates,
942: through the measure $M$. It also assumes that the problems that will be
943: encountered by the mining algorithm are the same as the problems for which
944: it was designed. This is an inheritance from Bayesian inductive inference and
945: leads to fundamental limitations on what can be done in such a setting. For instance, if
946: new data does not help clarify an ambiguity,
947: does the fault lie with the model (SAL higher-level aggregate) or with the
948: data? We can summarize this problem by saying that the approach requires strong {\it a priori} information about
949: what is possible and what is not.
950:
951: Nevertheless, by advocating targeted use of domain specific knowledge and
952: aiding qualitative model selection, our methodology is more efficient at determining
953: high level models from empirical data. Together, SAL and our information-theoretic measure $M$
954: encapsulate knowledge about physical properties and this is what makes our
955: methodology a viable one for data mining purposes.
956: In future we aim to characterize more formally
957: the particular forms of domain knowledge that help overcome sparsity and
958: noise in scientific datasets.
959:
960: It should be mentioned that while the two studies formulate their sampling objectives differently, they
961: are naturally supported by the SAL framework:
962: \begin{itemize}
963: \item (pockets) Where should I collect data in order to mine the pockets with high
964: confidence?
965: \item (Jordan forms) How much data should I collect in order to determine the right
966: Jordan form with high confidence?
967: \end{itemize}
968: One could imagine extending our framework to also take into account the expense
969: of data samples. If the cost of data collection is non-uniform across the domain, then
970: including this in the design of our functional will allow us to tradeoff the cost
971: of gathering information with the expected improvement
972: in problem solving performance. This area of data mining is referred to as {\it active
973: learning.}
974:
975: Data mining can sometimes be a controversial term in a discipline that is used to
976: mathematical rigor; this is because it often used synonymously with `lack of a hypothesis
977: or theory.' We hope to have convinced the reader that this need not be the case and
978: that data mining can indeed be sensitive to knowledge about the domain, especially
979: physical properties of the kind we have harnessed here. As data mining
980: applications become more prevalent in science, the need to incorporate {\it a priori}
981: domain knowledge will only become more important.
982:
983: \begin{thebibliography}{10}
984:
985: \bibitem{ambig}
986: C.~Bailey-Kellogg and N.~Ramakrishnan.
987: \newblock {Ambiguity-Directed Sampling for Qualitative Analysis of Sparse Data
988: from Spatially Distributed Physical Systems}.
989: \newblock In {\em Proceedings of the Seventeenth International Joint Conference
990: on Artificial Intelligence (IJCAI-01)}, pages 43--50, 2001.
991:
992: \bibitem{bailey-kellogg01}
993: C.~Bailey-Kellogg and F.~Zhao.
994: \newblock {Influence-Based Model Decomposition for Reasoning about Spatially
995: Distributed Physical Systems}.
996: \newblock {\em Artificial Intelligence}, Vol. 130(2):pages 125--166, 2001.
997:
998: \bibitem{bailey-kellogg96}
999: C.~Bailey-Kellogg, F.~Zhao, and K.~Yip.
1000: \newblock {Spatial Aggregation: Language and Applications}.
1001: \newblock In {\em Proceedings of the Thirteenth National Conference on
1002: Artificial Intelligence (AAAI'96)}, pages 517--522, 1996.
1003:
1004: \bibitem{berleant-kuipers}
1005: D.~Berleant and B.~Kuipers.
1006: \newblock {Qualitative and Quantitative Simulation: Bridging the Gap}.
1007: \newblock {\em Artificial Intelligence}, Vol. 95(2):pages 215--255, 1998.
1008:
1009: \bibitem{precise}
1010: F.~Chaitin-Chatelin and V.~Frayss\'{e}.
1011: \newblock {\em Lectures on Finite Precision Computations}.
1012: \newblock SIAM Monographs, 1996.
1013:
1014: \bibitem{Easterling}
1015: R.G. Easterling.
1016: \newblock {Comment on `Design and Analysis of Computer Experiments'}.
1017: \newblock {\em {Statistical Science}}, Vol. 4(4):pages 425--427, {1989}.
1018:
1019: \bibitem{edelman-ma}
1020: A.~Edelman and Y.~Ma.
1021: \newblock {Non-Generic Eigenvalue Perturbations of {Jordan} Blocks}.
1022: \newblock {\em Linear Algebra \& Applications}, Vol. 273(1-3):pages 45--63,
1023: 1998.
1024:
1025: \bibitem{staircase}
1026: A.~Edelman and Y.~Ma.
1027: \newblock {Staircase Failures Explained by Orthogonal Versal Forms}.
1028: \newblock {\em SIAM Journal on Matrix Analysis and Applications}, Vol.
1029: 21(3):pages 1004--1025, 2000.
1030:
1031: \bibitem{ganti-ieee}
1032: V.~Ganti, J.~Gehrke, and R.~Ramakrishnan.
1033: \newblock {Mining Very Large Databases}.
1034: \newblock {\em IEEE Computer}, Vol. 32(8):pages 38--45, August 1999.
1035:
1036: \bibitem{vizcraft}
1037: A.~Goel, C.A. Baker, C.A. Shaffer, B.~Grossman, W.H. Mason, L.T. Watson, and
1038: R.T. Haftka.
1039: \newblock {VizCraft: A Problem-Solving Environment for Aircraft Configuration
1040: Design}.
1041: \newblock {\em IEEE/AIP Computing in Science and Engineering}, Vol. 3(1):pages
1042: 56--66, 2001.
1043:
1044: \bibitem{journel}
1045: A.~Journel.
1046: \newblock {Constrainted Interpolation and Qualitative Information - The Soft
1047: Kriging Approach}.
1048: \newblock {\em {Mathematical Geology}}, Vol. 18(2):pages 269--286, November
1049: {1986}.
1050:
1051: \bibitem{mannila}
1052: J.~Kivinen and H.~Mannila.
1053: \newblock {The Use of Sampling in Knowledge Discovery}.
1054: \newblock In {\em Proceedings of the Thirteenth ACM Symposium on Principles of
1055: Database Systems}, pages 77--85, 1994.
1056:
1057: \bibitem{ltw}
1058: D.L. Knill, A.A. Giunta, C.A. Baker, B.~Grossman, W.H. Mason, R.T. Haftka, and
1059: L.T. Watson.
1060: \newblock {Response Surface Models Combining Linear and Euler Aerodynamics for
1061: Supersonic Transport Design}.
1062: \newblock {\em Journal of Aircraft}, 36(1):pages 75--86, 1999.
1063:
1064: \bibitem{hash}
1065: Y.~Lamdan and H.~Wolfson.
1066: \newblock {Geometric Hashing: A General and Efficient Model-Based Recognition
1067: Scheme}.
1068: \newblock In {\em Proceedings of the Second International Conference on
1069: Computer Vision (ICCV)}, pages 238--249, 1988.
1070:
1071: \bibitem{response-book}
1072: R.H. Myers and D.C. Montgomery.
1073: \newblock {\em Response Surface Methodology: Process and Product Optimization
1074: using Designed Experiments}.
1075: \newblock Wiley, Jan 2002.
1076:
1077: \bibitem{ordonez00}
1078: I.~{Ord\'{o}\~{n}ez} and F.~Zhao.
1079: \newblock {{STA}: Spatio-Temporal Aggregation with Applications to Analysis of
1080: Diffusion-Reaction Phenomena}.
1081: \newblock In {\em Proceedings of the Seventeenth National Conference on
1082: Artificial Intelligence (AAAI'00)}, pages 517--523, 2000.
1083:
1084: \bibitem{naren-ayg-advances}
1085: N.~Ramakrishnan and A.Y. Grama.
1086: \newblock {Mining Scientific Data}.
1087: \newblock {\em Advances in Computers}, Vol. 55:pages 119--169, Sep 2001.
1088:
1089: \bibitem{dace}
1090: J.~Sacks, W.J. Welch, T.J. Mitchell, and H.P. Wynn.
1091: \newblock {Design and Analysis of Computer Experiments}.
1092: \newblock {\em {Statistical Science}}, Vol. 4(4):pages 409--435, {1989}.
1093:
1094: \bibitem{yip96a}
1095: K.M. Yip and F.~Zhao.
1096: \newblock {Spatial Aggregation: Theory and Applications}.
1097: \newblock {\em Journal of Artificial Intelligence Research}, Vol. 5:pages
1098: 1--26, 1996.
1099:
1100: \bibitem{yip95b}
1101: K.M. Yip, F.~Zhao, and E.~Sacks.
1102: \newblock {Imagistic Reasoning}.
1103: \newblock {\em ACM Computing Surveys}, Vol. 27(3):pages 363--365, 1995.
1104:
1105: \end{thebibliography}
1106: \end{document}
1107:
1108: