cs0402016/cs0402016
1: % Template article for preprint document class `elsart'
2: % SP 2001/01/05
3: 
4: \documentclass{elsart}
5: 
6: % Use the option doublespacing or reviewcopy to obtain double line spacing
7: % \documentclass[doublespacing]{elsart}
8: 
9: % if you use PostScript figures in your article
10: % use the graphics package for simple commands
11: % \usepackage{graphics}
12: % or use the graphicx package for more complicated commands
13: \usepackage{graphicx}
14: % or use the epsfig package if you prefer to use the old commands
15: % \usepackage{epsfig}
16: 
17: % The amssymb package provides various useful mathematical symbols
18: \usepackage{amssymb}
19: 
20: \begin{document}
21: 
22: \begin{frontmatter}
23: 
24: % Title, authors and addresses
25: 
26: % use the thanksref command within \title, \author or \address for footnotes;
27: % use the corauthref command within \author for corresponding author footnotes;
28: % use the ead command for the email address,
29: % and the form \ead[url] for the home page:
30: % \title{Title\thanksref{label1}}
31: % \thanks[label1]{}
32: % \author{Name\corauthref{cor1}\thanksref{label2}}
33: % \ead{email address}
34: % \ead[url]{home page}
35: % \thanks[label2]{}
36: % \corauth[cor1]{}
37: % \address{Address\thanksref{label3}}
38: % \thanks[label3]{}
39: 
40: \title{Perspects in astrophysical databases}
41: 
42: % use optional labels to link authors explicitly to addresses:
43: % \author[label1,label2]{}
44: % \address[label1]{}
45: % \address[label2]{}
46: 
47: \author[addr1]{Marco Frailis}
48: \author[addr2]{Alessandro De Angelis}
49: \author[addr3]{Vito Roberto}
50: 
51: \address[addr1]{Dipartimento di Fisica, Universit\`a di Udine, via delle Scienze 208, 33100 Udine, Italy}
52: \address[addr2]{INFN, Sezione di Trieste, Gruppo Collegato di Udine, via delle Scienze 208, 33100 Udine, Italy}
53: \address[addr3]{Dipartimento di Matematica e Informatica, Universit\`a di Udine, via delle Scienze 208, 33100 Udine, Italy}
54: 
55: 
56: \begin{abstract}
57: % Text of abstract
58:   Astrophysics has become a domain extremely rich of scientific data. Data
59:   mining tools are needed for information extraction from such large datasets.
60:   This asks for an approach to data management emphasizing the efficiency and
61:   simplicity of data access; efficiency is obtained using multidimensional
62:   access methods and simplicity is achieved by properly handling metadata.
63:   Moreover, clustering and classification techniques on large datasets pose
64:   additional requirements in terms of computation and memory scalability and
65:   interpretability of results. In this study we review some possible solutions.
66: \end{abstract}
67: 
68: \begin{keyword}
69: % keywords here, in the form: keyword \sep keyword
70:   Multidimensional Indexing, Data Mining, Astrophysical Databases, Data Warehousing
71: % PACS codes here, in the form: \PACS code \sep code
72: \PACS 
73: \end{keyword}
74: \end{frontmatter}
75: 
76: % main text
77: \section{Introduction}
78: \label{introduction}
79: 
80: At present, astrophysics is a discipline in which the exponential growth and
81: heterogeneity of data require the use of data mining techniques. The primary
82: source of astronomical data are the systematic sky surveys over a wide energy
83: range (from $10^{-7}$ eV to $10^{13}$ eV). Large archives and digital sky
84: surveys with dimensions of $10^{12}$ bytes currently exist, while in the near
85: future they will reach sizes of the order of $10^{15}$ bytes.  Numerical
86: simulations are also producing comparable volumes of information.
87: 
88: Several scientific research fields require to perform the analysis on multiple
89: energy spectra and consequently to get the data from different missions.
90: Therefore, the use of data mining techniques is necessary to maximize the
91: information extraction from such a growing quantity of data. This task is
92: hardened by different issues, like the heterogeneity of astronomical data, due
93: in part to their high dimensionality including both spatial and temporal
94: components, due in part to the multiplicity of instruments and projects, or the
95: use of traditional operational systems, in which the emphasis is on data
96: normalization, to organize astrophysical data. Data mining for multi-wavelength
97: analysis necessitates using an informational system, or data warehouse, as a
98: model for data management, a definition of a common set of metadata to guarantee
99: the interoperability between different archives and a more efficient data
100: exploration.
101: 
102: \section{Towards a data whareouse}
103: Most of the online resources available to the astrophysicists community are
104: simple data archives containing observational parameters (detector, type of the
105: observation, coordinates, astronomical object, exposure time, etc.).  Many
106: astronomical catalogs can be accessed online, but it is still difficult to
107: correlate objects in different archives or access multiple catalogs
108: simultaneously. Some advances, in this direction, have been accomplished by
109: projects like Vizier, Aladin and SkyView \cite{Viz,Sky}.
110: 
111: With an ideal astrophysical database, the users should be able to perform
112: queries based on scientific parameters (magnitude, redshift, spectral indexes,
113: morphological type of galaxies, etc.), easily discover the object types
114: contained into the archive and the available properties for each type, and define
115: the set of objects which they are interested in by constraining the values of
116: their scientific properties along with the desired level of detail~\cite{DSZDG00}.
117: 
118: The aforesaid requirements can be satisfied organizing data in a data
119: warehouse. A data warehouse can be defined as a \emph{subject-oriented},
120: \emph{integrated}, \emph{time varying} and \emph{non-volatile} data collection
121: \cite{Inm97}. In a data warehouse, data are arranged in a structure that can be easily
122: explored and queried, with fewer tables and keys than the equivalent relational
123: model. You start from a relational model, but some restrictions are introduced by
124: using \emph{facts}, \emph{dimensions}, \emph{hierarchies} and \emph{measures} in
125: a characteristic star structure called \emph{star schema}~\cite{Pet94}. The
126: central table is called ``fact'' table and it is the highest dimensional table of
127: the scheme. It can represent a particular phenomenon that we want to study. This
128: table is surrounded by a number of tables, called ``dimensions'', which
129: represent entities related to the phenomenon to be studied and connected to the
130: central table, forming the ends of the star. Within the dimensions, attributes
131: are arranged in hierarchies, determining the ``drill-down'' and ``roll-up''
132: operations available on each dimension: the result is a tree that the user can
133: visit from the root to the leaves, refining his query (drill-down) or
134: generalizing it (roll-up).
135: 
136: Metadata play an important role: a researcher has to obtain information about
137: the environment in which data have been gathered, in order to understand the
138: respondence to the project requirements, like date and/or data acquisition
139: method, internal or external error estimates, aim of data. Computing systems
140: have to access metadata to merge or compare data from different sources.  For
141: instance, it is necessary that units are expressed unambiguously to allow
142: comparisons between data with different units.
143: 
144: The astrophysicists community, in addition to using the FITS (Flexible Image
145: Transport System) exchange format, is currently considering alternatives like
146: XML to improve the interoperability. Some attempts to define a common standard
147: are XSIL (eXtensible Scientific Interchange Language), XDF (eXtensible Data
148: Format) and VOTable \cite{Sta02}.
149: 
150: \section{Multidimensional access methods}
151: In the Astroparticle and Astrophysical fields, data is mostly characterized by
152: multidimensional arrays. For instance, in X-ray and Gamma-ray astronomy, the
153: data gathered by detectors are lists of detected photons whose properties
154: include position (RA, DEC), arrival time, energy, error measures both for the
155: position and the energy estimates (dependent on the instrument response),
156: quality measures of the events . Source catalogs, produced by the analysis of
157: the raw data, are lists of point and extended sources characterized by
158: coordinates, magnitude, spectral indexes, flux, etc.
159: 
160: This multidimensional (spatial) data tend to be large (sky maps can reach sizes
161: of Terabytes) requiring the integration of the secondary storage, and there is no
162: total ordering on spatial objects preserving spatial proximity~\cite{GG98}. This
163: characteristic makes difficult to use traditional indexing methods, like B-trees
164: or linear hashing.
165: 
166: \emph{Data mining} applied to multidimenisonal data analyzes the relationships
167: between the attributes of a multidimensional object stored into the database and
168: the attributes of the neighboring ones. Typical queries required by this kind of
169: analysis are: \emph{point queries}, to find all objects overlapping the query
170: point; \emph{range queries}, to find all objects having at least one common
171: point with a query window; \emph{nearest neighbor queries}, to find all objects
172: that have a minimum distance from the query object. Another important operation
173: is the \emph{spatial join}, needed to search multiple source catalogs and
174: cross-identify sources from different wavebands. Some of the following indexing
175: methods can be used to improve the queries efficiency.
176: 
177: {\bf HTM}. Data gathered by all sky survays are distributed on an imaginary
178: sphere. The HTM~\cite{KST01} indexing method maps triangular regions of the
179: sphere to unique identifiers keeping a certain degree of locality.  The
180: technique for subdividing the sphere in spherical triangles is a recursive
181: process.  The starting point is a spherical octahedron which identifies 8
182: spherical triangles of equal size. In a recursion step, a triangle is further
183: subdivided into 4 triangles by connecting the side midpoints. At each level of
184: the recursion, the area of the resulting triangles is roughly the same and each
185: triangle is uniquely identified by a 2 bit value. This method as been used to
186: index the Sloan Digital Sky Survay, a catalog of 200 M objects in a
187: multi-terabyte archive. A level-5 HTM index is used to partition the bulk data.
188: A database for each level-5 leaf node of the HTM (defining the database file
189: name) has been built. Each database, containing tuples in a 5-dimensional color
190: space, is indexed by a KD-tree.
191: 
192: {\bf KD-tree and its variants}. The KD-tree~\cite{Bent75} is a binary tree that
193: stores points of a $k$-dimensional space. In each internal node, the KD-tree
194: divides the $k$-dimensional space into two parts with a $(k-1)$-dimensional
195: hyperplane. The direction of the hyperplane, that is the dimension on which the
196: division is performed, alternates between the $k$ possibilities from one tree
197: level to the following one. The subdivision process is recursive and terminates
198: when the size of a node (its longer side) or the number of points contained into
199: it is below a certain threshold. Given $N$ data points, the average cost of an
200: insertion operation is $O(\log_2 N)$. The tree structure and the resulting
201: hierarchical division of the space depends on the \emph{splitting rule}. A
202: drawback of KD-trees is that they have to be completely contained into the main
203: memory. With large datasets this is not feasible.  KD-B-trees~\cite{Rob81} and
204: hB-trees~\cite{LS90} combine properties of KD-trees and B-trees to overcome this
205: problem.
206: 
207: {\bf R-tree and its variants}. The R-trees~\cite{Gutt84} are hierarchical
208: dynamic data structures meant to efficiently index multidimensional objects with
209: a spatial extent. They are used to store not the real objects but their minimum
210: bounding box (MBB). Each node of the R-tree corresponds to a disk page. Similar
211: to B-trees, the R-trees are balanced and they guarantee an efficient memory
212: usage. Due to the overlapping between the MBBs of sibling nodes, in an R-tree a
213: range query can require more than one search path to be traversed. Search
214: performances depend on the insertion algorithms. Some variants have been
215: proposed to improve the disjointness among regions: the R$^*$-tree \cite{BKSS90}, which
216: uses a new insertion policy, the SR-tree \cite{KS97}, which uses the intersection of
217: bounding spheres and bounding rectangles to keep small the diameters and volumes
218: of the regions, and the A-tree \cite{SYUK02}, which improves the fanout of the nodes
219: using an approximation of the MBRs.
220: 
221: 
222: Usually, the analysis of astrophysical data is performed on a static dataset.
223: In this case, an optimized index (in terms of memory and query performances) can
224: be built using a priori information on the dataset. Several bulk loading
225: techniques have been proposed in the literature. We have followed a top-down
226: construction method called VAMSplit algorithm, described in \cite{WJ96}, to
227: build and optimized R-Tree. The main idea is to find a split strategy that
228: minimizes the number of buckets used and provides a good query performance. This
229: is achieved by recursively splitting the dataset on a near median element along
230: the dimension with maximum variance. To adapt it to a large dataset, we had to
231: implement an external selection algorithm. The implementation uses a sampling
232: method suggested by \cite{MR01} to find a good pivot value and reduce the number
233: of I/O operations; a caching strategy explained in \cite{BK99} has been adopted
234: to partition the data into the secondary memory.
235: 
236: 
237: \section{Clustering algorithms on large datasets}
238: Clustering algorithms have to locate regions of interest in which to perform more
239: detailed analysis and point out correlations between objects. An important
240: issue, in large datasets, is the efficiency and scalability of the clustering
241: algorithms with respect to the dataset size.
242: 
243: Many scalable algorithms have been proposed in the last ten years, including:
244: BIRCH, CURE, CLIQUE~\cite{Ber02}.
245: 
246: In particular, BIRCH is a hierarchical clustering algorithm. The main idea
247: behind the algorithm is to compress data into small subclusters and then to
248: perform a standard partitional clustering on the subclusters. Each subcluster is
249: represented by a \emph{clustering feature} which is a triplet summarizing
250: information about the group of data objects, that is the number of points
251: contained into the cluster and the linear sum and the square sum of the data
252: points. This algorithm has a linear cost with respect to the number of data
253: points.
254: 
255: CURE is an hierarchical agglomerative algorithm. Instead of using a single
256: centroid or object, it selects a fixed number of well-scattered objects to
257: represent each cluster. The distance between two clusters is defined as the
258: distance between the closest pair of representatives points and at each step of the
259: algorithm, the two closest clusters are merged. The algorithm terminates when
260: the desired number of clusters is obtained. To reduce the computational cost of
261: the algorithm, these steps are performed on a data sample (using suitable
262: sampling techniques). Its computational cost is not worse than the BIRCH one.
263: 
264: CLIQUE has been designed to locate clusters in subspaces of high dimensional
265: data. This is useful because generally, in high dimensional spaces, data are
266: scattered. CLIQUE partitions the space into a grid of disjoint rectangular units
267: of equal size. The algorithm is made up of three phases: first, it finds
268: subspaces containing clusters of dense units, than identifies the clusters,
269: and finally generates a minimum description for each cluster. Also this
270: algorithm scales linearly with the database size.
271: 
272: \section{Novelty detection: Support Vector Clustering}
273: Support Vector Machines and the related kernel methods are becoming popular for
274: data mining tasks. In many real problems, the task is not classifying but
275: novelties or anomalies detecting. In astrophysics, possible applications are the
276: research of anomalous events or new astronomical sources. An approach is finding
277: the \emph{support} of a distribution (rather than estimating the density
278: function of the data), thus avoiding the need of an a priori parameterized model
279: of the distribution. A method to solve this problem is represented by the
280: Support Vector Clustering (SVC) algorithm~\cite{BHSV01}, in which data are
281: mapped to a higher dimensional space by means of a Gaussian kernel function. In
282: the new space, the algorithm finds the minimum sphere enclosing the data. The
283: mapping of the sphere to the original input space generates a set of contours
284: enclosing the data and corresponding to the support of the
285: distribution. Outliers are defined as the Bounding Support Vectors (BSV).
286: 
287: \section{Conclusions}
288: In this work we have studied some data management and mining issues related to
289: astrophysical data, aiming at a complete data mining framework. In particular,
290: we have justified the need for a data warehousing approach to handle
291: astrophysical data and we have focused on multidimensional access methods to
292: efficiently index spatial and multidimensional data. A second issue concerns
293: clustering techniques on large datasets, and we have discussed about some
294: scalable algorithms with linear computational complexity. Finally, we have
295: outlined the usefulness of non-parametric clustering algorithms, like the SVC,
296: for novelty detection.
297: 
298: 
299: % The Appendices part is started with the command \appendix;
300: % appendix sections are then done as normal sections
301: % \appendix
302: 
303: % \section{}
304: % \label{}
305: 
306: \begin{thebibliography}{00}
307: 
308: % \bibitem{label}
309: % Text of bibliographic item
310: 
311: \bibitem{Viz} \emph{CDS}, \texttt{http://cdsweb.u-strasbg.fr}
312: \bibitem{Sky} \emph{SkyView - The Internet's Virtual Telescope},
313:   \texttt{http://skyview.gsfc.nasa.gov}
314: \bibitem{DSZDG00} P. Dowler, D. Schade, R. Zingle, D. Durand, S. Gaudet,
315:   \emph{Scientific Data Mining} ASP Conf. Ser., Vol. 216, Astronomical Data
316:   Analysis Software and Systems IX, eds. N. Manset, C. Veillet, D. Crabtree (San
317:   Francisco: ASP), 211 (2000)
318: \bibitem{Inm97} W. H. Inmon, \emph{What is a Data Warehouse?}, Prism Tech
319:   Topics 1(1) (1997)
320: \bibitem{Pet94} S. Peterson, \emph{Stars: A Pattern Language for Query
321:     Optimized Schema}, Portland Pattern Repository (1999)
322: \bibitem{Sta02} R. Stamper, \emph{XML for STP data}, Tech. Report (2002)
323: \bibitem{GG98} V. Gaede, O. Gunther, \emph{Multidimensional access
324:     methods}, ACM Comput. Surv., 30:170-231 (1998)
325: \bibitem{KST01} P. Z. Kunszt, A. S. Szalay, A. R. Thakar, \emph{The
326:     Hierarchical Triangular Mesh}, Proc. of the MPA/ESO/MPE workshop in Mining
327:     the sky, 631-637, Springer Verlag (2001)
328: \bibitem{Bent75} J. L. Bentley, \emph{Multidimensional binary search
329:     trees used for associative searching}, Communications of the ACM, 18(9),
330:     509-517 (1975)
331: \bibitem{Rob81} J. T. Robinson, \emph{The KD-B-tree: a search structure
332:     for large multidimensional Dynamic Indexes}, Proc. ACM SIGMOD Int. Conf. on
333:     Management of Data, 10-18 (1981) 
334: \bibitem{LS90} D. B. Lomet, B. Salzberg, \emph{A Multiattribute Indexing
335:     Method with Good Guaranteed Performance}, ACM Trans. on Database Systems
336:     (1990)
337: \bibitem{Gutt84} A. Guttman, \emph{R-trees: A Dynamic index Structure
338:     for Spatial Searching}, Proc. ACM SIGMOD Int. Conf. on Management of Data,
339:     47-57 (1984) 
340: \bibitem{BKSS90} N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger,
341:   \emph{The R$^*$-tree: An Efficient and Robust Access Method for Points and
342:   Rectangles}, Proc. ACM SIGMOD Int. Conf. on Management of Data, 322-331 (1990)
343: \bibitem{KS97} N.~Katayama, S.~Satoh, \emph{An Index Structure for
344:     High-Dimensional Nearest Neighbor Queries}, Proc. ACM SIGMOD Int Conf on
345:   Management of Data, 369-380 (1997)
346: \bibitem{SYUK02} Y.~Sakurai, M.~Yoshikawa, S.~Uemura, H.~Kojima, \emph{Spatial
347:     Indexing of High-Dimensional Data Based on Relative Approximation}, VLDB
348:   Journal, 11(2):93-108 (2002)
349: \bibitem{WJ96} D.A. White, R. Jain, \emph{Similarity Indexing: Algorithms and
350:     Performance}, Proc. SPIE, 2670:62-73, San Diego, USA (1996) 
351: \bibitem{MR01} C.~Mart\'inez, S.~Roura, \emph{Optimal Sampling Strategies in
352:     Quicksort and Quickselect}, SIAM J. on Comput., 31:683-705 (2001)
353: \bibitem{BK99} C.~B\"ohm H.-P.~Kriegel, \emph{Efficient Bulk Loading of Large
354:     High-Dimensional Indexes}, DaWaK '99, 31:251-260 (1999)
355: \bibitem{Ber02} P. Berkhin, \emph{Survey Of Clustering Data Mining Techniques},
356:   Tech Report (2002)
357: \bibitem{BHSV01} A. Ben-Hur, D. Horn, H. T. Siegelmann, V. Vapnik,
358:   \emph{Support Vector Clustering}, Journal of Machine Learning Research 2(Dec),
359:   125-137 (2001)
360: % notes:
361: % \bibitem{label} \note
362: 
363: % subbibitems:
364: % \begin{subbibitems}{label}
365: % \bibitem{label1}
366: % \bibitem{label2}
367: % If there is a note, it should come last:
368: % \bibitem{label3} \note
369: % \end{subbibitems}
370: 
371: \end{thebibliography}
372: 
373: \end{document}
374: 
375: