0402:cs0402016/cs0402016

1: % Template article for preprint document class `elsart'

2: % SP 2001/01/05

3:

4: \documentclass{elsart}

5:

6: % Use the option doublespacing or reviewcopy to obtain double line spacing

7: % \documentclass[doublespacing]{elsart}

8:

9: % if you use PostScript figures in your article

10: % use the graphics package for simple commands

11: % \usepackage{graphics}

12: % or use the graphicx package for more complicated commands

13: \usepackage{graphicx}

14: % or use the epsfig package if you prefer to use the old commands

15: % \usepackage{epsfig}

16:

17: % The amssymb package provides various useful mathematical symbols

18: \usepackage{amssymb}

19:

20: \begin{document}

21:

22: \begin{frontmatter}

23:

24: % Title, authors and addresses

25:

26: % use the thanksref command within \title, \author or \address for footnotes;

27: % use the corauthref command within \author for corresponding author footnotes;

28: % use the ead command for the email address,

29: % and the form \ead[url] for the home page:

30: % \title{Title\thanksref{label1}}

31: % \thanks[label1]{}

32: % \author{Name\corauthref{cor1}\thanksref{label2}}

33: % \ead{email address}

34: % \ead[url]{home page}

35: % \thanks[label2]{}

36: % \corauth[cor1]{}

37: % \address{Address\thanksref{label3}}

38: % \thanks[label3]{}

39:

40: \title{Perspects in astrophysical databases}

41:

42: % use optional labels to link authors explicitly to addresses:

43: % \author[label1,label2]{}

44: % \address[label1]{}

45: % \address[label2]{}

46:

47: \author[addr1]{Marco Frailis}

48: \author[addr2]{Alessandro De Angelis}

49: \author[addr3]{Vito Roberto}

50:

51: \address[addr1]{Dipartimento di Fisica, Universit\`a di Udine, via delle Scienze 208, 33100 Udine, Italy}

52: \address[addr2]{INFN, Sezione di Trieste, Gruppo Collegato di Udine, via delle Scienze 208, 33100 Udine, Italy}

53: \address[addr3]{Dipartimento di Matematica e Informatica, Universit\`a di Udine, via delle Scienze 208, 33100 Udine, Italy}

54:

55:

56: \begin{abstract}

57: % Text of abstract

58:   Astrophysics has become a domain extremely rich of scientific data. Data

59:   mining tools are needed for information extraction from such large datasets.

60:   This asks for an approach to data management emphasizing the efficiency and

61:   simplicity of data access; efficiency is obtained using multidimensional

62:   access methods and simplicity is achieved by properly handling metadata.

63:   Moreover, clustering and classification techniques on large datasets pose

64:   additional requirements in terms of computation and memory scalability and

65:   interpretability of results. In this study we review some possible solutions.

66: \end{abstract}

67:

68: \begin{keyword}

69: % keywords here, in the form: keyword \sep keyword

70:   Multidimensional Indexing, Data Mining, Astrophysical Databases, Data Warehousing

71: % PACS codes here, in the form: \PACS code \sep code

72: \PACS

73: \end{keyword}

74: \end{frontmatter}

75:

76: % main text

77: \section{Introduction}

78: \label{introduction}

79:

80: At present, astrophysics is a discipline in which the exponential growth and

81: heterogeneity of data require the use of data mining techniques. The primary

82: source of astronomical data are the systematic sky surveys over a wide energy

83: range (from $10^{-7}$ eV to $10^{13}$ eV). Large archives and digital sky

84: surveys with dimensions of $10^{12}$ bytes currently exist, while in the near

85: future they will reach sizes of the order of $10^{15}$ bytes.  Numerical

86: simulations are also producing comparable volumes of information.

87:

88: Several scientific research fields require to perform the analysis on multiple

89: energy spectra and consequently to get the data from different missions.

90: Therefore, the use of data mining techniques is necessary to maximize the

91: information extraction from such a growing quantity of data. This task is

92: hardened by different issues, like the heterogeneity of astronomical data, due

93: in part to their high dimensionality including both spatial and temporal

94: components, due in part to the multiplicity of instruments and projects, or the

95: use of traditional operational systems, in which the emphasis is on data

96: normalization, to organize astrophysical data. Data mining for multi-wavelength

97: analysis necessitates using an informational system, or data warehouse, as a

98: model for data management, a definition of a common set of metadata to guarantee

99: the interoperability between different archives and a more efficient data

100: exploration.

101:

102: \section{Towards a data whareouse}

103: Most of the online resources available to the astrophysicists community are

104: simple data archives containing observational parameters (detector, type of the

105: observation, coordinates, astronomical object, exposure time, etc.).  Many

106: astronomical catalogs can be accessed online, but it is still difficult to

107: correlate objects in different archives or access multiple catalogs

108: simultaneously. Some advances, in this direction, have been accomplished by

109: projects like Vizier, Aladin and SkyView \cite{Viz,Sky}.

110:

111: With an ideal astrophysical database, the users should be able to perform

112: queries based on scientific parameters (magnitude, redshift, spectral indexes,

113: morphological type of galaxies, etc.), easily discover the object types

114: contained into the archive and the available properties for each type, and define

115: the set of objects which they are interested in by constraining the values of

116: their scientific properties along with the desired level of detail~\cite{DSZDG00}.

117:

118: The aforesaid requirements can be satisfied organizing data in a data

119: warehouse. A data warehouse can be defined as a \emph{subject-oriented},

120: \emph{integrated}, \emph{time varying} and \emph{non-volatile} data collection

121: \cite{Inm97}. In a data warehouse, data are arranged in a structure that can be easily

122: explored and queried, with fewer tables and keys than the equivalent relational

123: model. You start from a relational model, but some restrictions are introduced by

124: using \emph{facts}, \emph{dimensions}, \emph{hierarchies} and \emph{measures} in

125: a characteristic star structure called \emph{star schema}~\cite{Pet94}. The

126: central table is called ``fact'' table and it is the highest dimensional table of

127: the scheme. It can represent a particular phenomenon that we want to study. This

128: table is surrounded by a number of tables, called ``dimensions'', which

129: represent entities related to the phenomenon to be studied and connected to the

130: central table, forming the ends of the star. Within the dimensions, attributes

131: are arranged in hierarchies, determining the ``drill-down'' and ``roll-up''

132: operations available on each dimension: the result is a tree that the user can

133: visit from the root to the leaves, refining his query (drill-down) or

134: generalizing it (roll-up).

135:

136: Metadata play an important role: a researcher has to obtain information about

137: the environment in which data have been gathered, in order to understand the

138: respondence to the project requirements, like date and/or data acquisition

139: method, internal or external error estimates, aim of data. Computing systems

140: have to access metadata to merge or compare data from different sources.  For

141: instance, it is necessary that units are expressed unambiguously to allow

142: comparisons between data with different units.

143:

144: The astrophysicists community, in addition to using the FITS (Flexible Image

145: Transport System) exchange format, is currently considering alternatives like

146: XML to improve the interoperability. Some attempts to define a common standard

147: are XSIL (eXtensible Scientific Interchange Language), XDF (eXtensible Data

148: Format) and VOTable \cite{Sta02}.

149:

150: \section{Multidimensional access methods}

151: In the Astroparticle and Astrophysical fields, data is mostly characterized by

152: multidimensional arrays. For instance, in X-ray and Gamma-ray astronomy, the

153: data gathered by detectors are lists of detected photons whose properties

154: include position (RA, DEC), arrival time, energy, error measures both for the

155: position and the energy estimates (dependent on the instrument response),

156: quality measures of the events . Source catalogs, produced by the analysis of

157: the raw data, are lists of point and extended sources characterized by

158: coordinates, magnitude, spectral indexes, flux, etc.

159:

160: This multidimensional (spatial) data tend to be large (sky maps can reach sizes

161: of Terabytes) requiring the integration of the secondary storage, and there is no

162: total ordering on spatial objects preserving spatial proximity~\cite{GG98}. This

163: characteristic makes difficult to use traditional indexing methods, like B-trees

164: or linear hashing.

165:

166: \emph{Data mining} applied to multidimenisonal data analyzes the relationships

167: between the attributes of a multidimensional object stored into the database and

168: the attributes of the neighboring ones. Typical queries required by this kind of

169: analysis are: \emph{point queries}, to find all objects overlapping the query

170: point; \emph{range queries}, to find all objects having at least one common

171: point with a query window; \emph{nearest neighbor queries}, to find all objects

172: that have a minimum distance from the query object. Another important operation

173: is the \emph{spatial join}, needed to search multiple source catalogs and

174: cross-identify sources from different wavebands. Some of the following indexing

175: methods can be used to improve the queries efficiency.

176:

177: {\bf HTM}. Data gathered by all sky survays are distributed on an imaginary

178: sphere. The HTM~\cite{KST01} indexing method maps triangular regions of the

179: sphere to unique identifiers keeping a certain degree of locality.  The

180: technique for subdividing the sphere in spherical triangles is a recursive

181: process.  The starting point is a spherical octahedron which identifies 8

182: spherical triangles of equal size. In a recursion step, a triangle is further

183: subdivided into 4 triangles by connecting the side midpoints. At each level of

184: the recursion, the area of the resulting triangles is roughly the same and each

185: triangle is uniquely identified by a 2 bit value. This method as been used to

186: index the Sloan Digital Sky Survay, a catalog of 200 M objects in a

187: multi-terabyte archive. A level-5 HTM index is used to partition the bulk data.

188: A database for each level-5 leaf node of the HTM (defining the database file

189: name) has been built. Each database, containing tuples in a 5-dimensional color

190: space, is indexed by a KD-tree.

191:

192: {\bf KD-tree and its variants}. The KD-tree~\cite{Bent75} is a binary tree that

193: stores points of a $k$-dimensional space. In each internal node, the KD-tree

194: divides the $k$-dimensional space into two parts with a $(k-1)$-dimensional

195: hyperplane. The direction of the hyperplane, that is the dimension on which the

196: division is performed, alternates between the $k$ possibilities from one tree

197: level to the following one. The subdivision process is recursive and terminates

198: when the size of a node (its longer side) or the number of points contained into

199: it is below a certain threshold. Given $N$ data points, the average cost of an

200: insertion operation is $O(\log_2 N)$. The tree structure and the resulting

201: hierarchical division of the space depends on the \emph{splitting rule}. A

202: drawback of KD-trees is that they have to be completely contained into the main

203: memory. With large datasets this is not feasible.  KD-B-trees~\cite{Rob81} and

204: hB-trees~\cite{LS90} combine properties of KD-trees and B-trees to overcome this

205: problem.

206:

207: {\bf R-tree and its variants}. The R-trees~\cite{Gutt84} are hierarchical

208: dynamic data structures meant to efficiently index multidimensional objects with

209: a spatial extent. They are used to store not the real objects but their minimum

210: bounding box (MBB). Each node of the R-tree corresponds to a disk page. Similar

211: to B-trees, the R-trees are balanced and they guarantee an efficient memory

212: usage. Due to the overlapping between the MBBs of sibling nodes, in an R-tree a

213: range query can require more than one search path to be traversed. Search

214: performances depend on the insertion algorithms. Some variants have been

215: proposed to improve the disjointness among regions: the R$^*$-tree \cite{BKSS90}, which

216: uses a new insertion policy, the SR-tree \cite{KS97}, which uses the intersection of

217: bounding spheres and bounding rectangles to keep small the diameters and volumes

218: of the regions, and the A-tree \cite{SYUK02}, which improves the fanout of the nodes

219: using an approximation of the MBRs.

220:

221:

222: Usually, the analysis of astrophysical data is performed on a static dataset.

223: In this case, an optimized index (in terms of memory and query performances) can

224: be built using a priori information on the dataset. Several bulk loading

225: techniques have been proposed in the literature. We have followed a top-down

226: construction method called VAMSplit algorithm, described in \cite{WJ96}, to

227: build and optimized R-Tree. The main idea is to find a split strategy that

228: minimizes the number of buckets used and provides a good query performance. This

229: is achieved by recursively splitting the dataset on a near median element along

230: the dimension with maximum variance. To adapt it to a large dataset, we had to

231: implement an external selection algorithm. The implementation uses a sampling

232: method suggested by \cite{MR01} to find a good pivot value and reduce the number

233: of I/O operations; a caching strategy explained in \cite{BK99} has been adopted

234: to partition the data into the secondary memory.

235:

236:

237: \section{Clustering algorithms on large datasets}

238: Clustering algorithms have to locate regions of interest in which to perform more

239: detailed analysis and point out correlations between objects. An important

240: issue, in large datasets, is the efficiency and scalability of the clustering

241: algorithms with respect to the dataset size.

242:

243: Many scalable algorithms have been proposed in the last ten years, including:

244: BIRCH, CURE, CLIQUE~\cite{Ber02}.

245:

246: In particular, BIRCH is a hierarchical clustering algorithm. The main idea

247: behind the algorithm is to compress data into small subclusters and then to

248: perform a standard partitional clustering on the subclusters. Each subcluster is

249: represented by a \emph{clustering feature} which is a triplet summarizing

250: information about the group of data objects, that is the number of points

251: contained into the cluster and the linear sum and the square sum of the data

252: points. This algorithm has a linear cost with respect to the number of data

253: points.

254:

255: CURE is an hierarchical agglomerative algorithm. Instead of using a single

256: centroid or object, it selects a fixed number of well-scattered objects to

257: represent each cluster. The distance between two clusters is defined as the

258: distance between the closest pair of representatives points and at each step of the

259: algorithm, the two closest clusters are merged. The algorithm terminates when

260: the desired number of clusters is obtained. To reduce the computational cost of

261: the algorithm, these steps are performed on a data sample (using suitable

262: sampling techniques). Its computational cost is not worse than the BIRCH one.

263:

264: CLIQUE has been designed to locate clusters in subspaces of high dimensional

265: data. This is useful because generally, in high dimensional spaces, data are

266: scattered. CLIQUE partitions the space into a grid of disjoint rectangular units

267: of equal size. The algorithm is made up of three phases: first, it finds

268: subspaces containing clusters of dense units, than identifies the clusters,

269: and finally generates a minimum description for each cluster. Also this

270: algorithm scales linearly with the database size.

271:

272: \section{Novelty detection: Support Vector Clustering}

273: Support Vector Machines and the related kernel methods are becoming popular for

274: data mining tasks. In many real problems, the task is not classifying but

275: novelties or anomalies detecting. In astrophysics, possible applications are the

276: research of anomalous events or new astronomical sources. An approach is finding

277: the \emph{support} of a distribution (rather than estimating the density

278: function of the data), thus avoiding the need of an a priori parameterized model

279: of the distribution. A method to solve this problem is represented by the

280: Support Vector Clustering (SVC) algorithm~\cite{BHSV01}, in which data are

281: mapped to a higher dimensional space by means of a Gaussian kernel function. In

282: the new space, the algorithm finds the minimum sphere enclosing the data. The

283: mapping of the sphere to the original input space generates a set of contours

284: enclosing the data and corresponding to the support of the

285: distribution. Outliers are defined as the Bounding Support Vectors (BSV).

286:

287: \section{Conclusions}

288: In this work we have studied some data management and mining issues related to

289: astrophysical data, aiming at a complete data mining framework. In particular,

290: we have justified the need for a data warehousing approach to handle

291: astrophysical data and we have focused on multidimensional access methods to

292: efficiently index spatial and multidimensional data. A second issue concerns

293: clustering techniques on large datasets, and we have discussed about some

294: scalable algorithms with linear computational complexity. Finally, we have

295: outlined the usefulness of non-parametric clustering algorithms, like the SVC,

296: for novelty detection.

297:

298:

299: % The Appendices part is started with the command \appendix;

300: % appendix sections are then done as normal sections

301: % \appendix

302:

303: % \section{}

304: % \label{}

305:

306: \begin{thebibliography}{00}

307:

308: % \bibitem{label}

309: % Text of bibliographic item

310:

311: \bibitem{Viz} \emph{CDS}, \texttt{http://cdsweb.u-strasbg.fr}

312: \bibitem{Sky} \emph{SkyView - The Internet's Virtual Telescope},

313:   \texttt{http://skyview.gsfc.nasa.gov}

314: \bibitem{DSZDG00} P. Dowler, D. Schade, R. Zingle, D. Durand, S. Gaudet,

315:   \emph{Scientific Data Mining} ASP Conf. Ser., Vol. 216, Astronomical Data

316:   Analysis Software and Systems IX, eds. N. Manset, C. Veillet, D. Crabtree (San

317:   Francisco: ASP), 211 (2000)

318: \bibitem{Inm97} W. H. Inmon, \emph{What is a Data Warehouse?}, Prism Tech

319:   Topics 1(1) (1997)

320: \bibitem{Pet94} S. Peterson, \emph{Stars: A Pattern Language for Query

321:     Optimized Schema}, Portland Pattern Repository (1999)

322: \bibitem{Sta02} R. Stamper, \emph{XML for STP data}, Tech. Report (2002)

323: \bibitem{GG98} V. Gaede, O. Gunther, \emph{Multidimensional access

324:     methods}, ACM Comput. Surv., 30:170-231 (1998)

325: \bibitem{KST01} P. Z. Kunszt, A. S. Szalay, A. R. Thakar, \emph{The

326:     Hierarchical Triangular Mesh}, Proc. of the MPA/ESO/MPE workshop in Mining

327:     the sky, 631-637, Springer Verlag (2001)

328: \bibitem{Bent75} J. L. Bentley, \emph{Multidimensional binary search

329:     trees used for associative searching}, Communications of the ACM, 18(9),

330:     509-517 (1975)

331: \bibitem{Rob81} J. T. Robinson, \emph{The KD-B-tree: a search structure

332:     for large multidimensional Dynamic Indexes}, Proc. ACM SIGMOD Int. Conf. on

333:     Management of Data, 10-18 (1981)

334: \bibitem{LS90} D. B. Lomet, B. Salzberg, \emph{A Multiattribute Indexing

335:     Method with Good Guaranteed Performance}, ACM Trans. on Database Systems

336:     (1990)

337: \bibitem{Gutt84} A. Guttman, \emph{R-trees: A Dynamic index Structure

338:     for Spatial Searching}, Proc. ACM SIGMOD Int. Conf. on Management of Data,

339:     47-57 (1984)

340: \bibitem{BKSS90} N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger,

341:   \emph{The R$^*$-tree: An Efficient and Robust Access Method for Points and

342:   Rectangles}, Proc. ACM SIGMOD Int. Conf. on Management of Data, 322-331 (1990)

343: \bibitem{KS97} N.~Katayama, S.~Satoh, \emph{An Index Structure for

344:     High-Dimensional Nearest Neighbor Queries}, Proc. ACM SIGMOD Int Conf on

345:   Management of Data, 369-380 (1997)

346: \bibitem{SYUK02} Y.~Sakurai, M.~Yoshikawa, S.~Uemura, H.~Kojima, \emph{Spatial

347:     Indexing of High-Dimensional Data Based on Relative Approximation}, VLDB

348:   Journal, 11(2):93-108 (2002)

349: \bibitem{WJ96} D.A. White, R. Jain, \emph{Similarity Indexing: Algorithms and

350:     Performance}, Proc. SPIE, 2670:62-73, San Diego, USA (1996)

351: \bibitem{MR01} C.~Mart\'inez, S.~Roura, \emph{Optimal Sampling Strategies in

352:     Quicksort and Quickselect}, SIAM J. on Comput., 31:683-705 (2001)

353: \bibitem{BK99} C.~B\"ohm H.-P.~Kriegel, \emph{Efficient Bulk Loading of Large

354:     High-Dimensional Indexes}, DaWaK '99, 31:251-260 (1999)

355: \bibitem{Ber02} P. Berkhin, \emph{Survey Of Clustering Data Mining Techniques},

356:   Tech Report (2002)

357: \bibitem{BHSV01} A. Ben-Hur, D. Horn, H. T. Siegelmann, V. Vapnik,

358:   \emph{Support Vector Clustering}, Journal of Machine Learning Research 2(Dec),

359:   125-137 (2001)

360: % notes:

361: % \bibitem{label} \note

362:

363: % subbibitems:

364: % \begin{subbibitems}{label}

365: % \bibitem{label1}

366: % \bibitem{label2}

367: % If there is a note, it should come last:

368: % \bibitem{label3} \note

369: % \end{subbibitems}

370:

371: \end{thebibliography}

372:

373: \end{document}

374:

375: