1: %
2: % $Description: Author guidelines and sample document in LaTeX 2.09/2e$
3: %
4: % $Author: ienne $
5: % $Date: 1995/09/15 15:20:59 $
6: % $Revision: 1.4 $
7: %
8:
9: % \documentstyle[times,twocolumn,latex8]{article} % LaTeX 2.09
10:
11: \documentclass[10pt,twocolumn]{article} % LaTeX 2e
12: \usepackage{latex8} % LaTeX 2e
13: \usepackage{times} % LaTeX 2e
14: \usepackage{psfig}
15:
16: %-------------------------------------------------------------------------
17: % take the % away on next line to produce the final camera-ready version
18: %\pagestyle{empty}
19:
20: %-------------------------------------------------------------------------
21: \begin{document}
22:
23: \title{A Scientific Data Management System for Irregular
24: Applications\thanks{This work was supported in part by the
25: Mathematical, Information, and Computational Sciences Division
26: subprogram of the Office of Advanced Scientific Computing Research,
27: U.S.\ Department of Energy, under Contract W-31-109-Eng-38, and in part
28: by a Work-for-Others Subaward No.\ 751 with the University of
29: Illinois, under NSF Cooperative Agreement \#ACI-9619019.}}
30:
31:
32: \author{
33: Jaechun~No$^\dag$ ~~ Rajeev~Thakur$^\dag$ ~~ Dinesh~Kaushik$^\dag$ ~~
34: Lori~Freitag$^\dag$ ~~ Alok~Choudhary$^\ddag$ \\ \\
35: \begin{tabular}{c c}
36: \hskip .05in \dag Math.\ and Computer Science Division &
37: \hskip .05in \ddag Dept.\ of Elec. and Computer Eng. \\
38: \hskip .05in Argonne National Laboratory & \hskip .05in Northwestern
39: University\\
40: \hskip .05in Argonne, IL 60439 & \hskip .05in Evanston, IL 60208\\
41: {\tt \{jano,thakur,kaushik,freitag\}@mcs.anl.gov} & \hskip .05in {\tt choudhar@ece.nwu.edu}
42: \end{tabular}
43: }
44:
45: \maketitle
46: \thispagestyle{empty}
47:
48: \begin{abstract}
49: Many scientific applications are I/O intensive and generate large data
50: sets, spanning hundreds or thousands of ``files.'' Management,
51: storage, efficient access, and analysis of this data present an
52: extremely challenging task. We have developed a software system, called
53: Scientific Data Manager (SDM), that uses a combination of parallel
54: file I/O and database support for high-performance scientific data
55: management. SDM provides a high-level API to the user and, internally,
56: uses a parallel file system to store real data and a database to store
57: application-related metadata.
58: In this paper, we describe how we designed and implemented SDM to
59: support irregular applications. SDM can efficiently handle the reading
60: and writing of data in an irregular mesh, as well as the distribution of
61: index values. We describe the SDM user interface and how we have
62: implemented it to achieve high performance. SDM makes extensive use of
63: MPI-IO's noncontiguous collective I/O functions. SDM also uses the
64: concept of a {\em history file} to optimize the cost of the index
65: distribution using the metadata stored in database. We present
66: performance results with two irregular applications, a CFD code called
67: FUN3D and a Rayleigh-Taylor instability code, on the SGI Origin2000
68: at Argonne National Laboratory.
69: \end{abstract}
70:
71: %-------------------------------------------------------------------------
72: \Section{Introduction \label{sec:intro}}
73: Many large-scale scientific applications are I/O intensive and
74: generate large amounts of data (on the order of several hundred
75: gigabytes to terabytes)~\cite{delrosario:prospects,poole:sio-survey}.
76: Many of these applications perform their computation and I/O on an
77: irregularly discretized mesh. The data accesses in those applications
78: make extensive use of arrays, called indirection
79: array~\cite{D94,ravi95a} or map array~\cite{grop99a}, in which each
80: value of the array denotes the corresponding data position in memory
81: or in the file.
82:
83: The data distribution in irregular applications can be done either
84: by using compiler directives with the support of runtime
85: preprocessing~\cite{rein:fortran,HPFF93} or by using a runtime
86: library~\cite{D94,ravi95a}. Most of the previous work in the area of
87: unstructured-grid applications focuses mainly on computation and
88: communication in such applications, not on I/O.
89:
90: We have developed a software system for large-scale scientific data
91: management, called Scientific Data Manager (SDM)~\cite{jano:sc00},
92: that combines the good features of both file I/O and
93: databases. SDM provides a high-level, user-friendly
94: interface. Internally, SDM interacts with a database to store
95: application-related metadata and uses MPI-IO to store the real data on
96: a high-performance parallel file system. SDM takes advantage of
97: various I/O optimizations available in MPI-IO, such as collective I/O
98: and noncontiguous requests, in a manner that is transparent to the
99: user. As a result, users can access data with the performance of
100: parallel file I/O, without having to bother with the details of file
101: I/O.
102:
103: In a previous paper~\cite{jano:sc00}, we described the use of SDM for
104: regular applications. In this paper, we describe the API, design, and
105: implementation of SDM for irregular applications. SDM can efficiently
106: handle the reading and writing of data in an irregular mesh, as well as
107: the distribution of index values. SDM also uses the concept of a {\em
108: history file} to optimize the cost of the index distribution using the
109: metadata stored in database. We present performance results with two
110: irregular applications, a CFD code called FUN3D and a Rayleigh-Taylor
111: instability code, on the SGI Origin2000 at Argonne National
112: Laboratory.
113:
114: The rest of this paper is organized as follows. In
115: Section~\ref{sec:obj} we discuss our goals in developing SDM for
116: irregular problems. In Section~\ref{sec:impl} we present a typical
117: irregular problem and describe the detailed implementation issues of
118: SDM to solve the problem. Performance results on the SGI Origin2000
119: at Argonne National Laboratory are presented in
120: Section~\ref{sec:perf}. We discuss related work in
121: Section~\ref{sec:related} and conclude in Section~\ref{sec:conc}.
122:
123: \Section{Design Objectives \label{sec:obj}}
124:
125: Our main objectives in designing SDM for irregular applications were
126: to achieve high-performance parallel I/O, to provide
127: a convenient high-level API,
128: and to optimize the execution cost of irregular applications.
129:
130: \begin{itemize}
131: \item{\bf High-Performance I/O}.
132: To achieve high-performance I/O, we decided to use a parallel file-I/O
133: system to store real data and use MPI-IO to access this data. MPI-IO,
134: the I/O interface defined as part of the MPI-2
135: standard~\cite{grop99a,mpi97a}, is rapidly emerging as the standard,
136: portable API for I/O in parallel applications. MPI-IO is specifically
137: designed to enable the optimizations that are critical for
138: high-performance parallel I/O. Examples of these optimizations include
139: collective I/O, the ability to access noncontiguous data sets, and the
140: ability to pass hints to the implementation about access patterns,
141: file-striping parameters, and so forth.
142:
143: \item{\bf High-Level API}.
144: Our goal was to provide a high-level unified API for any kind of
145: application (regular or irregular) while encapsulating the details of
146: either MPI-IO or databases. With SDM, user can specify the data with a
147: high-level description, together with annotations, and use a similar
148: API for data retrieval. SDM internally translates the user's request
149: into appropriate MPI-IO calls, including creating MPI derived
150: datatypes for noncontiguous data~\cite{thak98a}. SDM also interacts
151: with the database when necessary, by using embedded SQL functions.
152:
153: \item{\bf Optimization for Irregular Applications}.
154: In irregular applications, the cost of an index distribution is
155: usually expensive, in terms of communication and computation.
156: In SDM, after partitioning the index values among processes, the local
157: index subsets of all processes are asynchronously written to a {\em
158: history file}, and the associated metadata is stored in database. When
159: the same index distribution is needed in subsequent runs, the index
160: values are read from the history file using the metadata stored
161: in database, and thereby the user can avoid repeating the
162: communication and computation for the same index distribution.
163: \end{itemize}
164:
165: \Section{Implementation \label{sec:impl}}
166:
167: We discuss the SDM API for solving a sample irregular problem and show
168: how the API is implemented.
169:
170: \SubSection{An Irregular Problem and SDM API}
171:
172: \begin{figure}[h]
173: \centerline{\psfig{figure=problem.eps,height=3.5in,width=3.5in}}
174: \caption{A sample irregular problem and its solution}
175: \label{fig:problem}
176: \end{figure}
177:
178: Figure~\ref{fig:problem} shows a typical irregular problem that sweeps
179: over the edges of an irregular mesh. In this problem, {\tt edge1} and
180: {\tt edge2} are two arrays representing nodes connected by an edge,
181: and arrays {\tt x} and {\tt y} are the actual data associated with
182: each edge and node, respectively. The partitioned arrays of {\tt
183: edge1}, {\tt edge2}, {\tt x}, and {\tt y} contain a single level of
184: ``ghost data'' beyond the boundaries to minimize remote accesses.
185: After the computation is completed, the results {\tt p} and {\tt q}
186: are written to a file in the order of global node numbers.
187:
188: Figures~\ref{fig:API1} and~\ref{fig:API2} respectively show the SDM API for
189: writing the results {\tt p} and {\tt q} and for partitioning {\tt
190: edge1}, {\tt edge2}, {\tt x}, and {\tt y} among processes to solve the
191: problem described in Figure~\ref{fig:problem}. We use the term {\em
192: import} to distinguish it from a {\em read} operation. A read
193: operation reads the data created in SDM, whereas an import operation
194: reads the data created outside of SDM.
195:
196: \begin{figure}[h]
197: \fbox{
198: \begin{minipage}[t]{3.2in}
199: \hspace*{0.1in}SDM\_initialize(nameOfApplication); \\
200: \hspace*{0.1in}result = SDM\_make\_datalist(2, \{p, q\}); \\
201: \hspace*{0.1in}result[0].data\_type = DOUBLE; \\
202: \hspace*{0.1in}SDM\_associate\_attributes(2, \&result[0]); \\
203: \hspace*{0.1in}handle = SDM\_set\_attributes(2, result); \\
204: \hspace*{0.1in}...... \\
205: \hspace*{0.1in}/* Partition edge1, edge2, x and y among processes \\
206: \hspace*{0.5in} (Figure 3) */ \\
207: \hspace*{0.1in}...... \\
208: \hspace*{0.1in}SDM\_data\_view(handle, 2, p, \&vector, \&localNodes); \\
209: \hspace*{0.1in}For (t=1; \(t<maxStep\); t++) \{ \\
210: \hspace*{0.3in}...... \\
211: \hspace*{0.3in}Do computation and produce results p and q; \\
212: \hspace*{0.3in}...... \\
213: \hspace*{0.3in}For (each checkpoint) \{ \\
214: \hspace*{0.5in}SDM\_write(handle, p, t, pBuf); \\
215: \hspace*{0.5in}SDM\_write(handle, q, t, qbuf); \\
216: \hspace*{0.3in}\} \\
217: \hspace*{0.1in}\} \\
218: \hspace*{0.1in}SDM\_finalize(handle, 2); \\
219: \end{minipage}
220: }
221: \caption{SDM API for writing results}
222: \label{fig:API1}
223: \end{figure}
224:
225: \SubSection{Implementation Details}
226:
227: The {\em partitioning vector}
228: is the one generated from a partitioning tool,
229: such as MeTis\cite{george:metis,kirk:metis}. Each value
230: of the vector denotes a processor rank where the node should be assigned.
231: In SDM, the partitioning vector should be replicated among processes.
232: Next, the {\em map array} is the one that specifies the mapping
233: of each element of the local array to the global array. This map array
234: is created in SDM after partitioning
235: the indexes using a partitioning vector,
236: or the map array can be specified by the user.
237:
238: Figure~\ref{fig:API1} shows the steps involved in initializing SDM to solve
239: the problem in Figure~\ref{fig:problem}.
240: Running the problem on SDM begins by calling
241: the {\em SDM\_initialize} to
242: establish database connection (for storing metadata).
243: Six database tables,
244: {\em run\_table}, {\em access\_pattern\_table}, {\em execution\_table},
245: {\em import\_table}, {\em index\_table}, and {\em index\_history\_table},
246: are created to store the metadata associated with the application.
247: Since two
248: data sets, {\tt p} and {\tt q}, are produced as a result of
249: computations and they have the same data type and global size,
250: these data sets are grouped in a data group to experiment different ways
251: of organizing data in files. All the metadata associated with these data sets
252: are stored in a database in the {\em SDM\_set\_attributes}.
253:
254: Figure~\ref{fig:API2} describes the steps in SDM
255: to partition the indexes and data.
256: The four arrays, {\tt edge1}, {\tt edge2}, {\tt x}, and {\tt y}, are
257: imported by creating a data group. Since these arrays
258: have been created outside of SDM, the user has no control
259: over the arrays except to read them, by specifying their data type,
260: appropriate file offset, and length. The user need not
261: create several data groups to import the arrays.
262: In the {\em SDM\_make\_importlist}, the metadata of this
263: imported data group, including a mechanism for the import
264: (partition), is stored in the {\em import\_table} for a later use.
265:
266: In order to partition {\tt edge1} and {\tt edge2}, the {\em SDM\_import}
267: is called to import the arrays with the parameters of file handle,
268: their position in the data group, file offset,
269: file length, and user buffer to hold
270: the data.
271: \begin{figure}[h]
272: \fbox{
273: \begin{minipage}[t]{3.2in}
274: \hspace*{0.1in}import = SDM\_make\_datalist(4, \{edge1, edge2, x, y\}); \\
275: \hspace*{0.1in}import[2].data\_type = DOUBLE; \\
276: \hspace*{0.1in}SDM\_associate\_attributes(2, \&import[2]); \\
277: \hspace*{0.1in}SDM\_make\_importlist(handle, 4, import); \\
278: \hspace*{0.1in} \\
279: \hspace*{0.1in}SDM\_import(handle, edge1, 0, totalEdges, tmp); \\
280: \hspace*{0.1in}SDM\_import(handle, edge2, (totalEdges*sizeof(int)), \\
281: \hspace*{0.3in}totalEdges, tmp+(totalEdges*sizeof(int))); \\
282: \hspace*{0.1in} \\
283: \hspace*{0.1in}/* Distribute edge1 and edge2 among processes */ \\
284: \hspace*{0.1in}vector = SDM\_partition\_table(handle, \\
285: \hspace*{0.5in} partitioning\_vector, totalNodes); \\
286: \hspace*{0.1in}partitioned\_edge = SDM\_partition\_index(handle, \\
287: \hspace*{0.3in}partitioning\_vector, totalNodes, \&tmp, \&vector); \\
288: \hspace*{0.1in} \\
289: \hspace*{0.1in}localEdges = SDM\_partition\_index\_size(handle); \\
290: \hspace*{0.1in}localNodes = SDM\_partition\_data\_size(handle); \\
291: \hspace*{0.1in} \\
292: \hspace*{0.1in}/* Make a history of this index distribution */ \\
293: \hspace*{0.1in}SDM\_index\_registry(handle, partitioned\_edge, vector); \\
294: \hspace*{0.1in} \\
295: \hspace*{0.1in}/* Import x */ \\
296: \hspace*{0.1in}file\_offset = 2*totalEdges*sizeof(int); \\
297: \hspace*{0.1in}SDM\_data\_view(handle, 1, x, \&partitioned\_edge, \\
298: \hspace*{0.3in}\&localEdges); \\
299: \hspace*{0.1in}SDM\_import(handle, x, file\_offset, totalEdges, xBuf); \\
300: \hspace*{0.1in} \\
301: \hspace*{0.1in}/* Import y */ \\
302: \hspace*{0.1in}file\_offset += totalEdges * sizeof(double); \\
303: \hspace*{0.1in}SDM\_data\_view(handle, 1, y, \&vector, \&localNodes); \\
304: \hspace*{0.1in}SDM\_import(handle, y, file\_offset, totalNodes, yBuf); \\
305: \hspace*{0.1in} \\
306: \hspace*{0.1in}SDM\_release\_importlist(handle, 4); \\
307: \end{minipage}
308: }
309: \caption{SDM API for partitioning indexes and data}
310: \label{fig:API2}
311: \end{figure}
312: The {\em SDM\_import} first accesses the
313: {\em index\_table} in the database
314: to see whether a history file exists with this problem size.
315: If so, the metadata, such as each process's partitioned index size and
316: the history file name, is retrieved from the {\em index\_table} and
317: {\em index\_history\_table}, and the control exits the {\em SDM\_import}.
318: Otherwise, the desired data is imported to the application.
319: Since {\tt edge1} and {\tt edge2} are being imported in a
320: contiguous way, there is no need to specify data mapping between the file
321: and processor memory. In the {\em SDM\_import},
322: the total domain (file length) is equally divided among processes, and
323: the data in the domain is contiguously imported into the application.
324: In our example, edges {\tt 0} and {\tt 1} are imported to process 0, and
325: edges {\tt 2} and {\tt 3} are imported to process 1.
326:
327: In the {\em SDM\_partition\_table},
328: the global partitioning vector,
329: {\tt partitioning\_vector} in Figure~\ref{fig:API2}, is
330: converted to the local vector, {\tt vector} in Figure~\ref{fig:API2},
331: to determine
332: which node should be assigned
333: to which process. In the example, nodes {\tt 0} and {\tt 3} are assigned
334: to process 0, and nodes {\tt 1}, {\tt 2}, and {\tt 4}
335: are assigned to process 1.
336:
337: If there is a history file for this problem size,
338: the {\em SDM\_partition\_index}
339: reads the already partitioned {\tt edge1} and {\tt edge2} from
340: the history file and converts them to the localized edges by using
341: the partitioning vector. This avoids the communication cost to
342: exchange each process's edges and the computation cost to choose the
343: edges to be assigned.
344: The disadvantage of the history file is that it cannot be used
345: if the program is run on a different number of processes from
346: when the file was created, because the edges and nodes being assigned to
347: each process dynamically change among different numbers of processes.
348: One efficient use of the history file is to create
349: it in advance for the various numbers of processes of interest.
350: As long as the user runs the application with any of those
351: numbers of processes,
352: an appropriate history can be chosen to reduce communication and
353: computation costs.
354: If there is no history file,
355: the edges in each process are distributed
356: by reading all the data in parallel and performing
357: a ring-oriented communication.
358:
359: If at least a node of an edge has been partitioned to a process, the edge is
360: assigned to the process. For example, edge {\tt 0} is assigned both to
361: process 0 and 1 because one node of the edge, {\tt edge1 0}, has been
362: partitioned to process 0 and the other node, {\tt edge2 1},
363: has been partitioned
364: to process 1. This edge is a ghost edge of both processes
365: being stored to minimize communication volumes.
366:
367: For storing the partitioned edges and nodes, including the ghost ones,
368: a certain amount of memory space is initially allocated to each process.
369: When the entire memory space is occupied by the partitioned data,
370: it is automatically doubled by adjusting the memory size.
371: This prevents the system from looking through the entire data
372: in two steps, one step to decide the size of memory space and
373: the other step to actually store the data in the memory space.
374:
375: After the edges and nodes are distributed, the edges in each process are
376: moved to the next process located at a ring network.
377: In the example, process 0 receives edges {\tt 2} and {\tt 3}, and
378: process 1 receives edges {\tt 0} and {\tt 1} to partition them
379: as described above.
380: After finishing the edge distribution, edges {\tt 0} and {\tt 2} are
381: assigned to process 0, and edges {\tt 0}, {\tt 1}, and {\tt 3} are
382: assigned to process 1. Similarly, nodes {\tt 0}, {\tt 1}, and {\tt 3} are
383: assigned to process 0, and nodes {\tt 0}, {\tt 1}, {\tt 2}, and {\tt 4} are
384: assigned to process 1.
385: In Figure~\ref{fig:API2}, {\tt partitioned\_edge} contains the edges
386: assigned to each process, and {\tt vector} contains the nodes assigned to it.
387: These are the two
388: map arrays to distribute the physical data associated with each edge
389: and node, respectively.
390:
391: If the {\em SDM\_index\_registry} was
392: executed for the first time and no history file was created earlier,
393: the metadata of the partitioned edges,
394: such as the partitioned size of each process, is stored in the
395: database tables {\em index\_table} and {\em index\_history\_table}.
396: Also, the partitioned edges are asynchronously written to a history file
397: to be retrieved in subsequent runs requiring the same edge distribution.
398: The use of the {\em SDM\_index\_registry} is optional.
399: If the user does not call the {\em SDM\_index\_registry},
400: no history file is created
401: after partitioning the edges.
402:
403: In order to import and partition data {\tt x} and {\tt y} in
404: the {\em SDM\_import}, the {\em SDM\_data\_view} must be called to
405: define the data mapping between a noncontiguous global view of the file
406: and a local view of the processor memory. Using the data mapping,
407: in the {\em SDM\_import}, the associated data is irregularly distributed
408: by calling a collective MPI-IO function.
409: In the {\em SDM\_release\_importlist},
410: the structures being used to import data
411: in the file handle are free.
412:
413: Figure~\ref{fig:API1} shows the steps to write two data sets, {\tt p} and
414: {\tt q}, after completing the computations at each checkpoint.
415: Before writing {\tt p} and {\tt q}, the data mapping
416: to write is defined in the {\em SDM\_data\_view}
417: using the map array ({\tt vector}) associated with
418: the node partition.
419:
420: SDM supports three different ways of organizing data in files. In
421: level 1, each data set generated at each time step
422: is written to a separate file.
423: This file organization is simple, but it
424: incurs the cost of a file-open, file-view to define the visible portion of
425: a file for each process and a file-close at each time step.
426: In level 2, each data set (within a group) is written to a separate
427: file, but different iterations of the same data set are appended to
428: the same file. This method results in a smaller number of files
429: and smaller file-open and file-view costs.
430: The offset in the file where data is appended is stored in the
431: {\em execution\_table}.
432: In level 3, all iterations of all data sets belonging to a group are
433: stored in a single file. As in
434: level~2, the file offset for each data set is stored in the
435: {\em execution\_table} by process~0 in the {\em SDM\_write} function.
436: The idea is that if a file system has high file-open and file-close costs, and
437: an application generates a high file-view cost, as in irregular
438: applications, SDM can generate a very small number of files.
439: However, if an application produces
440: a large number of data sets with a large problem size, level 3 file
441: organization would result in very large files, which may degrade
442: the performance.
443:
444: Figure~\ref{fig:flow} depicts the metadata storage in the database and
445: the organization of data in files in SDM for the example
446: in Figure~\ref{fig:problem}.
447:
448: \begin{figure}[h]
449: \centerline{\psfig{figure=diagram.eps,height=3.5in,width=3.2in}}
450: \caption{SDM execution flow to solve for the example in Figure~\ref{fig:problem}}
451: \label{fig:flow}
452: \end{figure}
453:
454: \Section{Performance Results \label{sec:perf}}
455:
456:
457: We obtained performance results on the SGI Origin2000 at Argonne
458: National Laboratory. The Origin2000 has 128 processors and
459: 10 Fibre Channel controllers connected to a total of 110 disks of
460: 9 GBytes capacity each. The file system on the Origin2000 is SGI's
461: XFS~\cite{XFS:Next,XFS:sweeney}. For the results, we used XFS buffered I/O
462: and MySQL~\cite{mysql-manual} to store the metadata.
463:
464: The first application template that we benchmarked was a tetrahedral
465: vertex-centered unstructured grid code developed by W. K. Anderson
466: of the NASA Langley Research Center~\cite{fun3d}. This application
467: uses a partitioning vector generated from MeTis to partition the nodes
468: and edges in a mesh.
469: To evaluate SDM ported to the application, we used
470: about 18M edges and 2M nodes.
471: At the initial stage, the application
472: imports edges, four data arrays associated with edges, and another four
473: data arrays associated with nodes. The total imported data size was
474: about 807 MBytes.
475: As a result of computations, the application wrote
476: about 21 MBytes of four data sets each
477: and 105 MBytes of a single data set.
478: Using 64 processors, we iterated the application template
479: two time steps; at each time step,
480: five data sets were written to files.
481:
482: The second application template that we ran was a Rayleigh-Taylor instability
483: application~\cite{lori:mesh} that is motivated by a joint project between the
484: University of Chicago and Argonne to study thermonuclear
485: flashes on astrophysical objects.
486: Whenever the current time reaches a certain point,
487: the application writes two data sets:
488: a single node data set associated with vertices in a mesh,
489: and a triangle data set associated with triangles on tetrahedral faces.
490: In the application template, we wrote about 36 MBytes of the node data set
491: and about 74 MBytes of the triangle data set at each time step.
492: Since we iterated the template five times, the total data size written
493: was approximately 550 MBytes.
494:
495: \SubSection{Results for FUN3D}
496:
497: \begin{figure}[h]
498: \centerline{\psfig{figure=import_time.eps,height=3in,width=3.2in}}
499: \caption{Execution time for partitioning indices and data in FUN3D}
500: \label{fig:import}
501: \end{figure}
502:
503: Figure~\ref{fig:import} shows the bandwidth to import and partition
504: 18M edges, four data sets each of 144 MBytes of data
505: associated with edges, and
506: another four data sets each of 21 MBytes of data associated with nodes.
507: The original version of the application---without using SDM---performs all
508: the I/O operations
509: by a single process (process 0), which then broadcasts
510: data to other processes. SDM performs I/O in parallel from
511: all processes using MPI-IO.
512: The bar labeled {\tt index distri.} in Figure~\ref{fig:import}
513: shows the communication and computation costs
514: to partition the edges after importing them to the application.
515: Also, the bar labeled {\tt import} shows the cost of
516: reading the edges and eight data arrays.
517:
518: The original application reads the edges in two steps: one
519: step to determine the amount of memory to
520: store the partitioned edges and the other step to actually
521: read the edges. SDM, however, extends the allocated memory
522: dynamically as needed (using C function {\tt realloc}) and
523: is therefore able to read the partitioned edges in a single step.
524: This contributes to the reduced cost of {\tt index distri.} when using
525: SDM.
526: When partitioning the edges with a history file,
527: the cost of {\tt index distri.}
528: is nothing but reading the history file of the edges in a contiguous way,
529: including the database cost to access the metadata.
530: Since the history file contains the already partitioned edges, there is
531: no need to import the edges; hence, the read cost
532: in {\tt import} is reduced.
533:
534: \begin{figure}[h]
535: \centerline{\psfig{figure=app1.eps,height=3in,width=3.2in}}
536: \caption{I/O bandwidth for reading and writing data in FUN3D}
537: \label{fig:app1}
538: \end{figure}
539:
540: Figure~\ref{fig:app1} shows the I/O bandwidth for writing and then
541: reading back the data generated from the application using 64
542: processors. The total data size was approximately 379 MBytes. In
543: level 1, each data array is written to separate files, resulting in
544: the creation of 10 different files. Each time the data array is
545: written to files, level 1 requires the cost for opening a file and
546: defining an MPI-IO {\em file view} to access the data from the portion
547: of the file pointed by the global file offset. In level 2, however,
548: each data array generated at each time step is appended in five files,
549: generating five file-open and file-view costs. This reduced number of
550: files improves the I/O performance slightly. In level
551: 3, only two files are generated, resulting in the best I/O performance
552: among the three file organizations. On the SGI Origin2000, the
553: difference between three file organizations is not significant because
554: the file-open cost is small.
555:
556: \SubSection{Results of RT Application}
557:
558: Figure~\ref{fig:app2} shows the I/O bandwidth for writing approximately
559: 550 MBytes of data. In the original application, the write operation
560: is performed sequentially. In other words, after seeking
561: the starting position in a file, processes write their local portion
562: of data one by one.
563: When we ported the application to SDM, the I/O performance increased
564: significantly because of the I/O optimizations of MPI-IO.
565:
566: In SDM, we wrote the node data set according to the global node number of the
567: partitioned nodes, and wrote the triangle data set contiguously.
568: Since two data sets are written to files separately, SDM supports
569: two different ways of file organization:
570: level 1 and level 2/3 (levels 2 and 3 are identical in this case).
571: As can be seen in Figure~\ref{fig:app2}, on the SGI Origin2000, changing the
572: file organization does not affect the I/O performance, since the cost of
573: file-open and file-view is very low.
574:
575: When the number of processors increases to write the same
576: data size, we can see the degradation of the I/O performance.
577: With 32 processors, the data size being written at each time step is
578: about 1 MByte for the node data set and 2 MBytes for the triangle data set.
579: If the number of processors goes up to 64, the buffer size
580: of each process becomes smaller, resulting in the performance reduction.
581: Clearly, there is an optimal buffer size that shows the best I/O
582: performance.
583:
584: \begin{figure}[h]
585: \centerline{\psfig{figure=app2.eps,height=3in,width=3.2in}}
586: \caption{I/O bandwidth for RT}
587: \label{fig:app2}
588: \end{figure}
589:
590: \Section{Related Work \label{sec:related}}
591:
592: Several efforts have sought to optimize I/O in parallel file
593: systems and runtime
594: libraries~\cite{benn94a,bord93b,corb96b,hube95a,kotz97a,madh96a,nieu97a,seam95b,thak96f}.
595: SRB (Storage Resource Broker)~\cite{baru98a} provides an uniform
596: interface to access various storage systems, such as file systems,
597: Unitree, HPSS and database objects. However, it does not fully support
598: the optimizations implemented in MPI-IO. Shoshani et
599: al.~\cite{shos98a,shos99a} describe an architecture for optimizing
600: access to large volumes of scientific data stored on tapes. The
601: Active Data Repository~\cite{kurc99a} and DataCutter~\cite{datacutter}
602: optimize storage, retrieval, and processing of very large
603: multidimensional datasets. The main difference between our work and
604: other efforts in I/O is that SDM aims to combine the good
605: features of parallel file I/O and databases, whereas other efforts
606: focus on either parallel I/O or data management, not both.
607:
608:
609: \Section{Summary \label{sec:conc}}
610:
611: We have described the SDM system, API, and implementation for I/O in
612: irregular applications. SDM provides an easy-to-use user interface for
613: managing large data sets and internally uses MPI-IO for
614: high-performance I/O and a database for storing metadata. We studied
615: the performance of SDM using two irregular applications: FUN3D and
616: RT. When we ported both applications to use SDM, there was a
617: significant improvement in I/O performance compared with the original
618: application. Also, we observed that using a history file for the
619: index distribution helped to reduce the computation and
620: communication costs. However, changing the SDM file organization from
621: level 1 to level 3 did not greatly affect the performance on the SGI
622: Origin2000, because of its low file-open and file-view costs.
623:
624: We plan to develop SDM further to support visualization applications
625: and to investigate whether SDM can effectively be used as a strategy
626: for implementing libraries such as HDF~\cite{hdf94a} and
627: netCDF~\cite{netc91a}.
628:
629: %-------------------------------------------------------------------------
630: \bibliographystyle{latex8}
631: \bibliography{bib1,bib2,bib3,bib4,bib5}
632: %\bibliography{/homes/jano/BIBTEX/pario-beta,/homes/jano/BIBTEX/main,/homes/jano/BIBTEX/MORE,/homes/jano/BIBTEX/SC,/homes/thakur/tex/bib/papers}
633: %\bibliography{latex8}
634:
635: \end{document}
636:
637:
638:
639:
640: