0102:cs0102016/main.tex

1: %

2: %  $Description: Author guidelines and sample document in LaTeX 2.09/2e$

3: %

4: %  $Author: ienne $

5: %  $Date: 1995/09/15 15:20:59 $

6: %  $Revision: 1.4 $

7: %

8:

9: % \documentstyle[times,twocolumn,latex8]{article} % LaTeX 2.09

10:

11: \documentclass[10pt,twocolumn]{article}          % LaTeX 2e

12: \usepackage{latex8}                              % LaTeX 2e

13: \usepackage{times}                               % LaTeX 2e

14: \usepackage{psfig}

15:

16: %-------------------------------------------------------------------------

17: % take the % away on next line to produce the final camera-ready version

18: %\pagestyle{empty}

19:

20: %-------------------------------------------------------------------------

21: \begin{document}

22:

23: \title{A Scientific Data Management System for Irregular

24: Applications\thanks{This work was supported in part by the

25: Mathematical, Information, and Computational Sciences Division

26: subprogram of the Office of Advanced Scientific Computing Research,

27: U.S.\ Department of Energy, under Contract W-31-109-Eng-38, and in part

28: by a Work-for-Others Subaward No.\ 751 with the University of

29: Illinois, under NSF Cooperative Agreement \#ACI-9619019.}}

30:

31:

32: \author{

33: Jaechun~No$^\dag$ ~~ Rajeev~Thakur$^\dag$ ~~ Dinesh~Kaushik$^\dag$ ~~

34: Lori~Freitag$^\dag$ ~~ Alok~Choudhary$^\ddag$ \\ \\

35: \begin{tabular}{c c}

36: \hskip .05in \dag Math.\ and Computer Science Division &

37:  \hskip .05in \ddag Dept.\ of Elec. and Computer Eng. \\

38: \hskip .05in Argonne National Laboratory & \hskip .05in Northwestern

39: University\\

40: \hskip .05in Argonne, IL 60439 & \hskip .05in Evanston, IL 60208\\

41: {\tt \{jano,thakur,kaushik,freitag\}@mcs.anl.gov} & \hskip .05in {\tt choudhar@ece.nwu.edu}

42: \end{tabular}

43: }

44:

45: \maketitle

46: \thispagestyle{empty}

47:

48: \begin{abstract}

49: Many scientific applications are I/O intensive and generate large data

50: sets, spanning hundreds or thousands of ``files.'' Management,

51: storage, efficient access, and analysis of this data present an

52: extremely challenging task. We have developed a software system, called

53: Scientific Data Manager (SDM), that uses a combination of parallel

54: file I/O and database support for high-performance scientific data

55: management. SDM provides a high-level API to the user and, internally,

56: uses a parallel file system to store real data and a database to store

57: application-related metadata.

58: In this paper, we describe how we designed and implemented SDM to

59: support irregular applications. SDM can efficiently handle the reading

60: and writing of data in an irregular mesh, as well as the distribution of

61: index values. We describe the SDM user interface and how we have

62: implemented it to achieve high performance. SDM makes extensive use of

63: MPI-IO's noncontiguous collective I/O functions.  SDM also uses the

64: concept of a {\em history file} to optimize the cost of the index

65: distribution using the metadata stored in database. We present

66: performance results with two irregular applications, a CFD code called

67: FUN3D and a Rayleigh-Taylor instability code, on the SGI Origin2000

68: at Argonne National Laboratory.

69: \end{abstract}

70:

71: %-------------------------------------------------------------------------

72: \Section{Introduction \label{sec:intro}}

73: Many large-scale scientific applications are I/O intensive and

74: generate large amounts of data (on the order of several hundred

75: gigabytes to terabytes)~\cite{delrosario:prospects,poole:sio-survey}.

76: Many of these applications perform their computation and I/O on an

77: irregularly discretized mesh.  The data accesses in those applications

78: make extensive use of arrays, called indirection

79: array~\cite{D94,ravi95a} or map array~\cite{grop99a}, in which each

80: value of the array denotes the corresponding data position in memory

81: or in the file.

82:

83: The data distribution in irregular applications can be done either

84: by using compiler directives with the support of runtime

85: preprocessing~\cite{rein:fortran,HPFF93} or by using a runtime

86: library~\cite{D94,ravi95a}.  Most of the previous work in the area of

87: unstructured-grid applications focuses mainly on computation and

88: communication in such applications, not on I/O.

89:

90: We have developed a software system for large-scale scientific data

91: management, called Scientific Data Manager (SDM)~\cite{jano:sc00},

92: that combines the good features of both file I/O and

93: databases. SDM provides a high-level, user-friendly

94: interface. Internally, SDM interacts with a database to store

95: application-related metadata and uses MPI-IO to store the real data on

96: a high-performance parallel file system. SDM takes advantage of

97: various I/O optimizations available in MPI-IO, such as collective I/O

98: and noncontiguous requests, in a manner that is transparent to the

99: user.  As a result, users can access data with the performance of

100: parallel file I/O, without having to bother with the details of file

101: I/O.

102:

103: In a previous paper~\cite{jano:sc00}, we described the use of SDM for

104: regular applications. In this paper, we describe the API, design, and

105: implementation of SDM for irregular applications. SDM can efficiently

106: handle the reading and writing of data in an irregular mesh, as well as

107: the distribution of index values. SDM also uses the concept of a {\em

108: history file} to optimize the cost of the index distribution using the

109: metadata stored in database. We present performance results with two

110: irregular applications, a CFD code called FUN3D and a Rayleigh-Taylor

111: instability code, on the SGI Origin2000 at Argonne National

112: Laboratory.

113:

114: The rest of this paper is organized as follows. In

115: Section~\ref{sec:obj} we discuss our goals in developing SDM for

116: irregular problems.  In Section~\ref{sec:impl} we present a typical

117: irregular problem and describe the detailed implementation issues of

118: SDM to solve the problem.  Performance results on the SGI Origin2000

119: at Argonne National Laboratory are presented in

120: Section~\ref{sec:perf}. We discuss related work in

121: Section~\ref{sec:related} and conclude in Section~\ref{sec:conc}.

122:

123: \Section{Design Objectives \label{sec:obj}}

124:

125: Our main objectives in designing SDM for irregular applications were

126: to achieve high-performance parallel I/O, to provide

127: a convenient high-level API,

128: and to optimize the execution cost of irregular applications.

129:

130: \begin{itemize}

131: \item{\bf High-Performance I/O}.

132: To achieve high-performance I/O, we decided to use a parallel file-I/O

133: system to store real data and use MPI-IO to access this data.  MPI-IO,

134: the I/O interface defined as part of the MPI-2

135: standard~\cite{grop99a,mpi97a}, is rapidly emerging as the standard,

136: portable API for I/O in parallel applications. MPI-IO is specifically

137: designed to enable the optimizations that are critical for

138: high-performance parallel I/O. Examples of these optimizations include

139: collective I/O, the ability to access noncontiguous data sets, and the

140: ability to pass hints to the implementation about access patterns,

141: file-striping parameters, and so forth.

142:

143: \item{\bf High-Level API}.

144: Our goal was to provide a high-level unified API for any kind of

145: application (regular or irregular) while encapsulating the details of

146: either MPI-IO or databases.  With SDM, user can specify the data with a

147: high-level description, together with annotations, and use a similar

148: API for data retrieval.  SDM internally translates the user's request

149: into appropriate MPI-IO calls, including creating MPI derived

150: datatypes for noncontiguous data~\cite{thak98a}.  SDM also interacts

151: with the database when necessary, by using embedded SQL functions.

152:

153: \item{\bf Optimization for Irregular Applications}.

154: In irregular applications, the cost of an index distribution is

155: usually expensive, in terms of communication and computation.

156: In SDM, after partitioning the index values among processes, the local

157: index subsets of all processes are asynchronously written to a {\em

158: history file}, and the associated metadata is stored in database.  When

159: the same index distribution is needed in subsequent runs, the index

160: values are read from the history file using the metadata stored

161: in database, and thereby the user can avoid repeating the

162: communication and computation for the same index distribution.

163: \end{itemize}

164:

165: \Section{Implementation \label{sec:impl}}

166:

167: We discuss the SDM API for solving a sample irregular problem and show

168:  how the API is implemented.

169:

170: \SubSection{An Irregular Problem and SDM API}

171:

172: \begin{figure}[h]

173: \centerline{\psfig{figure=problem.eps,height=3.5in,width=3.5in}}

174: \caption{A sample irregular problem and its solution}

175: \label{fig:problem}

176: \end{figure}

177:

178: Figure~\ref{fig:problem} shows a typical irregular problem that sweeps

179: over the edges of an irregular mesh.  In this problem, {\tt edge1} and

180: {\tt edge2} are two arrays representing nodes connected by an edge,

181: and arrays {\tt x} and {\tt y} are the actual data associated with

182: each edge and node, respectively.  The partitioned arrays of {\tt

183: edge1}, {\tt edge2}, {\tt x}, and {\tt y} contain a single level of

184: ``ghost data'' beyond the boundaries to minimize remote accesses.

185: After the computation is completed, the results {\tt p} and {\tt q}

186: are written to a file in the order of global node numbers.

187:

188: Figures~\ref{fig:API1} and~\ref{fig:API2} respectively show the SDM API for

189: writing the results {\tt p} and {\tt q} and for partitioning {\tt

190: edge1}, {\tt edge2}, {\tt x}, and {\tt y} among processes to solve the

191: problem described in Figure~\ref{fig:problem}.  We use the term {\em

192: import} to distinguish it from a {\em read} operation.  A read

193: operation reads the data created in SDM, whereas an import operation

194: reads the data created outside of SDM.

195:

196: \begin{figure}[h]

197: \fbox{

198: \begin{minipage}[t]{3.2in}

199: \hspace*{0.1in}SDM\_initialize(nameOfApplication); \\

200: \hspace*{0.1in}result = SDM\_make\_datalist(2, \{p, q\}); \\

201: \hspace*{0.1in}result[0].data\_type = DOUBLE; \\

202: \hspace*{0.1in}SDM\_associate\_attributes(2, \&result[0]); \\

203: \hspace*{0.1in}handle = SDM\_set\_attributes(2, result); \\

204: \hspace*{0.1in}...... \\

205: \hspace*{0.1in}/* Partition edge1, edge2, x and y among processes \\

206: \hspace*{0.5in} (Figure 3) */ \\

207: \hspace*{0.1in}...... \\

208: \hspace*{0.1in}SDM\_data\_view(handle, 2, p, \&vector, \&localNodes); \\

209: \hspace*{0.1in}For (t=1; \(t<maxStep\); t++) \{ \\

210: \hspace*{0.3in}...... \\

211: \hspace*{0.3in}Do computation and produce results p and q; \\

212: \hspace*{0.3in}...... \\

213: \hspace*{0.3in}For (each checkpoint) \{ \\

214: \hspace*{0.5in}SDM\_write(handle, p, t, pBuf); \\

215: \hspace*{0.5in}SDM\_write(handle, q, t, qbuf); \\

216: \hspace*{0.3in}\} \\

217: \hspace*{0.1in}\} \\

218: \hspace*{0.1in}SDM\_finalize(handle, 2); \\

219: \end{minipage}

220: }

221: \caption{SDM API for writing results}

222: \label{fig:API1}

223: \end{figure}

224:

225: \SubSection{Implementation Details}

226:

227: The {\em partitioning vector}

228:  is the one generated from a partitioning tool,

229:  such as MeTis\cite{george:metis,kirk:metis}. Each value

230: of the vector denotes a processor rank where the node should be assigned.

231: In SDM, the partitioning vector should be replicated among processes.

232: Next, the {\em map array} is the one that specifies the mapping

233: of each element of the local array to the global array. This map array

234: is created in SDM after partitioning

235: the indexes using a partitioning vector,

236: or the map array can be specified by the user.

237:

238: Figure~\ref{fig:API1} shows the steps involved in initializing SDM to solve

239: the problem in Figure~\ref{fig:problem}.

240: Running the problem on SDM begins by calling

241: the {\em SDM\_initialize} to

242: establish database connection (for storing metadata).

243: Six database tables,

244: {\em run\_table}, {\em access\_pattern\_table}, {\em execution\_table},

245: {\em import\_table}, {\em index\_table}, and {\em index\_history\_table},

246: are created to store the metadata associated with the application.

247: Since two

248: data sets, {\tt p} and {\tt q}, are produced as a result of

249: computations and they have the same data type and global size,

250: these data sets are grouped in a data group to experiment different ways

251: of organizing data in files. All the metadata associated with these data sets

252: are stored in a database in the {\em SDM\_set\_attributes}.

253:

254: Figure~\ref{fig:API2} describes the steps in SDM

255: to partition the indexes and data.

256: The four arrays, {\tt edge1}, {\tt edge2}, {\tt x}, and {\tt y}, are

257: imported by creating a data group. Since these arrays

258: have been created outside of SDM, the user has no control

259: over the arrays except to read them, by specifying their data type,

260: appropriate file offset, and length. The user need not

261: create several data groups to import the arrays.

262: In the {\em SDM\_make\_importlist}, the metadata of this

263: imported data group, including a mechanism for the import

264: (partition), is stored in the {\em import\_table} for a later use.

265:

266: In order to partition {\tt edge1} and {\tt edge2}, the {\em SDM\_import}

267: is called to import the arrays with the parameters of file handle,

268: their position in the data group, file offset,

269: file length, and user buffer to hold

270: the data.

271: \begin{figure}[h]

272: \fbox{

273: \begin{minipage}[t]{3.2in}

274: \hspace*{0.1in}import = SDM\_make\_datalist(4, \{edge1, edge2, x, y\}); \\

275: \hspace*{0.1in}import[2].data\_type = DOUBLE; \\

276: \hspace*{0.1in}SDM\_associate\_attributes(2, \&import[2]); \\

277: \hspace*{0.1in}SDM\_make\_importlist(handle, 4, import); \\

278: \hspace*{0.1in} \\

279: \hspace*{0.1in}SDM\_import(handle, edge1, 0, totalEdges, tmp); \\

280: \hspace*{0.1in}SDM\_import(handle, edge2, (totalEdges*sizeof(int)), \\

281: \hspace*{0.3in}totalEdges, tmp+(totalEdges*sizeof(int))); \\

282: \hspace*{0.1in} \\

283: \hspace*{0.1in}/* Distribute edge1 and edge2 among processes */ \\

284: \hspace*{0.1in}vector = SDM\_partition\_table(handle, \\

285: \hspace*{0.5in} partitioning\_vector, totalNodes); \\

286: \hspace*{0.1in}partitioned\_edge = SDM\_partition\_index(handle, \\

287: \hspace*{0.3in}partitioning\_vector, totalNodes, \&tmp, \&vector); \\

288: \hspace*{0.1in} \\

289: \hspace*{0.1in}localEdges = SDM\_partition\_index\_size(handle); \\

290: \hspace*{0.1in}localNodes = SDM\_partition\_data\_size(handle); \\

291: \hspace*{0.1in} \\

292: \hspace*{0.1in}/* Make a history of this index distribution */ \\

293: \hspace*{0.1in}SDM\_index\_registry(handle, partitioned\_edge, vector); \\

294: \hspace*{0.1in} \\

295: \hspace*{0.1in}/* Import x */ \\

296: \hspace*{0.1in}file\_offset = 2*totalEdges*sizeof(int); \\

297: \hspace*{0.1in}SDM\_data\_view(handle, 1, x, \&partitioned\_edge, \\

298: \hspace*{0.3in}\&localEdges); \\

299: \hspace*{0.1in}SDM\_import(handle, x, file\_offset, totalEdges, xBuf); \\

300: \hspace*{0.1in} \\

301: \hspace*{0.1in}/* Import y */ \\

302: \hspace*{0.1in}file\_offset += totalEdges * sizeof(double); \\

303: \hspace*{0.1in}SDM\_data\_view(handle, 1, y, \&vector, \&localNodes); \\

304: \hspace*{0.1in}SDM\_import(handle, y, file\_offset, totalNodes, yBuf); \\

305: \hspace*{0.1in} \\

306: \hspace*{0.1in}SDM\_release\_importlist(handle, 4); \\

307: \end{minipage}

308: }

309: \caption{SDM API for partitioning indexes and data}

310: \label{fig:API2}

311: \end{figure}

312: The {\em SDM\_import} first accesses the

313:  {\em index\_table} in the database

314: to see whether a history file exists with this problem size.

315: If so, the metadata, such as each process's partitioned index size and

316: the history file name, is retrieved from the {\em index\_table} and

317: {\em index\_history\_table}, and the control exits the {\em SDM\_import}.

318: Otherwise, the desired data is imported to the application.

319: Since {\tt edge1} and {\tt edge2} are being imported in a

320: contiguous way, there is no need to specify data mapping between the file

321: and processor memory. In the {\em SDM\_import},

322: the total domain (file length) is equally divided among processes, and

323: the data in the domain is contiguously imported into the application.

324: In our example, edges {\tt 0} and {\tt 1} are imported to process 0, and

325: edges {\tt 2} and {\tt 3} are imported to process 1.

326:

327: In the {\em SDM\_partition\_table},

328: the global partitioning vector,

329:  {\tt partitioning\_vector} in Figure~\ref{fig:API2}, is

330: converted to the local vector, {\tt vector} in Figure~\ref{fig:API2},

331: to determine

332: which node should be assigned

333: to which process. In the example, nodes {\tt 0} and {\tt 3} are assigned

334: to process 0, and nodes {\tt 1}, {\tt 2}, and {\tt 4}

335: are assigned to process 1.

336:

337: If there is a history file for this problem size,

338: the {\em SDM\_partition\_index}

339: reads the already partitioned {\tt edge1} and {\tt edge2} from

340: the history file and converts them to the localized edges by using

341: the partitioning vector. This avoids the communication cost to

342: exchange each process's edges and the computation cost to choose the

343: edges to be assigned.

344: The disadvantage of the history file is that it cannot be used

345: if the program is run on a different number of processes from

346: when the file was created, because the edges and nodes being assigned to

347: each process dynamically change among different numbers of processes.

348: One efficient use of the history file is to create

349: it in advance for the various numbers of processes of interest.

350: As long as the user runs the application with any of those

351:  numbers of processes,

352: an appropriate history can be chosen to reduce communication and

353: computation costs.

354: If there is no history file,

355: the edges in each process are distributed

356: by reading all the data in parallel and performing

357: a ring-oriented communication.

358:

359: If at least a node of an edge has been partitioned to a process, the edge is

360: assigned to the process. For example, edge {\tt 0} is assigned both to

361: process 0 and 1 because one node of the edge, {\tt edge1 0}, has been

362: partitioned to process 0 and the other node, {\tt edge2 1},

363: has been partitioned

364: to process 1. This edge is a ghost edge of both processes

365: being stored to minimize communication volumes.

366:

367: For storing the partitioned edges and nodes, including the ghost ones,

368: a certain amount of memory space is initially allocated to each process.

369: When the entire memory space is occupied by the partitioned data,

370: it is automatically doubled by adjusting the memory size.

371: This prevents the system from looking through the entire data

372: in two steps, one step to decide the size of memory space and

373: the other step to actually store the data in the memory space.

374:

375: After the edges and nodes are distributed, the edges in each process are

376: moved to the next process located at a ring network.

377: In the example, process 0 receives edges {\tt 2} and {\tt 3}, and

378: process 1 receives edges {\tt 0} and {\tt 1} to partition them

379:  as described above.

380: After finishing the edge distribution, edges {\tt 0} and {\tt 2} are

381: assigned to process 0, and edges {\tt 0}, {\tt 1}, and {\tt 3} are

382: assigned to process 1. Similarly, nodes {\tt 0}, {\tt 1}, and {\tt 3} are

383: assigned to process 0, and nodes {\tt 0}, {\tt 1}, {\tt 2}, and {\tt 4} are

384: assigned to process 1.

385: In Figure~\ref{fig:API2}, {\tt partitioned\_edge} contains the edges

386: assigned to each process, and {\tt vector} contains the nodes assigned to it.

387: These are the two

388: map arrays to distribute the physical data associated with each edge

389: and node, respectively.

390:

391: If the {\em SDM\_index\_registry} was

392: executed for the first time and no history file was created earlier,

393: the metadata of the partitioned edges,

394: such as the partitioned size of each process, is stored in the

395: database tables {\em index\_table} and {\em index\_history\_table}.

396: Also, the partitioned edges are asynchronously written to a history file

397: to be retrieved in subsequent runs requiring the same edge distribution.

398: The use of the {\em SDM\_index\_registry} is optional.

399: If the user does not call the {\em SDM\_index\_registry},

400: no history file is created

401: after partitioning the edges.

402:

403: In order to import and partition data {\tt x} and {\tt y} in

404: the {\em SDM\_import}, the {\em SDM\_data\_view} must be called to

405: define the data mapping between a noncontiguous global view of the file

406: and a local view of the processor memory. Using the data mapping,

407:  in the {\em SDM\_import}, the associated data is irregularly distributed

408: by calling a collective MPI-IO function.

409: In the {\em SDM\_release\_importlist},

410: the structures being used to import data

411: in the file handle are free.

412:

413: Figure~\ref{fig:API1} shows the steps to write two data sets, {\tt p} and

414: {\tt q}, after completing the computations at each checkpoint.

415: Before writing {\tt p} and {\tt q}, the data mapping

416: to write is defined in the {\em SDM\_data\_view}

417: using the map array ({\tt vector}) associated with

418: the node partition.

419:

420: SDM supports three different ways of organizing data in files.  In

421: level 1, each data set generated at each time step

422: is written to a separate file.

423: This file organization is simple, but it

424: incurs the cost of a file-open, file-view to define the visible portion of

425: a file for each process and a file-close at each time step.

426: In level 2, each data set (within a group) is written to a separate

427: file, but different iterations of the same data set are appended to

428: the same file. This method results in a smaller number of files

429: and smaller file-open and file-view costs.

430: The offset in the file where data is appended is stored in the

431: {\em execution\_table}.

432: In level 3, all iterations of all data sets belonging to a group are

433: stored in a single file. As in

434: level~2, the file offset for each data set is stored in the

435: {\em execution\_table} by process~0 in the {\em SDM\_write} function.

436: The idea is that if a file system has high file-open and file-close costs, and

437: an application generates a high file-view cost, as in irregular

438: applications, SDM can generate a very small number of files.

439: However, if an application produces

440: a large number of data sets with a large problem size, level 3 file

441: organization would result in very large files, which may degrade

442: the performance.

443:

444: Figure~\ref{fig:flow} depicts the metadata storage in the database and

445: the organization of data in files in SDM for the example

446: in Figure~\ref{fig:problem}.

447:

448: \begin{figure}[h]

449: \centerline{\psfig{figure=diagram.eps,height=3.5in,width=3.2in}}

450: \caption{SDM execution flow to solve for the example in Figure~\ref{fig:problem}}

451: \label{fig:flow}

452: \end{figure}

453:

454: \Section{Performance Results \label{sec:perf}}

455:

456:

457: We obtained performance results on the SGI Origin2000 at Argonne

458: National Laboratory.  The Origin2000 has 128 processors and

459: 10 Fibre Channel controllers connected to a total of 110 disks of

460: 9 GBytes capacity each.  The file system on the Origin2000 is SGI's

461: XFS~\cite{XFS:Next,XFS:sweeney}. For the results, we used XFS buffered I/O

462: and MySQL~\cite{mysql-manual} to store the metadata.

463:

464: The first application template that we benchmarked was a tetrahedral

465: vertex-centered unstructured grid code developed by W. K. Anderson

466: of the NASA Langley Research Center~\cite{fun3d}. This application

467: uses a partitioning vector generated from MeTis to partition the nodes

468: and edges in a mesh.

469: To evaluate SDM ported to the application, we used

470: about 18M edges and 2M nodes.

471: At the initial stage, the application

472: imports edges, four data arrays associated with edges, and another four

473: data arrays associated with nodes. The total imported data size was

474: about 807 MBytes.

475: As a result of computations, the application wrote

476: about 21 MBytes of four data sets each

477: and 105 MBytes of a single data set.

478: Using 64 processors, we iterated the application template

479:  two time steps; at each time step,

480: five data sets were written to files.

481:

482: The second application template that we ran was a Rayleigh-Taylor instability

483: application~\cite{lori:mesh} that is motivated by a joint project between the

484: University of Chicago and Argonne to study thermonuclear

485: flashes on astrophysical objects.

486: Whenever the current time reaches a certain point,

487: the application writes two data sets:

488: a single node data set associated with vertices in a mesh,

489: and a triangle data set associated with triangles on tetrahedral faces.

490: In the application template, we wrote about 36 MBytes of the node data set

491: and about 74 MBytes of the triangle data set at each time step.

492: Since we iterated the template five times, the total data size written

493: was approximately 550 MBytes.

494:

495: \SubSection{Results for FUN3D}

496:

497: \begin{figure}[h]

498: \centerline{\psfig{figure=import_time.eps,height=3in,width=3.2in}}

499: \caption{Execution time for partitioning indices and data in FUN3D}

500: \label{fig:import}

501: \end{figure}

502:

503: Figure~\ref{fig:import} shows the bandwidth to import and partition

504: 18M edges, four data sets each of 144 MBytes of data

505: associated with edges, and

506: another four data sets each of 21 MBytes of data associated with nodes.

507: The original version of the application---without using SDM---performs all

508: the I/O operations

509:  by a single process (process 0), which then broadcasts

510: data to other processes. SDM performs I/O in parallel from

511: all processes using MPI-IO.

512: The bar labeled {\tt index distri.} in Figure~\ref{fig:import}

513: shows the communication and computation costs

514: to partition the edges after importing them to the application.

515: Also, the bar labeled {\tt import} shows the cost of

516: reading the edges and eight data arrays.

517:

518: The original application reads the edges in two steps: one

519: step to determine the amount of memory to

520: store the partitioned edges and the other step to actually

521: read the edges. SDM, however, extends the allocated memory

522: dynamically as needed (using C function {\tt realloc}) and

523: is therefore able to read the partitioned edges in a single step.

524: This contributes to the reduced cost of {\tt index distri.} when using

525: SDM.

526: When partitioning the edges with a history file,

527: the cost of {\tt index distri.}

528: is nothing but reading the history file of the edges in a contiguous way,

529: including the database cost to access the metadata.

530: Since the history file contains the already partitioned edges, there is

531: no need to import the edges; hence, the read cost

532:  in {\tt import} is reduced.

533:

534: \begin{figure}[h]

535: \centerline{\psfig{figure=app1.eps,height=3in,width=3.2in}}

536: \caption{I/O bandwidth for reading and writing data in FUN3D}

537: \label{fig:app1}

538: \end{figure}

539:

540: Figure~\ref{fig:app1} shows the I/O bandwidth for writing and then

541: reading back the data generated from the application using 64

542: processors.  The total data size was approximately 379 MBytes.  In

543: level 1, each data array is written to separate files, resulting in

544: the creation of 10 different files.  Each time the data array is

545: written to files, level 1 requires the cost for opening a file and

546: defining an MPI-IO {\em file view} to access the data from the portion

547: of the file pointed by the global file offset.  In level 2, however,

548: each data array generated at each time step is appended in five files,

549: generating five file-open and file-view costs. This reduced number of

550: files improves the I/O performance slightly.  In level

551: 3, only two files are generated, resulting in the best I/O performance

552: among the three file organizations.  On the SGI Origin2000, the

553: difference between three file organizations is not significant because

554: the file-open cost is small.

555:

556: \SubSection{Results of RT Application}

557:

558: Figure~\ref{fig:app2} shows the I/O bandwidth for writing approximately

559: 550 MBytes of data. In the original application, the write operation

560: is performed sequentially. In other words, after seeking

561: the starting position in a file, processes write their local portion

562: of data one by one.

563: When we ported the application to SDM, the I/O performance increased

564: significantly because of the I/O optimizations of MPI-IO.

565:

566: In SDM, we wrote the node data set according to the global node number of the

567: partitioned nodes, and wrote the triangle data set contiguously.

568: Since two data sets are written to files separately, SDM supports

569: two different ways of file organization:

570: level 1 and level 2/3 (levels 2 and 3 are identical in this case).

571: As can be seen in Figure~\ref{fig:app2}, on the SGI Origin2000, changing the

572: file organization does not affect the I/O performance, since the cost of

573: file-open and file-view is very low.

574:

575: When the number of processors increases to write the same

576: data size, we can see the degradation of the I/O performance.

577: With 32 processors, the data size being written at each time step is

578: about 1 MByte for the node data set and 2 MBytes for the triangle data set.

579: If the number of processors goes up to 64, the buffer size

580: of each process becomes smaller, resulting in the performance reduction.

581: Clearly, there is an optimal buffer size that shows the best I/O

582: performance.

583:

584: \begin{figure}[h]

585: \centerline{\psfig{figure=app2.eps,height=3in,width=3.2in}}

586: \caption{I/O bandwidth for RT}

587: \label{fig:app2}

588: \end{figure}

589:

590: \Section{Related Work \label{sec:related}}

591:

592: Several efforts have sought to optimize I/O in parallel file

593: systems and runtime

594: libraries~\cite{benn94a,bord93b,corb96b,hube95a,kotz97a,madh96a,nieu97a,seam95b,thak96f}.

595: SRB (Storage Resource Broker)~\cite{baru98a} provides an uniform

596: interface to access various storage systems, such as file systems,

597: Unitree, HPSS and database objects. However, it does not fully support

598: the optimizations implemented in MPI-IO.  Shoshani et

599: al.~\cite{shos98a,shos99a} describe an architecture for optimizing

600: access to large volumes of scientific data stored on tapes.  The

601: Active Data Repository~\cite{kurc99a} and DataCutter~\cite{datacutter}

602: optimize storage, retrieval, and processing of very large

603: multidimensional datasets.  The main difference between our work and

604: other efforts in I/O is that SDM aims to combine the good

605: features of parallel file I/O and databases, whereas other efforts

606: focus on either parallel I/O or data management, not both.

607:

608:

609: \Section{Summary \label{sec:conc}}

610:

611: We have described the SDM system, API, and implementation for I/O in

612: irregular applications. SDM provides an easy-to-use user interface for

613: managing large data sets and internally uses MPI-IO for

614: high-performance I/O and a database for storing metadata. We studied

615: the performance of SDM using two irregular applications: FUN3D and

616: RT. When we ported both applications to use SDM, there was a

617: significant improvement in I/O performance compared with the original

618: application. Also, we observed that using a history file for the

619: index distribution helped to reduce the computation and

620: communication costs.  However, changing the SDM file organization from

621: level 1 to level 3 did not greatly affect the performance on the SGI

622: Origin2000, because of its low file-open and file-view costs.

623:

624: We plan to develop SDM further to support visualization applications

625: and to investigate whether SDM can effectively be used as a strategy

626: for implementing libraries such as HDF~\cite{hdf94a} and

627: netCDF~\cite{netc91a}.

628:

629: %-------------------------------------------------------------------------

630: \bibliographystyle{latex8}

631: \bibliography{bib1,bib2,bib3,bib4,bib5}

632: %\bibliography{/homes/jano/BIBTEX/pario-beta,/homes/jano/BIBTEX/main,/homes/jano/BIBTEX/MORE,/homes/jano/BIBTEX/SC,/homes/thakur/tex/bib/papers}

633: %\bibliography{latex8}

634:

635: \end{document}

636:

637:

638:

639:

640: