0303:cs0303016/paper.tex

1: %

2: % Fast Parallel I/O on ParaStation Clusters

3: %

4: % Authors:

5: %   Thomas D�ssel, Norbert Eicker, Florin Isaila, Thomas Lippert,

6: %   Hartmut Neff, Thomas Moschny, Klaus Schilling, Walter Tichy

7: %

8: % Keywords:

9: %   Parallel I/O, Cluster, ParaStation, PVFS, ClusterFile, ...

10: %

11: % $Revision: 1.48 $

12: % $Date: 2003/03/18 14:40:18 $

13: %

14: % Time-stamp: <2003-02-27 16:23:55 CET moschny>

15: %

16: \documentclass{elsart}

17:

18: \usepackage{amsmath,amssymb}

19: \usepackage[latin1]{inputenc}

20: %\usepackage[T1]{fontenc}

21: %\usepackage{palatino}

22: \usepackage{latexsym}

23: \usepackage{graphicx}

24:

25: \newcommand{\eq}[1]{eq.~(\ref{#1})}

26: \newcommand{\eqs}[3]{Eqs. (\ref{#1},\ref{#2},\ref{#3})}

27: \newcommand{\fig}[1]{Figure~\ref{#1}}

28: \newcommand{\tab}[1]{Table~\ref{#1}}

29:

30: \makeatletter

31: \makeatother

32:

33: \hyphenation{Para-Station}

34:

35: \begin{document}

36:

37:

38: \begin{frontmatter}

39:

40:

41: \title{Fast Parallel I/O on Cluster Computers}

42:

43: \author[W]{Thomas D�ssel},

44: \author[W,P]{Norbert Eicker},

45: \author[K]{Florin Isaila},

46: \author[W]{Thomas Lippert},

47: \author[W,K]{Thomas Moschny},

48: \author[B]{Hartmut Neff},

49: \author[W]{Klaus Schilling},

50: \and\author[K]{Walter Tichy}

51: \address[W]{Department of Physics, University of Wuppertal, Gau�stra�e~20,

52:             42097~Wuppertal, Germany}

53: \address[P]{ParTec AG,  Possartstr.~20, 81679~M�nchen, Germany}

54: \address[K]{Institute for Program Structures and

55:   Data Organization (IPD),

56:   University of Karlsruhe, Postfach~6980,

57:   76128~Karlsruhe, Germany}

58: \address[B]{Physics Department, Boston University,

59: 590~Commonwealth Avenue Boston,\\ MA~02215, USA}

60:

61:

62: \begin{abstract}

63: %  This is version \verb%$Id: paper.tex,v 1.48 2003/03/18 14:40:18 lippert Exp $%.

64:   Today's cluster computers suffer from slow I/O, which slows down

65:   I/O-intensive applications. We show that fast disk I/O can be

66:   achieved by operating a parallel file system over fast networks such

67:   as Myrinet or Gigabit Ethernet.

68:

69:   In this paper, we demonstrate how the ParaStation3 communication

70:   system helps speed-up the performance of parallel I/O on clusters

71:   using the open source parallel virtual file system (PVFS) as testbed

72:   and production system.  We will describe the set-up of PVFS on the

73:   Alpha-Linux-Cluster-Engine (ALiCE) located at Wuppertal University,

74:   Germany.  Benchmarks on ALiCE achieve write-performances of up to

75:   1~GB/s from a 32-processor compute-partition to a 32-processor PVFS

76:   I/O-partition, outperforming known benchmark results for PVFS on the

77:   same network by more than a factor of 2.  Read-performance from

78:   buffer-cache reaches up to 2.2~GB/s.  Our benchmarks are giant,

79:   I/O-intensive eigenmode problems from lattice quantum

80:   chromodynamics, demonstrating stability and performance of PVFS over

81:   Parastation in large-scale production runs.\\[6pt]

82: \noindent {\em Keywords: Cluster Computing, Parallel File System, ParaStation,

83:   Lattice QCD}

84:

85: \end{abstract}

86:

87: \end{frontmatter}

88:

89: \section{Introduction}

90:

91: Within the past years commodity-off-the-shelf (COTS) clusters have

92: evolved towards cost-effective general-purpose HPC devices.  Such

93: systems, self-made and commonly denoted as Beowulf

94: computers~\cite{ridge97beowulf}, show up increasingly on the TOP500

95: list~\cite{top500}.  While gigabit network technology

96: (Gigabit-Ethernet, Myrinet) along with error-correcting

97: zero-copy-communication software~\cite{GM,SCORE,PARTEC} have boosted

98: communication-intensive number-crunching tasks, I/O-intensive

99: computations have not benefited from cluster computers to the same

100: extent.  The reason was the relatively slow I/O capability of

101: clusters.

102:

103: Today, a promising approach to achieve fast I/O on cluster computers

104: is to utilize distributed disks and their aggregate bandwidth by means

105: of a parallel file system (PFS).  A PFS is designed to make the entire

106: disk capacity of the I/O-nodes available to all the compute-nodes and

107: to allow the parallel file access of the compute nodes to be

108: translated into real parallel disk access.  Physically, files are

109: stored on a given partition of cluster nodes by distributing the data

110: of the given file, for instance in a round robin fashion.  In contrast

111: to standard network file systems, a PFS provides concurrent parallel

112: access to store or read the file from all nodes of a parallel

113: application.  In a typical implementation, a set of compute-nodes

114: reads and writes data to another set of I/O-nodes that host the

115: physical resources of the PFS\@.  The two sets of nodes may be

116: identical, may partly overlap or may even be distinct.  In principle

117: this concept allows for scalability of the I/O-rate with the number of

118: I/O-nodes, provided that \emph{(i)} the number of compute-nodes is

119: large enough to saturate the capacity of the I/O-nodes---this is

120: usually the case as soon as the number of compute-nodes equals the

121: number of I/O-nodes---and \emph{(ii)} the network delivers full

122: bi-sectional bandwidth---this is, for instance, the case for crossbar

123: or multi-stage crossbar topologies.

124:

125: Distributed file systems like NFS or AFS are not suited for concurrent

126: high-bandwidth file-access as required in I/O-intensive cluster

127: computing applications.  In these systems, parallel data accesses of

128: compute nodes are serialized by file servers. Therefore, they can not

129: be called "parallel" file systems.  There exist commercially available

130: parallel file systems (sometimes platform dependent), for example the

131: General Parallel File System GPFS (IBM~\cite{GPFS}).  Currently, the

132: only known open source parallel file system, working in a stable

133: manner, and freely available for Linux under the GNU General Public

134: License, is the Parallel Virtual File System (PVFS) developed at

135: Clemson University and Argonne National Laboratory~\cite{PVFS-Site}.

136: PVFS is devised as a truly parallel file system for use on cluster

137: computers. The communication back-end of the standard distribution of

138: PVFS is based on the TCP/IP protocol.  Therefore, PVFS can readily be

139: operated on top of any network supporting TCP/IP\@.

140:

141:

142: In this paper we consider PVFS boosted by the Myrinet communication

143: network~\cite{MYRINET} of the Alpha Linux Cluster Engine (ALiCE)

144: located at Wuppertal University, Germany~\cite{PIK:2002}.  There are

145: several error-correcting and package-loss-safe communication

146: sub-systems available, designed to drive Myrinet: e.g.\ the vendor-provided GM

147: software~\cite{MYRINET}, SCore~\cite{SCORE}, and the ParaStation

148: system~\cite{PARTEC}, developed at Karlsruhe University.  On ALiCE, we

149: are using ParaStation.

150:

151: ParaStation implements the concept of virtual nodes, operating in

152: close interaction with queuing systems like PBS~\cite{PBS}.  The

153: communication system provides safe multi-user operation and

154: outstanding stability, not to mention the comfortable

155: single-point-of-administration management by means of the

156: ParaStation-daemons. System crashes are tidily cleaned-up without any

157: user interference. Most important in our context is however the

158: communication bandwidth under ParaStation.  A special kernel module

159: routes TCP/IP via the ParaStation communication system and renders

160: Myrinet an additional IP-network with full bi-sectional bandwidth. In

161: this manner, the superior bandwidth from ParaStation as Myrinet driver

162: can be exploited.

163:

164: Combining parallel file systems like PVFS with ParaStation perfectly

165: meets the demands of an application from the post-simulation phase of

166: a large scale Monte Carlo project in lattice-quantum chromodynamics

167: (LQCD).  We are evaluating giant eigenproblems~\cite{Neff:2001zr},

168: which are very data-intensive, on cluster computers. The eigenvectors

169: are required for the construction of correlations between two quark

170: loops. Such creation and annihilation of quarks originate from the

171: quantum fluctuations of the QCD vacuum, according to Heisenberg's

172: uncertainty principle. They are considered affectual to the unusually

173: large mass of the $\eta'$-meson~\cite{Schilling:2002gm}.  In our

174: simulations, we have to compute $\mathcal O(1000)$ low eigenvectors of

175: the fermionic matrix, which describes the dynamics of the quarks.  The

176: size of each vector is $\mathcal O(10^6)$.

177:

178: Typically, about 10~GB of I/O is carried out in data-intensive

179: production steps of about 10~minutes compute time on 16 to

180: 64~processors; actually several thousands of such runs are performed.

181: Without parallel I/O, reading or writing lasts between 10 and

182: 30~minutes.  PVFS helps cut down the read and write times to about

183: 20~seconds.

184:

185: The paper is organized as follows: in Section~\ref{SETUP} we present

186: the new TCP/IP kernel module included in ParaStation and describe the

187: connectivity of ALiCE and its specific PVFS implementation.

188: Section~\ref{SPEEDS} gives benchmarks for the components of the

189: I/O-machinery, including disk-speed and TCP/IP node-to-node rates.  The

190: results of the PVFS benchmarks are shown in Section~\ref{PVFSBENCH}

191: along with a comparison with Ref.~\cite{PVFSpaper}.  We are in the

192: position to test PVFS/ParaStation within a large-scale application

193: from lattice quantum chromodynamics, described in Section~\ref{EIGEN}.

194: Finally we summarize and give a short outlook on the novel parallel

195: file system ClusterFile that is currently under development at the

196: IPD, University of Karlsruhe~\cite{IT01}.

197:

198: \newpage

199:

200: \section{Technical Background\label{SETUP}}

201:

202: For cluster applications based on IP-communication to benefit from

203: ParaStation's performance, a TCP/IP kernel module has been introduced

204: on top of the ParaStation communication system.  In this manner,

205: stable communication with gigabyte bandwidth is provided for the

206: parallel virtual file system (PVFS) as implemented on the 128-node

207: Alpha-Linux-Cluster-Engine ALiCE\@.

208:

209: \subsection{TCP/IP kernel module with ParaStation\label{MODULE}}

210:

211: Many cluster applications neither use MPI nor the low-level

212: communication API as for instance provided by ParaStation.  To benefit

213: from ParaStation's performance for such applications, an additional

214: TCP/IP module was developed at the Institute for Programs and Data

215: Structures, University of Karlsruhe.

216:

217: This module provides a network driver interface to the Linux kernel,

218: just as any other Ethernet card driver does. This way, virtually any

219: Ethernet protocol that is supported by the kernel can be transported

220: over Myrinet. In practice, most applications will use the TCP (or UDP)

221: over IP protocol\footnote{The intended use is the reason for the

222:   somewhat misleading name of the module. It should better be called

223:   Ethernet driver for ParaStation.}.

224:

225: Internally, the module uses the kernel variant of the ParaStation

226: low-level communication interface to send raw Ethernet datagrams to

227: any other host in the cluster. This interface supports a set of basic

228: communication operations, since at this level  we don't need more

229: functionality. Packet (dis)assembling for instance is done by the

230: kernel. Upon startup, a so-called \textit{kernel-context} is obtained

231: from the ParaStation module. This reserves a certain number of

232: communication buffers in the memory of the Myrinet network adapter

233: card and instructs the driver to listen for messages addressed to the

234: TCP/IP module.  Secondly, a call-back mechanism is established:

235: whenever such a message arrives, an interrupt is risen and the

236: ParaStation interrupt handler calls a method of the TCP/IP module

237: which in turn hands the message to the Linux network stack. If a

238: message is to be sent, the network stack functions call a method of

239: the module that was registered upon initialization, handing the

240: message over to the ParaStation module.

241:

242: In order to address other nodes in the cluster, the TCP/IP driver

243: module (in fact an Ethernet driver) maps the ParaStation node

244: identification number (ParaStation ID) onto Ethernet hardware

245: addresses. The administrator can set up static ARP tables that map

246: IP addresses to these hardware addresses. This is necessary, as the

247: driver does not support broadcast messages that would be required to

248: enable automatic ARP functionality. However, if IP addresses are

249: chosen such that the ParaStation ID (the node number counted from 0)

250: equals the IP address minus one, the static ARP tables can be omitted.

251: In this case, any driver module in the cluster can guess ParaStation

252: IDs from IP addresses and thus generate fake ARP reply messages.

253:

254:

255: \subsection{ALiCE setup}

256:

257: The Alpha-Linux-Cluster-Engine ALiCE is an assembly of 128 Compaq DS10

258: workstations~\cite{PIK:2002}. ALiCE, located at the University of

259: Wuppertal (Germany), is fully operational since the end of 2000. The

260: machine is equipped with Alpha 21264 EV67 processors, with 2~MB cache,

261: operating at a frequency of 616~MHz. With 256~MB ECC memory for each

262: processor, the total amount of memory is 32~GB\@. The disk space of

263: 10~GB per node adds up to about 1.3~TB in total.  The nodes are

264: interconnected by a 2 $\times$ 1.28~Gbit/s Myrinet network configured

265: as a multi-stage crossbar with full bi-sectional bandwidth.  The

266: M2M-PCI64A-2 Myrinet cards utilize a 64~bit/33~MHz PCI bus.

267:

268: \fig{MYRINET} shows the hardware plan of the inter-node connection of

269: ALiCE\@. A full crossbar is realized by three switch-stages (where

270: each octal switch actually consists of two stages) employing the

271: $2\times 4$ Myrinet M2M-oct-SW8 switches and 8~Myrinet M2M-dual-SW8

272: switches. The hardware latency is about 100~ns per switch stage, far

273: below the total latency (software and hardware) of 17.1~$\mu$s.

274:

275: \begin{figure}[!htb]

276: \begin{center}

277:   \includegraphics[width=.82\textwidth]{c14.eps}

278: \end{center}

279: \caption{Myrinet multi-stage crossbar network.\label{MYRINET}}

280: \end{figure}

281:

282: We have extended the network in order to incorporate an external file

283: and archive server as well as to provide gigabit links to external

284: machines for the purpose of fast on-line visualization.  To this end,

285: we have exchanged two of the 8 M2M-dual-SW8 switches by two M2M-SW16

286: switches, as sketched in \fig{MYRINET2}.  This way, we could avoid an

287: inhomogeneous number of hardware stages of the multi-stage crossbar

288: network.

289:

290: \begin{figure}[!htb]

291: \begin{center}

292: \includegraphics[width=\textwidth]{netz.eps}

293: \end{center}

294: \caption{Inclusion of external devices (not all connections

295:   drawn).\label{MYRINET2}}

296: \end{figure}

297:

298: \subsection{PVFS partitioning}

299:

300: On ALiCE, we run 4 different PVFS partitions with 32-nodes each. This

301: partitioning fits well with the compute-partitions, chosen such as to

302: optimize the compute performances of our applications. Each PVFS

303: partition (\texttt{/pvfs1} to \texttt{/pvfs4}) is represented by a

304: mount point on each node and on the file server.  Mounting PVFS on the

305: file server enables us to copy UNIX files with ParaStation TCP/IP

306: speed from the external RAID to PVFS\@.  For each partition, the last

307: node plays the r�le of the management node.  The entire set-up is

308: displayed in \fig{SETUPPVFS}.

309:

310: \begin{figure}[!htb]

311: \begin{center}

312: \includegraphics[width=.8\textwidth]{pvfs.eps}

313: \end{center}

314: \caption{Organization of 4 PVFS partitions on ALiCE.\label{SETUPPVFS}}

315: \end{figure}

316:

317: \section{Performance of Basic I/O-Devices\label{SPEEDS}}

318:

319: The proper interpretation of the benchmark results presented in

320: Section~\ref{PVFSBENCH} requires some background knowledge about the

321: features of ALiCE's basic I/O-components, i.e., file system

322: performance, TCP/IP and MPI data rates.

323:

324: \subsection{Local file system performances\label{DIRTY}}

325:

326: The ALiCE DS10 nodes are equipped with 10~GB Maxtor IDE

327: disks\footnote{5400~rpm versions.}.  We have carried out tests on

328: local file systems, formatted with ReiserFS, by means of

329: \texttt{bonnie++}~\cite{BONNIE}.  Write and read-performances are

330: determined for a series of file sizes, from 10~MB up to 2~GB\@.  In

331: order to reveal buffer-cache effects, we have adjusted the ``dirty

332: buffer'' parameter\footnote{The standard behavior of the Linux kernel

333:   is to start flushing buffer pages as soon as the given percentage of

334:   memory available for buffer cache is ``dirty''. A buffer page is

335:   called dirty if it changed in memory but has not yet been written to

336:   disk.} to two different values, 40~\% and 70~\%. Our results are

337: given in \fig{BONNIE}.

338:

339: \begin{figure}[!htb]

340: \begin{center}

341: \includegraphics[width=.8\textwidth]{bonnie.eps}

342: \end{center}

343: \caption{Local file system performances on ALiCE.\label{BONNIE}}

344: \end{figure}

345:

346: The \texttt{bonnie++} benchmark works as follows: the given file is

347: written and read back immediately.  We observe the write performance

348: to buffer cache of 60~MB/s to fall to about 23~MB/s at a file size of

349: 100 and~200 MB, for 40 and 70~\% dirty buffers, respectively.

350: Asymptotically, the performance is decreasing to about 19~MB/s.  Being

351: able to read from the buffer cache, the subsequent read operation

352: shows a bandwidth of more than 280~MB/s for small files.  At a file

353: size of about 200~MB, the speed drops down to 24~MB/s.

354:

355: \subsection{TCP/IP performance via ParaStation}

356:

357: On ALiCE, we can choose TCP packages to be routed via Fast Ethernet or

358: alternatively---using the new TCP/IP kernel module described in

359: Section~\ref{MODULE}---via ParaStation/Myrinet.  The \texttt{ttcp}

360: benchmark issues TCP/IP packets over a point-to-point connection to

361: determine the uni-directional TCP/IP speed, cf.\ \fig{TCPFIG}.  The

362: performance is seen to saturate at 93~MB/s.

363:

364: \begin{figure}[htb]

365: \begin{center}

366:   \includegraphics[width=.8\textwidth]{ttcp.eps}

367: \end{center}

368: \caption{Performance test by \texttt{ttcp} via ParaStation.\label{TCPFIG}}

369: \end{figure}

370:

371: It is instructive to compare these results with the outcome of the

372: Pallas \texttt{send-receive} MPI benchmark~\cite{PALLASMPI}, see

373: \fig{PALLAS}.

374:

375: \begin{figure}[!htb]

376: \begin{center}

377:   \includegraphics[width=.7\textwidth]{pmb2_sendrecv.eps}

378: \end{center}

379:  \caption{Performances of the

380:    \texttt{send-receive} Pallas MPI-benchmark on ALiCE.\label{PALLAS}}

381: \end{figure}

382:

383: \begin{figure}[htb]

384: \begin{center}

385:   \includegraphics[width=.7\textwidth]{mpi_unidir.eps}

386: \end{center}

387:  \caption{MPI-performance for uni-directional communication.\label{MPI_UNIDIR}}

388: \end{figure}

389:

390: For the \texttt{send-receive} case (i.e.\ the bi-directional

391: situation), the performance levels off at a total bandwidth of about

392: 175~MB/s (adding up the data rates of both directions).  As the PALLAS

393: benchmark does not provide uni-directional measurements, we have

394: prepared corresponding \texttt{send} and \texttt{receive} programs and

395: found a saturation of the uni-directional MPI-performance at about

396: 140~MB/s, cf.\ \fig{MPI_UNIDIR}.  The data rate via TCP/IP is only

397: about 34~\% smaller than via MPI, in spite of the overheads of the

398: full-fledged TCP/IP implementation.  Still this leaves room for

399: improvement, since on a cluster there exists a priori knowledge of the

400: paths to all IP destinations, therefore one could try to set up a slim

401: TCP/IP protocol.  Moreover, a double error checking is carried out,

402: one on the ParaStation level, and a second one on the TCP/IP level.

403:

404: \section{File System Benchmarks\label{PVFSBENCH}}

405:

406: Our benchmarks follow Ref.~\cite{PVFSpaper}, where performance results

407: on 60~nodes of the Chiba-City cluster at Argonne National Laboratory

408: have been reported.  This cluster is equipped with the same Myrinet

409: version as ALiCE\@. This will enable us to compare our results using

410: TCP/IP over ParaStation on an Alpha system with TCP/IP over the

411: Myrinet/GM software on a Pentium cluster.  Since loading and

412: discharging large amounts of data to the cluster constitute a crucial

413: bottleneck for data-intensive production runs on clusters, we include

414: performances with respect to reading from and writing to UNIX files

415: located on an external RAID\@.

416:

417: \subsection{Concurrent read/write performance}

418:

419: Our test code works as follows: a new PVFS file, common to $P$

420: compute processes, is opened on $N$ I/O-nodes. Concurrently the same

421: amount of data $S$ is written from each of the $P$ processes (virtual

422: partitioning) to disjoint parts of the file. PVFS stripes

423: the data onto the $N$ I/O-nodes (physical partitioning)

424: with a stripe size of 64~kB.

425:

426: After the data is written, we close the file and reopen it again in

427: order to reshuffle the same data back to the compute-nodes.  The

428: bandwidth for write and read operations is computed from the maximum

429: of the wall clock execution times achieved on all the $P$

430: compute-nodes.

431:

432: We vary $P$ in the range $P=1\dots 64$, and repeat each measurement

433: for $N=4$, $N=16$ and $N=32$ I/O-nodes. The amount of data $S$ written

434: and read \textit{per compute-node} is chosen proportional to the

435: number $N$ of I/O-nodes, $S/N = \textrm{const}$ (here we follow

436: closely the benchmark of Ref.~\cite{PVFSpaper}). The reasoning is that

437: although we vary the number of I/O-nodes, the buffer-cache will be

438: saturated for one and the same number of compute-nodes.  Indeed this

439: behavior is borne out in \fig{WRITE}, which shows the cumulative

440: throughput for the write operation.

441:

442: \begin{figure}[htb]

443: \begin{center}

444:   \includegraphics[width=.8\textwidth]{write.eps}

445: \end{center}

446: \caption{Concurrent write performance.\label{WRITE}}

447: \end{figure}

448:

449: We followed Ref.~\cite{PVFSpaper} and have carried out 5~measurements

450: in each case.  The smallest and the largest results were discarded and

451: the remaining ones have been averaged. Actually, the five values did

452: not differ by more than 2~\% in any of the measurements.

453:

454: As we see from \fig{WRITE}, the performance quickly reaches a plateau

455: for each I/O-partition. There is no visible impact from the number of

456: compute-nodes as long as the buffer-cache of the I/O-nodes is not

457: saturated. This occurs when the number of compute-nodes is greater

458: than 50. The amount of data written and read on every I/O-node is

459: $P\cdot S/N$, so it is greater than 100~MB for $P>50$ and

460: $S/N=2\,\textrm{MB}$\footnote{In fact the buffer-cache itself is not

461:   saturated, but at this point the amount of ``dirty buffers'' has

462:   reached 40~\% of the whole buffer-cache, see Section~\ref{DIRTY}.

463:   With no other program running at the same time, almost all of the

464:   main memory is used for the buffer-cache, and 40~\% of 256~MB yield

465:   100~MB, as observed.}.

466:

467: The write performance reaches between 29 and 35~MB/s for each

468: I/O-node, not exhausting the \texttt{bonnie++} figures of

469: Section~\ref{DIRTY} or the TCP/IP speed, see \fig{TCPFIG}.  However,

470: we achieve about 30~\% faster write performances than reported in

471: Ref.~\cite{PVFSpaper}.  Visually comparing the Figure~6 in

472: Ref.~\cite{PVFSpaper} with \fig{WRITE} we recognize the substantially

473: improved stability of our curves, a well known feature of the

474: communication sub-system ParaStation.

475:

476: After buffer-cache saturation, the performance drops down to a value

477: which is about 18~\% smaller than expected from the hard disk

478: performance benchmarks displayed in \fig{BONNIE}.

479:

480: As the read operation is carried out directly after the write, with

481: only a synchronizing barrier in between, the read process can draw the

482: data directly from buffer-cache. As explained above, dirty buffers are

483: written to disk if their size exhausts the limit of 100~MB\@. However,

484: they remain in memory and can be read back at a rate limited only by

485: the memory bandwidth. Thus, \fig{READ} shows no loss of performance

486: throughout the test range.

487:

488: \begin{figure}[htb]

489: \begin{center}

490:   \includegraphics[width=.8\textwidth]{read.eps}

491: \end{center}

492: \caption{Concurrent read performance.\label{READ}}

493: \end{figure}

494:

495: It is gratifying to find that each I/O-node can send with a speed

496: varying between 56~MB/s and 75~MB/s, since several sockets are served

497: simultaneously. This performance is just 20~\% slower than the

498: measured point-to-point performance via TCP/IP, but still about 45~\%

499: slower than the actual capabilities of ParaStation as seen in MPI

500: applications, see \fig{PALLAS}.  The maximal performance reaches more

501: than 1800~MB/s for 32 I/O-nodes.

502:

503: We should remark, that the read test seems to be rather artificial.

504: Actually, it would be more meaningful given a hard disk with a read

505: performance faster than 75~MB/s.  In general, a real application reads

506: data from disk and not from buffer-cache. In that case, one expects a

507: saturation below hard disk read performance, as demonstrated in

508: Section~\ref{EIGEN}.

509:

510: In order to test the raw throughput of the disk, i.e.\ without

511: utilizing the buffer cache, a modified benchmark was used.

512: Now, a huge amount of data (multiple times the size of the buffer

513: cache) is created and written to several files.  After writing and

514: closing the last file, it is very unlikely that any data from the

515: first files is still present in the buffer cache.  A subsequent read

516: operation will therefore be forced to read directly from hard-disk.

517: \fig{READ-DISK} shows the result of this benchmark.

518:

519: \begin{figure}[htb]

520: \begin{center}

521:   \includegraphics[width=.8\textwidth]{read-disk.eps}

522: \end{center}

523: \caption{Concurrent read performance from disk.\label{READ-DISK}}

524: \end{figure}

525:

526: Obviously the throughput for read operations from PVFS drops

527: dramatically within such realistic setup. In the case of 4~I/O-nodes

528: the read performance drops to about 13~MB/s per I/O-node.

529: Nevertheless a total read performance of 300~MB/s can be achieved if

530: all 32~I/O-nodes are utilized.

531:

532: The results presented so far have used a stripe size of 64~kB, the

533: default value of PVFS\@. Taking a stripe size as large as the amount

534: of data a given compute-node has to write, we achieve about 920~MB/s

535: writing from 32 compute-nodes to 32 I/O-nodes. The corresponding

536: read-operation achieves up to 2200~MB/s using the cache.

537: Without a full cache, performance is lower.

538:

539: A second important feature---as far as data-intensive computations on

540: clusters are concerned---is the speed for charging and discharging the

541: system. A high throughput is  is crucial for the success of

542: computer experiments that work on data sets much larger than the PVFS

543: disk space available. In this case one has to retrieve (store) data

544: from (to) an external TB-size repository.

545:

546: Distributing UNIX files onto PVFS from the archive in principle could

547: proceed with the TCP/IP performance of about 92~MB/s.  Actually, the

548: limiting factor is the disk performance of the file server. Our RAID

549: delivers or stores at about 25~MB/s.  In practice this limitation

550: poses no real problem, since data staging can be carried out

551: asynchronously to the parallel applications on the cluster.

552:

553: \section{Fast I/O for Giant Eigenproblems in Lattice Field

554: Theory\label{EIGEN}}

555:

556: In the following, we demonstrate how PVFS/ParaStation enables us to

557: compute huge eigensystems on cluster computers.  Such computational

558: problems arise in the post simulation phase of Monte Carlo simulations

559: of lattice quantum chromodynamics (LQCD).  We aim at computing

560: $\mathcal O(1000)$ low eigenvectors of the so-called fermionic matrix, which

561: describes the dynamics of quarks in the gluon background field.  The

562: size of each vector is $\mathcal O(10^6)$.  Typically about 10~GB I/O is

563: carried out in production runs of length $\approx$30~minutes on

564: 32~processors of ALiCE\@. In practice, we have to perform thousands of

565: such runs.

566:

567: \subsection{Physical problem}

568:

569: Non-perturbative lattice quantum chromodynamics (LQCD) deals with the

570: determination of hadronic properties and

571: interactions~\cite{Montvay:1994cy}.  Particularly important

572: observables are given by the mass spectrum of bound quark states, as

573: for instance the masses of hadrons like the $\pi$-Meson and the rho

574: $p$-Meson.  Among these particles, hadronic states that can be

575: classified as \textit{singlet representation} of the flavor-SU(3)

576: group play a special r\^ole.  They are characterized by contributions

577: of so-called non-valence objects. More precisely, their correlation

578: functions, $C_{\eta'}(t_1-t_2)$, the quantities which allow to extract

579: the physical properties of the hadrons by exploring their decay in

580: time, $\Delta t=t_1-t_2$, contain contributions from correlators

581: between closed virtual quark-gluon loops. These ``non-valence''

582: objects are nothing but a manifestation of quantum mechanical vacuum

583: fluctuations, which follow from Heisenberg's uncertainty principle as

584: applied to relativistic field theory.  From a physical point of view,

585: flavor singlet objects are particularly intriguing, as they are

586: sensitive to (and thus allow to explore) the topological structure of

587: the QCD vacuum.

588:

589: The reliable determination of disconnected diagrams has been a

590: long-standing issue ever since the early days of LQCD\@. It can be

591: traced back to the numerical problem of getting information about

592: functionals of the complete inverse fermionic matrix

593: $M^{-1}$.\footnote{In contrast to flavor singlet observables,

594:   non-singlet masses are far simpler to compute: they imply the

595:   solution of a few systems of linear equations of type $Mx=b$, the

596:   discrete analogue of Dirac's equation with source term.}

597:

598: First attempts in this direction started only a few years ago, using

599: the so-called stochastic estimator method (SE)~\cite{Eicker:1996gk} for the

600: computation of the trace of $M^{-1}$. This approach requires to solve

601: a linear system $Mx = \xi $ on hundreds of source vectors $\xi$, with

602: $\xi$ being noise vectors that are Z$_2$ of Gaussian distributed.

603:

604: In Ref.~\cite{Neff:2001zr}, we have shown how to estimate the mass of

605: the $\eta'$ meson just using a set of low-lying eigenmodes of $M$.

606: Strictly speaking, our approach deals with the matrix $Q$, the

607: hermitian form of $M$, the eigenvectors of which form an orthogonal

608: base~\cite{Hip:2001hc}:

609: \begin{equation}

610: Q =\gamma_5 M.

611: \end{equation}

612: The Wilson-Dirac matrix is given by

613: \begin{eqnarray}

614: M_{\mathsf{x,y}} = \mathbf{1}_{cs}\delta_{\mathsf{x,y}}

615: - \kappa \sum_{\mu=1}^4 &&

616: {(\mathbf{1}_s-\gamma_{\mu})}\otimes U_{\mu}(\sf x)\,\delta_{\sf x,\sf

617:   y-\mu}\nonumber\\

618: +&&

619: {(\mathbf{1}_s+\gamma_{\mu})}\otimes U_{\mu}^{\dagger}(\sf x-\mu)\,\delta_{\sf x,\sf y+\mu}.

620: \end{eqnarray}

621: The symbols $\gamma_{\mu}$ stand for the $4\times 4$ Dirac spin

622: matrices. The $3\times 3$ matrices $U_{\mu} \in$ color-SU(3) represent

623: the gluonic vector field, thus $\mathbf{1}_{cs}$ is a $12\times 12$

624: unit-matrix in color and spin space.  $M$ is a sparse matrix in

625: 4-dimensional Euclidean space-time with matrix valued stochastically

626: distributed coefficient functions of type $(\mathbf{1}_s\pm

627: \gamma_{\mu})\otimes U_{\mu}(\mathsf{x})$ at site $\mathsf{x}$.

628:

629: Once the low-lying modes are computed, it is possible to approximate

630: the full inverse matrix $Q^{-1}$ and those matrix functionals or

631: functions of $Q$ which are sensitive to small eigenvalues, i.e.\

632: long-range physics.

633:

634: \subsection{Numerical procedure}

635: In LQCD, hadronic masses are extracted from the large-time behavior of

636: correlation functions. The correlator of the flavor

637: \textit{non-singlet} $\pi$-meson is defined as

638: \begin{equation}

639: C_{\pi} (t=t_1-t_2) = \left<

640: \sum_{\vec x,\vec y} \mbox{Tr}  \Big[ Q^{-1} (\vec x,t_1;\vec y,t_2) Q^{-1}

641: (\vec y,t_2;\vec x,t_1) \Big]  \right>_U, \label{picorr}

642: \end{equation}

643: while the flavor \textit{singlet} $\eta'$ meson correlator is composed

644: of two terms,

645: the first

646: corresponding to the propagation of a quark-antiquark-pair from $\vec

647: x$ to $\vec y$ without annihilation in between and the second one

648: being characterized by intermediate pair annihilation:

649: \begin{eqnarray}\label{ETAPRIME}

650: C_{\eta'} (t_1-t_2) &=& C_{\pi} (t_1-t_2)\nonumber\\ &-&2 \left<

651: \sum_{\vec x,\vec y} \mbox{Tr}  \Big[ Q^{-1} (\vec x,t_1;\vec x,t_1) \Big]

652: \mbox{Tr} \Big[ Q^{-1} (\vec y,t_2;\vec y,t_2) \Big]  \right>_U.

653: \end{eqnarray}

654: The brackets $\left< \ldots \right>_U$ indicate the average over a

655: canonical ensemble of gauge field configurations. For large time

656: separations $t$, the respective correlation functions are dominated by

657: the ground state, because the higher excitations die out and therefore

658: become proportional to $\exp(-m_{0}t)$. $m_0$ is the mass of the

659: particle associated to the correlation function.

660:

661: As already mentioned, the $\pi$-correlator (\ref{picorr}), equivalent

662: to the first term of the $\eta'$-correlator, is obtained by solving the

663: linear system

664: \begin{equation}

665: M(\vec x,t_1;\vec y,t_2)\, c(\vec y,t_2) = \delta(\vec 1,1;\vec x,t_1); \label{lineq}

666: \end{equation}

667: on 12 source vectors (here located at site $(\vec 1,1)$).  Of course,

668: the statistics could be improved by averaging over many sources;

669: however this becomes prohibitively expensive as the effort increases

670: with the number of sources.

671:

672: The second term in \eq{ETAPRIME},

673: \begin{equation}

674: \sum_{\vec x,\vec y} \mbox{Tr}

675: \Big[ Q^{-1} (\vec x,t_1;\vec x,t_1) \Big] \mbox{Tr} \Big[ Q^{-1}

676: (\vec y,t_2;\vec y,t_2) \Big],

677: \end{equation}

678: depends on the diagonal elements of $Q^{-1}$. The inverse can be

679: expanded in terms of the eigenmodes weighted by the inverse

680: eigenvalues:

681: \begin{equation}

682: Q^{-1} (\vec x,t_1;\vec y,t_2)  = \sum_i\frac{1}{\lambda_i}

683: \frac{|\psi_i(\vec x,t_1)\rangle\langle\psi_i(\vec y,t_2)|}{

684: \langle\psi_i|\psi_i\rangle} ,

685: \end{equation}

686: where $\lambda_i$ and $\psi_i$ are the eigenvalues and the

687: eigenvectors of $Q$ respectively.  We found that we can approximate

688: the sum on the right hand side by restriction to $\mathcal O(300)$ lowest-lying

689: eigenvalues and their corresponding eigenvectors. Due to the factor

690: $1/\lambda_i$, one expects that the low-lying eigenmodes will tend to

691: dominate the sum. Our procedure is called truncated eigenmode

692: approximation (TEA).

693:

694: We compute the eigensystem by means of the implicitly restarted

695: Arnoldi method, a generalization of the standard Lanczos procedure.  A

696: crucial ingredient of our approach is a Chebyshev acceleration

697: technique.  The spectrum is transformed such that the Arnoldi

698: eigenvalue determination becomes uniformly accurate for the entire

699: part of the spectrum we aim for.  A comfortable parallel

700: implementation of IRAM is provided by the PARPACK package

701: \cite{PARPACK}.

702:

703: We work on a space-time lattice of size $16^3 \times 32$.  Taking into

704: account the Dirac and color indices, the Dirac matrix acts on a $12

705: \times 16^3 \times 32 = 1.572.864$ dimensional vector space. This

706: explains why the inversion of the entire Dirac matrix is not feasible

707: since this would need about 40~TB memory space, whereas the

708: determination of 300~low-lying eigenvectors leads to about 7.5~GB

709: memory space only. Our computations are based on canonical ensembles

710: of 200~field configurations with $n_f = 2$ flavors of dynamical sea

711: quarks. Such kind of ensembles have been generated at 5~different

712: dynamical quark masses in the framework of the SESAM full QCD project

713: \cite{Lippert:1999ha}.  It takes us about 3.5~Tflops-hours to solve

714: for 300~low-lying modes per ensemble for each quark mass. Altogether

715: we aim at $\mathcal O(30)$ valence mass/coupling combinations. First

716: physical results of our computations can be found in

717: Refs.~\cite{Neff:2001zr,Attig:2001ty,Neff:2002mq,Schilling:2001xd}.

718:

719: \subsection{Benchmarks}

720:

721: The typical compute partitions used for the TEA on ALiCE range from 16

722: to 64~nodes, depending on the number of eigenmodes required for the

723: approximation as well as on the memory available.  Each node reads its

724: specific portion of a given eigenmode corresponding to the sub-lattice

725: assigned to this node. In our case, a simple regular space

726: decomposition of the $16^3 \times 32$ lattice in $z$ and $t$

727: directions is applied.

728:

729: Physically, the 300~eigenmodes for each field configuration are stored

730: as a single large file in round-robin manner. In case of a stripe-size

731: that corresponds to the size of an entire time-slice of the lattice,

732: each time-slice will be assigned to exactly one processor by PVFS\@.

733:

734: In our application we use MPI-IO calls

735: (\texttt{MPI\_File\_read\_at()}) instead of standard I/O to

736: read from the PVFS\@.  In \tab{ROMIO} performance averages over four

737: measurements are presented.  The results fluctuate only marginally.

738:

739: \begin{table}

740: \begin{center}

741: \caption{\label{ROMIO} Read and write performances for three compute partitions.}

742: \begin{tabular}{l|ccc}

743: \hline

744: \# of compute nodes    &   16  & 32  & 64 \\

745: \# of eigenmodes       &   100 & 200 & 300 \\

746: \hline

747: read performance per I/O-node [MBytes/s] &   11.9& 13.3& 11.1\\

748: write performance per I/O-node [MBytes/s] &  11.9& 13.4& 13.2\\

749: \hline

750: \end{tabular}

751: \end{center}

752: \end{table}

753:

754: As to be expected from \fig{READ-DISK} the throughput for

755: read-operations is as high as 13~MB/s, which is the actual hard-disk

756: performance. The effective throughput for write-operations is about 10

757: \% less than achieved in \fig{WRITE}, most likely a remaining MPI-IO

758: overhead.

759:

760: \subsection{Gain}

761:

762: A production step includes reading eigenmodes and computing

763: observables. Typically the computation takes about 10~minutes. Another

764: 10~minutes are needed for loading all 300~modes from the local disk.

765: Loading the data from an external archive (which is the usual

766: procedure since local disks do not provide enough capacity for an

767: entire ensemble of field configurations and eigenmodes) lasts about

768: 30~minutes. This large mismatch between compute time and I/O time

769: (where the processors are idling) renders standard clusters very

770: inefficent for such type of computational problems, denoted as {\it

771:   data-intensive}.

772:

773: PVFS via ParaStation/Myrinet cuts down the I/O-times to less than

774: 20~seconds for both reading and writing. This is a substantial

775: acceleration compared to local or remote disk I/O.  With these

776: improvements we were able to enter large-scale productions with

777: thousands of read-compute-write sequences to be carried out. After

778: several months of continuous heavy duty we find the I/O system to

779: behave remarkably reliable and stable, with no failure encountered so

780: far.

781:

782: \section{Future Perspectives: ClusterFile}

783:

784: Most parallel file systems, including PVFS, distribute the files

785: stripe-wise over the I/O-nodes. However, parallel applications provide

786: their specific data structures that are often accessed in form of

787: regular patterns called the ``virtual'' or ``logical'' partitioning of

788: the data.  The access pattern studies~\cite{NK+96,SR97,SR98} have

789: shown that the performance and scalability of parallel scientific

790: applications with intensive parallel I/O-activity suffer significantly

791: from the mismatch between virtual partitioning and physical placement

792: of file data. This may result in an under-utilization of disk and

793: network bandwidths and in a decreased parallel exploitation of independent

794: disk capacity---among other effects.

795:

796: For the above mentioned reasons, it is important that a parallel file

797: system offers support for flexible physical partitioning. On one hand

798: one would wish a fully developed PFS to be able to ``translate''

799: efficiently from any physical partitioning of the files on the

800: I/O-nodes of the PFS to any virtual partitioning and vice versa, on

801: the other hand one would like to control the physical data layout on

802: the PFS\@. So far, this goal is only partly realized by PVFS\@.

803: Therefore, we decided to develop a novel parallel file system, called

804: ClusterFile~\cite{IT01}, that addresses these issues.

805:

806: In its present state, ClusterFile presents architectural similarities

807: with PVFS, in so far that several I/O-daemons are responsible

808: for storing the data on I/O-nodes with one central meta-data manager.

809: The clients must link to a library that provides transparent file

810: system access.

811:

812: Unlike PVFS, ClusterFile employs a file partitioning model, that

813: allows the arbitrary distribution of data over a cluster, in regular

814: or irregular patterns. The model is optimized for regular patterns,

815: since most frequently used data structures of parallel scientific

816: applications are multidimensional arrays~\cite{NK+96}, partitioned

817: into chunks among parallel processes.

818:

819: The file model of ClusterFile is used for both physical and logical

820: partitioning. A file is physically partitioned into sub-files. As an

821: example, it is possible to spread a file over several disks using a

822: block-cyclic distribution or any other regular and irregular

823: distribution as for instance supported by High Performance

824: Fortran~\cite{HPF}.

825:

826: A file may be logically partitioned between several processors by

827: means of views. A view is a sequential window to an eventually

828: non-contiguous subset of a file and can be used exactly in the same

829: way as a file. An important advantage of using views is that it

830: relieves the programmer from complex index computation.  Additionally,

831: it also provides hints on potential future access patterns and can be

832: used for matching the physical to the logical distribution. For

833: instance, if a file is striped in round-robin manner over the disks

834: while the parallel processes of an application set block-cyclic views,

835: it could be better for performance to convert the physical layout into

836: a conforming block-cyclic distribution.

837:

838: The processes of a parallel application access a shared file in many

839: cases at nearly equal times. This observation suggests the design of

840: collective I/O operations. Their main goal is to coalesce many small

841: I/O requests of several cooperating processes into a few large

842: requests. The two main categories of collective I/O are

843: two-phase~\cite{RB+93} and disk-directed~\cite{KD94} operations. The

844: two-phase I/O consists of a file access step that is independent of

845: the virtual partitioning, and a shuffle-phase in which the access data

846: is distributed according to the access pattern.  The MPI-IO library,

847: which is implemented on top of PVFS and ClusterFile utilizes the

848: two-phase method since it is not aware of the physical file

849: partitioning.  ClusterFile implements a version of the disk-directed

850: method, in which the requests are gathered at I/O nodes, coalesced,

851: before access is performed, and the result is returned to the compute

852: nodes. This approach has the advantage that it can exploit the

853: relationship between physical and logical partitionings, whereas the

854: two-phase method separates the operation into two distinct steps.

855:

856: In the future we plan to test ClusterFile on ALiCE and to perform a

857: detailed comparison with PVFS\@.  We expect the collective I/O

858: implementation of ClusterFile to perform better than that of MPI-IO,

859: due to the above mentioned reasons.

860:

861: Another important step will be the incorporation of cooperative caching

862: policies~\cite{DW94} that allow the buffer-caches of I/O- and compute-nodes

863: to interact. The goal is to provide a global caching policy that

864: provides a better utilization of buffer-caches and avoids unnecessary

865: disk requests.

866:

867: \section{Summary}

868: We have demonstrated that the ParaStation3 communication system speeds

869: up the performance of parallel I/O on cluster computers such as ALiCE.

870: I/O-benchmarks with PVFS using Parastation over Myrinet achieve a

871: throughput for write-operations of up to 1~GB/s from a 32-processor

872: compute-partition, given a 32-processor PVFS I/O-partition.  These

873: results out-perform known benchmark results for PVFS on 1.28~Gbit

874: Myrinet by more than a factor of 2, a fact that is mainly due to the

875: superior communication features of ParaStation.  Read-performance from

876: buffer-cache reaches up to 2.2~GB/s, while reading from hard-disk

877: saturates at the cumulative hard-disk performance. The I/O-performance

878: achieved with PVFS using ParaStation enables us to carry out extremely

879: data-intensive eigenmode computations on ALiCE in the framework of

880: lattice quantum chromodynamics. In the future the I/O-system will be

881: utilized for storing and processing mass data in high energy physics

882: data analysis on clusters.

883:

884: \section*{Acknowledgments}

885:

886: This work was supported by the Deutsche Forschungsgemeinschaft as

887: twinning project ``Alpha-Linux-Cluster'' (Ti264/6-1 \& Li701/3-1).

888: Physics related work was supported under Li701/4-1 (RESH

889: Forschergruppe FOR 240/4-1), and by the EU Research and Training

890: Network HPRN-CT-2000-00145 ``Hadron Properties from Lattice QCD''.  We

891: thank Guido Arnold and Boris Orth for their help with the cluster

892: computer ALiCE\@.

893:

894:

895: \bibliographystyle{h-elsevier}

896: \bibliography{pvfs}

897:

898: \end{document}

899:

900: \bibitem{BONNIE}    http://www.coker.com.au/bonnie++/

901: \bibitem{GPFS}      http://www-124.ibm.com/gpfs/

902: \bibitem{PVFS-site} http://parlweb.parl.clemson.edu/pvfs/

903:

904:

905: %%% Local Variables:

906: %%%   mode: latex

907: %%%   TeX-master: "paper"

908: %%%   time-stamp-line-limit: 16

909: %%%   coding: latin-1-unix

910: %%% End:

911:

912: %%% LocalWords:  Deutsche Forschungsgemeinschaft RESH Forschergruppe kB

913: %%% LocalWords:  QCD Eicker Isaila Lippert PVFS ALiCE PFS AFS GPFS TCP

914: %%% LocalWords:  IP MPI IPD ECC D�ssel HPC Pallas Moschny Schilling Tichy

915: %%% LocalWords:  partitionings Linux ARP ClusterFile ParTec Gau�stra�e eigenmodes

916: