1: %
2: % Fast Parallel I/O on ParaStation Clusters
3: %
4: % Authors:
5: % Thomas Düssel, Norbert Eicker, Florin Isaila, Thomas Lippert,
6: % Hartmut Neff, Thomas Moschny, Klaus Schilling, Walter Tichy
7: %
8: % Keywords:
9: % Parallel I/O, Cluster, ParaStation, PVFS, ClusterFile, ...
10: %
11: % $Revision: 1.48 $
12: % $Date: 2003/03/18 14:40:18 $
13: %
14: % Time-stamp: <2003-02-27 16:23:55 CET moschny>
15: %
16: \documentclass{elsart}
17:
18: \usepackage{amsmath,amssymb}
19: \usepackage[latin1]{inputenc}
20: %\usepackage[T1]{fontenc}
21: %\usepackage{palatino}
22: \usepackage{latexsym}
23: \usepackage{graphicx}
24:
25: \newcommand{\eq}[1]{eq.~(\ref{#1})}
26: \newcommand{\eqs}[3]{Eqs. (\ref{#1},\ref{#2},\ref{#3})}
27: \newcommand{\fig}[1]{Figure~\ref{#1}}
28: \newcommand{\tab}[1]{Table~\ref{#1}}
29:
30: \makeatletter
31: \makeatother
32:
33: \hyphenation{Para-Station}
34:
35: \begin{document}
36:
37:
38: \begin{frontmatter}
39:
40:
41: \title{Fast Parallel I/O on Cluster Computers}
42:
43: \author[W]{Thomas Düssel},
44: \author[W,P]{Norbert Eicker},
45: \author[K]{Florin Isaila},
46: \author[W]{Thomas Lippert},
47: \author[W,K]{Thomas Moschny},
48: \author[B]{Hartmut Neff},
49: \author[W]{Klaus Schilling},
50: \and\author[K]{Walter Tichy}
51: \address[W]{Department of Physics, University of Wuppertal, Gaußstraße~20,
52: 42097~Wuppertal, Germany}
53: \address[P]{ParTec AG, Possartstr.~20, 81679~München, Germany}
54: \address[K]{Institute for Program Structures and
55: Data Organization (IPD),
56: University of Karlsruhe, Postfach~6980,
57: 76128~Karlsruhe, Germany}
58: \address[B]{Physics Department, Boston University,
59: 590~Commonwealth Avenue Boston,\\ MA~02215, USA}
60:
61:
62: \begin{abstract}
63: % This is version \verb%$Id: paper.tex,v 1.48 2003/03/18 14:40:18 lippert Exp $%.
64: Today's cluster computers suffer from slow I/O, which slows down
65: I/O-intensive applications. We show that fast disk I/O can be
66: achieved by operating a parallel file system over fast networks such
67: as Myrinet or Gigabit Ethernet.
68:
69: In this paper, we demonstrate how the ParaStation3 communication
70: system helps speed-up the performance of parallel I/O on clusters
71: using the open source parallel virtual file system (PVFS) as testbed
72: and production system. We will describe the set-up of PVFS on the
73: Alpha-Linux-Cluster-Engine (ALiCE) located at Wuppertal University,
74: Germany. Benchmarks on ALiCE achieve write-performances of up to
75: 1~GB/s from a 32-processor compute-partition to a 32-processor PVFS
76: I/O-partition, outperforming known benchmark results for PVFS on the
77: same network by more than a factor of 2. Read-performance from
78: buffer-cache reaches up to 2.2~GB/s. Our benchmarks are giant,
79: I/O-intensive eigenmode problems from lattice quantum
80: chromodynamics, demonstrating stability and performance of PVFS over
81: Parastation in large-scale production runs.\\[6pt]
82: \noindent {\em Keywords: Cluster Computing, Parallel File System, ParaStation,
83: Lattice QCD}
84:
85: \end{abstract}
86:
87: \end{frontmatter}
88:
89: \section{Introduction}
90:
91: Within the past years commodity-off-the-shelf (COTS) clusters have
92: evolved towards cost-effective general-purpose HPC devices. Such
93: systems, self-made and commonly denoted as Beowulf
94: computers~\cite{ridge97beowulf}, show up increasingly on the TOP500
95: list~\cite{top500}. While gigabit network technology
96: (Gigabit-Ethernet, Myrinet) along with error-correcting
97: zero-copy-communication software~\cite{GM,SCORE,PARTEC} have boosted
98: communication-intensive number-crunching tasks, I/O-intensive
99: computations have not benefited from cluster computers to the same
100: extent. The reason was the relatively slow I/O capability of
101: clusters.
102:
103: Today, a promising approach to achieve fast I/O on cluster computers
104: is to utilize distributed disks and their aggregate bandwidth by means
105: of a parallel file system (PFS). A PFS is designed to make the entire
106: disk capacity of the I/O-nodes available to all the compute-nodes and
107: to allow the parallel file access of the compute nodes to be
108: translated into real parallel disk access. Physically, files are
109: stored on a given partition of cluster nodes by distributing the data
110: of the given file, for instance in a round robin fashion. In contrast
111: to standard network file systems, a PFS provides concurrent parallel
112: access to store or read the file from all nodes of a parallel
113: application. In a typical implementation, a set of compute-nodes
114: reads and writes data to another set of I/O-nodes that host the
115: physical resources of the PFS\@. The two sets of nodes may be
116: identical, may partly overlap or may even be distinct. In principle
117: this concept allows for scalability of the I/O-rate with the number of
118: I/O-nodes, provided that \emph{(i)} the number of compute-nodes is
119: large enough to saturate the capacity of the I/O-nodes---this is
120: usually the case as soon as the number of compute-nodes equals the
121: number of I/O-nodes---and \emph{(ii)} the network delivers full
122: bi-sectional bandwidth---this is, for instance, the case for crossbar
123: or multi-stage crossbar topologies.
124:
125: Distributed file systems like NFS or AFS are not suited for concurrent
126: high-bandwidth file-access as required in I/O-intensive cluster
127: computing applications. In these systems, parallel data accesses of
128: compute nodes are serialized by file servers. Therefore, they can not
129: be called "parallel" file systems. There exist commercially available
130: parallel file systems (sometimes platform dependent), for example the
131: General Parallel File System GPFS (IBM~\cite{GPFS}). Currently, the
132: only known open source parallel file system, working in a stable
133: manner, and freely available for Linux under the GNU General Public
134: License, is the Parallel Virtual File System (PVFS) developed at
135: Clemson University and Argonne National Laboratory~\cite{PVFS-Site}.
136: PVFS is devised as a truly parallel file system for use on cluster
137: computers. The communication back-end of the standard distribution of
138: PVFS is based on the TCP/IP protocol. Therefore, PVFS can readily be
139: operated on top of any network supporting TCP/IP\@.
140:
141:
142: In this paper we consider PVFS boosted by the Myrinet communication
143: network~\cite{MYRINET} of the Alpha Linux Cluster Engine (ALiCE)
144: located at Wuppertal University, Germany~\cite{PIK:2002}. There are
145: several error-correcting and package-loss-safe communication
146: sub-systems available, designed to drive Myrinet: e.g.\ the vendor-provided GM
147: software~\cite{MYRINET}, SCore~\cite{SCORE}, and the ParaStation
148: system~\cite{PARTEC}, developed at Karlsruhe University. On ALiCE, we
149: are using ParaStation.
150:
151: ParaStation implements the concept of virtual nodes, operating in
152: close interaction with queuing systems like PBS~\cite{PBS}. The
153: communication system provides safe multi-user operation and
154: outstanding stability, not to mention the comfortable
155: single-point-of-administration management by means of the
156: ParaStation-daemons. System crashes are tidily cleaned-up without any
157: user interference. Most important in our context is however the
158: communication bandwidth under ParaStation. A special kernel module
159: routes TCP/IP via the ParaStation communication system and renders
160: Myrinet an additional IP-network with full bi-sectional bandwidth. In
161: this manner, the superior bandwidth from ParaStation as Myrinet driver
162: can be exploited.
163:
164: Combining parallel file systems like PVFS with ParaStation perfectly
165: meets the demands of an application from the post-simulation phase of
166: a large scale Monte Carlo project in lattice-quantum chromodynamics
167: (LQCD). We are evaluating giant eigenproblems~\cite{Neff:2001zr},
168: which are very data-intensive, on cluster computers. The eigenvectors
169: are required for the construction of correlations between two quark
170: loops. Such creation and annihilation of quarks originate from the
171: quantum fluctuations of the QCD vacuum, according to Heisenberg's
172: uncertainty principle. They are considered affectual to the unusually
173: large mass of the $\eta'$-meson~\cite{Schilling:2002gm}. In our
174: simulations, we have to compute $\mathcal O(1000)$ low eigenvectors of
175: the fermionic matrix, which describes the dynamics of the quarks. The
176: size of each vector is $\mathcal O(10^6)$.
177:
178: Typically, about 10~GB of I/O is carried out in data-intensive
179: production steps of about 10~minutes compute time on 16 to
180: 64~processors; actually several thousands of such runs are performed.
181: Without parallel I/O, reading or writing lasts between 10 and
182: 30~minutes. PVFS helps cut down the read and write times to about
183: 20~seconds.
184:
185: The paper is organized as follows: in Section~\ref{SETUP} we present
186: the new TCP/IP kernel module included in ParaStation and describe the
187: connectivity of ALiCE and its specific PVFS implementation.
188: Section~\ref{SPEEDS} gives benchmarks for the components of the
189: I/O-machinery, including disk-speed and TCP/IP node-to-node rates. The
190: results of the PVFS benchmarks are shown in Section~\ref{PVFSBENCH}
191: along with a comparison with Ref.~\cite{PVFSpaper}. We are in the
192: position to test PVFS/ParaStation within a large-scale application
193: from lattice quantum chromodynamics, described in Section~\ref{EIGEN}.
194: Finally we summarize and give a short outlook on the novel parallel
195: file system ClusterFile that is currently under development at the
196: IPD, University of Karlsruhe~\cite{IT01}.
197:
198: \newpage
199:
200: \section{Technical Background\label{SETUP}}
201:
202: For cluster applications based on IP-communication to benefit from
203: ParaStation's performance, a TCP/IP kernel module has been introduced
204: on top of the ParaStation communication system. In this manner,
205: stable communication with gigabyte bandwidth is provided for the
206: parallel virtual file system (PVFS) as implemented on the 128-node
207: Alpha-Linux-Cluster-Engine ALiCE\@.
208:
209: \subsection{TCP/IP kernel module with ParaStation\label{MODULE}}
210:
211: Many cluster applications neither use MPI nor the low-level
212: communication API as for instance provided by ParaStation. To benefit
213: from ParaStation's performance for such applications, an additional
214: TCP/IP module was developed at the Institute for Programs and Data
215: Structures, University of Karlsruhe.
216:
217: This module provides a network driver interface to the Linux kernel,
218: just as any other Ethernet card driver does. This way, virtually any
219: Ethernet protocol that is supported by the kernel can be transported
220: over Myrinet. In practice, most applications will use the TCP (or UDP)
221: over IP protocol\footnote{The intended use is the reason for the
222: somewhat misleading name of the module. It should better be called
223: Ethernet driver for ParaStation.}.
224:
225: Internally, the module uses the kernel variant of the ParaStation
226: low-level communication interface to send raw Ethernet datagrams to
227: any other host in the cluster. This interface supports a set of basic
228: communication operations, since at this level we don't need more
229: functionality. Packet (dis)assembling for instance is done by the
230: kernel. Upon startup, a so-called \textit{kernel-context} is obtained
231: from the ParaStation module. This reserves a certain number of
232: communication buffers in the memory of the Myrinet network adapter
233: card and instructs the driver to listen for messages addressed to the
234: TCP/IP module. Secondly, a call-back mechanism is established:
235: whenever such a message arrives, an interrupt is risen and the
236: ParaStation interrupt handler calls a method of the TCP/IP module
237: which in turn hands the message to the Linux network stack. If a
238: message is to be sent, the network stack functions call a method of
239: the module that was registered upon initialization, handing the
240: message over to the ParaStation module.
241:
242: In order to address other nodes in the cluster, the TCP/IP driver
243: module (in fact an Ethernet driver) maps the ParaStation node
244: identification number (ParaStation ID) onto Ethernet hardware
245: addresses. The administrator can set up static ARP tables that map
246: IP addresses to these hardware addresses. This is necessary, as the
247: driver does not support broadcast messages that would be required to
248: enable automatic ARP functionality. However, if IP addresses are
249: chosen such that the ParaStation ID (the node number counted from 0)
250: equals the IP address minus one, the static ARP tables can be omitted.
251: In this case, any driver module in the cluster can guess ParaStation
252: IDs from IP addresses and thus generate fake ARP reply messages.
253:
254:
255: \subsection{ALiCE setup}
256:
257: The Alpha-Linux-Cluster-Engine ALiCE is an assembly of 128 Compaq DS10
258: workstations~\cite{PIK:2002}. ALiCE, located at the University of
259: Wuppertal (Germany), is fully operational since the end of 2000. The
260: machine is equipped with Alpha 21264 EV67 processors, with 2~MB cache,
261: operating at a frequency of 616~MHz. With 256~MB ECC memory for each
262: processor, the total amount of memory is 32~GB\@. The disk space of
263: 10~GB per node adds up to about 1.3~TB in total. The nodes are
264: interconnected by a 2 $\times$ 1.28~Gbit/s Myrinet network configured
265: as a multi-stage crossbar with full bi-sectional bandwidth. The
266: M2M-PCI64A-2 Myrinet cards utilize a 64~bit/33~MHz PCI bus.
267:
268: \fig{MYRINET} shows the hardware plan of the inter-node connection of
269: ALiCE\@. A full crossbar is realized by three switch-stages (where
270: each octal switch actually consists of two stages) employing the
271: $2\times 4$ Myrinet M2M-oct-SW8 switches and 8~Myrinet M2M-dual-SW8
272: switches. The hardware latency is about 100~ns per switch stage, far
273: below the total latency (software and hardware) of 17.1~$\mu$s.
274:
275: \begin{figure}[!htb]
276: \begin{center}
277: \includegraphics[width=.82\textwidth]{c14.eps}
278: \end{center}
279: \caption{Myrinet multi-stage crossbar network.\label{MYRINET}}
280: \end{figure}
281:
282: We have extended the network in order to incorporate an external file
283: and archive server as well as to provide gigabit links to external
284: machines for the purpose of fast on-line visualization. To this end,
285: we have exchanged two of the 8 M2M-dual-SW8 switches by two M2M-SW16
286: switches, as sketched in \fig{MYRINET2}. This way, we could avoid an
287: inhomogeneous number of hardware stages of the multi-stage crossbar
288: network.
289:
290: \begin{figure}[!htb]
291: \begin{center}
292: \includegraphics[width=\textwidth]{netz.eps}
293: \end{center}
294: \caption{Inclusion of external devices (not all connections
295: drawn).\label{MYRINET2}}
296: \end{figure}
297:
298: \subsection{PVFS partitioning}
299:
300: On ALiCE, we run 4 different PVFS partitions with 32-nodes each. This
301: partitioning fits well with the compute-partitions, chosen such as to
302: optimize the compute performances of our applications. Each PVFS
303: partition (\texttt{/pvfs1} to \texttt{/pvfs4}) is represented by a
304: mount point on each node and on the file server. Mounting PVFS on the
305: file server enables us to copy UNIX files with ParaStation TCP/IP
306: speed from the external RAID to PVFS\@. For each partition, the last
307: node plays the rôle of the management node. The entire set-up is
308: displayed in \fig{SETUPPVFS}.
309:
310: \begin{figure}[!htb]
311: \begin{center}
312: \includegraphics[width=.8\textwidth]{pvfs.eps}
313: \end{center}
314: \caption{Organization of 4 PVFS partitions on ALiCE.\label{SETUPPVFS}}
315: \end{figure}
316:
317: \section{Performance of Basic I/O-Devices\label{SPEEDS}}
318:
319: The proper interpretation of the benchmark results presented in
320: Section~\ref{PVFSBENCH} requires some background knowledge about the
321: features of ALiCE's basic I/O-components, i.e., file system
322: performance, TCP/IP and MPI data rates.
323:
324: \subsection{Local file system performances\label{DIRTY}}
325:
326: The ALiCE DS10 nodes are equipped with 10~GB Maxtor IDE
327: disks\footnote{5400~rpm versions.}. We have carried out tests on
328: local file systems, formatted with ReiserFS, by means of
329: \texttt{bonnie++}~\cite{BONNIE}. Write and read-performances are
330: determined for a series of file sizes, from 10~MB up to 2~GB\@. In
331: order to reveal buffer-cache effects, we have adjusted the ``dirty
332: buffer'' parameter\footnote{The standard behavior of the Linux kernel
333: is to start flushing buffer pages as soon as the given percentage of
334: memory available for buffer cache is ``dirty''. A buffer page is
335: called dirty if it changed in memory but has not yet been written to
336: disk.} to two different values, 40~\% and 70~\%. Our results are
337: given in \fig{BONNIE}.
338:
339: \begin{figure}[!htb]
340: \begin{center}
341: \includegraphics[width=.8\textwidth]{bonnie.eps}
342: \end{center}
343: \caption{Local file system performances on ALiCE.\label{BONNIE}}
344: \end{figure}
345:
346: The \texttt{bonnie++} benchmark works as follows: the given file is
347: written and read back immediately. We observe the write performance
348: to buffer cache of 60~MB/s to fall to about 23~MB/s at a file size of
349: 100 and~200 MB, for 40 and 70~\% dirty buffers, respectively.
350: Asymptotically, the performance is decreasing to about 19~MB/s. Being
351: able to read from the buffer cache, the subsequent read operation
352: shows a bandwidth of more than 280~MB/s for small files. At a file
353: size of about 200~MB, the speed drops down to 24~MB/s.
354:
355: \subsection{TCP/IP performance via ParaStation}
356:
357: On ALiCE, we can choose TCP packages to be routed via Fast Ethernet or
358: alternatively---using the new TCP/IP kernel module described in
359: Section~\ref{MODULE}---via ParaStation/Myrinet. The \texttt{ttcp}
360: benchmark issues TCP/IP packets over a point-to-point connection to
361: determine the uni-directional TCP/IP speed, cf.\ \fig{TCPFIG}. The
362: performance is seen to saturate at 93~MB/s.
363:
364: \begin{figure}[htb]
365: \begin{center}
366: \includegraphics[width=.8\textwidth]{ttcp.eps}
367: \end{center}
368: \caption{Performance test by \texttt{ttcp} via ParaStation.\label{TCPFIG}}
369: \end{figure}
370:
371: It is instructive to compare these results with the outcome of the
372: Pallas \texttt{send-receive} MPI benchmark~\cite{PALLASMPI}, see
373: \fig{PALLAS}.
374:
375: \begin{figure}[!htb]
376: \begin{center}
377: \includegraphics[width=.7\textwidth]{pmb2_sendrecv.eps}
378: \end{center}
379: \caption{Performances of the
380: \texttt{send-receive} Pallas MPI-benchmark on ALiCE.\label{PALLAS}}
381: \end{figure}
382:
383: \begin{figure}[htb]
384: \begin{center}
385: \includegraphics[width=.7\textwidth]{mpi_unidir.eps}
386: \end{center}
387: \caption{MPI-performance for uni-directional communication.\label{MPI_UNIDIR}}
388: \end{figure}
389:
390: For the \texttt{send-receive} case (i.e.\ the bi-directional
391: situation), the performance levels off at a total bandwidth of about
392: 175~MB/s (adding up the data rates of both directions). As the PALLAS
393: benchmark does not provide uni-directional measurements, we have
394: prepared corresponding \texttt{send} and \texttt{receive} programs and
395: found a saturation of the uni-directional MPI-performance at about
396: 140~MB/s, cf.\ \fig{MPI_UNIDIR}. The data rate via TCP/IP is only
397: about 34~\% smaller than via MPI, in spite of the overheads of the
398: full-fledged TCP/IP implementation. Still this leaves room for
399: improvement, since on a cluster there exists a priori knowledge of the
400: paths to all IP destinations, therefore one could try to set up a slim
401: TCP/IP protocol. Moreover, a double error checking is carried out,
402: one on the ParaStation level, and a second one on the TCP/IP level.
403:
404: \section{File System Benchmarks\label{PVFSBENCH}}
405:
406: Our benchmarks follow Ref.~\cite{PVFSpaper}, where performance results
407: on 60~nodes of the Chiba-City cluster at Argonne National Laboratory
408: have been reported. This cluster is equipped with the same Myrinet
409: version as ALiCE\@. This will enable us to compare our results using
410: TCP/IP over ParaStation on an Alpha system with TCP/IP over the
411: Myrinet/GM software on a Pentium cluster. Since loading and
412: discharging large amounts of data to the cluster constitute a crucial
413: bottleneck for data-intensive production runs on clusters, we include
414: performances with respect to reading from and writing to UNIX files
415: located on an external RAID\@.
416:
417: \subsection{Concurrent read/write performance}
418:
419: Our test code works as follows: a new PVFS file, common to $P$
420: compute processes, is opened on $N$ I/O-nodes. Concurrently the same
421: amount of data $S$ is written from each of the $P$ processes (virtual
422: partitioning) to disjoint parts of the file. PVFS stripes
423: the data onto the $N$ I/O-nodes (physical partitioning)
424: with a stripe size of 64~kB.
425:
426: After the data is written, we close the file and reopen it again in
427: order to reshuffle the same data back to the compute-nodes. The
428: bandwidth for write and read operations is computed from the maximum
429: of the wall clock execution times achieved on all the $P$
430: compute-nodes.
431:
432: We vary $P$ in the range $P=1\dots 64$, and repeat each measurement
433: for $N=4$, $N=16$ and $N=32$ I/O-nodes. The amount of data $S$ written
434: and read \textit{per compute-node} is chosen proportional to the
435: number $N$ of I/O-nodes, $S/N = \textrm{const}$ (here we follow
436: closely the benchmark of Ref.~\cite{PVFSpaper}). The reasoning is that
437: although we vary the number of I/O-nodes, the buffer-cache will be
438: saturated for one and the same number of compute-nodes. Indeed this
439: behavior is borne out in \fig{WRITE}, which shows the cumulative
440: throughput for the write operation.
441:
442: \begin{figure}[htb]
443: \begin{center}
444: \includegraphics[width=.8\textwidth]{write.eps}
445: \end{center}
446: \caption{Concurrent write performance.\label{WRITE}}
447: \end{figure}
448:
449: We followed Ref.~\cite{PVFSpaper} and have carried out 5~measurements
450: in each case. The smallest and the largest results were discarded and
451: the remaining ones have been averaged. Actually, the five values did
452: not differ by more than 2~\% in any of the measurements.
453:
454: As we see from \fig{WRITE}, the performance quickly reaches a plateau
455: for each I/O-partition. There is no visible impact from the number of
456: compute-nodes as long as the buffer-cache of the I/O-nodes is not
457: saturated. This occurs when the number of compute-nodes is greater
458: than 50. The amount of data written and read on every I/O-node is
459: $P\cdot S/N$, so it is greater than 100~MB for $P>50$ and
460: $S/N=2\,\textrm{MB}$\footnote{In fact the buffer-cache itself is not
461: saturated, but at this point the amount of ``dirty buffers'' has
462: reached 40~\% of the whole buffer-cache, see Section~\ref{DIRTY}.
463: With no other program running at the same time, almost all of the
464: main memory is used for the buffer-cache, and 40~\% of 256~MB yield
465: 100~MB, as observed.}.
466:
467: The write performance reaches between 29 and 35~MB/s for each
468: I/O-node, not exhausting the \texttt{bonnie++} figures of
469: Section~\ref{DIRTY} or the TCP/IP speed, see \fig{TCPFIG}. However,
470: we achieve about 30~\% faster write performances than reported in
471: Ref.~\cite{PVFSpaper}. Visually comparing the Figure~6 in
472: Ref.~\cite{PVFSpaper} with \fig{WRITE} we recognize the substantially
473: improved stability of our curves, a well known feature of the
474: communication sub-system ParaStation.
475:
476: After buffer-cache saturation, the performance drops down to a value
477: which is about 18~\% smaller than expected from the hard disk
478: performance benchmarks displayed in \fig{BONNIE}.
479:
480: As the read operation is carried out directly after the write, with
481: only a synchronizing barrier in between, the read process can draw the
482: data directly from buffer-cache. As explained above, dirty buffers are
483: written to disk if their size exhausts the limit of 100~MB\@. However,
484: they remain in memory and can be read back at a rate limited only by
485: the memory bandwidth. Thus, \fig{READ} shows no loss of performance
486: throughout the test range.
487:
488: \begin{figure}[htb]
489: \begin{center}
490: \includegraphics[width=.8\textwidth]{read.eps}
491: \end{center}
492: \caption{Concurrent read performance.\label{READ}}
493: \end{figure}
494:
495: It is gratifying to find that each I/O-node can send with a speed
496: varying between 56~MB/s and 75~MB/s, since several sockets are served
497: simultaneously. This performance is just 20~\% slower than the
498: measured point-to-point performance via TCP/IP, but still about 45~\%
499: slower than the actual capabilities of ParaStation as seen in MPI
500: applications, see \fig{PALLAS}. The maximal performance reaches more
501: than 1800~MB/s for 32 I/O-nodes.
502:
503: We should remark, that the read test seems to be rather artificial.
504: Actually, it would be more meaningful given a hard disk with a read
505: performance faster than 75~MB/s. In general, a real application reads
506: data from disk and not from buffer-cache. In that case, one expects a
507: saturation below hard disk read performance, as demonstrated in
508: Section~\ref{EIGEN}.
509:
510: In order to test the raw throughput of the disk, i.e.\ without
511: utilizing the buffer cache, a modified benchmark was used.
512: Now, a huge amount of data (multiple times the size of the buffer
513: cache) is created and written to several files. After writing and
514: closing the last file, it is very unlikely that any data from the
515: first files is still present in the buffer cache. A subsequent read
516: operation will therefore be forced to read directly from hard-disk.
517: \fig{READ-DISK} shows the result of this benchmark.
518:
519: \begin{figure}[htb]
520: \begin{center}
521: \includegraphics[width=.8\textwidth]{read-disk.eps}
522: \end{center}
523: \caption{Concurrent read performance from disk.\label{READ-DISK}}
524: \end{figure}
525:
526: Obviously the throughput for read operations from PVFS drops
527: dramatically within such realistic setup. In the case of 4~I/O-nodes
528: the read performance drops to about 13~MB/s per I/O-node.
529: Nevertheless a total read performance of 300~MB/s can be achieved if
530: all 32~I/O-nodes are utilized.
531:
532: The results presented so far have used a stripe size of 64~kB, the
533: default value of PVFS\@. Taking a stripe size as large as the amount
534: of data a given compute-node has to write, we achieve about 920~MB/s
535: writing from 32 compute-nodes to 32 I/O-nodes. The corresponding
536: read-operation achieves up to 2200~MB/s using the cache.
537: Without a full cache, performance is lower.
538:
539: A second important feature---as far as data-intensive computations on
540: clusters are concerned---is the speed for charging and discharging the
541: system. A high throughput is is crucial for the success of
542: computer experiments that work on data sets much larger than the PVFS
543: disk space available. In this case one has to retrieve (store) data
544: from (to) an external TB-size repository.
545:
546: Distributing UNIX files onto PVFS from the archive in principle could
547: proceed with the TCP/IP performance of about 92~MB/s. Actually, the
548: limiting factor is the disk performance of the file server. Our RAID
549: delivers or stores at about 25~MB/s. In practice this limitation
550: poses no real problem, since data staging can be carried out
551: asynchronously to the parallel applications on the cluster.
552:
553: \section{Fast I/O for Giant Eigenproblems in Lattice Field
554: Theory\label{EIGEN}}
555:
556: In the following, we demonstrate how PVFS/ParaStation enables us to
557: compute huge eigensystems on cluster computers. Such computational
558: problems arise in the post simulation phase of Monte Carlo simulations
559: of lattice quantum chromodynamics (LQCD). We aim at computing
560: $\mathcal O(1000)$ low eigenvectors of the so-called fermionic matrix, which
561: describes the dynamics of quarks in the gluon background field. The
562: size of each vector is $\mathcal O(10^6)$. Typically about 10~GB I/O is
563: carried out in production runs of length $\approx$30~minutes on
564: 32~processors of ALiCE\@. In practice, we have to perform thousands of
565: such runs.
566:
567: \subsection{Physical problem}
568:
569: Non-perturbative lattice quantum chromodynamics (LQCD) deals with the
570: determination of hadronic properties and
571: interactions~\cite{Montvay:1994cy}. Particularly important
572: observables are given by the mass spectrum of bound quark states, as
573: for instance the masses of hadrons like the $\pi$-Meson and the rho
574: $p$-Meson. Among these particles, hadronic states that can be
575: classified as \textit{singlet representation} of the flavor-SU(3)
576: group play a special r\^ole. They are characterized by contributions
577: of so-called non-valence objects. More precisely, their correlation
578: functions, $C_{\eta'}(t_1-t_2)$, the quantities which allow to extract
579: the physical properties of the hadrons by exploring their decay in
580: time, $\Delta t=t_1-t_2$, contain contributions from correlators
581: between closed virtual quark-gluon loops. These ``non-valence''
582: objects are nothing but a manifestation of quantum mechanical vacuum
583: fluctuations, which follow from Heisenberg's uncertainty principle as
584: applied to relativistic field theory. From a physical point of view,
585: flavor singlet objects are particularly intriguing, as they are
586: sensitive to (and thus allow to explore) the topological structure of
587: the QCD vacuum.
588:
589: The reliable determination of disconnected diagrams has been a
590: long-standing issue ever since the early days of LQCD\@. It can be
591: traced back to the numerical problem of getting information about
592: functionals of the complete inverse fermionic matrix
593: $M^{-1}$.\footnote{In contrast to flavor singlet observables,
594: non-singlet masses are far simpler to compute: they imply the
595: solution of a few systems of linear equations of type $Mx=b$, the
596: discrete analogue of Dirac's equation with source term.}
597:
598: First attempts in this direction started only a few years ago, using
599: the so-called stochastic estimator method (SE)~\cite{Eicker:1996gk} for the
600: computation of the trace of $M^{-1}$. This approach requires to solve
601: a linear system $Mx = \xi $ on hundreds of source vectors $\xi$, with
602: $\xi$ being noise vectors that are Z$_2$ of Gaussian distributed.
603:
604: In Ref.~\cite{Neff:2001zr}, we have shown how to estimate the mass of
605: the $\eta'$ meson just using a set of low-lying eigenmodes of $M$.
606: Strictly speaking, our approach deals with the matrix $Q$, the
607: hermitian form of $M$, the eigenvectors of which form an orthogonal
608: base~\cite{Hip:2001hc}:
609: \begin{equation}
610: Q =\gamma_5 M.
611: \end{equation}
612: The Wilson-Dirac matrix is given by
613: \begin{eqnarray}
614: M_{\mathsf{x,y}} = \mathbf{1}_{cs}\delta_{\mathsf{x,y}}
615: - \kappa \sum_{\mu=1}^4 &&
616: {(\mathbf{1}_s-\gamma_{\mu})}\otimes U_{\mu}(\sf x)\,\delta_{\sf x,\sf
617: y-\mu}\nonumber\\
618: +&&
619: {(\mathbf{1}_s+\gamma_{\mu})}\otimes U_{\mu}^{\dagger}(\sf x-\mu)\,\delta_{\sf x,\sf y+\mu}.
620: \end{eqnarray}
621: The symbols $\gamma_{\mu}$ stand for the $4\times 4$ Dirac spin
622: matrices. The $3\times 3$ matrices $U_{\mu} \in$ color-SU(3) represent
623: the gluonic vector field, thus $\mathbf{1}_{cs}$ is a $12\times 12$
624: unit-matrix in color and spin space. $M$ is a sparse matrix in
625: 4-dimensional Euclidean space-time with matrix valued stochastically
626: distributed coefficient functions of type $(\mathbf{1}_s\pm
627: \gamma_{\mu})\otimes U_{\mu}(\mathsf{x})$ at site $\mathsf{x}$.
628:
629: Once the low-lying modes are computed, it is possible to approximate
630: the full inverse matrix $Q^{-1}$ and those matrix functionals or
631: functions of $Q$ which are sensitive to small eigenvalues, i.e.\
632: long-range physics.
633:
634: \subsection{Numerical procedure}
635: In LQCD, hadronic masses are extracted from the large-time behavior of
636: correlation functions. The correlator of the flavor
637: \textit{non-singlet} $\pi$-meson is defined as
638: \begin{equation}
639: C_{\pi} (t=t_1-t_2) = \left<
640: \sum_{\vec x,\vec y} \mbox{Tr} \Big[ Q^{-1} (\vec x,t_1;\vec y,t_2) Q^{-1}
641: (\vec y,t_2;\vec x,t_1) \Big] \right>_U, \label{picorr}
642: \end{equation}
643: while the flavor \textit{singlet} $\eta'$ meson correlator is composed
644: of two terms,
645: the first
646: corresponding to the propagation of a quark-antiquark-pair from $\vec
647: x$ to $\vec y$ without annihilation in between and the second one
648: being characterized by intermediate pair annihilation:
649: \begin{eqnarray}\label{ETAPRIME}
650: C_{\eta'} (t_1-t_2) &=& C_{\pi} (t_1-t_2)\nonumber\\ &-&2 \left<
651: \sum_{\vec x,\vec y} \mbox{Tr} \Big[ Q^{-1} (\vec x,t_1;\vec x,t_1) \Big]
652: \mbox{Tr} \Big[ Q^{-1} (\vec y,t_2;\vec y,t_2) \Big] \right>_U.
653: \end{eqnarray}
654: The brackets $\left< \ldots \right>_U$ indicate the average over a
655: canonical ensemble of gauge field configurations. For large time
656: separations $t$, the respective correlation functions are dominated by
657: the ground state, because the higher excitations die out and therefore
658: become proportional to $\exp(-m_{0}t)$. $m_0$ is the mass of the
659: particle associated to the correlation function.
660:
661: As already mentioned, the $\pi$-correlator (\ref{picorr}), equivalent
662: to the first term of the $\eta'$-correlator, is obtained by solving the
663: linear system
664: \begin{equation}
665: M(\vec x,t_1;\vec y,t_2)\, c(\vec y,t_2) = \delta(\vec 1,1;\vec x,t_1); \label{lineq}
666: \end{equation}
667: on 12 source vectors (here located at site $(\vec 1,1)$). Of course,
668: the statistics could be improved by averaging over many sources;
669: however this becomes prohibitively expensive as the effort increases
670: with the number of sources.
671:
672: The second term in \eq{ETAPRIME},
673: \begin{equation}
674: \sum_{\vec x,\vec y} \mbox{Tr}
675: \Big[ Q^{-1} (\vec x,t_1;\vec x,t_1) \Big] \mbox{Tr} \Big[ Q^{-1}
676: (\vec y,t_2;\vec y,t_2) \Big],
677: \end{equation}
678: depends on the diagonal elements of $Q^{-1}$. The inverse can be
679: expanded in terms of the eigenmodes weighted by the inverse
680: eigenvalues:
681: \begin{equation}
682: Q^{-1} (\vec x,t_1;\vec y,t_2) = \sum_i\frac{1}{\lambda_i}
683: \frac{|\psi_i(\vec x,t_1)\rangle\langle\psi_i(\vec y,t_2)|}{
684: \langle\psi_i|\psi_i\rangle} ,
685: \end{equation}
686: where $\lambda_i$ and $\psi_i$ are the eigenvalues and the
687: eigenvectors of $Q$ respectively. We found that we can approximate
688: the sum on the right hand side by restriction to $\mathcal O(300)$ lowest-lying
689: eigenvalues and their corresponding eigenvectors. Due to the factor
690: $1/\lambda_i$, one expects that the low-lying eigenmodes will tend to
691: dominate the sum. Our procedure is called truncated eigenmode
692: approximation (TEA).
693:
694: We compute the eigensystem by means of the implicitly restarted
695: Arnoldi method, a generalization of the standard Lanczos procedure. A
696: crucial ingredient of our approach is a Chebyshev acceleration
697: technique. The spectrum is transformed such that the Arnoldi
698: eigenvalue determination becomes uniformly accurate for the entire
699: part of the spectrum we aim for. A comfortable parallel
700: implementation of IRAM is provided by the PARPACK package
701: \cite{PARPACK}.
702:
703: We work on a space-time lattice of size $16^3 \times 32$. Taking into
704: account the Dirac and color indices, the Dirac matrix acts on a $12
705: \times 16^3 \times 32 = 1.572.864$ dimensional vector space. This
706: explains why the inversion of the entire Dirac matrix is not feasible
707: since this would need about 40~TB memory space, whereas the
708: determination of 300~low-lying eigenvectors leads to about 7.5~GB
709: memory space only. Our computations are based on canonical ensembles
710: of 200~field configurations with $n_f = 2$ flavors of dynamical sea
711: quarks. Such kind of ensembles have been generated at 5~different
712: dynamical quark masses in the framework of the SESAM full QCD project
713: \cite{Lippert:1999ha}. It takes us about 3.5~Tflops-hours to solve
714: for 300~low-lying modes per ensemble for each quark mass. Altogether
715: we aim at $\mathcal O(30)$ valence mass/coupling combinations. First
716: physical results of our computations can be found in
717: Refs.~\cite{Neff:2001zr,Attig:2001ty,Neff:2002mq,Schilling:2001xd}.
718:
719: \subsection{Benchmarks}
720:
721: The typical compute partitions used for the TEA on ALiCE range from 16
722: to 64~nodes, depending on the number of eigenmodes required for the
723: approximation as well as on the memory available. Each node reads its
724: specific portion of a given eigenmode corresponding to the sub-lattice
725: assigned to this node. In our case, a simple regular space
726: decomposition of the $16^3 \times 32$ lattice in $z$ and $t$
727: directions is applied.
728:
729: Physically, the 300~eigenmodes for each field configuration are stored
730: as a single large file in round-robin manner. In case of a stripe-size
731: that corresponds to the size of an entire time-slice of the lattice,
732: each time-slice will be assigned to exactly one processor by PVFS\@.
733:
734: In our application we use MPI-IO calls
735: (\texttt{MPI\_File\_read\_at()}) instead of standard I/O to
736: read from the PVFS\@. In \tab{ROMIO} performance averages over four
737: measurements are presented. The results fluctuate only marginally.
738:
739: \begin{table}
740: \begin{center}
741: \caption{\label{ROMIO} Read and write performances for three compute partitions.}
742: \begin{tabular}{l|ccc}
743: \hline
744: \# of compute nodes & 16 & 32 & 64 \\
745: \# of eigenmodes & 100 & 200 & 300 \\
746: \hline
747: read performance per I/O-node [MBytes/s] & 11.9& 13.3& 11.1\\
748: write performance per I/O-node [MBytes/s] & 11.9& 13.4& 13.2\\
749: \hline
750: \end{tabular}
751: \end{center}
752: \end{table}
753:
754: As to be expected from \fig{READ-DISK} the throughput for
755: read-operations is as high as 13~MB/s, which is the actual hard-disk
756: performance. The effective throughput for write-operations is about 10
757: \% less than achieved in \fig{WRITE}, most likely a remaining MPI-IO
758: overhead.
759:
760: \subsection{Gain}
761:
762: A production step includes reading eigenmodes and computing
763: observables. Typically the computation takes about 10~minutes. Another
764: 10~minutes are needed for loading all 300~modes from the local disk.
765: Loading the data from an external archive (which is the usual
766: procedure since local disks do not provide enough capacity for an
767: entire ensemble of field configurations and eigenmodes) lasts about
768: 30~minutes. This large mismatch between compute time and I/O time
769: (where the processors are idling) renders standard clusters very
770: inefficent for such type of computational problems, denoted as {\it
771: data-intensive}.
772:
773: PVFS via ParaStation/Myrinet cuts down the I/O-times to less than
774: 20~seconds for both reading and writing. This is a substantial
775: acceleration compared to local or remote disk I/O. With these
776: improvements we were able to enter large-scale productions with
777: thousands of read-compute-write sequences to be carried out. After
778: several months of continuous heavy duty we find the I/O system to
779: behave remarkably reliable and stable, with no failure encountered so
780: far.
781:
782: \section{Future Perspectives: ClusterFile}
783:
784: Most parallel file systems, including PVFS, distribute the files
785: stripe-wise over the I/O-nodes. However, parallel applications provide
786: their specific data structures that are often accessed in form of
787: regular patterns called the ``virtual'' or ``logical'' partitioning of
788: the data. The access pattern studies~\cite{NK+96,SR97,SR98} have
789: shown that the performance and scalability of parallel scientific
790: applications with intensive parallel I/O-activity suffer significantly
791: from the mismatch between virtual partitioning and physical placement
792: of file data. This may result in an under-utilization of disk and
793: network bandwidths and in a decreased parallel exploitation of independent
794: disk capacity---among other effects.
795:
796: For the above mentioned reasons, it is important that a parallel file
797: system offers support for flexible physical partitioning. On one hand
798: one would wish a fully developed PFS to be able to ``translate''
799: efficiently from any physical partitioning of the files on the
800: I/O-nodes of the PFS to any virtual partitioning and vice versa, on
801: the other hand one would like to control the physical data layout on
802: the PFS\@. So far, this goal is only partly realized by PVFS\@.
803: Therefore, we decided to develop a novel parallel file system, called
804: ClusterFile~\cite{IT01}, that addresses these issues.
805:
806: In its present state, ClusterFile presents architectural similarities
807: with PVFS, in so far that several I/O-daemons are responsible
808: for storing the data on I/O-nodes with one central meta-data manager.
809: The clients must link to a library that provides transparent file
810: system access.
811:
812: Unlike PVFS, ClusterFile employs a file partitioning model, that
813: allows the arbitrary distribution of data over a cluster, in regular
814: or irregular patterns. The model is optimized for regular patterns,
815: since most frequently used data structures of parallel scientific
816: applications are multidimensional arrays~\cite{NK+96}, partitioned
817: into chunks among parallel processes.
818:
819: The file model of ClusterFile is used for both physical and logical
820: partitioning. A file is physically partitioned into sub-files. As an
821: example, it is possible to spread a file over several disks using a
822: block-cyclic distribution or any other regular and irregular
823: distribution as for instance supported by High Performance
824: Fortran~\cite{HPF}.
825:
826: A file may be logically partitioned between several processors by
827: means of views. A view is a sequential window to an eventually
828: non-contiguous subset of a file and can be used exactly in the same
829: way as a file. An important advantage of using views is that it
830: relieves the programmer from complex index computation. Additionally,
831: it also provides hints on potential future access patterns and can be
832: used for matching the physical to the logical distribution. For
833: instance, if a file is striped in round-robin manner over the disks
834: while the parallel processes of an application set block-cyclic views,
835: it could be better for performance to convert the physical layout into
836: a conforming block-cyclic distribution.
837:
838: The processes of a parallel application access a shared file in many
839: cases at nearly equal times. This observation suggests the design of
840: collective I/O operations. Their main goal is to coalesce many small
841: I/O requests of several cooperating processes into a few large
842: requests. The two main categories of collective I/O are
843: two-phase~\cite{RB+93} and disk-directed~\cite{KD94} operations. The
844: two-phase I/O consists of a file access step that is independent of
845: the virtual partitioning, and a shuffle-phase in which the access data
846: is distributed according to the access pattern. The MPI-IO library,
847: which is implemented on top of PVFS and ClusterFile utilizes the
848: two-phase method since it is not aware of the physical file
849: partitioning. ClusterFile implements a version of the disk-directed
850: method, in which the requests are gathered at I/O nodes, coalesced,
851: before access is performed, and the result is returned to the compute
852: nodes. This approach has the advantage that it can exploit the
853: relationship between physical and logical partitionings, whereas the
854: two-phase method separates the operation into two distinct steps.
855:
856: In the future we plan to test ClusterFile on ALiCE and to perform a
857: detailed comparison with PVFS\@. We expect the collective I/O
858: implementation of ClusterFile to perform better than that of MPI-IO,
859: due to the above mentioned reasons.
860:
861: Another important step will be the incorporation of cooperative caching
862: policies~\cite{DW94} that allow the buffer-caches of I/O- and compute-nodes
863: to interact. The goal is to provide a global caching policy that
864: provides a better utilization of buffer-caches and avoids unnecessary
865: disk requests.
866:
867: \section{Summary}
868: We have demonstrated that the ParaStation3 communication system speeds
869: up the performance of parallel I/O on cluster computers such as ALiCE.
870: I/O-benchmarks with PVFS using Parastation over Myrinet achieve a
871: throughput for write-operations of up to 1~GB/s from a 32-processor
872: compute-partition, given a 32-processor PVFS I/O-partition. These
873: results out-perform known benchmark results for PVFS on 1.28~Gbit
874: Myrinet by more than a factor of 2, a fact that is mainly due to the
875: superior communication features of ParaStation. Read-performance from
876: buffer-cache reaches up to 2.2~GB/s, while reading from hard-disk
877: saturates at the cumulative hard-disk performance. The I/O-performance
878: achieved with PVFS using ParaStation enables us to carry out extremely
879: data-intensive eigenmode computations on ALiCE in the framework of
880: lattice quantum chromodynamics. In the future the I/O-system will be
881: utilized for storing and processing mass data in high energy physics
882: data analysis on clusters.
883:
884: \section*{Acknowledgments}
885:
886: This work was supported by the Deutsche Forschungsgemeinschaft as
887: twinning project ``Alpha-Linux-Cluster'' (Ti264/6-1 \& Li701/3-1).
888: Physics related work was supported under Li701/4-1 (RESH
889: Forschergruppe FOR 240/4-1), and by the EU Research and Training
890: Network HPRN-CT-2000-00145 ``Hadron Properties from Lattice QCD''. We
891: thank Guido Arnold and Boris Orth for their help with the cluster
892: computer ALiCE\@.
893:
894:
895: \bibliographystyle{h-elsevier}
896: \bibliography{pvfs}
897:
898: \end{document}
899:
900: \bibitem{BONNIE} http://www.coker.com.au/bonnie++/
901: \bibitem{GPFS} http://www-124.ibm.com/gpfs/
902: \bibitem{PVFS-site} http://parlweb.parl.clemson.edu/pvfs/
903:
904:
905: %%% Local Variables:
906: %%% mode: latex
907: %%% TeX-master: "paper"
908: %%% time-stamp-line-limit: 16
909: %%% coding: latin-1-unix
910: %%% End:
911:
912: %%% LocalWords: Deutsche Forschungsgemeinschaft RESH Forschergruppe kB
913: %%% LocalWords: QCD Eicker Isaila Lippert PVFS ALiCE PFS AFS GPFS TCP
914: %%% LocalWords: IP MPI IPD ECC Düssel HPC Pallas Moschny Schilling Tichy
915: %%% LocalWords: partitionings Linux ARP ClusterFile ParTec Gaußstraße eigenmodes
916: