1: \documentclass[twocolumn,twoside,slac]{revtex4}
2: \usepackage{graphicx}
3: \usepackage{fancyhdr}
4:
5: % $Id$
6:
7: \pagestyle{fancy}
8: \fancyhead{} % clear all fields
9: \fancyhead[C]{\it {2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003}} \fancyhead[RO,LE]{\thepage}
10: \fancyfoot{} % clear all fields
11: \fancyfoot[LE,LO]{\bf THAT006}
12: \renewcommand{\headrulewidth}{0pt}
13: \renewcommand{\footrulewidth}{0pt}
14: \renewcommand{\sfdefault}{phv}
15:
16: \setlength{\textheight}{235mm}
17: \setlength{\textwidth}{170mm}
18: \setlength{\topmargin}{-20mm}
19:
20: \bibliographystyle{apsrev}
21:
22: \begin{document}
23:
24: \title{Worldwide Fast File Replication on Grid Datafarm}
25:
26: \author{Osamu Tatebe, Satoshi Sekiguchi}
27: \affiliation{AIST, Tsukuba, Ibaraki 3058568, JAPAN}
28: %
29: \author{Youhei Morita}
30: \affiliation{KEK, Tsukuba, Ibaraki 3050801, JAPAN}
31: %
32: \author{Satoshi Matsuoka}
33: \affiliation{Tokyo Institute of Technology, Meguro, Tokyo 152-8552, JAPAN}
34: %
35: \author{Noriyuki Soda}
36: \affiliation{Software Research Associates, Inc., Naka, Nagoya, 4600003, JAPAN}
37:
38: \begin{abstract}
39: The Grid Datafarm architecture is designed for global petascale
40: data-intensive computing. It provides a global parallel filesystem
41: with online petascale storage, scalable I/O bandwidth, and scalable
42: parallel processing, and it can exploit local I/O in a grid of
43: clusters with tens of thousands of nodes. One of features is that it
44: manages file replicas in filesystem metadata for fault tolerance and
45: load balancing.
46:
47: This paper discusses and evaluates several techniques to support
48: long-distance fast file replication. The Grid Datafarm manages a
49: ranked group of files as a Gfarm file, each file, called a Gfarm file
50: fragment, being stored on a filesystem node, or replicated on several
51: filesystem nodes. Each Gfarm file fragment is replicated
52: independently and in parallel using rate-controlled HighSpeed TCP with
53: network striping. On a US-Japan testbed with 10,000 km distance, we
54: achieve 419 Mbps using 2 nodes on each side, and 741 Mbps using 4
55: nodes out of 893 Mbps with two transpacific networks.
56: \end{abstract}
57:
58: \maketitle
59:
60: \thispagestyle{fancy}
61:
62: %-------------------------------------------------------------------------
63: \section{Introduction}
64: \label{sec:intro}
65: Petascale data intensive computing wave has been coming such as
66: high-energy physics data analysis, astronomical data analysis, and
67: bio-informatics data analysis. More than ten petabyte storage needs
68: to be shared and analyzed by world-wide users with high efficiency,
69: high security, and high dependability.
70:
71: The Grid Datafarm architecture \cite{gfarm-ccgrid2002} is designed for
72: global petascale data-intensive computing to enable the process of
73: large amounts of data at multiple regional PC clusters. The aim of
74: this research is to establish a large-scale parallel filesystem by
75: exploiting local storage of cluster nodes spread in the extensive
76: area, a platform system needed to support a petabyte scale data
77: intensive computing. The Grid Datafarm architecture enables
78: high-speed access to a large amount of data by utilizing file access
79: locality, and realizes fault tolerance of disks and networks by data
80: replication.
81:
82: This paper discusses about long-distance fast file replication on the
83: Grid Datafarm. To improve the performance in high bandwidth-delay
84: product networks, congestion control keeping efficient, fair,
85: scalable, and stable plays a key role. The easiest way to improve the
86: performance is to open multiple TCP connections in parallel, while
87: this approach leaves the parameter of the number of connections to be
88: determined by the user, which may result in heavy congestion with too
89: much number of connections. There are several researches addressing
90: this issue such as HighSpeed TCP \cite{HighSpeedTCP}, Scalable TCP
91: \cite{ScalableTCP}, FAST TCP \cite{FASTTCP}, and XCP
92: \cite{XCP-SIGCOMM2002}. HighSpeed TCP is an attempt to improve
93: congestion control of TCP for large congestion windows with better
94: flexibility, better scaling, better slow-start behavior, and competing
95: more fairly with current TCP, keeping backward compatibility and
96: incremental deployment. It modifies the TCP response function only
97: for large congestion windows to reach high bandwidth reasonably
98: quickly when in slow-start, and to reach high bandwidth without overly
99: long delays when recovering from multiple retransmit timeouts, or when
100: ramping-up from a period with small congestion windows.
101:
102: For file replication of large files in high bandwidth-delay product
103: networks, it is also necessary to improve disk I/O performance not
104: only the network performance. At this time, each cluster node has
105: capability to transmit at a rate of 1 Gbps, while the performance of
106: an IDE or a SCSI disk is at most 50 MB/s. To improve the disk I/O
107: bandwidth, disk striping such as RAID-0 is effective.
108:
109: This paper is organized as follows. In Section~\ref{sec:rep}, the
110: file replication on the Grid Datafarm is discussed.
111: Section~\ref{sec:eval} evaluates the network performance and the file
112: replication performance using a US-Japan Grid Datafarm testbed.
113:
114: %-------------------------------------------------------------------------
115: %\section{HighSpeed TCP}
116: %TCP connections with larger congestion windows
117: %modify the TCP response function
118:
119: %-------------------------------------------------------------------------
120: \section{File Replication on Grid Datafarm}
121: \label{sec:rep}
122: The Grid Datafarm provides a Grid file system that federates multiple
123: local filesystems on a Grid across administrative domains. The Grid
124: file system provides virtualized hierarchical namespaces for files
125: having consistent access control with flexible capabilities
126: management. There is a replica catalog to manage mappings from the
127: hierarchical namespace for files to one or more physical file
128: locations. This enables efficient, dependable, and transparent file
129: sharing on a Grid.
130:
131: The Grid Datafarm has a feature for data parallel execution. It
132: manages a ranked group of files as a single Gfarm file. This makes it
133: possible to manage a lot of distributed files as a single file, which
134: will be analyzed in parallel. Each parallel process possibly
135: generates a new set of output files also managed as a single file.
136: File-affinity scheduling and new concept of a file view enable the
137: ``owner computes'' strategy, or ``move the computation to data''
138: approach for the parallel data analysis.
139:
140: When replicating a Gfarm file, each file of the Gfarm file, or a group
141: of files, stored on a different filesystem node can be replicated in
142: parallel and independently. File replication on the Grid Datafarm is
143: considered to be a parallel file replication from multiple cluster
144: nodes to (different) multiple cluster nodes.
145:
146: In high bandwidth-delay product networks, multiple TCP streams, or
147: network striping, is effective to improve the performance, while disk
148: accesses to striping data decreases the performance of the disk I/O\@.
149: It is necessary to utilize a modified TCP or other protocols to
150: achieve high performance with a single stream. HighSpeed TCP is one
151: of proposals of the modification of congestion control of TCP, which
152: is utilized by the performance evaluation on a US-Japan Grid Datafarm
153: testbed.
154:
155: As described in Section~\ref{sec:intro}, the disk I/O performance is
156: poorer than the network bandwidth on each cluster node. One of
157: requirements from the Grid Datafarm architecture is a dense and
158: high-performance storage on each node for online petabyte-scale
159: storage. To meet this requirement, each node of the AIST Gfarm
160: cluster was designed to be a 1U server having a 3ware RAID card with
161: four 120GB IDE disks in RAID 0 that achieves 480 GB storage capacity
162: and over 110 MB/s for contiguous block reads and writes that is
163: comparable with the network performance.
164:
165: %-------------------------------------------------------------------------
166: \section{Performance Evaluation}
167: \label{sec:eval}
168:
169: \subsection{PC clusters and wide-area networks}
170: \label{ssec:env}
171: During the international conference SC2002, held in Baltimore from
172: November 16th to November 22nd of 2002, a Grid Datafarm Data Grid
173: environment was set up linking seven PC clusters in both Japan and the
174: U.S.
175:
176: \begin{figure}[tb]
177: \includegraphics[width=\columnwidth]{network2-en.eps}
178: \caption{\label{fig:env} Network logical map of world-wide Grid
179: Datafarm testbed. Four sites in Japan and three sites in U.S. are
180: integrated with the Grid Datafarm Data Grid middleware.}
181: \end{figure}
182:
183: Seven systems of PC clusters, including the one at the SC2002 booth (a
184: total of 190 PCs) were located in the research centers of both Japan
185: and the U.S. (AIST, High Energy Accelerator Research Organization,
186: Tokyo Institute of Technology, the University of Tokyo, Indiana
187: University and San Diego Supercomputer Center (SDSC)). They were
188: integrated with the Grid Datafarm Data Grid middleware
189: \cite{gfarm-home} (Figure~\ref{fig:env}). Tsukuba WAN, APAN/TransPAC
190: \cite{apan-transpac} and Maffin \cite{maffin} supported in
191: establishing the network.
192:
193: The system had the peak floating-point performance of 962 Gflops. It
194: was equipped with a large capacity file system of 18 TB at 6,600 MB/s
195: access rate as a Gfarm wide-area filesystem.
196:
197: This environment utilized multiple high-speed wide-area networks, that
198: is, Tsukuba WAN and SuperSINET in Japan, APAN/TransPAC and NII-ESnet
199: HEP PVC between Japan and the U.S., Abilene and ESnet in the U.S. and
200: SCinet at the SC2002 booth. The bandwidth from SC2002 to both Indiana
201: University and SDSC was 622 Mbps. In the transpacific network, it was
202: 893 Mbps. Total theoretical maximum bandwidth of the network linking
203: seven clusters was 2.173 Gbps one way (See Figure~\ref{fig:env}).
204:
205: For the file replication, a large amount of scientific data taken from
206: particle physics was generated mainly in the large-scale PC cluster of
207: Tokyo Institutes of Technology and created data replicas of several
208: hundreds of GB at each of the other clusters in a single filesystem
209: image.
210:
211: At the SC2002 booth in Baltimore, there was a 12-node AIST Gfarm
212: cluster connected with gigabit ethernet that connects to the SCinet
213: with 10 gigabit ethernet using the Force10 E1200 switch. Each node
214: consisted of a dual Intel Xeon 2.4GHz processor, 1GB memory, and a
215: 3ware Escalade 7500-4 RAID controller with four 120GB 3.5'' HDDs,
216: which was configured in RAID-0. The disk I/O performance for
217: contiguous blocks achieved 109 MB/s for writes and 168 MB/s for reads.
218: The network bandwidth of gigabit ethernet was 941 Mbps using the iperf
219: bandwidth measurement tool \cite{iperf}. File replication performance
220: was 75 MB/s, that was equivalent to 629 Mbps, using the Gfarm Data
221: Grid Middleware.
222:
223: % Fri May 16 01:34:16 JST 2003
224: %gfm65% ./thput-fsys -rw -b 65536 -s 10240
225: %testing with 10240 MB file
226: %bufsize write [bytes/sec] read [bytes/sec]
227: % 65536 108959734 196902836
228: %gfm65% ./thput-fsys -rw -b 65536 -s 20480
229: %testing with 20480 MB file
230: %bufsize write [bytes/sec] read [bytes/sec]
231: % 65536 113479404 185055488
232: %gfm65% ./thput-fsys -rw -b 65536 -s 40960
233: %testing with 40960 MB file
234: %bufsize write [bytes/sec] read [bytes/sec]
235: % 65536 115339604 181332188
236: %gfm65% ./thput-fsys -rw -b 65536 -s 81920
237: %testing with 81920 MB file
238: %bufsize write [bytes/sec] read [bytes/sec]
239: % 65536 114586309 176661637
240:
241: At the Grid Technology Research Center, AIST in Tsukuba, Japan, there
242: was the same 7-node AIST Gfarm cluster that connects to the Tokyo XP
243: with gigabit ethernet via Tsukuba WAN and Maffin networks.
244:
245: At the Indiana University, there was a 15-node PC cluster connected
246: with Fast Ethernet connects to Indianapolis GigaPoP with OC-12. The
247: disk I/O performance for contiguous blocks was 9.3 MB/s for writes and
248: 10.2 MB/s for reads.
249:
250: %gfarm01% thput-fsys -b 65536 -s 2560 -rw
251: %testing with 2560 MB file
252: %bufsize write [bytes/sec] read [bytes/sec]
253: % 65536 9710990 10711022
254:
255: At the SDSC in San Diego, there was a 8-node PC cluster connected with
256: gigabit ethernet that connects to outside with OC-12. The disk I/O
257: performance for contiguous blocks was 29.4 MB/s for writes and 20.0
258: for reads.
259:
260: %slic04% thput-fsys -rw -b 65536 -s 10240
261: %testing with 10240 MB file
262: %bufsize write [bytes/sec] read [bytes/sec]
263: % 65536 30840332 21006343
264:
265: APAN/TransPAC transpacific network consisted of two links; the
266: northern route (OC-12 POS) between Seattle and Tokyo, and the southern
267: route (OC-12 ATM) between Chicago and Tokyo. The southern route was
268: shaped to 271 Mbps. By default, all IP packets were transmitted via
269: the northern route. To utilize the both routes, we configured a
270: static route such that every traffic between specific three nodes at
271: SC2002 booth and AIST was transmitted via the southern route.
272:
273: \begin{table}[tb]
274: \caption{Round trip time between the SC2002 booth and other sites.
275: 'AIST (N)' and 'AIST (S)' mean that IP packets are transmitted via the
276: northern route and via the southern route, respectively}
277: \label{tab:rtt}
278: \begin{center}
279: \begin{tabular}{cc}
280: \hline
281: AIST (N) & 199 msec \\
282: AIST (S) & 222 msec \\
283: Indiana & 30 msec \\
284: SDSC & 86 msec \\
285: \hline
286: \end{tabular}
287: \end{center}
288: \end{table}
289:
290: The round trip time (RTT) of IP packets between the SC2002 booth and
291: other sites is shown in the Table~\ref{tab:rtt}.
292:
293: %-------------------------------------------------------------------------
294: \subsection{HighSpeed TCP over transpacific network}
295: Figure~\ref{fig:sc-gfm} shows the measured network bandwidth of the
296: HighSpeed TCP from the SC2002 booth in U.S. to the AIST in Japan via
297: the APAN/TransPAC northern route. The bandwidth was measured using
298: the iperf from one node to one node with two HighSpeed TCP streams.
299: The buffer size of each socket was 8 MB, which gave the theoretical
300: peak bandwidth 337 Mbps for one connection with the RTT 199 ms. From
301: the Figure~\ref{fig:sc-gfm}, the measured peak bandwidth achieved 529
302: Mbps in 5-second average out of the physical network bandwidth 622
303: Mbps. Due to the packet loss, the bandwidth occasionally dropped,
304: however, it was recovered reasonably quickly thanks to the HighSpeed
305: TCP\@.
306:
307: % 289 Mbps 5-second peak bandwidth
308:
309: \begin{figure}[tb]
310: \includegraphics[width=\columnwidth]{sc-gfm.eps}
311: \caption{\label{fig:sc-gfm} HighSpeed TCP bandwidth from U.S. to
312: Japan via the APAN/TransPAC northern route (OS-12 POS).}
313: \end{figure}
314:
315: Figure~\ref{fig:gfm-sc} shows the measured HighSpeed TCP bandwidth
316: from Japan to U.S. via the northern route. The machine and network
317: configurations were the same as the previous measurement except the
318: traffic direction. Because the traffic was slightly heavy at this
319: time, the measured peak bandwidth was 443 Mbps. After the heavy
320: packet loss, the bandwidth was recovered slowly just like the regular
321: TCP\@.
322:
323: \begin{figure}[tb]
324: \includegraphics[width=\columnwidth]{gfm-sc.eps}
325: \caption{\label{fig:gfm-sc} HighSpeed TCP bandwidth from Japan to
326: U.S. via the APAN/TransPAC northern route (OS-12 POS)}
327: \end{figure}
328:
329: Figure~\ref{fig:aist-sdsc-ge} is the case using the APAN/TransPAC
330: southern route. Since the southern route was shaped to 271 Mbps, one
331: HighSpeed TCP stream would be able to fill the bandwidth. However,
332: the stream suffered the critical packet loss, and only achieved 85.9
333: Mbps in 10-minutes average although 251 Mbps in 5-second peak
334: bandwidth. One of reasons for the critical packet loss was the
335: setting of an ATM switch configured without the random early drop. To
336: cope with this problem, it is necessary to control the network traffic
337: rate not to excess the physical network bandwidth, that is, 271 Mbps.
338:
339: \begin{figure}[tb]
340: \includegraphics[width=\columnwidth]{aist-sdsc-ge.eps}
341: \caption{\label{fig:aist-sdsc-ge} HighSpeed TCP bandwidth from Japan
342: to U.S. via the APAN/TransPAC southern route (OC-12 ATM, 271 Mbps
343: shaping) with one non-rate-limited HighSpeed TCP stream.}
344: \end{figure}
345:
346: When the maximum bandwidth of one HighSpeed TCP stream was limited to
347: 100 Mbps, it was possible to achieve stable and high bandwidth shown
348: by Figure~\ref{fig:aist-sdsc-fe2}. This rate control was realized by
349: changing the network from gigabit ethernet to fast ethernet. This
350: case achieved 166.1 Mbps in 10-minutes average, and 190.0 Mbps in
351: 5-second peak bandwidth.
352:
353: \begin{figure}[tb]
354: \includegraphics[width=\columnwidth]{aist-sdsc-fe2.eps}
355: \caption{\label{fig:aist-sdsc-fe2} HighSpeed TCP bandwidth from Japan
356: to U.S. via the APAN/TransPAC southern route (OC-12 ATM, 271 Mbps
357: shaping) with two 100-Mbps HighSpeed TCP streams.}
358: \end{figure}
359:
360: %-------------------------------------------------------------------------
361: \subsection{File replication}
362: The performance of file replication of large files that do not fit the
363: main memory is limited by the disk I/O performance and the network
364: performance. When replicating between sites, the file replication
365: performance is also limited by the bandwidth of the wide-area network
366: shown by Figure~\ref{fig:env}. Table~\ref{tab:max-rep} shows the
367: performance limit of file replication with one node at each site.
368:
369: \begin{table}[tb]
370: \begin{center}
371: \caption{Performance limit of file replication using one node at each
372: site in MB/s.}
373: \label{tab:max-rep}
374: \begin{tabular}{c|cc}
375: & To & From \\
376: \hline
377: Indiana & 9.3 & 10.2 \\
378: SDSC & 29.4 & 20.0 \\
379: AIST & 109 & 112 \\
380: SC2002 & 109 & 112 \\
381: \end{tabular}
382: \end{center}
383: \end{table}
384:
385: Figure~\ref{fig:sc-ussites} shows the performance of file replication
386: between one node at the SC2002 booth and various number of nodes at
387: Indiana Univ.\ and SDSC\@. Between the SC2002 booth and Indiana
388: Univ., the file replication performance increased almost proportional
389: to the number of nodes at Indiana Univ., and achieved the maximum
390: performance of 34.9 MB/s, that is, 293 Mbps, from four nodes at
391: Indiana Univ.\ to one node at the SC2002 booth with a 4GB file.
392:
393: \begin{figure}[tb]
394: \includegraphics[width=\columnwidth]{sc-ussites.eps}
395: \caption{\label{fig:sc-ussites} File replication performance between
396: one node at SC2002 booth and several nodes at Indiana Univ.\ and SDSC\@.}
397: \end{figure}
398:
399: From the SC2002 booth to SDSC, the file replication performance of a
400: 2GB file achieved 32.8 MB/s even with one node at SDSC\@. On the
401: other hand, the performance from SDSC to the SC2002 booth showed a
402: different tendency to increase scalable with respect to the number of
403: nodes at SDSC\@. One reason was this direction requires multiple
404: HighSpeed TCP streams to improve the network performance because one
405: stream achieved only 1.1 to 2.3 Mbps. For the file replication from
406: SDSC to the SC2002 booth, seven parallel streams were used in each
407: node pair.
408:
409: Table~\ref{tab:us-japan-rep} showed parameters and the measured
410: bandwidth of file replication from Baltimore in U.S. to Tsukuba in
411: Japan at a distance of more than 10,000 km. As shown in the previous
412: section, a HighSpeed TCP stream achieved about 260 Mbps with socket
413: buffer size 8 MB, while multiple rate-controlled streams were
414: effective to stabilize the bandwidth.
415:
416: \begin{table*}[th]
417: \begin{center}
418: \caption{Parameters and measured performance of US-Japan file replication}
419: \label{tab:us-japan-rep}
420: \begin{tabular}{cccccc}
421: \# node pairs & \# streams & data size & 10-sec average BW
422: & Transfer time & Average BW \\
423: \hline
424: 1 (N1) & 1 (N1x1) & 2 GB & n/a & 162.6 sec & 106 Mbps \\
425: 1 (N1) & 8 (N8x1) & 2 GB & n/a & 124.7 sec & 138 Mbps \\
426: 1 (N1) & 16 (N16x1) & 2 GB & n/a & 113.0 sec & 152 Mbps \\
427: \hline
428: 1 (S1) & 1 (S1x1) & 1 GB & n/a & 193.0 sec & 44.5 Mbps \\
429: 1 (S1) & 8 (S8x1) & 1 GB & 170 Mbps & 91.5 sec & 93.9 Mbps \\
430: 1 (S1) & 16 (S16x1) & 1 GB & n/a & 173.3 sec & 49.6 Mbps \\
431: \hline
432: 2 (N2) & 32 (N16x2) & 2$\times$2 GB & 419 Mbps & 115.9 sec & 297 Mbps \\
433: 3 (N3) & 48 (N16x3) & 2$\times$3 GB & 593 Mbps & 139.6 sec & 369 Mbps \\
434: 4 (N3+S1) & 56 (N16x3+S8x1) & 2$\times$4 GB & 741 Mbps & 150.0 sec & 458 Mbps \\
435: \end{tabular}
436: \end{center}
437: \end{table*}
438:
439: To control and limit the traffic at any rate, the socket buffer size
440: and the interval of sending data were adjusted. The interval of
441: sending data was also needed to be adjusted to suppress too fast
442: increase of the congestion window that causes the serious packet loss.
443:
444: Using the northern route, 16 streams achieved the best bandwidth of
445: 152 Mbps in average for file replication of a 2 GB file. Using the
446: southern route, 8 streams achieved the best bandwidth of 93.9 Mbps in
447: average for file replication of a 1 GB file. Using three node pairs
448: for the northern route and one node pair for the southern route, the
449: file replication of a 8 GB file achieved 458 Mbps in average, and 741
450: Mbps in 10-second peak bandwidth out of 893 Mbps.
451:
452: %parameter Northern route Southern route
453: %socket buffer size 610 KB 250 KB
454: %Traffic control per stream 50 Mbps 28.5 Mbps
455: %\# streams per node pair 16 streams 8 streams
456: %\# nodes 3 hosts 1 host
457: %stripe unit size 128 KB 128 KB
458:
459: For the SC2002 high-performance bandwidth challenge, parameters of
460: Table~\ref{tab:sc2002-bwc} was set up based on the previous
461: measurement. In the remote site column, `AIST N' means the AIST via
462: the APAN/TransPAC northern route, and `AIST S' means the AIST via the
463: southern route. The `Measured BW' column shows the measured average
464: bandwidth of file replication of a 1 GB or 2 GB file, which is not the
465: same as 5-second or 10-second peak bandwidth.
466:
467: \begin{table*}[tb]
468: \begin{center}
469: \caption{Parameters for file replication and expected bandwidth from
470: and to a 12-node Gfarm cluster at SC2002 booth}
471: \label{tab:sc2002-bwc}
472: \begin{tabular}{cccccc}
473: \multicolumn{6}{c}{Outgoing traffic} \\
474: \# nodes & Remote & \# nodes & \# streams &
475: Socket buffer size, & Measured BW \\
476: in Baltimore & site & at remote site & /node &
477: rate limit & (1-2min avg) \\
478: \hline
479: 3 & SDSC & 5 & 1 & 1 MB & $>$ 60 MB/s \\
480: 2 & Indiana & 8 & 1 & 1 MB & 56.8 MB/s \\
481: 3 & AIST N & 3 & 16 & 610 KB, 50 Mbps & 44.0 MB/s \\
482: 1 & AIST S & 1 & 16 & 346 KB, 28.5 Mbps & 10.6 MB/s \\
483: \hline
484: 9 & S,I,A & 5,8,4 & - & & $>$ 171 Mbps \\
485: & & & & & ($>$ 1.43 Gbps) \\
486: \multicolumn{6}{c}{Incoming traffic} \\
487: \hline
488: 1 & SDSC & 3 & 7 & 7 MB & 23.1 MB/s \\
489: 1* & Indiana & 4 & 1 & 1 MB & 34.9 MB/s \\
490: 1* & AIST N & 1 & 1 & 610 KB, 50 Mbps & n/a \\
491: 1 & AIST S & 1 & 1 & 346 KB & n/a \\
492: \hline
493: 3 & S,I,A & 3,4,2 & - & & $>$ 58 MB/s \\
494: & & & & & ($>$ 487 Mbps)
495: \end{tabular}
496: \end{center}
497: \end{table*}
498:
499: The average bandwidth in one to two minutes would be expected to be
500: achieved over 1.43 Gbps for outgoing traffic and over 487 Mbps for
501: incoming traffic, over 1.92 Gbps in total for both directions, if
502: there were no unknown congestion and no unexpected packet drop by
503: using several networks simultaneously.
504:
505: \begin{figure}[tb]
506: \includegraphics[width=\columnwidth]{bwc02.eps}
507: \caption{\label{fig:bwc02} File replication performance in 10-second
508: average between the SC2002 booth and other sites during the SC2002
509: high-performance bandwidth challenge.}
510: \end{figure}
511:
512: As a result, the file replication performance was shown by
513: Figure~\ref{fig:bwc02} in 10-second average bandwidth. The peak
514: bandwidth was 1.40 Gbps for outgoing traffic, and 0.526 Gbps for
515: incoming traffic. The 0.1-second average bandwidth measured by the
516: SCinet showed 1.691 Gbps for outgoing traffic, and 0.595 Gbps for
517: incoming traffic, 2.286 Gbps in total for both directions using 12
518: nodes in Baltimore.
519:
520: %-------------------------------------------------------------------------
521: %\section{Related Works}
522:
523: %-------------------------------------------------------------------------
524: \section{Summary and Future Work}
525: The Grid Datafarm is an architecture for petabyte-scale data-intensive
526: computing providing online ten petabyte-scale storage, an I/O
527: bandwidth scales to the TB/s range and scalable computational power,
528: which is securely and dependably shared on a Grid. This paper
529: discussed and evaluated the performance of file replication on the
530: Grid Datafarm.
531:
532: For the evaluation of the network performance in high bandwidth-delay
533: product networks between U.S. and Japan, the HighSpeed TCP performed
534: very well on the transpacific network of OC-12 POS, and achieved 529
535: Mbps in 5-second average using two streams of one node pair. On the
536: other hand, the application-level rate-control of a HighSpeed TCP
537: stream was necessary for the network of OC-12 ATM to achieve stable
538: and high bandwidth.
539:
540: Within the U.S., the file replication showed any performance problem,
541: while between U.S. and Japan, application-level rate-control of a
542: HighSpeed TCP were also needed for stability. As a result, using
543: three node pairs for the northern route and one node pair for the
544: southern route, the file replication of a 8 GB file achieved 741 Mbps
545: in 10-second average out of 893 Mbps.
546:
547: Between the SC2002 booth and other three sites including a Japan site,
548: the peak bandwidth of file replication in 0.1-second average showed
549: 1.691 Gbps for outgoing traffic, and 0.595 Gbps for incoming traffic,
550: 2.286 Gbps in total for both directions using 12 nodes in the SC2002
551: booth.
552:
553: The Grid Datafarm can be applied to theoretical or experimental
554: science that calls upon large-scale data analysis and simulation. We
555: are planning to evaluate it using large-scale production applications
556: such as high-energy physics data analysis, analysis of observational
557: data of all-sky multiple wavelength bands in astronomy, gene analysis
558: in bio-informatics and so on on a world-wide Grid Datafarm testbed.
559:
560: %-------------------------------------------------------------------------
561: \section*{Acknowledgments}
562: We would like to thank kind help for PRAGMA members, especially, Rick
563: McMullen, John Hicks at Indiana Univ., Phillip Papadopoulos at SDSC\@.
564: We are thankful to the web100 and net100 projects for providing a
565: HighSpeed TCP patch for a linux kernel. We are grateful to Hisashi
566: Eguchi at Maffin, Kazunori Konishi, Yoshinori Kitatsuji, and Ayumu
567: Kubota at APAN, Chris Robb at Indiana Univ.\ for investigation of
568: bottlenecks of wide-area networks. We appreciate great help of Force
569: 10 Networks, Inc.\ for providing the E1200 switch with 10 gigabit
570: ethernet network interface. This research was supported by the
571: Ministry of Economy, Trade and Industry through research grant of
572: Network computing project, and the Ministry of Education, Culture,
573: Sports, Science and Technology of Japan through a Grant-in-Aid for
574: Scientific Research on Priority Areas (2) (No.\ 13224034).
575:
576:
577: %\bibliography{grid}
578: \begin{thebibliography}{9}
579: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
580: \expandafter\ifx\csname bibnamefont\endcsname\relax
581: \def\bibnamefont#1{#1}\fi
582: \expandafter\ifx\csname bibfnamefont\endcsname\relax
583: \def\bibfnamefont#1{#1}\fi
584: \expandafter\ifx\csname citenamefont\endcsname\relax
585: \def\citenamefont#1{#1}\fi
586: \expandafter\ifx\csname url\endcsname\relax
587: \def\url#1{\texttt{#1}}\fi
588: \expandafter\ifx\csname urlprefix\endcsname\relax\def\urlprefix{URL }\fi
589: \providecommand{\bibinfo}[2]{#2}
590: \providecommand{\eprint}[2][]{\url{#2}}
591:
592: %\bibitem[{\citenamefont{Tatebe et~al.}(2002)\citenamefont{Tatebe, Morita,
593: % Matsuoka, Soda, and Sekiguchi}}]{gfarm-ccgrid2002}
594: \bibitem{gfarm-ccgrid2002}
595: \bibinfo{author}{\bibfnamefont{O.}~\bibnamefont{Tatebe}},
596: \bibinfo{author}{\bibfnamefont{Y.}~\bibnamefont{Morita}},
597: \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Matsuoka}},
598: \bibinfo{author}{\bibfnamefont{N.}~\bibnamefont{Soda}}, \bibnamefont{and}
599: \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Sekiguchi}}, in
600: \emph{\bibinfo{booktitle}{Proceedings of the 2nd IEEE/ACM International
601: Symposium on Cluster Computing and the Grid (CCGrid 2002)}}
602: (\bibinfo{year}{2002}), pp. \bibinfo{pages}{102--110}.
603:
604: \bibitem[{\citenamefont{Floyd}(2003)}]{HighSpeedTCP}
605: \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Floyd}}, in
606: \emph{\bibinfo{booktitle}{Internet draft, draft-floyd-tcp-highspeed-02.txt}}
607: (\bibinfo{year}{2003}), \bibinfo{note}{\url{http://www.icir.org/floyd/hstcp.html}}.
608:
609: \bibitem[{Sca()}]{ScalableTCP}
610: \emph{\bibinfo{title}{Scalable {TCP}}},
611: \bibinfo{note}{\url{http://www-lce.eng.cam.ac.uk/~ctk21/scalable/}}.
612:
613: \bibitem[{FAS()}]{FASTTCP}
614: \emph{\bibinfo{title}{{FAST} {TCP}}},
615: \bibinfo{note}{\url{http://netlab.caltech.edu/FAST/}}.
616:
617: \bibitem[{\citenamefont{Katabi et~al.}(2002)\citenamefont{Katabi, Handley, and
618: Rohrs}}]{XCP-SIGCOMM2002}
619: \bibinfo{author}{\bibfnamefont{D.}~\bibnamefont{Katabi}},
620: \bibinfo{author}{\bibfnamefont{M.}~\bibnamefont{Handley}}, \bibnamefont{and}
621: \bibinfo{author}{\bibfnamefont{C.}~\bibnamefont{Rohrs}}, in
622: \emph{\bibinfo{booktitle}{Proceedings of ACM SIGCOMM 2002 Concerence}}
623: (\bibinfo{year}{2002}).
624:
625: \bibitem[{gfa()}]{gfarm-home}
626: \emph{\bibinfo{title}{Grid Datafarm}},
627: \bibinfo{note}{\url{http://datafarm.apgrid.org/}}.
628:
629: \bibitem[{apa()}]{apan-transpac}
630: \emph{\bibinfo{title}{{APAN}/{T}rans{PAC}}},
631: \bibinfo{note}{\url{http://www.transpac.org/}}.
632:
633: \bibitem[{maf()}]{maffin}
634: \emph{\bibinfo{title}{Maffin}},
635: \bibinfo{note}{\url{http://www.maffin.ad.jp/}}.
636:
637: \bibitem[{ipe()}]{iperf}
638: \emph{\bibinfo{title}{Iperf}},
639: \bibinfo{note}{\url{http://dast.nlanr.net/Projects/Iperf/}}.
640:
641: \end{thebibliography}
642:
643: \end{document}
644: