cs0306090/cr.tex
1: \documentclass[twocolumn,twoside,slac]{revtex4}
2: \usepackage{graphicx}
3: \usepackage{fancyhdr}
4: 
5: % $Id$
6: 
7: \pagestyle{fancy}
8: \fancyhead{} % clear all fields
9: \fancyhead[C]{\it {2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003}} \fancyhead[RO,LE]{\thepage}
10: \fancyfoot{} % clear all fields
11: \fancyfoot[LE,LO]{\bf THAT006}
12: \renewcommand{\headrulewidth}{0pt}
13: \renewcommand{\footrulewidth}{0pt}
14: \renewcommand{\sfdefault}{phv}
15: 
16: \setlength{\textheight}{235mm}
17: \setlength{\textwidth}{170mm}
18: \setlength{\topmargin}{-20mm}
19: 
20: \bibliographystyle{apsrev}
21: 
22: \begin{document}
23: 
24: \title{Worldwide Fast File Replication on Grid Datafarm}
25: 
26: \author{Osamu Tatebe, Satoshi Sekiguchi}
27: \affiliation{AIST, Tsukuba, Ibaraki 3058568, JAPAN}
28: %
29: \author{Youhei Morita}
30: \affiliation{KEK, Tsukuba, Ibaraki 3050801, JAPAN}
31: %
32: \author{Satoshi Matsuoka}
33: \affiliation{Tokyo Institute of Technology, Meguro, Tokyo 152-8552, JAPAN}
34: %
35: \author{Noriyuki Soda}
36: \affiliation{Software Research Associates, Inc., Naka, Nagoya, 4600003, JAPAN}
37: 
38: \begin{abstract}
39: The Grid Datafarm architecture is designed for global petascale
40: data-intensive computing.  It provides a global parallel filesystem
41: with online petascale storage, scalable I/O bandwidth, and scalable
42: parallel processing, and it can exploit local I/O in a grid of
43: clusters with tens of thousands of nodes.  One of features is that it
44: manages file replicas in filesystem metadata for fault tolerance and
45: load balancing.
46: 
47: This paper discusses and evaluates several techniques to support
48: long-distance fast file replication.  The Grid Datafarm manages a
49: ranked group of files as a Gfarm file, each file, called a Gfarm file
50: fragment, being stored on a filesystem node, or replicated on several
51: filesystem nodes.  Each Gfarm file fragment is replicated
52: independently and in parallel using rate-controlled HighSpeed TCP with
53: network striping.  On a US-Japan testbed with 10,000 km distance, we
54: achieve 419 Mbps using 2 nodes on each side, and 741 Mbps using 4
55: nodes out of 893 Mbps with two transpacific networks.
56: \end{abstract}
57: 
58: \maketitle
59: 
60: \thispagestyle{fancy}
61: 
62: %------------------------------------------------------------------------- 
63: \section{Introduction}
64: \label{sec:intro}
65: Petascale data intensive computing wave has been coming such as
66: high-energy physics data analysis, astronomical data analysis, and
67: bio-informatics data analysis.  More than ten petabyte storage needs
68: to be shared and analyzed by world-wide users with high efficiency,
69: high security, and high dependability.
70: 
71: The Grid Datafarm architecture \cite{gfarm-ccgrid2002} is designed for
72: global petascale data-intensive computing to enable the process of
73: large amounts of data at multiple regional PC clusters.  The aim of
74: this research is to establish a large-scale parallel filesystem by
75: exploiting local storage of cluster nodes spread in the extensive
76: area, a platform system needed to support a petabyte scale data
77: intensive computing.  The Grid Datafarm architecture enables
78: high-speed access to a large amount of data by utilizing file access
79: locality, and realizes fault tolerance of disks and networks by data
80: replication.
81: 
82: This paper discusses about long-distance fast file replication on the
83: Grid Datafarm.  To improve the performance in high bandwidth-delay
84: product networks, congestion control keeping efficient, fair,
85: scalable, and stable plays a key role.  The easiest way to improve the
86: performance is to open multiple TCP connections in parallel, while
87: this approach leaves the parameter of the number of connections to be
88: determined by the user, which may result in heavy congestion with too
89: much number of connections.  There are several researches addressing
90: this issue such as HighSpeed TCP \cite{HighSpeedTCP}, Scalable TCP
91: \cite{ScalableTCP}, FAST TCP \cite{FASTTCP}, and XCP
92: \cite{XCP-SIGCOMM2002}.  HighSpeed TCP is an attempt to improve
93: congestion control of TCP for large congestion windows with better
94: flexibility, better scaling, better slow-start behavior, and competing
95: more fairly with current TCP, keeping backward compatibility and
96: incremental deployment.  It modifies the TCP response function only
97: for large congestion windows to reach high bandwidth reasonably
98: quickly when in slow-start, and to reach high bandwidth without overly
99: long delays when recovering from multiple retransmit timeouts, or when
100: ramping-up from a period with small congestion windows.
101: 
102: For file replication of large files in high bandwidth-delay product
103: networks, it is also necessary to improve disk I/O performance not
104: only the network performance.  At this time, each cluster node has
105: capability to transmit at a rate of 1 Gbps, while the performance of
106: an IDE or a SCSI disk is at most 50 MB/s.  To improve the disk I/O
107: bandwidth, disk striping such as RAID-0 is effective.
108: 
109: This paper is organized as follows.  In Section~\ref{sec:rep}, the
110: file replication on the Grid Datafarm is discussed.
111: Section~\ref{sec:eval} evaluates the network performance and the file
112: replication performance using a US-Japan Grid Datafarm testbed.
113: 
114: %------------------------------------------------------------------------- 
115: %\section{HighSpeed TCP}
116: %TCP connections with larger congestion windows
117: %modify the TCP response function
118: 
119: %------------------------------------------------------------------------- 
120: \section{File Replication on Grid Datafarm}
121: \label{sec:rep}
122: The Grid Datafarm provides a Grid file system that federates multiple
123: local filesystems on a Grid across administrative domains.  The Grid
124: file system provides virtualized hierarchical namespaces for files
125: having consistent access control with flexible capabilities
126: management.  There is a replica catalog to manage mappings from the
127: hierarchical namespace for files to one or more physical file
128: locations.  This enables efficient, dependable, and transparent file
129: sharing on a Grid.
130: 
131: The Grid Datafarm has a feature for data parallel execution.  It
132: manages a ranked group of files as a single Gfarm file.  This makes it
133: possible to manage a lot of distributed files as a single file, which
134: will be analyzed in parallel.  Each parallel process possibly
135: generates a new set of output files also managed as a single file.
136: File-affinity scheduling and new concept of a file view enable the
137: ``owner computes'' strategy, or ``move the computation to data''
138: approach for the parallel data analysis.
139: 
140: When replicating a Gfarm file, each file of the Gfarm file, or a group
141: of files, stored on a different filesystem node can be replicated in
142: parallel and independently.  File replication on the Grid Datafarm is
143: considered to be a parallel file replication from multiple cluster
144: nodes to (different) multiple cluster nodes.
145: 
146: In high bandwidth-delay product networks, multiple TCP streams, or
147: network striping, is effective to improve the performance, while disk
148: accesses to striping data decreases the performance of the disk I/O\@.
149: It is necessary to utilize a modified TCP or other protocols to
150: achieve high performance with a single stream.  HighSpeed TCP is one
151: of proposals of the modification of congestion control of TCP, which
152: is utilized by the performance evaluation on a US-Japan Grid Datafarm
153: testbed.
154: 
155: As described in Section~\ref{sec:intro}, the disk I/O performance is
156: poorer than the network bandwidth on each cluster node.  One of
157: requirements from the Grid Datafarm architecture is a dense and
158: high-performance storage on each node for online petabyte-scale
159: storage.  To meet this requirement, each node of the AIST Gfarm
160: cluster was designed to be a 1U server having a 3ware RAID card with
161: four 120GB IDE disks in RAID 0 that achieves 480 GB storage capacity
162: and over 110 MB/s for contiguous block reads and writes that is
163: comparable with the network performance.
164: 
165: %------------------------------------------------------------------------- 
166: \section{Performance Evaluation}
167: \label{sec:eval}
168: 
169: \subsection{PC clusters and wide-area networks}
170: \label{ssec:env}
171: During the international conference SC2002, held in Baltimore from
172: November 16th to November 22nd of 2002, a Grid Datafarm Data Grid
173: environment was set up linking seven PC clusters in both Japan and the
174: U.S.
175: 
176: \begin{figure}[tb]
177:  \includegraphics[width=\columnwidth]{network2-en.eps}
178:  \caption{\label{fig:env} Network logical map of world-wide Grid
179:  Datafarm testbed.  Four sites in Japan and three sites in U.S. are
180:  integrated with the Grid Datafarm Data Grid middleware.}
181: \end{figure}
182: 
183: Seven systems of PC clusters, including the one at the SC2002 booth (a
184: total of 190 PCs) were located in the research centers of both Japan
185: and the U.S. (AIST, High Energy Accelerator Research Organization,
186: Tokyo Institute of Technology, the University of Tokyo, Indiana
187: University and San Diego Supercomputer Center (SDSC)).  They were
188: integrated with the Grid Datafarm Data Grid middleware
189: \cite{gfarm-home} (Figure~\ref{fig:env}).  Tsukuba WAN, APAN/TransPAC
190: \cite{apan-transpac} and Maffin \cite{maffin} supported in
191: establishing the network.
192: 
193: The system had the peak floating-point performance of 962 Gflops.  It
194: was equipped with a large capacity file system of 18 TB at 6,600 MB/s
195: access rate as a Gfarm wide-area filesystem.
196: 
197: This environment utilized multiple high-speed wide-area networks, that
198: is, Tsukuba WAN and SuperSINET in Japan, APAN/TransPAC and NII-ESnet
199: HEP PVC between Japan and the U.S., Abilene and ESnet in the U.S. and
200: SCinet at the SC2002 booth.  The bandwidth from SC2002 to both Indiana
201: University and SDSC was 622 Mbps.  In the transpacific network, it was
202: 893 Mbps.  Total theoretical maximum bandwidth of the network linking
203: seven clusters was 2.173 Gbps one way (See Figure~\ref{fig:env}).
204: 
205: For the file replication, a large amount of scientific data taken from
206: particle physics was generated mainly in the large-scale PC cluster of
207: Tokyo Institutes of Technology and created data replicas of several
208: hundreds of GB at each of the other clusters in a single filesystem
209: image.
210: 
211: At the SC2002 booth in Baltimore, there was a 12-node AIST Gfarm
212: cluster connected with gigabit ethernet that connects to the SCinet
213: with 10 gigabit ethernet using the Force10 E1200 switch.  Each node
214: consisted of a dual Intel Xeon 2.4GHz processor, 1GB memory, and a
215: 3ware Escalade 7500-4 RAID controller with four 120GB 3.5'' HDDs,
216: which was configured in RAID-0.  The disk I/O performance for
217: contiguous blocks achieved 109 MB/s for writes and 168 MB/s for reads.
218: The network bandwidth of gigabit ethernet was 941 Mbps using the iperf
219: bandwidth measurement tool \cite{iperf}.  File replication performance
220: was 75 MB/s, that was equivalent to 629 Mbps, using the Gfarm Data
221: Grid Middleware.
222: 
223: % Fri May 16 01:34:16 JST 2003
224: %gfm65% ./thput-fsys -rw -b 65536 -s 10240
225: %testing with 10240 MB file
226: %bufsize     write [bytes/sec]     read [bytes/sec]
227: %  65536   108959734            196902836
228: %gfm65% ./thput-fsys -rw -b 65536 -s 20480
229: %testing with 20480 MB file
230: %bufsize     write [bytes/sec]     read [bytes/sec]
231: %  65536   113479404            185055488
232: %gfm65% ./thput-fsys -rw -b 65536 -s 40960
233: %testing with 40960 MB file
234: %bufsize     write [bytes/sec]     read [bytes/sec]
235: %  65536   115339604            181332188
236: %gfm65% ./thput-fsys -rw -b 65536 -s 81920
237: %testing with 81920 MB file
238: %bufsize     write [bytes/sec]     read [bytes/sec]
239: %  65536   114586309            176661637
240: 
241: At the Grid Technology Research Center, AIST in Tsukuba, Japan, there
242: was the same 7-node AIST Gfarm cluster that connects to the Tokyo XP
243: with gigabit ethernet via Tsukuba WAN and Maffin networks.
244: 
245: At the Indiana University, there was a 15-node PC cluster connected
246: with Fast Ethernet connects to Indianapolis GigaPoP with OC-12.  The
247: disk I/O performance for contiguous blocks was 9.3 MB/s for writes and
248: 10.2 MB/s for reads.
249: 
250: %gfarm01% thput-fsys -b 65536 -s 2560 -rw
251: %testing with 2560 MB file
252: %bufsize     write [bytes/sec]     read [bytes/sec]
253: %  65536     9710990             10711022          
254: 
255: At the SDSC in San Diego, there was a 8-node PC cluster connected with
256: gigabit ethernet that connects to outside with OC-12.  The disk I/O
257: performance for contiguous blocks was 29.4 MB/s for writes and 20.0
258: for reads.
259: 
260: %slic04% thput-fsys -rw -b 65536 -s 10240
261: %testing with 10240 MB file
262: %bufsize     write [bytes/sec]     read [bytes/sec]
263: %  65536    30840332             21006343          
264: 
265: APAN/TransPAC transpacific network consisted of two links; the
266: northern route (OC-12 POS) between Seattle and Tokyo, and the southern
267: route (OC-12 ATM) between Chicago and Tokyo.  The southern route was
268: shaped to 271 Mbps.  By default, all IP packets were transmitted via
269: the northern route.  To utilize the both routes, we configured a
270: static route such that every traffic between specific three nodes at
271: SC2002 booth and AIST was transmitted via the southern route.
272: 
273: \begin{table}[tb]
274: \caption{Round trip time between the SC2002 booth and other sites.
275: 'AIST (N)' and 'AIST (S)' mean that IP packets are transmitted via the
276: northern route and via the southern route, respectively}
277: \label{tab:rtt}
278: \begin{center}
279: \begin{tabular}{cc}
280: \hline
281: AIST (N) & 199 msec \\
282: AIST (S) & 222 msec \\
283: Indiana & 30 msec \\
284: SDSC & 86 msec \\
285: \hline
286: \end{tabular}
287: \end{center}
288: \end{table}
289: 
290: The round trip time (RTT) of IP packets between the SC2002 booth and
291: other sites is shown in the Table~\ref{tab:rtt}.
292: 
293: %------------------------------------------------------------------------- 
294: \subsection{HighSpeed TCP over transpacific network}
295: Figure~\ref{fig:sc-gfm} shows the measured network bandwidth of the
296: HighSpeed TCP from the SC2002 booth in U.S. to the AIST in Japan via
297: the APAN/TransPAC northern route.  The bandwidth was measured using
298: the iperf from one node to one node with two HighSpeed TCP streams.
299: The buffer size of each socket was 8 MB, which gave the theoretical
300: peak bandwidth 337 Mbps for one connection with the RTT 199 ms.  From
301: the Figure~\ref{fig:sc-gfm}, the measured peak bandwidth achieved 529
302: Mbps in 5-second average out of the physical network bandwidth 622
303: Mbps.  Due to the packet loss, the bandwidth occasionally dropped,
304: however, it was recovered reasonably quickly thanks to the HighSpeed
305: TCP\@.
306: 
307: % 289 Mbps 5-second peak bandwidth
308: 
309: \begin{figure}[tb]
310:  \includegraphics[width=\columnwidth]{sc-gfm.eps}
311:  \caption{\label{fig:sc-gfm} HighSpeed TCP bandwidth from U.S. to
312:  Japan via the APAN/TransPAC northern route (OS-12 POS).}
313: \end{figure}
314: 
315: Figure~\ref{fig:gfm-sc} shows the measured HighSpeed TCP bandwidth
316: from Japan to U.S. via the northern route.  The machine and network
317: configurations were the same as the previous measurement except the
318: traffic direction.  Because the traffic was slightly heavy at this
319: time, the measured peak bandwidth was 443 Mbps.  After the heavy
320: packet loss, the bandwidth was recovered slowly just like the regular
321: TCP\@.
322: 
323: \begin{figure}[tb]
324:  \includegraphics[width=\columnwidth]{gfm-sc.eps}
325:  \caption{\label{fig:gfm-sc} HighSpeed TCP bandwidth from Japan to
326:  U.S.  via the APAN/TransPAC northern route (OS-12 POS)}
327: \end{figure}
328: 
329: Figure~\ref{fig:aist-sdsc-ge} is the case using the APAN/TransPAC
330: southern route.  Since the southern route was shaped to 271 Mbps, one
331: HighSpeed TCP stream would be able to fill the bandwidth.  However,
332: the stream suffered the critical packet loss, and only achieved 85.9
333: Mbps in 10-minutes average although 251 Mbps in 5-second peak
334: bandwidth.  One of reasons for the critical packet loss was the
335: setting of an ATM switch configured without the random early drop.  To
336: cope with this problem, it is necessary to control the network traffic
337: rate not to excess the physical network bandwidth, that is, 271 Mbps.
338: 
339: \begin{figure}[tb]
340:  \includegraphics[width=\columnwidth]{aist-sdsc-ge.eps}
341:  \caption{\label{fig:aist-sdsc-ge} HighSpeed TCP bandwidth from Japan
342:  to U.S. via the APAN/TransPAC southern route (OC-12 ATM, 271 Mbps
343:  shaping) with one non-rate-limited HighSpeed TCP stream.}
344: \end{figure}
345: 
346: When the maximum bandwidth of one HighSpeed TCP stream was limited to
347: 100 Mbps, it was possible to achieve stable and high bandwidth shown
348: by Figure~\ref{fig:aist-sdsc-fe2}.  This rate control was realized by
349: changing the network from gigabit ethernet to fast ethernet.  This
350: case achieved 166.1 Mbps in 10-minutes average, and 190.0 Mbps in
351: 5-second peak bandwidth.
352: 
353: \begin{figure}[tb]
354:  \includegraphics[width=\columnwidth]{aist-sdsc-fe2.eps}
355:  \caption{\label{fig:aist-sdsc-fe2} HighSpeed TCP bandwidth from Japan
356:  to U.S. via the APAN/TransPAC southern route (OC-12 ATM, 271 Mbps
357:  shaping) with two 100-Mbps HighSpeed TCP streams.}
358: \end{figure}
359: 
360: %------------------------------------------------------------------------- 
361: \subsection{File replication}
362: The performance of file replication of large files that do not fit the
363: main memory is limited by the disk I/O performance and the network
364: performance.  When replicating between sites, the file replication
365: performance is also limited by the bandwidth of the wide-area network
366: shown by Figure~\ref{fig:env}.  Table~\ref{tab:max-rep} shows the
367: performance limit of file replication with one node at each site.
368: 
369: \begin{table}[tb]
370: \begin{center}
371: \caption{Performance limit of file replication using one node at each
372: site in MB/s.}
373: \label{tab:max-rep}
374: \begin{tabular}{c|cc}
375:         &   To   &   From \\
376: \hline
377: Indiana &   9.3  &   10.2 \\
378: SDSC    &  29.4  &   20.0 \\
379: AIST    & 109    &  112   \\
380: SC2002  & 109    &  112   \\
381: \end{tabular}
382: \end{center}
383: \end{table}
384: 
385: Figure~\ref{fig:sc-ussites} shows the performance of file replication
386: between one node at the SC2002 booth and various number of nodes at
387: Indiana Univ.\ and SDSC\@.  Between the SC2002 booth and Indiana
388: Univ., the file replication performance increased almost proportional
389: to the number of nodes at Indiana Univ., and achieved the maximum
390: performance of 34.9 MB/s, that is, 293 Mbps, from four nodes at
391: Indiana Univ.\ to one node at the SC2002 booth with a 4GB file.
392: 
393: \begin{figure}[tb]
394:  \includegraphics[width=\columnwidth]{sc-ussites.eps}
395:  \caption{\label{fig:sc-ussites} File replication performance between
396:  one node at SC2002 booth and several nodes at Indiana Univ.\ and SDSC\@.}
397: \end{figure}
398: 
399: From the SC2002 booth to SDSC, the file replication performance of a
400: 2GB file achieved 32.8 MB/s even with one node at SDSC\@.  On the
401: other hand, the performance from SDSC to the SC2002 booth showed a
402: different tendency to increase scalable with respect to the number of
403: nodes at SDSC\@.  One reason was this direction requires multiple
404: HighSpeed TCP streams to improve the network performance because one
405: stream achieved only 1.1 to 2.3 Mbps.  For the file replication from
406: SDSC to the SC2002 booth, seven parallel streams were used in each
407: node pair.
408: 
409: Table~\ref{tab:us-japan-rep} showed parameters and the measured
410: bandwidth of file replication from Baltimore in U.S. to Tsukuba in
411: Japan at a distance of more than 10,000 km.  As shown in the previous
412: section, a HighSpeed TCP stream achieved about 260 Mbps with socket
413: buffer size 8 MB, while multiple rate-controlled streams were
414: effective to stabilize the bandwidth.
415: 
416: \begin{table*}[th]
417: \begin{center}
418: \caption{Parameters and measured performance of US-Japan file replication}
419: \label{tab:us-japan-rep}
420: \begin{tabular}{cccccc}
421: \# node pairs & \# streams & data size & 10-sec average BW
422:  & Transfer time & Average BW \\
423: \hline
424: 1 (N1)    &  1 (N1x1)  & 2 GB & n/a & 162.6 sec & 106 Mbps \\
425: 1 (N1)    &  8 (N8x1)  & 2 GB & n/a & 124.7 sec & 138 Mbps \\
426: 1 (N1)    & 16 (N16x1) & 2 GB & n/a  & 113.0 sec & 152 Mbps \\
427: \hline
428: 1 (S1)    &  1 (S1x1)  & 1 GB & n/a   & 193.0 sec & 44.5 Mbps \\
429: 1 (S1)    &  8 (S8x1)  & 1 GB & 170 Mbps &  91.5 sec & 93.9 Mbps \\
430: 1 (S1)    & 16 (S16x1) & 1 GB & n/a   & 173.3 sec & 49.6 Mbps \\
431: \hline
432: 2 (N2)    & 32 (N16x2) & 2$\times$2 GB & 419 Mbps & 115.9 sec & 297 Mbps \\
433: 3 (N3)    & 48 (N16x3) & 2$\times$3 GB & 593 Mbps & 139.6 sec & 369 Mbps \\
434: 4 (N3+S1) & 56 (N16x3+S8x1) & 2$\times$4 GB & 741 Mbps & 150.0 sec & 458 Mbps \\
435: \end{tabular}
436: \end{center}
437: \end{table*}
438: 
439: To control and limit the traffic at any rate, the socket buffer size
440: and the interval of sending data were adjusted.  The interval of
441: sending data was also needed to be adjusted to suppress too fast
442: increase of the congestion window that causes the serious packet loss.
443: 
444: Using the northern route, 16 streams achieved the best bandwidth of
445: 152 Mbps in average for file replication of a 2 GB file.  Using the
446: southern route, 8 streams achieved the best bandwidth of 93.9 Mbps in
447: average for file replication of a 1 GB file.  Using three node pairs
448: for the northern route and one node pair for the southern route, the
449: file replication of a 8 GB file achieved 458 Mbps in average, and 741
450: Mbps in 10-second peak bandwidth out of 893 Mbps.
451: 
452: %parameter               Northern route   Southern route
453: %socket buffer size           610 KB          250 KB
454: %Traffic control per stream    50 Mbps         28.5 Mbps
455: %\# streams per node pair      16 streams       8 streams
456: %\# nodes                       3 hosts         1 host
457: %stripe unit size             128 KB           128 KB
458: 
459: For the SC2002 high-performance bandwidth challenge, parameters of
460: Table~\ref{tab:sc2002-bwc} was set up based on the previous
461: measurement.  In the remote site column, `AIST N' means the AIST via
462: the APAN/TransPAC northern route, and `AIST S' means the AIST via the
463: southern route.  The `Measured BW' column shows the measured average
464: bandwidth of file replication of a 1 GB or 2 GB file, which is not the
465: same as 5-second or 10-second peak bandwidth.
466: 
467: \begin{table*}[tb]
468: \begin{center}
469: \caption{Parameters for file replication and expected bandwidth from
470: and to a 12-node Gfarm cluster at SC2002 booth}
471: \label{tab:sc2002-bwc}
472: \begin{tabular}{cccccc}
473: \multicolumn{6}{c}{Outgoing traffic} \\
474: \# nodes & Remote & \# nodes & \# streams &
475: Socket buffer size, & Measured BW \\
476: in Baltimore & site & at remote site & /node &
477: rate limit & (1-2min avg) \\
478: \hline
479:  3 &  SDSC    & 5 &  1 & 1 MB              & $>$ 60 MB/s \\
480:  2 &  Indiana & 8 &  1 & 1 MB              &  56.8 MB/s \\
481:  3 &  AIST N  & 3 & 16 & 610 KB, 50 Mbps   &  44.0 MB/s \\
482:  1 &  AIST S  & 1 & 16 & 346 KB, 28.5 Mbps &  10.6 MB/s \\
483: \hline
484:  9 & S,I,A    & 5,8,4 & - & & $>$ 171 Mbps \\
485:    &          &       &   & & ($>$ 1.43 Gbps) \\
486: \multicolumn{6}{c}{Incoming traffic} \\
487: \hline
488:  1  &  SDSC    & 3 &  7 & 7 MB             &  23.1 MB/s \\
489:  1* &  Indiana & 4 &  1 & 1 MB             &  34.9 MB/s \\
490:  1* &  AIST N  & 1 &  1 & 610 KB, 50 Mbps  &   n/a \\
491:  1  &  AIST S  & 1 &  1 & 346 KB           &   n/a \\
492: \hline
493:  3  & S,I,A    & 3,4,2 & - & & $>$ 58 MB/s \\
494:     &          &       &   & & ($>$ 487 Mbps)
495: \end{tabular}
496: \end{center}
497: \end{table*}
498: 
499: The average bandwidth in one to two minutes would be expected to be
500: achieved over 1.43 Gbps for outgoing traffic and over 487 Mbps for
501: incoming traffic, over 1.92 Gbps in total for both directions, if
502: there were no unknown congestion and no unexpected packet drop by
503: using several networks simultaneously.
504: 
505: \begin{figure}[tb]
506:  \includegraphics[width=\columnwidth]{bwc02.eps}
507:  \caption{\label{fig:bwc02} File replication performance in 10-second
508:  average between the SC2002 booth and other sites during the SC2002
509:  high-performance bandwidth challenge.}
510: \end{figure}
511: 
512: As a result, the file replication performance was shown by
513: Figure~\ref{fig:bwc02} in 10-second average bandwidth.  The peak
514: bandwidth was 1.40 Gbps for outgoing traffic, and 0.526 Gbps for
515: incoming traffic.  The 0.1-second average bandwidth measured by the
516: SCinet showed 1.691 Gbps for outgoing traffic, and 0.595 Gbps for
517: incoming traffic, 2.286 Gbps in total for both directions using 12
518: nodes in Baltimore.
519: 
520: %------------------------------------------------------------------------- 
521: %\section{Related Works}
522: 
523: %------------------------------------------------------------------------- 
524: \section{Summary and Future Work}
525: The Grid Datafarm is an architecture for petabyte-scale data-intensive
526: computing providing online ten petabyte-scale storage, an I/O
527: bandwidth scales to the TB/s range and scalable computational power,
528: which is securely and dependably shared on a Grid.  This paper
529: discussed and evaluated the performance of file replication on the
530: Grid Datafarm.
531: 
532: For the evaluation of the network performance in high bandwidth-delay
533: product networks between U.S. and Japan, the HighSpeed TCP performed
534: very well on the transpacific network of OC-12 POS, and achieved 529
535: Mbps in 5-second average using two streams of one node pair.  On the
536: other hand, the application-level rate-control of a HighSpeed TCP
537: stream was necessary for the network of OC-12 ATM to achieve stable
538: and high bandwidth.
539: 
540: Within the U.S., the file replication showed any performance problem,
541: while between U.S. and Japan, application-level rate-control of a
542: HighSpeed TCP were also needed for stability.  As a result, using
543: three node pairs for the northern route and one node pair for the
544: southern route, the file replication of a 8 GB file achieved 741 Mbps
545: in 10-second average out of 893 Mbps.
546: 
547: Between the SC2002 booth and other three sites including a Japan site,
548: the peak bandwidth of file replication in 0.1-second average showed
549: 1.691 Gbps for outgoing traffic, and 0.595 Gbps for incoming traffic,
550: 2.286 Gbps in total for both directions using 12 nodes in the SC2002
551: booth.
552: 
553: The Grid Datafarm can be applied to theoretical or experimental
554: science that calls upon large-scale data analysis and simulation.  We
555: are planning to evaluate it using large-scale production applications
556: such as high-energy physics data analysis, analysis of observational
557: data of all-sky multiple wavelength bands in astronomy, gene analysis
558: in bio-informatics and so on on a world-wide Grid Datafarm testbed.
559: 
560: %------------------------------------------------------------------------- 
561: \section*{Acknowledgments}
562: We would like to thank kind help for PRAGMA members, especially, Rick
563: McMullen, John Hicks at Indiana Univ., Phillip Papadopoulos at SDSC\@.
564: We are thankful to the web100 and net100 projects for providing a
565: HighSpeed TCP patch for a linux kernel.  We are grateful to Hisashi
566: Eguchi at Maffin, Kazunori Konishi, Yoshinori Kitatsuji, and Ayumu
567: Kubota at APAN, Chris Robb at Indiana Univ.\ for investigation of
568: bottlenecks of wide-area networks.  We appreciate great help of Force
569: 10 Networks, Inc.\ for providing the E1200 switch with 10 gigabit
570: ethernet network interface.  This research was supported by the
571: Ministry of Economy, Trade and Industry through research grant of
572: Network computing project, and the Ministry of Education, Culture,
573: Sports, Science and Technology of Japan through a Grant-in-Aid for
574: Scientific Research on Priority Areas (2) (No.\ 13224034).
575: 
576: 
577: %\bibliography{grid}
578: \begin{thebibliography}{9}
579: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
580: \expandafter\ifx\csname bibnamefont\endcsname\relax
581:   \def\bibnamefont#1{#1}\fi
582: \expandafter\ifx\csname bibfnamefont\endcsname\relax
583:   \def\bibfnamefont#1{#1}\fi
584: \expandafter\ifx\csname citenamefont\endcsname\relax
585:   \def\citenamefont#1{#1}\fi
586: \expandafter\ifx\csname url\endcsname\relax
587:   \def\url#1{\texttt{#1}}\fi
588: \expandafter\ifx\csname urlprefix\endcsname\relax\def\urlprefix{URL }\fi
589: \providecommand{\bibinfo}[2]{#2}
590: \providecommand{\eprint}[2][]{\url{#2}}
591: 
592: %\bibitem[{\citenamefont{Tatebe et~al.}(2002)\citenamefont{Tatebe, Morita,
593: %  Matsuoka, Soda, and Sekiguchi}}]{gfarm-ccgrid2002}
594: \bibitem{gfarm-ccgrid2002}
595: \bibinfo{author}{\bibfnamefont{O.}~\bibnamefont{Tatebe}},
596:   \bibinfo{author}{\bibfnamefont{Y.}~\bibnamefont{Morita}},
597:   \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Matsuoka}},
598:   \bibinfo{author}{\bibfnamefont{N.}~\bibnamefont{Soda}}, \bibnamefont{and}
599:   \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Sekiguchi}}, in
600:   \emph{\bibinfo{booktitle}{Proceedings of the 2nd IEEE/ACM International
601:   Symposium on Cluster Computing and the Grid (CCGrid 2002)}}
602:   (\bibinfo{year}{2002}), pp. \bibinfo{pages}{102--110}.
603: 
604: \bibitem[{\citenamefont{Floyd}(2003)}]{HighSpeedTCP}
605: \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Floyd}}, in
606:   \emph{\bibinfo{booktitle}{Internet draft, draft-floyd-tcp-highspeed-02.txt}}
607:   (\bibinfo{year}{2003}), \bibinfo{note}{\url{http://www.icir.org/floyd/hstcp.html}}.
608: 
609: \bibitem[{Sca()}]{ScalableTCP}
610: \emph{\bibinfo{title}{Scalable {TCP}}},
611:   \bibinfo{note}{\url{http://www-lce.eng.cam.ac.uk/~ctk21/scalable/}}.
612: 
613: \bibitem[{FAS()}]{FASTTCP}
614: \emph{\bibinfo{title}{{FAST} {TCP}}},
615:   \bibinfo{note}{\url{http://netlab.caltech.edu/FAST/}}.
616: 
617: \bibitem[{\citenamefont{Katabi et~al.}(2002)\citenamefont{Katabi, Handley, and
618:   Rohrs}}]{XCP-SIGCOMM2002}
619: \bibinfo{author}{\bibfnamefont{D.}~\bibnamefont{Katabi}},
620:   \bibinfo{author}{\bibfnamefont{M.}~\bibnamefont{Handley}}, \bibnamefont{and}
621:   \bibinfo{author}{\bibfnamefont{C.}~\bibnamefont{Rohrs}}, in
622:   \emph{\bibinfo{booktitle}{Proceedings of ACM SIGCOMM 2002 Concerence}}
623:   (\bibinfo{year}{2002}).
624: 
625: \bibitem[{gfa()}]{gfarm-home}
626: \emph{\bibinfo{title}{Grid Datafarm}},
627:   \bibinfo{note}{\url{http://datafarm.apgrid.org/}}.
628: 
629: \bibitem[{apa()}]{apan-transpac}
630: \emph{\bibinfo{title}{{APAN}/{T}rans{PAC}}},
631:   \bibinfo{note}{\url{http://www.transpac.org/}}.
632: 
633: \bibitem[{maf()}]{maffin}
634: \emph{\bibinfo{title}{Maffin}},
635:   \bibinfo{note}{\url{http://www.maffin.ad.jp/}}.
636: 
637: \bibitem[{ipe()}]{iperf}
638: \emph{\bibinfo{title}{Iperf}},
639:   \bibinfo{note}{\url{http://dast.nlanr.net/Projects/Iperf/}}.
640: 
641: \end{thebibliography}
642: 
643: \end{document}
644: