0306:cs0306090/cr.tex

1: \documentclass[twocolumn,twoside,slac]{revtex4}

2: \usepackage{graphicx}

3: \usepackage{fancyhdr}

4:

5: % $Id$

6:

7: \pagestyle{fancy}

8: \fancyhead{} % clear all fields

9: \fancyhead[C]{\it {2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003}} \fancyhead[RO,LE]{\thepage}

10: \fancyfoot{} % clear all fields

11: \fancyfoot[LE,LO]{\bf THAT006}

12: \renewcommand{\headrulewidth}{0pt}

13: \renewcommand{\footrulewidth}{0pt}

14: \renewcommand{\sfdefault}{phv}

15:

16: \setlength{\textheight}{235mm}

17: \setlength{\textwidth}{170mm}

18: \setlength{\topmargin}{-20mm}

19:

20: \bibliographystyle{apsrev}

21:

22: \begin{document}

23:

24: \title{Worldwide Fast File Replication on Grid Datafarm}

25:

26: \author{Osamu Tatebe, Satoshi Sekiguchi}

27: \affiliation{AIST, Tsukuba, Ibaraki 3058568, JAPAN}

28: %

29: \author{Youhei Morita}

30: \affiliation{KEK, Tsukuba, Ibaraki 3050801, JAPAN}

31: %

32: \author{Satoshi Matsuoka}

33: \affiliation{Tokyo Institute of Technology, Meguro, Tokyo 152-8552, JAPAN}

34: %

35: \author{Noriyuki Soda}

36: \affiliation{Software Research Associates, Inc., Naka, Nagoya, 4600003, JAPAN}

37:

38: \begin{abstract}

39: The Grid Datafarm architecture is designed for global petascale

40: data-intensive computing.  It provides a global parallel filesystem

41: with online petascale storage, scalable I/O bandwidth, and scalable

42: parallel processing, and it can exploit local I/O in a grid of

43: clusters with tens of thousands of nodes.  One of features is that it

44: manages file replicas in filesystem metadata for fault tolerance and

45: load balancing.

46:

47: This paper discusses and evaluates several techniques to support

48: long-distance fast file replication.  The Grid Datafarm manages a

49: ranked group of files as a Gfarm file, each file, called a Gfarm file

50: fragment, being stored on a filesystem node, or replicated on several

51: filesystem nodes.  Each Gfarm file fragment is replicated

52: independently and in parallel using rate-controlled HighSpeed TCP with

53: network striping.  On a US-Japan testbed with 10,000 km distance, we

54: achieve 419 Mbps using 2 nodes on each side, and 741 Mbps using 4

55: nodes out of 893 Mbps with two transpacific networks.

56: \end{abstract}

57:

58: \maketitle

59:

60: \thispagestyle{fancy}

61:

62: %-------------------------------------------------------------------------

63: \section{Introduction}

64: \label{sec:intro}

65: Petascale data intensive computing wave has been coming such as

66: high-energy physics data analysis, astronomical data analysis, and

67: bio-informatics data analysis.  More than ten petabyte storage needs

68: to be shared and analyzed by world-wide users with high efficiency,

69: high security, and high dependability.

70:

71: The Grid Datafarm architecture \cite{gfarm-ccgrid2002} is designed for

72: global petascale data-intensive computing to enable the process of

73: large amounts of data at multiple regional PC clusters.  The aim of

74: this research is to establish a large-scale parallel filesystem by

75: exploiting local storage of cluster nodes spread in the extensive

76: area, a platform system needed to support a petabyte scale data

77: intensive computing.  The Grid Datafarm architecture enables

78: high-speed access to a large amount of data by utilizing file access

79: locality, and realizes fault tolerance of disks and networks by data

80: replication.

81:

82: This paper discusses about long-distance fast file replication on the

83: Grid Datafarm.  To improve the performance in high bandwidth-delay

84: product networks, congestion control keeping efficient, fair,

85: scalable, and stable plays a key role.  The easiest way to improve the

86: performance is to open multiple TCP connections in parallel, while

87: this approach leaves the parameter of the number of connections to be

88: determined by the user, which may result in heavy congestion with too

89: much number of connections.  There are several researches addressing

90: this issue such as HighSpeed TCP \cite{HighSpeedTCP}, Scalable TCP

91: \cite{ScalableTCP}, FAST TCP \cite{FASTTCP}, and XCP

92: \cite{XCP-SIGCOMM2002}.  HighSpeed TCP is an attempt to improve

93: congestion control of TCP for large congestion windows with better

94: flexibility, better scaling, better slow-start behavior, and competing

95: more fairly with current TCP, keeping backward compatibility and

96: incremental deployment.  It modifies the TCP response function only

97: for large congestion windows to reach high bandwidth reasonably

98: quickly when in slow-start, and to reach high bandwidth without overly

99: long delays when recovering from multiple retransmit timeouts, or when

100: ramping-up from a period with small congestion windows.

101:

102: For file replication of large files in high bandwidth-delay product

103: networks, it is also necessary to improve disk I/O performance not

104: only the network performance.  At this time, each cluster node has

105: capability to transmit at a rate of 1 Gbps, while the performance of

106: an IDE or a SCSI disk is at most 50 MB/s.  To improve the disk I/O

107: bandwidth, disk striping such as RAID-0 is effective.

108:

109: This paper is organized as follows.  In Section~\ref{sec:rep}, the

110: file replication on the Grid Datafarm is discussed.

111: Section~\ref{sec:eval} evaluates the network performance and the file

112: replication performance using a US-Japan Grid Datafarm testbed.

113:

114: %-------------------------------------------------------------------------

115: %\section{HighSpeed TCP}

116: %TCP connections with larger congestion windows

117: %modify the TCP response function

118:

119: %-------------------------------------------------------------------------

120: \section{File Replication on Grid Datafarm}

121: \label{sec:rep}

122: The Grid Datafarm provides a Grid file system that federates multiple

123: local filesystems on a Grid across administrative domains.  The Grid

124: file system provides virtualized hierarchical namespaces for files

125: having consistent access control with flexible capabilities

126: management.  There is a replica catalog to manage mappings from the

127: hierarchical namespace for files to one or more physical file

128: locations.  This enables efficient, dependable, and transparent file

129: sharing on a Grid.

130:

131: The Grid Datafarm has a feature for data parallel execution.  It

132: manages a ranked group of files as a single Gfarm file.  This makes it

133: possible to manage a lot of distributed files as a single file, which

134: will be analyzed in parallel.  Each parallel process possibly

135: generates a new set of output files also managed as a single file.

136: File-affinity scheduling and new concept of a file view enable the

137: ``owner computes'' strategy, or ``move the computation to data''

138: approach for the parallel data analysis.

139:

140: When replicating a Gfarm file, each file of the Gfarm file, or a group

141: of files, stored on a different filesystem node can be replicated in

142: parallel and independently.  File replication on the Grid Datafarm is

143: considered to be a parallel file replication from multiple cluster

144: nodes to (different) multiple cluster nodes.

145:

146: In high bandwidth-delay product networks, multiple TCP streams, or

147: network striping, is effective to improve the performance, while disk

148: accesses to striping data decreases the performance of the disk I/O\@.

149: It is necessary to utilize a modified TCP or other protocols to

150: achieve high performance with a single stream.  HighSpeed TCP is one

151: of proposals of the modification of congestion control of TCP, which

152: is utilized by the performance evaluation on a US-Japan Grid Datafarm

153: testbed.

154:

155: As described in Section~\ref{sec:intro}, the disk I/O performance is

156: poorer than the network bandwidth on each cluster node.  One of

157: requirements from the Grid Datafarm architecture is a dense and

158: high-performance storage on each node for online petabyte-scale

159: storage.  To meet this requirement, each node of the AIST Gfarm

160: cluster was designed to be a 1U server having a 3ware RAID card with

161: four 120GB IDE disks in RAID 0 that achieves 480 GB storage capacity

162: and over 110 MB/s for contiguous block reads and writes that is

163: comparable with the network performance.

164:

165: %-------------------------------------------------------------------------

166: \section{Performance Evaluation}

167: \label{sec:eval}

168:

169: \subsection{PC clusters and wide-area networks}

170: \label{ssec:env}

171: During the international conference SC2002, held in Baltimore from

172: November 16th to November 22nd of 2002, a Grid Datafarm Data Grid

173: environment was set up linking seven PC clusters in both Japan and the

174: U.S.

175:

176: \begin{figure}[tb]

177:  \includegraphics[width=\columnwidth]{network2-en.eps}

178:  \caption{\label{fig:env} Network logical map of world-wide Grid

179:  Datafarm testbed.  Four sites in Japan and three sites in U.S. are

180:  integrated with the Grid Datafarm Data Grid middleware.}

181: \end{figure}

182:

183: Seven systems of PC clusters, including the one at the SC2002 booth (a

184: total of 190 PCs) were located in the research centers of both Japan

185: and the U.S. (AIST, High Energy Accelerator Research Organization,

186: Tokyo Institute of Technology, the University of Tokyo, Indiana

187: University and San Diego Supercomputer Center (SDSC)).  They were

188: integrated with the Grid Datafarm Data Grid middleware

189: \cite{gfarm-home} (Figure~\ref{fig:env}).  Tsukuba WAN, APAN/TransPAC

190: \cite{apan-transpac} and Maffin \cite{maffin} supported in

191: establishing the network.

192:

193: The system had the peak floating-point performance of 962 Gflops.  It

194: was equipped with a large capacity file system of 18 TB at 6,600 MB/s

195: access rate as a Gfarm wide-area filesystem.

196:

197: This environment utilized multiple high-speed wide-area networks, that

198: is, Tsukuba WAN and SuperSINET in Japan, APAN/TransPAC and NII-ESnet

199: HEP PVC between Japan and the U.S., Abilene and ESnet in the U.S. and

200: SCinet at the SC2002 booth.  The bandwidth from SC2002 to both Indiana

201: University and SDSC was 622 Mbps.  In the transpacific network, it was

202: 893 Mbps.  Total theoretical maximum bandwidth of the network linking

203: seven clusters was 2.173 Gbps one way (See Figure~\ref{fig:env}).

204:

205: For the file replication, a large amount of scientific data taken from

206: particle physics was generated mainly in the large-scale PC cluster of

207: Tokyo Institutes of Technology and created data replicas of several

208: hundreds of GB at each of the other clusters in a single filesystem

209: image.

210:

211: At the SC2002 booth in Baltimore, there was a 12-node AIST Gfarm

212: cluster connected with gigabit ethernet that connects to the SCinet

213: with 10 gigabit ethernet using the Force10 E1200 switch.  Each node

214: consisted of a dual Intel Xeon 2.4GHz processor, 1GB memory, and a

215: 3ware Escalade 7500-4 RAID controller with four 120GB 3.5'' HDDs,

216: which was configured in RAID-0.  The disk I/O performance for

217: contiguous blocks achieved 109 MB/s for writes and 168 MB/s for reads.

218: The network bandwidth of gigabit ethernet was 941 Mbps using the iperf

219: bandwidth measurement tool \cite{iperf}.  File replication performance

220: was 75 MB/s, that was equivalent to 629 Mbps, using the Gfarm Data

221: Grid Middleware.

222:

223: % Fri May 16 01:34:16 JST 2003

224: %gfm65% ./thput-fsys -rw -b 65536 -s 10240

225: %testing with 10240 MB file

226: %bufsize     write [bytes/sec]     read [bytes/sec]

227: %  65536   108959734            196902836

228: %gfm65% ./thput-fsys -rw -b 65536 -s 20480

229: %testing with 20480 MB file

230: %bufsize     write [bytes/sec]     read [bytes/sec]

231: %  65536   113479404            185055488

232: %gfm65% ./thput-fsys -rw -b 65536 -s 40960

233: %testing with 40960 MB file

234: %bufsize     write [bytes/sec]     read [bytes/sec]

235: %  65536   115339604            181332188

236: %gfm65% ./thput-fsys -rw -b 65536 -s 81920

237: %testing with 81920 MB file

238: %bufsize     write [bytes/sec]     read [bytes/sec]

239: %  65536   114586309            176661637

240:

241: At the Grid Technology Research Center, AIST in Tsukuba, Japan, there

242: was the same 7-node AIST Gfarm cluster that connects to the Tokyo XP

243: with gigabit ethernet via Tsukuba WAN and Maffin networks.

244:

245: At the Indiana University, there was a 15-node PC cluster connected

246: with Fast Ethernet connects to Indianapolis GigaPoP with OC-12.  The

247: disk I/O performance for contiguous blocks was 9.3 MB/s for writes and

248: 10.2 MB/s for reads.

249:

250: %gfarm01% thput-fsys -b 65536 -s 2560 -rw

251: %testing with 2560 MB file

252: %bufsize     write [bytes/sec]     read [bytes/sec]

253: %  65536     9710990             10711022

254:

255: At the SDSC in San Diego, there was a 8-node PC cluster connected with

256: gigabit ethernet that connects to outside with OC-12.  The disk I/O

257: performance for contiguous blocks was 29.4 MB/s for writes and 20.0

258: for reads.

259:

260: %slic04% thput-fsys -rw -b 65536 -s 10240

261: %testing with 10240 MB file

262: %bufsize     write [bytes/sec]     read [bytes/sec]

263: %  65536    30840332             21006343

264:

265: APAN/TransPAC transpacific network consisted of two links; the

266: northern route (OC-12 POS) between Seattle and Tokyo, and the southern

267: route (OC-12 ATM) between Chicago and Tokyo.  The southern route was

268: shaped to 271 Mbps.  By default, all IP packets were transmitted via

269: the northern route.  To utilize the both routes, we configured a

270: static route such that every traffic between specific three nodes at

271: SC2002 booth and AIST was transmitted via the southern route.

272:

273: \begin{table}[tb]

274: \caption{Round trip time between the SC2002 booth and other sites.

275: 'AIST (N)' and 'AIST (S)' mean that IP packets are transmitted via the

276: northern route and via the southern route, respectively}

277: \label{tab:rtt}

278: \begin{center}

279: \begin{tabular}{cc}

280: \hline

281: AIST (N) & 199 msec \\

282: AIST (S) & 222 msec \\

283: Indiana & 30 msec \\

284: SDSC & 86 msec \\

285: \hline

286: \end{tabular}

287: \end{center}

288: \end{table}

289:

290: The round trip time (RTT) of IP packets between the SC2002 booth and

291: other sites is shown in the Table~\ref{tab:rtt}.

292:

293: %-------------------------------------------------------------------------

294: \subsection{HighSpeed TCP over transpacific network}

295: Figure~\ref{fig:sc-gfm} shows the measured network bandwidth of the

296: HighSpeed TCP from the SC2002 booth in U.S. to the AIST in Japan via

297: the APAN/TransPAC northern route.  The bandwidth was measured using

298: the iperf from one node to one node with two HighSpeed TCP streams.

299: The buffer size of each socket was 8 MB, which gave the theoretical

300: peak bandwidth 337 Mbps for one connection with the RTT 199 ms.  From

301: the Figure~\ref{fig:sc-gfm}, the measured peak bandwidth achieved 529

302: Mbps in 5-second average out of the physical network bandwidth 622

303: Mbps.  Due to the packet loss, the bandwidth occasionally dropped,

304: however, it was recovered reasonably quickly thanks to the HighSpeed

305: TCP\@.

306:

307: % 289 Mbps 5-second peak bandwidth

308:

309: \begin{figure}[tb]

310:  \includegraphics[width=\columnwidth]{sc-gfm.eps}

311:  \caption{\label{fig:sc-gfm} HighSpeed TCP bandwidth from U.S. to

312:  Japan via the APAN/TransPAC northern route (OS-12 POS).}

313: \end{figure}

314:

315: Figure~\ref{fig:gfm-sc} shows the measured HighSpeed TCP bandwidth

316: from Japan to U.S. via the northern route.  The machine and network

317: configurations were the same as the previous measurement except the

318: traffic direction.  Because the traffic was slightly heavy at this

319: time, the measured peak bandwidth was 443 Mbps.  After the heavy

320: packet loss, the bandwidth was recovered slowly just like the regular

321: TCP\@.

322:

323: \begin{figure}[tb]

324:  \includegraphics[width=\columnwidth]{gfm-sc.eps}

325:  \caption{\label{fig:gfm-sc} HighSpeed TCP bandwidth from Japan to

326:  U.S.  via the APAN/TransPAC northern route (OS-12 POS)}

327: \end{figure}

328:

329: Figure~\ref{fig:aist-sdsc-ge} is the case using the APAN/TransPAC

330: southern route.  Since the southern route was shaped to 271 Mbps, one

331: HighSpeed TCP stream would be able to fill the bandwidth.  However,

332: the stream suffered the critical packet loss, and only achieved 85.9

333: Mbps in 10-minutes average although 251 Mbps in 5-second peak

334: bandwidth.  One of reasons for the critical packet loss was the

335: setting of an ATM switch configured without the random early drop.  To

336: cope with this problem, it is necessary to control the network traffic

337: rate not to excess the physical network bandwidth, that is, 271 Mbps.

338:

339: \begin{figure}[tb]

340:  \includegraphics[width=\columnwidth]{aist-sdsc-ge.eps}

341:  \caption{\label{fig:aist-sdsc-ge} HighSpeed TCP bandwidth from Japan

342:  to U.S. via the APAN/TransPAC southern route (OC-12 ATM, 271 Mbps

343:  shaping) with one non-rate-limited HighSpeed TCP stream.}

344: \end{figure}

345:

346: When the maximum bandwidth of one HighSpeed TCP stream was limited to

347: 100 Mbps, it was possible to achieve stable and high bandwidth shown

348: by Figure~\ref{fig:aist-sdsc-fe2}.  This rate control was realized by

349: changing the network from gigabit ethernet to fast ethernet.  This

350: case achieved 166.1 Mbps in 10-minutes average, and 190.0 Mbps in

351: 5-second peak bandwidth.

352:

353: \begin{figure}[tb]

354:  \includegraphics[width=\columnwidth]{aist-sdsc-fe2.eps}

355:  \caption{\label{fig:aist-sdsc-fe2} HighSpeed TCP bandwidth from Japan

356:  to U.S. via the APAN/TransPAC southern route (OC-12 ATM, 271 Mbps

357:  shaping) with two 100-Mbps HighSpeed TCP streams.}

358: \end{figure}

359:

360: %-------------------------------------------------------------------------

361: \subsection{File replication}

362: The performance of file replication of large files that do not fit the

363: main memory is limited by the disk I/O performance and the network

364: performance.  When replicating between sites, the file replication

365: performance is also limited by the bandwidth of the wide-area network

366: shown by Figure~\ref{fig:env}.  Table~\ref{tab:max-rep} shows the

367: performance limit of file replication with one node at each site.

368:

369: \begin{table}[tb]

370: \begin{center}

371: \caption{Performance limit of file replication using one node at each

372: site in MB/s.}

373: \label{tab:max-rep}

374: \begin{tabular}{c|cc}

375:         &   To   &   From \\

376: \hline

377: Indiana &   9.3  &   10.2 \\

378: SDSC    &  29.4  &   20.0 \\

379: AIST    & 109    &  112   \\

380: SC2002  & 109    &  112   \\

381: \end{tabular}

382: \end{center}

383: \end{table}

384:

385: Figure~\ref{fig:sc-ussites} shows the performance of file replication

386: between one node at the SC2002 booth and various number of nodes at

387: Indiana Univ.\ and SDSC\@.  Between the SC2002 booth and Indiana

388: Univ., the file replication performance increased almost proportional

389: to the number of nodes at Indiana Univ., and achieved the maximum

390: performance of 34.9 MB/s, that is, 293 Mbps, from four nodes at

391: Indiana Univ.\ to one node at the SC2002 booth with a 4GB file.

392:

393: \begin{figure}[tb]

394:  \includegraphics[width=\columnwidth]{sc-ussites.eps}

395:  \caption{\label{fig:sc-ussites} File replication performance between

396:  one node at SC2002 booth and several nodes at Indiana Univ.\ and SDSC\@.}

397: \end{figure}

398:

399: From the SC2002 booth to SDSC, the file replication performance of a

400: 2GB file achieved 32.8 MB/s even with one node at SDSC\@.  On the

401: other hand, the performance from SDSC to the SC2002 booth showed a

402: different tendency to increase scalable with respect to the number of

403: nodes at SDSC\@.  One reason was this direction requires multiple

404: HighSpeed TCP streams to improve the network performance because one

405: stream achieved only 1.1 to 2.3 Mbps.  For the file replication from

406: SDSC to the SC2002 booth, seven parallel streams were used in each

407: node pair.

408:

409: Table~\ref{tab:us-japan-rep} showed parameters and the measured

410: bandwidth of file replication from Baltimore in U.S. to Tsukuba in

411: Japan at a distance of more than 10,000 km.  As shown in the previous

412: section, a HighSpeed TCP stream achieved about 260 Mbps with socket

413: buffer size 8 MB, while multiple rate-controlled streams were

414: effective to stabilize the bandwidth.

415:

416: \begin{table*}[th]

417: \begin{center}

418: \caption{Parameters and measured performance of US-Japan file replication}

419: \label{tab:us-japan-rep}

420: \begin{tabular}{cccccc}

421: \# node pairs & \# streams & data size & 10-sec average BW

422:  & Transfer time & Average BW \\

423: \hline

424: 1 (N1)    &  1 (N1x1)  & 2 GB & n/a & 162.6 sec & 106 Mbps \\

425: 1 (N1)    &  8 (N8x1)  & 2 GB & n/a & 124.7 sec & 138 Mbps \\

426: 1 (N1)    & 16 (N16x1) & 2 GB & n/a  & 113.0 sec & 152 Mbps \\

427: \hline

428: 1 (S1)    &  1 (S1x1)  & 1 GB & n/a   & 193.0 sec & 44.5 Mbps \\

429: 1 (S1)    &  8 (S8x1)  & 1 GB & 170 Mbps &  91.5 sec & 93.9 Mbps \\

430: 1 (S1)    & 16 (S16x1) & 1 GB & n/a   & 173.3 sec & 49.6 Mbps \\

431: \hline

432: 2 (N2)    & 32 (N16x2) & 2$\times$2 GB & 419 Mbps & 115.9 sec & 297 Mbps \\

433: 3 (N3)    & 48 (N16x3) & 2$\times$3 GB & 593 Mbps & 139.6 sec & 369 Mbps \\

434: 4 (N3+S1) & 56 (N16x3+S8x1) & 2$\times$4 GB & 741 Mbps & 150.0 sec & 458 Mbps \\

435: \end{tabular}

436: \end{center}

437: \end{table*}

438:

439: To control and limit the traffic at any rate, the socket buffer size

440: and the interval of sending data were adjusted.  The interval of

441: sending data was also needed to be adjusted to suppress too fast

442: increase of the congestion window that causes the serious packet loss.

443:

444: Using the northern route, 16 streams achieved the best bandwidth of

445: 152 Mbps in average for file replication of a 2 GB file.  Using the

446: southern route, 8 streams achieved the best bandwidth of 93.9 Mbps in

447: average for file replication of a 1 GB file.  Using three node pairs

448: for the northern route and one node pair for the southern route, the

449: file replication of a 8 GB file achieved 458 Mbps in average, and 741

450: Mbps in 10-second peak bandwidth out of 893 Mbps.

451:

452: %parameter               Northern route   Southern route

453: %socket buffer size           610 KB          250 KB

454: %Traffic control per stream    50 Mbps         28.5 Mbps

455: %\# streams per node pair      16 streams       8 streams

456: %\# nodes                       3 hosts         1 host

457: %stripe unit size             128 KB           128 KB

458:

459: For the SC2002 high-performance bandwidth challenge, parameters of

460: Table~\ref{tab:sc2002-bwc} was set up based on the previous

461: measurement.  In the remote site column, `AIST N' means the AIST via

462: the APAN/TransPAC northern route, and `AIST S' means the AIST via the

463: southern route.  The `Measured BW' column shows the measured average

464: bandwidth of file replication of a 1 GB or 2 GB file, which is not the

465: same as 5-second or 10-second peak bandwidth.

466:

467: \begin{table*}[tb]

468: \begin{center}

469: \caption{Parameters for file replication and expected bandwidth from

470: and to a 12-node Gfarm cluster at SC2002 booth}

471: \label{tab:sc2002-bwc}

472: \begin{tabular}{cccccc}

473: \multicolumn{6}{c}{Outgoing traffic} \\

474: \# nodes & Remote & \# nodes & \# streams &

475: Socket buffer size, & Measured BW \\

476: in Baltimore & site & at remote site & /node &

477: rate limit & (1-2min avg) \\

478: \hline

479:  3 &  SDSC    & 5 &  1 & 1 MB              & $>$ 60 MB/s \\

480:  2 &  Indiana & 8 &  1 & 1 MB              &  56.8 MB/s \\

481:  3 &  AIST N  & 3 & 16 & 610 KB, 50 Mbps   &  44.0 MB/s \\

482:  1 &  AIST S  & 1 & 16 & 346 KB, 28.5 Mbps &  10.6 MB/s \\

483: \hline

484:  9 & S,I,A    & 5,8,4 & - & & $>$ 171 Mbps \\

485:    &          &       &   & & ($>$ 1.43 Gbps) \\

486: \multicolumn{6}{c}{Incoming traffic} \\

487: \hline

488:  1  &  SDSC    & 3 &  7 & 7 MB             &  23.1 MB/s \\

489:  1* &  Indiana & 4 &  1 & 1 MB             &  34.9 MB/s \\

490:  1* &  AIST N  & 1 &  1 & 610 KB, 50 Mbps  &   n/a \\

491:  1  &  AIST S  & 1 &  1 & 346 KB           &   n/a \\

492: \hline

493:  3  & S,I,A    & 3,4,2 & - & & $>$ 58 MB/s \\

494:     &          &       &   & & ($>$ 487 Mbps)

495: \end{tabular}

496: \end{center}

497: \end{table*}

498:

499: The average bandwidth in one to two minutes would be expected to be

500: achieved over 1.43 Gbps for outgoing traffic and over 487 Mbps for

501: incoming traffic, over 1.92 Gbps in total for both directions, if

502: there were no unknown congestion and no unexpected packet drop by

503: using several networks simultaneously.

504:

505: \begin{figure}[tb]

506:  \includegraphics[width=\columnwidth]{bwc02.eps}

507:  \caption{\label{fig:bwc02} File replication performance in 10-second

508:  average between the SC2002 booth and other sites during the SC2002

509:  high-performance bandwidth challenge.}

510: \end{figure}

511:

512: As a result, the file replication performance was shown by

513: Figure~\ref{fig:bwc02} in 10-second average bandwidth.  The peak

514: bandwidth was 1.40 Gbps for outgoing traffic, and 0.526 Gbps for

515: incoming traffic.  The 0.1-second average bandwidth measured by the

516: SCinet showed 1.691 Gbps for outgoing traffic, and 0.595 Gbps for

517: incoming traffic, 2.286 Gbps in total for both directions using 12

518: nodes in Baltimore.

519:

520: %-------------------------------------------------------------------------

521: %\section{Related Works}

522:

523: %-------------------------------------------------------------------------

524: \section{Summary and Future Work}

525: The Grid Datafarm is an architecture for petabyte-scale data-intensive

526: computing providing online ten petabyte-scale storage, an I/O

527: bandwidth scales to the TB/s range and scalable computational power,

528: which is securely and dependably shared on a Grid.  This paper

529: discussed and evaluated the performance of file replication on the

530: Grid Datafarm.

531:

532: For the evaluation of the network performance in high bandwidth-delay

533: product networks between U.S. and Japan, the HighSpeed TCP performed

534: very well on the transpacific network of OC-12 POS, and achieved 529

535: Mbps in 5-second average using two streams of one node pair.  On the

536: other hand, the application-level rate-control of a HighSpeed TCP

537: stream was necessary for the network of OC-12 ATM to achieve stable

538: and high bandwidth.

539:

540: Within the U.S., the file replication showed any performance problem,

541: while between U.S. and Japan, application-level rate-control of a

542: HighSpeed TCP were also needed for stability.  As a result, using

543: three node pairs for the northern route and one node pair for the

544: southern route, the file replication of a 8 GB file achieved 741 Mbps

545: in 10-second average out of 893 Mbps.

546:

547: Between the SC2002 booth and other three sites including a Japan site,

548: the peak bandwidth of file replication in 0.1-second average showed

549: 1.691 Gbps for outgoing traffic, and 0.595 Gbps for incoming traffic,

550: 2.286 Gbps in total for both directions using 12 nodes in the SC2002

551: booth.

552:

553: The Grid Datafarm can be applied to theoretical or experimental

554: science that calls upon large-scale data analysis and simulation.  We

555: are planning to evaluate it using large-scale production applications

556: such as high-energy physics data analysis, analysis of observational

557: data of all-sky multiple wavelength bands in astronomy, gene analysis

558: in bio-informatics and so on on a world-wide Grid Datafarm testbed.

559:

560: %-------------------------------------------------------------------------

561: \section*{Acknowledgments}

562: We would like to thank kind help for PRAGMA members, especially, Rick

563: McMullen, John Hicks at Indiana Univ., Phillip Papadopoulos at SDSC\@.

564: We are thankful to the web100 and net100 projects for providing a

565: HighSpeed TCP patch for a linux kernel.  We are grateful to Hisashi

566: Eguchi at Maffin, Kazunori Konishi, Yoshinori Kitatsuji, and Ayumu

567: Kubota at APAN, Chris Robb at Indiana Univ.\ for investigation of

568: bottlenecks of wide-area networks.  We appreciate great help of Force

569: 10 Networks, Inc.\ for providing the E1200 switch with 10 gigabit

570: ethernet network interface.  This research was supported by the

571: Ministry of Economy, Trade and Industry through research grant of

572: Network computing project, and the Ministry of Education, Culture,

573: Sports, Science and Technology of Japan through a Grant-in-Aid for

574: Scientific Research on Priority Areas (2) (No.\ 13224034).

575:

576:

577: %\bibliography{grid}

578: \begin{thebibliography}{9}

579: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi

580: \expandafter\ifx\csname bibnamefont\endcsname\relax

581:   \def\bibnamefont#1{#1}\fi

582: \expandafter\ifx\csname bibfnamefont\endcsname\relax

583:   \def\bibfnamefont#1{#1}\fi

584: \expandafter\ifx\csname citenamefont\endcsname\relax

585:   \def\citenamefont#1{#1}\fi

586: \expandafter\ifx\csname url\endcsname\relax

587:   \def\url#1{\texttt{#1}}\fi

588: \expandafter\ifx\csname urlprefix\endcsname\relax\def\urlprefix{URL }\fi

589: \providecommand{\bibinfo}[2]{#2}

590: \providecommand{\eprint}[2][]{\url{#2}}

591:

592: %\bibitem[{\citenamefont{Tatebe et~al.}(2002)\citenamefont{Tatebe, Morita,

593: %  Matsuoka, Soda, and Sekiguchi}}]{gfarm-ccgrid2002}

594: \bibitem{gfarm-ccgrid2002}

595: \bibinfo{author}{\bibfnamefont{O.}~\bibnamefont{Tatebe}},

596:   \bibinfo{author}{\bibfnamefont{Y.}~\bibnamefont{Morita}},

597:   \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Matsuoka}},

598:   \bibinfo{author}{\bibfnamefont{N.}~\bibnamefont{Soda}}, \bibnamefont{and}

599:   \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Sekiguchi}}, in

600:   \emph{\bibinfo{booktitle}{Proceedings of the 2nd IEEE/ACM International

601:   Symposium on Cluster Computing and the Grid (CCGrid 2002)}}

602:   (\bibinfo{year}{2002}), pp. \bibinfo{pages}{102--110}.

603:

604: \bibitem[{\citenamefont{Floyd}(2003)}]{HighSpeedTCP}

605: \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Floyd}}, in

606:   \emph{\bibinfo{booktitle}{Internet draft, draft-floyd-tcp-highspeed-02.txt}}

607:   (\bibinfo{year}{2003}), \bibinfo{note}{\url{http://www.icir.org/floyd/hstcp.html}}.

608:

609: \bibitem[{Sca()}]{ScalableTCP}

610: \emph{\bibinfo{title}{Scalable {TCP}}},

611:   \bibinfo{note}{\url{http://www-lce.eng.cam.ac.uk/~ctk21/scalable/}}.

612:

613: \bibitem[{FAS()}]{FASTTCP}

614: \emph{\bibinfo{title}{{FAST} {TCP}}},

615:   \bibinfo{note}{\url{http://netlab.caltech.edu/FAST/}}.

616:

617: \bibitem[{\citenamefont{Katabi et~al.}(2002)\citenamefont{Katabi, Handley, and

618:   Rohrs}}]{XCP-SIGCOMM2002}

619: \bibinfo{author}{\bibfnamefont{D.}~\bibnamefont{Katabi}},

620:   \bibinfo{author}{\bibfnamefont{M.}~\bibnamefont{Handley}}, \bibnamefont{and}

621:   \bibinfo{author}{\bibfnamefont{C.}~\bibnamefont{Rohrs}}, in

622:   \emph{\bibinfo{booktitle}{Proceedings of ACM SIGCOMM 2002 Concerence}}

623:   (\bibinfo{year}{2002}).

624:

625: \bibitem[{gfa()}]{gfarm-home}

626: \emph{\bibinfo{title}{Grid Datafarm}},

627:   \bibinfo{note}{\url{http://datafarm.apgrid.org/}}.

628:

629: \bibitem[{apa()}]{apan-transpac}

630: \emph{\bibinfo{title}{{APAN}/{T}rans{PAC}}},

631:   \bibinfo{note}{\url{http://www.transpac.org/}}.

632:

633: \bibitem[{maf()}]{maffin}

634: \emph{\bibinfo{title}{Maffin}},

635:   \bibinfo{note}{\url{http://www.maffin.ad.jp/}}.

636:

637: \bibitem[{ipe()}]{iperf}

638: \emph{\bibinfo{title}{Iperf}},

639:   \bibinfo{note}{\url{http://dast.nlanr.net/Projects/Iperf/}}.

640:

641: \end{thebibliography}

642:

643: \end{document}

644: