0206:cs0206038/pnew.tex

1:

2:

3: \documentclass[jpdc]{apjrnl}

4: \usepackage{graphics}

5: %

6: \listfiles

7:

8:

9: %

10: \begin{document}

11:

12:

13: \title{A Multilevel Approach to Topology-Aware Collective Operations

14:         in Computational Grids}

15:

16:

17: \author{Nicholas T. Karonis

18: %

19: \\Department of Computer Science

20: \\Northern Illinois University

21: \\DeKalb, IL~~60115

22: \\Argonne National Laboratory

23: \\Argonne, IL~~60439

24: \\Email: karonis@niu.edu

25: %

26: \and         %

27: Bronis de Supinski

28: %

29: \\Center for Applied Scientific Computing

30: \\Lawrence Livermore National Laboratory

31: \\Livermore, CA~~94551

32: \\Email: bronis@llnl.gov

33: %

34: \and

35: Ian Foster

36: %

37: \\Argonne National Laboratory

38: \\Argonne, IL~~60439

39: \\The University of Chicago

40: \\Chicago, IL~~60637

41: \\Email: foster@mcs.anl.gov

42: %

43: \and

44: William Gropp and Ewing Lusk

45: %

46: \\Mathematics and Computer Science Division

47: \\Argonne National Laboratory

48: \\Argonne, IL~~60439

49: \\Email: gropp@mcs.anl.gov, lusk@mcs.anl.gov

50: %

51: \and

52: Sebastien Lacour

53: %

54: \\IRISA / INRIA Rennes

55: \\University of Beaulieu

56: \\35042 Rennes, France

57: \\Email: Sebastien.Lacour@irisa.fr

58: %

59: }

60: \date{April 2002}

61:

62:

63: \maketitle

64:

65:

66: Proposed running head: Multilevel Topology-Aware Collective Operations

67:

68:

69:

70: \pagebreak

71:

72:

73: %

74: %

75: %

76: \begin{abstract}

77: The efficient implementation of collective communication operations has

78: received much attention.  Initial efforts produced ``optimal'' trees based

79: on network communication models that assumed

80: equal \mbox{point-to-point} latencies

81: between any two processes.  This assumption is violated in

82: most practical settings, however, particularly in heterogeneous

83: systems such as clusters of SMPs and wide-area ``computational Grids,''

84: with the result that collective operations

85: perform suboptimally.  In response, more recent work has focused

86: on creating {\em topology-aware} trees for collective operations that minimize

87: communication across slower channels (e.g., a wide-area network).  While these

88: efforts have significant communication benefits, they all limit their view of

89: the network to only two layers.  We present a strategy based upon a

90: {\em multilayer} view of the network.  By creating {\em multilevel

91: topology-aware} trees we take advantage of communication cost differences

92: at every level in the network.  We used this strategy to implement

93: topology-aware versions of several MPI collective operations

94: in \hbox{MPICH-G2}, the Globus Toolkit$^{TM}$-enabled version of the

95: popular MPICH implementation of the MPI standard.

96: Using information about topology provided by \hbox{MPICH-G2},

97: we construct these multilevel

98: topology-aware trees automatically during execution.

99: We present results demonstrating the advantages of our

100: multilevel approach by comparing it to the default (topology-unaware)

101: implementation provided by MPICH and a topology-aware two-layer implementation.

102: \end{abstract}

103:

104:

105: \begin{keywords}

106: MPI, collective operations, \hbox{MPICH-G2}, grid computing, Globus Toolkit

107: \end{keywords}

108:

109:

110: \pagebreak

111:

112:

113: %

114: %

115: %

116: \section{Introduction}

117:

118:

119: The problem of building ``optimal'' communication trees for

120: collective operations

121: has received much attention in recent years.

122: The telephone model, which assumes that send and receive times are

123: equal and that messages are not packetized, implies that the optimal broadcast

124: algorithm uses a binomial tree.

125: Under models that expand the telephone model to account for message latency,

126: such as the postal~\cite{postal} or LogP~\cite{logp} models, the communication

127: topology of an optimal

128: broadcast algorithm becomes a generalized Fibonacci tree.

129: All of these approaches construct optimal trees for collective operations by first

130: modeling the communication characteristics of a network with a set of

131: parameters and then building the optimal trees based on parameter values and

132: their model.

133:

134:

135: Underlying this work is

136: the assumption that the communication times between all process pairs

137: in the computation are equal.  While this is a reasonable approximation

138: when the entire computation is performed on a single machine,

139: it is not reasonable when the computation is executed on a cluster of

140: symmetric multiprocessors (SMPs) in a local-area network, or worse, in

141: a {\em computational Grid}~\cite{Globus,GridBook,globus-physics} environment,

142: in which

143: multiple parallel computers are connected by

144: local-area, campus-area, or even wide-area networks.  Rapid

145: improvements in network performance have engendered considerable

146: interest in parallel computing

147: in the last context, as evidenced by experiments and

148: initiatives such as

149: the I-WAY~\cite{isoftcpe}, National Technology Grid~\cite{CACMgrid},

150: Information Power Grid~\cite{IPG99}, and TeraGrid~\cite{teragrid}.

151:

152: %

153:

154:

155:

156: Under these circumstances the trees produced by the conventional

157: models perform suboptimally.

158: In such heterogeneous environments, communication costs over different links

159: %

160: %

161: can differ by an order of magnitude or

162: more.  In these situations, {\em topology-aware} algorithms can

163: dramatically

164: improveme the performance.  For example, in the case of N

165: processors distributed into two clusters, a traditional

166: reduction algorithm may generate \mbox{O(log N)} intercluster messages,

167: while a topology-aware algorithm generates only 1, for a

168: cost saving of a factor of \mbox{O(log N)} if intercluster message

169: costs dominate.

170:

171:

172: Previous work~\cite{starT,magpie} has demonstrated that

173: topology-aware

174: collective operations can indeed reduce communication costs by reducing

175: the amount of communication performed over slow channels.  However, this work

176: limited the depth of network stratification to only two levels: other

177: processors are either near or far.  In~\cite{optcollops} we compared a

178: prototype of our multilevel approach to the topology-{\em un}aware binomial

179: tree algorithm distributed with MPICH and to MagPIe, one of the topology-aware

180: two-level techniques.  In that prototype we ``guessed'' which computers shared

181: a local network by inspecting their fully qualified domain names, and

182: thereafter representing our multilevel clustering of processes with a

183: sequence of {\em hidden communicators} inside MPI communicators.

184:

185: In this paper we present a much improved refinement of that prototype

186: that allows collective operations to exploit knowledge concerning the

187: structure of a multilevel network, in which the neighbors are processors

188: that are categorized according to their expected \mbox{point-to-point}

189: communication characteristics.  The identification of which processes

190: share a local network is now a simple matter of users providing values

191: for selected environment variables.  Additionally the use of hidden

192: communicators to represent the multilevel clustering has been replaced

193: by integer vectors.  The use of hidden communicators required us to

194: implement the collective operations as a {\em sequence of collective

195: operations}, for example, an \verb+MPI_Bcast+ was implemented as a

196: sequence of \verb+MPI_Bcast+s sequencing over each of the hidden communicators

197: in turn, which typically resulted in the use of binomial trees at each level.

198: By replacing the hidden communicators with integer vectors we

199: are now free to implement collective operations using \mbox{point-to-point}

200: operations over any tree we create.

201:

202:

203: To permit experimental studies, we have implemented our multilevel

204: approach for five of the collective operations supported by the Message Passing

205: Interface (MPI) standard~\cite{mpi-forum:journal}: \verb+MPI_Bcast+,

206: \verb+MPI_Reduce+, \verb+MPI_Barrier+, \verb+MPI_Gather+, and

207: \verb+MPI_Scatter+.  We use \hbox{MPICH-G2}~\cite{mpichg2}, the successor to

208: \hbox{MPICH-G}~\cite{mpi-nexus-pc}, which

209: is based on the popular MPICH implementation~\cite{mpich} of the

210: MPI standard. \hbox{MPICH-G2} uses services provided

211: by the Globus Toolkit$^{TM}$, or simply Globus, to support execution in

212: heterogeneous and distributed

213: environments.  This use of \hbox{MPICH-G2} enables experimentation within

214: realistic wide-area environments that would not otherwise be easily accessible.

215:

216:

217: In the sections that follow, we describe our multilevel topology approach.

218: Then, we present experimental results that illustrate the benefits of

219: our multilevel approach by comparing it with (1) the topology-{\em un}aware

220: implementation currently distributed with MPICH and (2) MagPIe~\cite{magpie},

221: one of the topology-aware two-level implementations of collective operations.

222: We briefly discuss other recent topology-aware and optimized collective

223: operations efforts and conclude with a discussion of future work.

224:

225:

226: %

227: %

228: %

229: \section{Multilevel Topology-Aware Approach}

230:

231:

232: \begin{figure}

233: \begin{centering}

234: %

235: %

236: %

237: %

238: %

239: %

240: %

241: %

242: %

243: %

244: %

245: %

246: %

247: \includegraphics{grid.eps}

248: \caption{An example of a Grid computation involving 10 processes on one

249:     IBM~SP at SDSC and another 10 processes distributed evenly across two

250:     SGI~Origin2000s (O2K$_a$ and O2K$_b$) at NCSA.}\label{fig-mach}

251: \end{centering}

252: \end{figure}

253:

254:

255: Figure~\ref{fig-mach} depicts an MPI application

256: involving 20 processes distributed over three machines located at

257: the San Diego Supercomputer Center (SDSC) and the National Center

258: for Supercomputing Applications (NCSA).

259: We depict 10 processes on the IBM~SP at SDSC and 5 processes

260: on each of two Origin2000s, O2K$_a$ and O2K$_b$, at NCSA.

261: The slowest communication is between sites, which uses TCP over a

262: wide-area network, with faster communication between the O2Ks at NCSA,

263: which uses TCP over their local-area network, and the fastest

264: communication, of course, within each machine.

265:

266: In the remainder of this section we describe a broadcast using first

267: the topology-unaware implementation currently distributed with MPICH,

268: then a 2-level topology-aware approach, and finally our multilevel

269: topology-aware broadcast.

270:

271: %

272: \subsection{A Topology-{\em Un}aware Broadcast}

273:

274: \begin{figure}

275: \begin{centering}

276: \includegraphics{binomial.eps}

277: \caption{The binomial trees $B_0$ through $B_3$.}\label{fig-binomial}

278: \end{centering}

279: \end{figure}

280:

281: Topology-unaware implementations of broadcast, including the one

282: distributed with MPICH, often make the simplifying

283: assumption that the communication times between all process

284: pairs in the computation are equal.  Under this assumption the broadcast

285: is often implemented by using a {\em binomial tree}.

286:

287: A binomial tree $B_k$ is an ordered tree (i.e., children of each node

288: are ordered) of order $k \ge 0$ defined recursively.  As

289: shown in Figure~\ref{fig-binomial}, the binomial tree $B_0$ consists

290: of a single node.  The binomial tree $B_k$ ($k > 0$) has a root

291: with $k$ children where the $i^{th}$ child ($0 < i \le k$) is the root

292: of the binomial tree $B_{k-i}$.  Figure~\ref{fig-binomial} depicts the binomial

293: trees $B_0$ through $B_3$.

294:

295: When communication times between all process pairs in the computation are

296: equal and have relatively low latency, \hbox{Bar-Noy} and Kipnis

297: show that implementing a broadcast with a binomial tree has the desirable

298: property that all processes will complete the broadcast at approximately

299: the same time thus, achieving proper load balancing~\cite{postal}.

300:

301: %

302: \subsection{A 2-Level Topology-Aware Broadcast}

303: \label{subsec-2lvl}

304:

305:

306: \begin{figure}

307: \begin{centering}

308: %

309: %

310: %

311: %

312: %

313: %

314: %

315: %

316: %

317: %

318: %

319: %

320: %

321: %

322: %

323: %

324: %

325: %

326: \includegraphics{magpie.eps}

327: \caption{An example of two 2-level topology-aware broadcast trees

328:     rooted at SDSC spanning 2 Origin2000s (O2K$_a$ and O2K$_b$) at NCSA

329:     and an IBM~SP at SDSC: (a) clustering processes on {\em machine boundaries}

330:     and (b) clustering on {\em site boundaries}.}\label{fig-2lvlbcast}

331: \end{centering}

332: \end{figure}

333:

334:

335: Existing 2-level topology-aware approaches~\cite{starT,magpie} cluster

336: processes into groups.  The two natural choices for the machines

337: depicted in Figure~\ref{fig-mach} are to cluster the processes based

338: either on {\em machine boundaries}, creating three groups -- the IBM~SP, O2K$_a$,

339: and O2K$_b$, or {\em site boundaries} creating two groups -- SDSC and NCSA.

340: While both are reasonable choices and would improve

341: performance when compared with the topology-{\em un}aware binomial tree

342: distributed with MPICH,

343: both choices ignore the disparity in network performance between the

344: local- and wide-area networks.  Consider, for example,

345: a broadcast rooted at one of the processes at SDSC.  Figure~\ref{fig-2lvlbcast}a

346: depicts the broadcast tree of the 2-level approach when the processes

347: are clustered on machine boundaries.  The broadcast starts

348: with the SDSC root process sending messages to designated processes on each

349: of the O2Ks at NCSA, resulting in two messages travelling across the wide-area

350: network, and concludes with broadcasts within each machine.  By contrast,

351: Figure~\ref{fig-2lvlbcast}b depicts the broadcast tree when the processes

352: are clustered on site boundaries.  In this case the root at SDSC

353: sends a single message across the wide-area network to a process on

354: one of the two O2Ks at NCSA and concludes with a broadcast within the

355: IBM~SP with another simultaneous broadcast across all the processes

356: at NCSA, which would typically require multiple messages to travel

357: across NCSA's local network.

358:

359:

360: %

361: \subsection{A Multilevel Topology-Aware Broadcast}

362:

363:

364: \begin{figure}

365: \begin{centering}

366: %

367: %

368: %

369: %

370: %

371: %

372: %

373: %

374: %

375: %

376: %

377: %

378: %

379: %

380: %

381: %

382: %

383: \includegraphics{g2.eps}

384: \caption{An example of a multilevel topology-aware broadcast tree rooted

385:     at SDSC spanning 2 Origin 2000s (O2K$_a$ and O2K$_b$) at NCSA and an

386:     IBM~SP at SDSC.}\label{fig-mlvlbcast}

387: \end{centering}

388: \end{figure}

389:

390:

391: The multilevel topology-aware approach we present minimizes messaging

392: across the slowest links {\em at each level} by clustering the processes

393: at the wide-area level into site groups, and then within each site group,

394: clustering processes at the local-area level into machine groups.

395: Using the same broadcast example from Section~\ref{subsec-2lvl}, we depict in

396: Figure~\ref{fig-mlvlbcast} the broadcast tree used by a multilevel approach.

397: Here the broadcast starts with the SDSC root process sending a single message

398: across the wide-area network to one of the processes at NCSA, in

399: Figure~\ref{fig-mlvlbcast} we depict a process on O2K$_a$.  The broadcast

400: continues with the receiving process on O2K$_a$ sending a single message

401: across NCSA's local network to a process on O2K$_b$ and the entire broadcast

402: concludes with broadcasts within each machine.  This multilevel

403: clustering minimizes messaging over the slower wide- and local-area

404: networks.

405:

406:

407: %

408: %

409: %

410: \section{Multilevel Topology-aware Approach in \hbox{MPICH-G2}}

411:

412:

413: In this section we describe our implementation of multilevel topology-aware

414: collective operations in the Globus Toolkit-based \hbox{MPICH-G2}.  For illustrative purposes,

415: we discuss our implementation of \verb+MPI_Bcast+ in detail.

416:

417: %

418: %

419: %

420: %

421: %

422: %

423: %

424: %

425: %

426: %

427: %

428: %

429: %

430: %

431: %

432: %

433: %

434: %

435: %

436: %

437: %

438: %

439: %

440:

441:

442: %

443: \subsection{RSL Specification of Topology}

444:

445:

446: MPICH-G2 uses the Globus Toolkit's Resource Specification Language~(RSL)~\cite{GRAM97} to

447: describe the resources required to run an application.

448: Users write {\em RSL scripts}, which identify resources

449: (e.g., computers) and specify requirements (e.g., number of CPUs, memory,

450: execution time) and parameters (e.g., location of executables, command

451: line arguments, environment variables) for each.  An RSL script can

452: be used as

453: the user interface to \verb+globusrun+, an upper-level Globus service

454: that first authenticates the user by using the Grid Security

455: %

456: Infrastructure~(GSI)~\cite{GSI-journal} and then

457: schedules and monitors the job across the various machines by using

458: two other Globus Toolkit services: the Dynamically-Updated Request Online

459: Coallocator~(DUROC)~\cite{CoAllocation99} and Grid Resource Allocation

460: and Management~(GRAM)~\cite{GRAM97}.  RSL is designed to be an

461: easy-to-use language to describe multiresource multisite jobs while

462: hiding all the site-specific details associated with requesting

463: such resources.

464:

465:

466: \begin{figure}

467: \begin{footnotesize}

468: \begin{verbatim}

469: +

470: ( &(resourceManagerContact="sp.npaci.edu")

471:    (count=10)

472:    (jobtype=mpi)

473:    (label="subjob 0")

474:    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0))

475:    (directory=/homes/users/smith)

476:    (executable=/homes/users/smith/myapp)

477: )

478: ( &(resourceManagerContact="o2ka.ncsa.uiuc.edu")

479:    (count=5)

480:    (jobtype=mpi)

481:    (label="subjob 1")

482:    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 1))

483:    (directory=/users/smith)

484:    (executable=/users/smith/myapp)

485: )

486: ( &(resourceManagerContact="o2kb.ncsa.uiuc.edu")

487:    (count=5)

488:    (jobtype=mpi)

489:    (label="subjob 2")

490:    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 2))

491:    (directory=/users/smith)

492:    (executable=/users/smith/myapp)

493: )

494: \end{verbatim}

495: \end{footnotesize}

496: \caption{\strut{An RSL script for an \hbox{MPICH-G2} application running

497:     on three machines that facilitates {\em 2-level process

498:     clustering}.}\label{fig-rsl2lvl}}

499: \end{figure}

500:

501:

502: %

503: %

504: %

505: %

506: %

507: %

508: %

509: %

510: %

511: %

512: %

513: %

514: %

515: %

516: %

517: %

518: %

519: %

520: %

521: %

522: %

523: %

524: %

525: %

526: %

527: %

528: %

529: %

530: %

531: %

532: %

533:

534:

535: \begin{figure}

536: \begin{footnotesize}

537: \begin{verbatim}

538: +

539: ( &(resourceManagerContact="sp.npaci.edu")

540:    (count=10)

541:    (jobtype=mpi)

542:    (label="subjob 0")

543:    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0))

544:    (directory=/homes/users/smith)

545:    (executable=/homes/users/smith/myapp)

546: )

547: ( &(resourceManagerContact="o2ka.ncsa.uiuc.edu")

548:    (count=5)

549:    (jobtype=mpi)

550:    (label="subjob 1")

551:    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 1)

552:        (GLOBUS_LAN_ID NCSAlan))

553:    (directory=/users/smith)

554:    (executable=/users/smith/myapp)

555: )

556: ( &(resourceManagerContact="o2kb.ncsa.uiuc.edu")

557:    (count=5)

558:    (jobtype=mpi)

559:    (label="subjob 2")

560:    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 2)

561:        (GLOBUS_LAN_ID NCSAlan))

562:    (directory=/users/smith)

563:    (executable=/users/smith/myapp)

564: )

565: \end{verbatim}

566: \end{footnotesize}

567: \caption{\strut{An RSL script for an \hbox{MPICH-G2} application running

568:     on three machines that facilitates {\em multilevel process

569:     clustering}.}\label{fig-rslMlvl}}

570: \end{figure}

571:

572:

573: Figure~\ref{fig-rsl2lvl} depicts an RSL script for an \hbox{MPICH-G2}

574: application intended to run on the computational Grid depicted in

575: Figure~\ref{fig-mach}.  It depicts a job as a set of three {\em subjobs},

576: where each subjob is associated with a particular resource, in our example,

577: a computer.  Subjobs define a natural machine-boundary partitioning of the

578: processes in \verb+MPI_COMM_WORLD+ and are sufficient for a 2-level machine

579: boundary clustering of the processes.  To achieve a multilevel clustering,

580: the user must identify those machines that are on the same local network

581: by specifying a value for an \hbox{MPICH-G2}-defined environment variable

582: \verb+GLOBUS_LAN_ID+, as depicted in the RSL script in

583: Figure~\ref{fig-rslMlvl}.  Specifying the same value (\verb+NCSAlan+)

584: in the second and third subjobs instructs \hbox{MPICH-G2} to cluster these

585: two machines into the same local-area network group.  This same technique

586: can be used to cluster many subjobs in the same local-area network group

587: while simultaneously creating multiple local-area network groups through

588: the assignment of multiple yet unique \verb+GLOBUS_LAN_ID+ values.  This

589: simple specification (the only difference between Figures~\ref{fig-rsl2lvl}

590: and~~\ref{fig-rslMlvl}) is all that is required to create multilevel

591: topology-aware clustering of the processes.

592:

593:

594: %

595: %

596: %

597: %

598: %

599: %

600: %

601: %

602: %

603: %

604: %

605: %

606: %

607: %

608: %

609:

610:

611: The multilevel clustering information specified in RSL (i.e., processes

612: gathered first into machine groups and then local network groups composed

613: of machine groups) creates a multilevel grouping of the processes in

614: \verb+MPI_COMM_WORLD+ and is distributed to all the processes during

615: \hbox{MPICH-G2} bootstrapping to be stored within \verb+MPI_COMM_WORLD+ on

616: each process.  When new communicators are created (e.g., via \verb+MPI_Comm_split+),

617: \hbox{MPICH-G2} propagates the relevant multilevel clustering information to

618: the newly created communicator so that {\em all communicators} in

619: \hbox{MPICH-G2} have the multilevel clustering information pertaining to their

620: process groups.  As an interesting side effect we have made this multilevel

621: topology information available to MPI applications through existing

622: MPI communicator caching idioms. See~\cite{mpichg2} for a full

623: description of \hbox{MPICH-G2}'s topology discovery mechanism.

624: %

625:

626:

627: %

628: \subsection{MPICH-G2's Multilevel Topology-Aware Broadcast}

629:

630:

631: A multilevel topology-aware clustering of processes is not sufficient

632: in itself to allow the construction of a broadcast tree such as that depicted in

633: Figure~\ref{fig-mlvlbcast}: \hbox{MPICH-G2} also needs to know which process is

634: the root of the broadcast.  Construction of the multilevel

635: topology-aware tree is therefore deferred until the application calls a

636: collective operation.  At that time each process simultaneously and

637: independently (i.e., without communication) construct an identical tree based

638: on the multilevel process grouping found in the communicator and the parameters

639: passed (e.g., identifying the root process of a broadcast) to the collective

640: operation.

641:

642:

643: One benefit of using a multilevel topology-aware tree to implement

644: a collective operation is that we are free to select different subtree

645: topologies at each level.  For example, a multilevel broadcast tree can start

646: with a broadcast from the root to selected processes at each site across a

647: wide-area network, followed by broadcasts at each site to selected processes

648: on each machine across the local networks, and concluding with broadcasts

649: within each machine.  We have the freedom to use different broadcast

650: topologies at each stage in the sequence.  Bar-Noy and Kipnis show

651: that in high-latency networks (e.g., a wide-area network)

652: the optimal broadcast topology is a flat tree in which the root sends the

653: data to all other processes directly, while in a low-latency network

654: (e.g., intramachine messaging), the optimal broadcast topology is a binomial

655: tree~\cite{postal}.  We take advantage of these findings and the flexibility

656: of our multilevel approach in our implementation of \verb+MPI_Bcast+ by using

657: a flat broadcast tree at the initial wide-area level and binomial trees at

658: the local-area and intramachine levels.

659:

660:

661: In the next section we present results demonstrating the advantages of our

662: multilevel approach by comparing it with the default (topology-{\em un}aware)

663: implementation provided by MPICH and a topology-aware two-layer implementation.

664:

665:

666: %

667: %

668: %

669: \section{Experimental Results}

670: \label{sec-exp}

671:

672:

673: %

674: %

675:

676:

677: To demonstrate the advantages of our multilevel approach, we examine

678: its effects on \verb+MPI_Bcast+.

679: The MPICH implementation of \verb+MPI_Bcast+ is based on binomial trees;

680: hence, in a distributed heterogeneous environment like a computational

681: Grid its performance is acutely sensitive to the distribution

682: of the processes and the root of the broadcast.

683: For example, in an application using $P=2^k$ processes distributed evenly

684: across $C=2^i, 0 \le i \le k$ clusters, a broadcast implemented using

685: a binomial tree propagates the message down its longest path

686: using at least $log_2C$ intercluster messages and $log_2\frac{P}{C}$

687: intracluster messages.

688: In contrast,

689: under certain intercluster network performance

690: conditions described by \mbox{Bar-Noy} and Kipnis in their postal model,

691: our multilevel method could be used to send

692: 1 intercluster message and $log_2\frac{P}{C}$ intracluster messages.

693: Assuming an intercluster latency $l_s$ sec and bandwidth $b_s$ Kb/sec;

694: and an intracluster latency $l_f$ sec and bandwidth $b_f$ Kb/sec,

695: broadcasting a message of N Kb using the

696: binomial tree conservatively takes

697: \mbox{$O((logC)(l_s+\frac{N}{b_s}) + (log\frac{P}{C})(l_f+\frac{N}{b_f}))$},

698: whereas

699: broadcasting the same message using our multilevel method takes only

700: \mbox{$O((l_s+\frac{N}{b_s}) + (log\frac{P}{C})(l_f+\frac{N}{b_f}))$}.

701:

702:

703: \begin{figure}

704: \begin{small}

705: \begin{verbatim}

706: For (each message size M)

707:   MPI_Barrier(MPI_COMM_WORLD)

708:   if (MPI_COMM_WORLD rank == 0)

709:     t0 = get_time()

710:   For (r = 0; r < Nprocs; r ++)

711:     MPI_Bcast(root=r to MPI_COMM_WORLD message size M)

712:     ack_barrier()

713:   if (MPI_COMM_WORLD rank == 0)

714:     t1 = get_time()

715:     report message size M, time t1-t0

716: \end{verbatim}

717: \end{small}

718: \caption{\strut{The broadcast timing application.}\label{fig-tim-bcast}}

719: \end{figure}

720:

721:

722: We wrote a small MPI application (depicted in Figure~\ref{fig-tim-bcast})

723: that times the broadcasts of messages of increasing size.

724: To represent a broadcast with an arbitrary

725: root, we timed how long it would take to broadcast each message

726: of size M as each process in \verb+MPI_COMM_WORLD+ took its turn

727: as the root.  Also, in order to eliminate any potential pipelining

728: that might occur between consecutive broadcasts, we inserted a barrier

729: (\verb+ack_barrier()+) after each broadcast in which all processes

730: other than rank 0 \verb+MPI_Send+ an ACK message to process 0 and then

731: wait to \verb+MPI_Recv+ a GO message.  Process 0, after \verb+MPI_Recv+'ing

732: the ACK message from all the other processes, \verb+MPI_Send+'s a GO

733: message to each of the other processes, one at a time.  We chose to write our

734: own barrier rather than calling \verb+MPI_Barrier+ because

735: we have reimplemented \verb+MPI_Barrier+ to reflect multilevel topology and

736: we wished these tests to reflect the differences only in the broadcast

737: implementations.

738:

739:

740: %

741: %

742: %

743: %

744: %

745: %

746: %

747: %

748: %

749: %

750: %

751:

752: We conducted experiments running the MPI

753: application depicted in Figure~\ref{fig-tim-bcast} on

754: three computers: the IBM~SP at the San Diego Supercomputer Center

755: (SDSC-SP) and the IBM~SP (ANL-SP) and SGI~Origin200 (ANL-O2K) at

756: Argonne National Laboratory.

757: We compare

758: our multilevel topology approach to the binomial tree provided by MPICH

759: and include comparisons to the 2-level approach provided by MagPIe.

760: We ran the application four times, each time using

761: 16 processes on each of the three computers.  These results are

762: depicted in Figure~\ref{fig-bcast-magpie}.

763: The curves labeled ``MagPIe-machine'' and ``MagPIe-site'' represent

764: two runs using MagPIe version 2.0.1, each time with

765: a different cluster definition.

766: In our first MagPIe

767: run (``MagPIe-machine'') we defined three clusters, one for each computer,

768: of 16 processes each.  In our second MagPIe run (``MagPIe-site'') we defined

769: two clusters: an ANL cluster comprising the two ANL machines having 32

770: processes and an SDSC cluster comprising the SDSC-SP having only 16 processes.

771:

772: %

773: %

774: %

775: \begin{figure}

776: \begin{centering}

777: \includegraphics{timing_bcast.16anlsp16sdscsp16sdscsp.magpie.eps}

778: \caption{Original MPICH broadcast~vs.~topology-aware MPICH broadcast~vs.~MagPIe

779:     broadcast running 16 processes

780:     on the IBM~SP at SDSC and 16 processes on each the IBM~SP and

781:     SGI~Origin2000 at ANL.}\label{fig-bcast-magpie}

782: \end{centering}

783: \end{figure}

784:

785: Figure~\ref{fig-bcast-magpie}

786: shows there are significant benefits to the multilevel approach when

787: compared with a simple binomial tree and even when compared with a 2-level

788: approach as implemented by MagPIe.

789: A multilevel view of the network allows an application to avoid

790: slower channels {\em at each level}.  In our experiments, the broadcast is

791: optimized by sending one message across the wide-area network, then one

792: message across the local-area network, and then many messages

793: within each computer.

794:

795:

796: %

797: %

798: %

799: \section{Related Work}

800:

801:

802: Previous efforts have focused on creating ``optimal'' trees for collective

803: operations where \mbox{point-to-point} communications are not necessarily

804: equal between any two processes.  Husbands and Hoe present

805: MPI-StarT~\cite{starT}, an MPI implementation for a cluster of SMPs

806: interconnected by a high-performance interconnect.  They report significant

807: improvements after modifying the MPICH broadcast algorithm, which uses

808: binomial trees.  Their modifications use information that describes their

809: cluster topology by minimizing intercluster communication during collective

810: operations. MagPIe~\cite{magpie} is another MPI system designed to construct

811: collective operation trees in heterogeneous communication environments.

812: MagPIe recognizes a two-layer communication network that distinguishes between

813: local- and wide-area communication.  By minimizing wide-area communication,

814: much in the same way MPI-StarT minimizes intercluster communication, MagPIe

815: has seen significant improvements in all the MPI collective operations.

816:

817:

818: Both efforts have produced impressive results and clearly demonstrate

819: that there are significant advantages to implementing collective operations in

820: a topology-aware manner.  However, both limit

821: their view of the network to only two layers; MPI-StarT distinguishes

822: between intra- and intercluster communication within the same

823: local-area, and MagPIe distinguishes between local- and wide-area communication.

824: There are opportunities for further optimization by

825: using trees that stratify the network deeper than two layers.

826:

827:

828: In~\cite{pipelining} van de Geijn et al. show the advantages of

829: implementing collective operations by segmenting and pipelining

830: messages when communicating over relatively slower channels (e.g., TCP

831: over local- and wide-area networks).

832:

833:

834:

835:

836: In~\cite{magpie-PC} Kielman et al. extend MagPIe by incorporating

837: van de Geijn's pipelining idea through a technique they call Parameterized

838: LogP (PLogP), which is an extension of the LogP model presented

839: by Culler et al~\cite{logp}.

840: In this extension, MagPIe still recognizes only a two-layer communication

841: network, but through parameterized studies of the network, the researchers determine

842: ``optimal'' packet sizes.  This technique works well for applications

843: that always run on the same computational grid having relatively

844: stable performance, but requires retuning when moving the application

845: from one computing environment or network to another.

846:

847:

848: %

849: %

850: %

851: \section{Future Work}

852:

853:

854: We have implemented five of the MPI collective operations in a

855: topology-aware multilevel manner in \hbox{MPICH-G2}.  Encouraged by our

856: initial results, we plan to upgrade \hbox{MPICH-G2's} remaining MPI collective

857: operations in a similar manner.

858:

859:

860: %

861: %

862: %

863: %

864: %

865: %

866: %

867: %

868: %

869: %

870: %

871: %

872: %

873: %

874: %

875: %

876: %

877: %

878: %

879: %

880: %

881: %

882: %

883: %

884: %

885:

886:

887: Our general strategy implements a collective operation by first stratifying

888: the network into multiple levels and then minimizing the communication

889: across the slowest channels.  In doing so, however, we may encounter

890: a tree that has multiple siblings at a particular level, for example,

891: many sites connected across the wide-area network or many machines

892: at a particular site.  When this situation happens, we implement the collective

893: operation at that level using a binomial tree at all but the wide-area

894: network level.  Unfortunately, a binomial tree is not always the best choice.

895: \mbox{Bar-Noy} and Kipnis show that the shape of a collective

896: operation tree depends heavily on the \mbox{point-to-point} communication

897: characteristics of the send/receive primitives on which it is implemented.

898: Their model incorporates a latency parameter $\lambda \ge 1$.  They show

899: that for low latencies, (for example, communication within a single machine),

900: the optimal broadcast tree is a binomial tree, but for higher latencies,

901: (for example, communication across a wide-area network), the optimal broadcast

902: tree becomes flatter.  We will investigate ways to select better, if

903: not optimal, collective operation trees by choosing those that respect

904: the different communication characteristics at each level of our multilevel

905: view.

906:

907:

908: The pipelining techniques presented by van de Geijn et al. can be used at

909: each of the levels in \hbox{MPICH-G2's} multilevel topology-aware collective

910: operations.  Using techniques similar to Kielman's PLogP method, we will

911: develop methods to determine the appropriate packet sizes with respect to

912: network performance at {\em each level} of our multilevel view.

913:

914:

915: %

916: %

917: %

918: \section{Summary}

919:

920:

921: As Grid computations become increasingly prevalent, the need for

922: topology-aware collective operations also increases.  We have a version

923: of \hbox{MPICH-G2} that implements five collective operations in a multilevel

924: topology-aware manner.  We have shown, at least for \verb+MPI_Bcast+, that when

925: compared with the binomial tree provided by MPICH and the 2-level approach

926: provided by MagPIe there are significant advantages to executing

927: collective operations using a multilevel view of the network.

928: Through a simple process of identifying machines that are common

929: to a local-area network, we have provided a means by which an MPI

930: application may take advantage of the multilevel topology-aware algorithms

931: without requiring code modifications or special functions.

932:

933:

934: %

935: %

936: %

937: \begin{acknowledge}

938:

939:

940: We thank the San Diego Supercomputer Center and the National Center

941: for Supercomputing Applications for providing access to their

942: machines.  We also thank the members of the Globus development

943: team for their support, patience, and many ideas. This work was

944: supported in part by the Mathematical, Information, and Computational

945: Sciences Division subprogram of the Office of Advanced Scientific

946: Computing Research, U.S. Department of Energy, under Contract

947: W-31-109-Eng-38; by the U.S. Department of Energy under

948: Cooperative Agreement No. DE-FC02-99ER25398;

949: by the National Science Foundation; by DARPA; and by

950: the NASA Information Power Grid program.

951:

952: \end{acknowledge}

953:

954:

955: %

956: \bibliographystyle{plain}

957: \bibliography{foster_bibliography,mpi-allbib,mpi-book,globus,prop}

958:

959:

960: %

961:

962: Nicholas T. Karonis received a B.S. in finance and a B.S. in computer

963: science from Northern Illinois University in 1985, an M.S. in computer

964: science from Northern Illinois University in 1987, and a Ph.D. in

965: computer science from Syracuse University in 1992.  He spent

966: summers from 1981 to 1991 as a student at Argonne National Laboratory

967: where he worked on the p4 message-passing library, automated reasoning,

968: and genetic sequence allignment.  From 1991 to 1995 he worked on the control

969: system at Argonne's Advanced Photon Source and from 1995 to 1996

970: for the Computing Division at Fermi National Accelerator Laboratory.

971: Since 1996 he has been an assistant professor of computer science at

972: Northern Illinois University and a resident associate guest of Argonne's

973: Mathematics and Computer Science Division where he has been a member

974: of the Globus Project.  His current research interest is message-passing

975: systems in computational Grids.

976:

977: Bronis R. de Supinski is a computer scientist in the Center

978: for Applied Scientific Computing at Lawrence Livermore

979: National Laboratory. His research interests include

980: message passing implementations and tools, memory performance

981: improvement, cache coherence and distributed shared memory,

982: consistency semantics and performance evaluation modeling

983: and tools. Bronis earned his Ph.D. in computer science

984: from the University of Virginia in 1998. He is a

985: member of the ACM and the IEEE Computer Society.

986:

987:

988: Ian Foster received his B.Sc. (Hons I) at the University of Canterbury

989: in 1979 and his Ph.D. from Imperial College, London, in 1998. He

990: is senior scientist and associate director of the

991: Mathematics and Computer Science Division at Argonne National

992: Laboratory, and professor of computer science at the University of

993: Chicago.  He has published four books and over 150 papers and

994: technical reports.  He co-leads the Globus Project, which provides

995: protocols and services used by industrial and academic distributed

996: computing projects worldwide.  He co-founded the influential Global

997: Grid Forum and co-edited the book ``The Grid: Blueprint for a New

998: Computing Infrastructure.''

999:

1000:

1001: William Gropp received his B.S. in mathematics from Case Western Reserve

1002: University in 1977, a an M.S. in physics from the University of Washington in

1003: 1978, and a Ph.D. in computer science from Stanford in 1982. He held the

1004: positions of assistant (1982-1988) and associate (1988-1990) professor in

1005: the Computer Science Department at Yale University. In 1990, he joined the

1006: numerical analysis group at Argonne, where he is a senior computer

1007: scientist and associate director of the Mathematics and Computer Science

1008: Division, a senior scientist in the Department of Computer Science at the

1009: University of Chicago, and a Senior Fellow in the Argonne-University of Chicago

1010: Computation Institute. His research interests are in parallel computing,

1011: software for scientific computing, and numerical methods for partial

1012: differential equations. He has played a major role in the development of

1013: the MPI message-passing standard. He is co-author of MPICH, the most widely used

1014: implementation of MPI, and was involved in the MPI Forum as a

1015: chapter author for both MPI-1 and MPI-2. He has written many books and

1016: papers on MPI including "Using MPI" and "Using MPI-2". He is also one of

1017: the designers of the PETSc parallel numerical library, and has developed

1018: efficient and scalable parallel algorithms for the solution of linear and

1019: nonlinear equations.

1020:

1021:

1022: Ewing Lusk received his B.A. in mathematics from the University of Notre Dame

1023: in 1965 and his Ph.D. in mathematics from the University of Maryland in 1970.

1024: He is currently a senior computer scientist in the Mathematics and Computer

1025: Science Division at Argonne National Laboratory.  His current projects include

1026: implementation of the MPI message-passing standard, research into programming

1027: models for parallel architectures, and parallel performance analysis tools.

1028: He is a leading member of the team responsible for MPICH implementation of the

1029: MPI message-passing interface standard.  He is the author of five books and

1030: more than seventy-five research articles in mathematics, automated deduction,

1031: and parallel computing.

1032:

1033:

1034: Sebastien Lacour graduated in physics in 1999 at the Ecole

1035: Normale Superieure of Lyon, France. He received his master's

1036: degree in computer science in 2002 at IFSIC, University of

1037: Rennes, France.  He is currently a Ph.D. student at IRISA/INRIA in

1038: Rennes.  His research interests include networks, compilation, and

1039: parallel and distributed systems.  His current work focuses on

1040: distributed shared-memory systems over large-scale, hierarchical

1041: architectures (multicluster platforms).

1042:

1043:

1044: \end{document}

1045: