0206:cs0206040/newp.tex

1: \documentclass[jpdc]{apjrnl}

2: \usepackage{graphics}

3: \listfiles

4:

5:

6: %

7: \begin{document}

8:

9: \title{MPICH-G2: A Grid-Enabled Implementation of the Message Passing

10:        Interface}

11:

12: \author{Nicholas T. Karonis

13: \\Department of Computer Science

14: \\Northern Illinois University

15: \\DeKalb, IL~~60115

16: \\Argonne National Laboratory

17: \\Argonne, IL~~60439

18: \\Email: karonis@niu.edu

19: \and

20: Brian Toonen

21: \\Mathematics and Computer Science Division

22: \\Argonne National Laboratory

23: \\Argonne, IL~~60439

24: \\Email: toonen@mcs.anl.gov

25: \and

26: Ian Foster

27: \\Argonne National Laboratory

28: \\Argonne, IL~~60439

29: \\The University of Chicago

30: \\Chicago, IL~~60637

31: \\Email: foster@mcs.anl.gov

32: }

33:

34: \date{April 2002}

35:

36: \maketitle

37:

38: Proposed running head: MPICH-G2: A Grid-Enabled MPI

39:

40: \pagebreak

41:

42:

43: %

44: %

45: %

46: \begin{abstract}

47:

48: Application development for distributed computing ``Grids'' can

49: benefit from tools that variously hide or enable application-level

50: management of critical aspects of the heterogeneous environment. As

51: part of an investigation of these issues, we have developed

52: \hbox{MPICH-G2}, a Grid-enabled implementation of the Message Passing

53: Interface (MPI) that allows a user to run MPI programs across multiple

54: computers, at the same or different sites, using the same commands that

55: would be used on a parallel computer.  This library extends the

56: Argonne MPICH implementation of MPI to use services provided by the

57: Globus Toolkit for authentication, authorization, resource allocation,

58: executable staging, and I/O, as well as for process creation,

59: monitoring, and control.  Various performance-critical operations,

60: including startup and collective operations, are configured to exploit

61: network topology information. The library also exploits MPI constructs

62: for performance management; for example, the MPI communicator

63: construct is used for application-level discovery of, and

64: adaptation to, both network topology and network \hbox{quality-of-service}

65: mechanisms.  We describe the \hbox{MPICH-G2} design and

66: implementation, present performance results, and review application

67: experiences, including record-setting distributed simulations.

68:

69: \end{abstract}

70:

71: \begin{keywords}

72: MPI, Grid computing, message passing, Globus Toolkit, \hbox{MPICH-G2}

73: \end{keywords}

74:

75: \pagebreak

76:

77:

78: %

79: %

80: %

81: \section{Introduction}

82:

83: So-called computational Grids~\cite{GridBook,PhysicsToday} enable the

84: coupling and coordinated use of geographically distributed resources

85: for such purposes as large-scale computation, distributed data

86: analysis, and remote visualization. The development or

87: adaptation of applications for Grid environments is made challenging,

88: however, by the often heterogeneous nature of the resources involved and the

89: facts that that these resources typically live in different

90: administrative domains, run different software, are subject to

91: different access control policies, and may be connected by networks

92: with widely varying performance characteristics.

93:

94: Such concerns have motivated various explorations of specialized, often

95: high-level, distributed programming models for Grid environments,

96: including various forms of object

97: systems~\cite{grim:legion,GannonChapterCite}, Web

98: technologies~\cite{FoxChapterCite,webos}, problem solving

99: environments~\cite{NetSolve,Ninf}, CORBA, workflow systems,

100: high-throughput computing systems~\cite{Nimrod,Condor}, and

101: compiler-based systems~\cite{KennedyChapterCite}.

102:

103: In contrast, we explore here a different approach that might appear

104: reactionary in its simplicity but that, in fact, delivers a remarkably

105: sophisticated technology for managing the heterogeneity associated

106: with Grid environments. Specifically, we advocate the use of a

107: well-known low-level parallel programming model, the Message Passing

108: Interface (MPI), as a basis for Grid programming.  While not a

109: high-level programming model by any means, MPI incorporates

110: sophisticated support for the management of heterogeneity (e.g., data

111: types), for the construction of modular programs (the communicator

112: construct), for management of latency (asynchronous operations), and

113: for the representation of global operations (collective

114: operations). These and other features have allowed MPI to achieve

115: tremendous success as a standard programming model for parallel

116: computers. We hypothesize that these same features can also be used to

117: good effect for Grid computing.

118:

119: Our investigation of MPI as a Grid programming model has focused on

120: three related questions.  First, can we implement MPI constructs

121: efficiently in Grid environments to {\em hide} heterogeneity

122: without introducing overhead?  Second, can we use MPI constructs to

123: enable users to {\em manage} heterogeneity, when this is required?

124: Third, do users find MPI useful in practice for application

125: development?

126:

127: To allow for the experimental exploration of these questions,

128: we have developed \hbox{MPICH-G2}, a complete implementation of the

129: MPI-1 standard~\cite{mpi-forum:journal} that uses services provided by

130: the Globus Toolkit$^{TM}$~\cite{GlobusHCW98} to extend the popular Argonne

131: MPICH implementation of MPI~\cite{mpich} for Grid

132: execution. \hbox{MPICH-G2} passes the MPICH test suite and represents

133: a complete redesign and reimplementation of the earlier \hbox{MPICH-G}

134: system~\cite{mpi-nexus-pc} that increases performance significantly

135: and incorporates a number of innovations.

136: Our experiences with \hbox{MPICH-G2}, as reported in this article,

137: allow us to respond in the affirmative to each question posed in the

138: preceding paragraph.

139:

140: MPICH-G2 hides heterogeneity by using Globus Toolkit services

141: for such purposes as authentication, authorization, executable

142: staging, process creation, process monitoring, process control,

143: communication, redirection of standard input and output, and remote

144: file access. The result is that a user can run MPI programs across

145: multiple computers at different sites using the same commands that

146: would be used on a parallel computer.  Furthermore, performance

147: studies show that overheads relative to native implementations of

148: basic communication functions are negligible.

149:

150: MPICH-G2 enables the use of several different MPI features for

151: user management of heterogeneity. MPI's asynchronous operations can be

152: used for latency management in wide-area networks. MPI's communicator

153: construct can be used to represent the hierarchical structure of

154: heterogeneous systems and thus allow applications to adapt their

155: behavior to such structures. (In separate work, we present

156: topology-aware collective operations as one example of an

157: ``application''~\cite{optcollops}.) We also show how MPI's

158: communicator construct can be used for user-level management of

159: network quality of service, as first introduced in an earlier

160: article~\cite{mpich-gq}.

161:

162: Many groups have used \hbox{MPICH-G2} for the execution of

163: both traditional parallel computing applications (e.g., numerical

164: simulation) and nontraditional distributed computing applications

165: (e.g., distributed visualization), in both local-area and wide-area

166: networks. This variety of applications and execution environments

167: persuades us that MPI can play a valuable role in Grid computing.

168:

169: MPICH-G2 is not the only implementation of MPI for

170: heterogeneous systems.  Others include MPICH with the ch\_p4 device

171: (which provides

172: limited support for heterogeneity), PACX-MPI~\cite{pacx},

173: and STAMPI~\cite{stampi}, each of which has

174: interesting features, as we discuss later. Magpie~\cite{magpie},

175: IMPI~\cite{impi-web}, and PVM~\cite{pvmbook}

176: also address relevant issues.

177: \hbox{MPICH-G2} is

178: unique, however, in the degree to which it hides and manages heterogeneity, as

179: well as in its large user community.

180:

181: In the rest of this article, we describe the problems that we faced in

182: developing \hbox{MPICH-G2}, the techniques used to overcome these

183: problems, and experimental results that indicate the performance of

184: the \hbox{MPICH-G2} implementation and the extent of its improvement

185: over \hbox{MPICH-G}. We conclude with a discussion of application

186: experiments and future directions.

187:

188:

189: %

190: %

191: %

192: \section{Background}

193:

194: We first provide some brief background on MPI, MPICH, and the Globus

195: Toolkit.

196:

197:

198: %

199: \subsection{Message Passing Interface}

200:

201: The Message Passing Interface standard defines a library of

202: routines that implement the message-passing model. These routines

203: include {\em point-to-point} communication functions, in which a {\em

204: send} operation is used to initiate a data transfer between two

205: concurrently executing program components and a matching {\em receive}

206: operation is used to extract that data from system data structures

207: into application memory space; and {\em collective} operations such as

208: broadcast and reductions that implement operations involving multiple

209: processes. Numerous other functions address other aspects of message

210: passing, including, in the MPI-2 extensions to

211: MPI~\cite{mpi-forum:mpi2-journal}, single-sided communication and

212: dynamic process creation.

213:

214: The primary interest of MPI from our perspective, apart from its broad

215: adoption, is the care taken in its design to ensure that underlying

216: performance issues are accessible to, not masked from, the programmer.

217: MPI mechanisms such as asynchronous operations, communicators,

218: and collective operations all turn out to be useful in Grid environments.

219:

220:

221: %

222: \subsection{MPICH Architecture}

223:

224: MPICH~\cite{gropp-lusk-doss-skjellum:mpich} is a popular

225: implementation of the Message Passing Interface standard.  It is a

226: high-performance, highly portable library originally developed as a

227: collaborative effort between Argonne National Laboratory and

228: Mississippi State University.  Argonne continues

229: research and development efforts aimed at improving MPICH performance

230: and functionality.

231:

232: In its present form, MPICH is a complete implementation of the MPI-1

233: standard with extensions to support the parallel I/O functionality

234: defined in the MPI-2 standard.  It is a mature, widely distributed

235: library, with more than 2,000 downloads per month, not including

236: downloads that occur at mirror sites.  Its free distribution and wide

237: portability have contributed materially to the adoption of the MPI

238: standard by the parallel computing community.

239:

240: MPICH derives its portability from its interfaces and layered

241: architecture.  At the top is the MPI interface as defined by the MPI

242: standards.  Directly beneath this interface is the MPICH layer, which

243: implements the MPI interface.  Much of the code in an MPI

244: implementation is independent of the networking device or process

245: management system.  This code, which includes error checking and

246: various manipulations of the opaque objects, is implemented directly

247: at the MPICH layer.  All other functionality is passed off to lower

248: layers be means of the Abstract Device Interface (ADI).

249:

250: The ADI is a simpler interface than MPI proper

251: and focuses on moving data between the MPI layer and the network

252: subsystem.  Those interested in implementing MPI for a particular

253: platform need only define the routines in the ADI in order to obtain a

254: full implementation.  Existing implementations of this device

255: interface for various MPPs, SMPs, and networks provide complete MPI

256: functionality in a wide variety of environments.  \hbox{MPICH-G2} is

257: another implementation of the ADI and is otherwise known as the {\em

258: globus2} device.

259:

260: %

261: %

262: %

263: %

264: %

265: %

266: %

267: %

268: %

269: %

270: %

271: %

272: %

273: %

274: %

275: %

276:

277:

278: %

279: \subsection{The Globus Toolkit}

280:

281: The Globus Toolkit is a collection of software components designed to

282: support the development of applications for high-performance

283: distributed computing environments, or

284: ``Grids''~\cite{GlobusHCW98,GridBook}.

285: Core components typically define a protocol for interacting with a

286: remote resource, plus an application program interface (API) used to

287: invoke that protocol.  (We introduce the protocols and APIs used

288: within \hbox{MPICH-G2} below.)  Higher-level libraries, services,

289: tools, and applications use core services to implement more complex

290: global functionality.

291: The various Globus Toolkit

292: components are reviewed in~\cite{Anatomy} and described in detail in

293: online documentation and in technical papers.

294:

295:

296: %

297: %

298: %

299: \section{MPICH-G2: A Grid-Enabled MPI}

300:

301: As noted in the introduction, \hbox{MPICH-G2} is a complete

302: implementation of the MPI-1 standard that uses Globus Toolkit services

303: to support efficient and transparent execution in heterogeneous Grid

304: environments, while also allowing for application management of

305: heterogeneity.  (It also implements client/server management functions

306: found in Section 5.4 of the MPI-2

307: standard~\cite{mpi-forum:mpi2-journal}. However, we do not discuss

308: these functions here.)

309:

310: In this section, we first describe the techniques used to hide

311: heterogeneity during startup and for process management, then the

312: techniques used to effect communication in heterogeneous systems,

313: and finally the support provided for application-level management

314: of heterogeneity.

315:

316:

317: %

318: \subsection{Hiding Heterogeneity during Startup and Management}

319: \label{sec-startup}

320:

321: As illustrated in Figure~\ref{fig-g} and discussed here,

322: MPICH-G2 uses a range of Globus Toolkit services to address the

323: various complex issues that arise in

324: heterogeneous, multisite Grid environments, such as cross-site

325: authentication, the need to deal with multiple schedulers with

326: different characteristics, coordinated process creation, heterogeneous

327: communication structures, executable staging, and collation of

328: standard output. In fact, \hbox{MPICH-G2} serves as an exemplary

329: case study of how Globus Toolkit mechanisms can be used to create

330: a Grid-enabled programming tool, as we now explain.

331:

332: \begin{figure}

333: \begin{center}

334: \resizebox{4.5in}{!}{\includegraphics{g2_fig.eps}}

335: \end{center}

336: \caption{Schematic of the MPICH-G2 startup, showing the various

337: Globus Toolkit components used to hide and manage heterogeneity. ``Fork,''

338: ``LSF,'' and ``LoadLeveler'' are different local schedulers.}

339: \label{fig-g}

340: \end{figure}

341:

342: Prior to startup of an \hbox{MPICH-G2} application, the user employs

343: the {\em Grid Security Infrastructure} (GSI)~\cite{GlobusSecurity} to

344: obtain a (public key) proxy credential that is used to authenticate

345: the user to each remote sites. This step provides a single sign on

346: capability.

347:

348: The user may also use the Monitoring and Discovery Service

349: (MDS)~\cite{mds97} to select computers on the basis of, for example,

350: configuration, availability, and network connectivity.

351:

352: Once authenticated, the user uses the standard {\tt mpirun} command to

353: request the creation of an MPI computation. The \hbox{MPICH-G2}

354: implementation of this command uses the {\em Resource Specification

355: Language~(RSL)}~\cite{GRAM97} to describe the job.  In brief,

356: users write {\em RSL scripts}, which identify resources

357: (e.g., computers) and specify requirements (e.g., number of CPUs, memory,

358: execution time, etc.) and parameters (e.g., location of executables, command

359: line arguments, environment variables, etc.) for each.  Based on the

360: information found in an RSL script, \hbox{MPICH-G2}

361: calls a {\em co-allocation library}

362: distributed with the Globus Toolkit, the Dynamically-Updated Request

363: Online Coallocator (DUROC)~\cite{CoAllocation99}, to schedule and

364: start the application across the various computers specified by the

365: user.

366:

367: The DUROC library itself uses the {\em Grid Resource Allocation and

368: Management} (GRAM)~\cite{GRAM97} API and protocol to start and

369: subsequently manage a set of subcomputations, one for each

370: computer. For each subcomputation, DUROC generates a GRAM request to a

371: remote GRAM server, which authenticates the user, performs local

372: authorization, and then interacts with the local scheduler to initiate

373: the computation. DUROC and associated \hbox{MPICH-G2} libraries tie

374: the various subcomputations together into a single MPI computation.

375:

376: GRAM will, if directed, use {\em Global Access to Secondary Storage}

377: (GASS)~\cite{GASS99} to stage executable(s) from remote locations

378: (indicated by URLs).  GASS is also used, once an application has

379: started, to direct standard output and error (stdout and stderr)

380: streams to the user's terminal, and to provide access to files

381: regardless of location, thus masking essentially all aspects of

382: geographical distribution except those associated with performance.

383:

384: Once the application has started, \hbox{MPICH-G2} selects the most

385: efficient communication method possible between any two processes,

386: using vendor-supplied MPI ({\em v}MPI) if available, or {\em Globus

387: communication} (Globus~IO) with {\em Globus Data Conversion}

388: (Globus~DC) for TCP, otherwise.

389:

390: %

391: %

392: %

393: %

394: %

395: %

396: %

397: %

398: %

399: %

400: %

401:

402: DUROC and GRAM also interact to monitor and manage the execution of the

403: application.  Each GRAM server monitors the life cycle of its subcomputation

404: as it passes from pending to running and then to terminating, communicating each

405: state transition back to DUROC.  Each subcomputation is held at a

406: DUROC-controlled barrier and is released from that barrier only after

407: all subcomputations have started executing.  Also, a request to terminate

408: the computation (``control C'') may be initiated by the user at which

409: time DUROC and the GRAM servers, communicating via GRAM process control

410: messages, terminate all processes.

411:

412: \begin{figure}

413: \begin{center}

414: \includegraphics{grid.eps}

415: \end{center}

416: \caption{An example of an \hbox{MPICH-G2} application running on a

417:     computational grid involving 4 processes on an IBM~SP at Site A and 8

418:     processes distributed evenly across two Linux clusters at Site B.}

419: \label{fig-grid}

420: \end{figure}

421:

422: \begin{figure}

423: \begin{center}

424: \begin{tabular}{|cc|cccccccccccc|}                           \hline

425: \multicolumn{2}{|c|} {\em Rank}    & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 \\ \hline \hline

426: \multicolumn{2}{|c|} {\em Depth}   & 4 & 4 & 4 & 4 & 3 & 3 & 3 & 3 & 3 & 3 &  3 &  3 \\ \hline

427:                      & wide area   & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0  &  0 \\

428: {\em Colors}         & local area  & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 & 1 &  1 &  1 \\

429:                      & system area & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 2 & 2 &  2 &  2 \\

430:                      & {\em v}MPI  & 0 & 0 & 0 & 0 &   &   &   &   &   &   &    &    \\ \hline

431: \end{tabular}

432: \end{center}

433: \caption{An example of {\em depths} and {\em colors} used by \hbox{MPICH-G2}

434:     to represent network topology in a computational grid.}

435: \label{fig-depths-colors}

436: \end{figure}

437:

438: After the processes have started, \hbox{MPICH-G2} uses information

439: specified in the RSL script to create {\em multilevel clustering} of the

440: processes based on the underlying network topology.

441: Figure~\ref{fig-grid}

442: depicts an MPI application involving 12 processes distributed across

443: three machines located at two sites.  We depict 4 processes

444: (\verb+MPI_COMM_WORLD+ ranks 0-3) on the IBM~SP at Site~A and

445: 4 processes on each of two Linux clusters (\verb+MPI_COMM_WORLD+ ranks 4-7

446: and 8-11, respectively) at Site~B.

447: Each process in \verb+MPI_COMM_WORLD+ is assigned a {\em topology depth}.

448: Processes

449: that communicate using only TCP are assigned topology depths of 3

450: (to distinguish between wide area, local area, and intramachine TCP

451: messaging), and processes

452: that can also communicate using a {\em v}MPI have a topology

453: depth of 4.

454: Using these topology depths \hbox{MPICH-G2} groups processes at a particular

455: level through the assignment of {\em colors}.  Two processes are assigned

456: the same color at a particular level if they can communicate with each

457: other at the network level.

458:

459: Figure~\ref{fig-depths-colors} depicts the {\em topology depths}

460: and {\em colors} for the processes depicted in Figure~\ref{fig-grid}.  Those

461: processes capable of communicating over {\em v}MPI, (i.e., those executing

462: on the IBM~SP), have a depth of 4, while the other processes, (i.e., those

463: executing on a Linux cluster), have a depth of 3.  Since all processes are on

464: the same wide-area network, they all have the same {\em color} (0) at the

465: wide-area level.  Similarly, at the local-area level, all the processes

466: at Site A are assigned one color (0), while all the processes at Site B

467: are assigned another (1).  This structure continues through the system-area level,

468: where processes are assigned the same color if and only if they are

469: on the same machine. Finally, processes that can communicate

470: over a {\em v}MPI are assigned the same color at the {\em v}MPI

471: level if and only if they can communicate directly with each other over

472: the {\em v}MPI.

473:

474: Topology depths and colors are used in the multilevel topology-aware

475: collective operations and topology-discovery mechanism described in

476: Sections~\ref{g2-impl} and~\ref{g2-mgmt}, respectively.

477:

478:

479: %

480: \subsection{Heterogeneous Communications}

481: \label{g2-impl}

482:

483: MPICH-G2 achieves major performance improvements relative to

484: the earlier \hbox{MPICH-G}~\cite{mpi-nexus-pc} by replacing

485: Nexus~\cite{JPDCNexus}, the multimethod, single-sided communication library used for

486: all communication in \hbox{MPICH-G}, with specialized MPICH-specific

487: communication code.  While Nexus has attractive features (e.g.,

488: multiprotocol support with highly tuned TCP support and automatic data

489: conversion), other attributes have proved less

490: attractive from a performance perspective. \hbox{MPICH-G2} now handles

491: all communication directly by reimplementing the good things about

492: Nexus and improving the others.  The result, as we show in

493: Section~\ref{sec-exp}, is that we achieve performance virtually

494: identical to vendor MPI and MPICH configured with the default TCP

495: (ch\_p4) device.  We provide here

496: a detailed description of the improvements and additions to

497: \hbox{MPICH-G} used to achieve this impressive performance.

498:

499: \paragraph{Increased bandwidth.}

500: In \hbox{MPICH-G}, each communication involved the copying of data

501: to and from Nexus buffers in sending and receiving processes.

502: \hbox{MPICH-G2} eliminates these two extra copies in the case of

503: intramachine messages where a vendor MPI exists. In this situation,

504: sends and receives now flow directly from and to application buffers, respectively.  In

505: addition, for TCP

506: messaging involving basic MPI datatypes (e.g., \verb+MPI_INT+,

507: \verb+MPI_FLOAT+) the sending process also transmits directly

508: from the application buffer.

509:

510: \paragraph{Reduced latency for intramachine vendor MPI messaging.}

511: Multiprotocol support is achieved in Nexus by polling each protocol

512: (TCP, vendor MPI, etc.) for incoming messages in a roundrobin

513: fashion~\cite{MultiMethodJPDC}. However, this strategy is inefficient

514: in many situations: it is relatively expensive to poll a TCP socket

515: and in practice it is often the case that many processes in a

516: \hbox{MPICH-G2} computation use only vendor MPI (for communicating with

517: other processes on the same machine).

518:

519: While this inefficiency can be reduced by adaptive

520: polling~\cite{MultiMethodJPDC} or by introducing distinct

521: proxy processes~\cite{pacx,stampi}, \hbox{MPICH-G2} takes a more direct

522: approach, exploiting the knowledge about message source that is

523: provided by TCP receive commands to eliminate TCP polling altogether

524: in many situations.  \hbox{MPICH-G2} polls TCP {\em only} when the

525: application is expecting data from a source that dictates, or might

526: dictate (e.g., \verb+MPI_Recv+ specifies source=\verb+MPI_ANY_SOURCE+),

527: TCP messaging.

528:

529: This avoidance of unnecessary polling when coupled with the need to

530: guarantee progress on both the vendor MPI and TCP protocols leads to

531: implementation decisions that can affect an application's

532: point-to-point communication performance.  Specifically, for processes

533: executing on machines where a vendor MPI is available, the context in

534: which the application calls \verb+MPI_Recv+ affects the manner in

535: which \hbox{MPICH-G2} implements that function, as follows:

536:

537: \begin{itemize}

538: \item {\bf Specified.}

539: The source rank specified in the call to \verb+MPI_Recv+ explicitly

540: identifies a process on the same machine (in the same vendor MPI job).

541: Furthermore, no asynchronous requests are outstanding (e.g., incomplete

542: \verb+MPI_Irecv+ and/or \verb+MPI_Isend+).  If these two conditions

543: are met, \hbox{MPICH-G2} implements \verb+MPI_Recv+ by directly

544: calling the \verb+MPI_Recv+ of the underlying vendor MPI.  This is the

545: most favorable circumstances under which an \verb+MPI_Recv+ can be

546: performed.

547:

548: \item {\bf Specified-pending.}

549: This category is similar to the {\em specified} category in that the

550: \verb+MPI_Recv+ specifies an explicit source rank on the same machine.

551: This time, however, one or more unsatisfied receive requests are

552: present, and each such request specifies a source on the same machine.

553: This situation forces \hbox{MPICH-G2} to continuously poll

554: (\verb+MPI_Iprobe+) the vendor MPI for incoming messages.  This

555: scenario results in less efficient MPICH-G2 performance since the

556: induced polling loop increases latency.

557:

558: \item {\bf Multimethod.}

559: Here the source rank for the \verb+MPI_Recv+ is \verb+MPI_ANY_SOURCE+

560: or \verb+MPI_Recv+ is called in the presence of unsatisfied

561: asynchronous requests that require, or might require, TCP messaging.

562: In this situation, \hbox{MPICH-G2} must poll both TCP and the

563: vendor MPI continuously.  This is the least efficient \hbox{MPICH-G2} scenario, since the relatively large cost of TCP polling results in even greater

564: latency.

565: \end{itemize}

566: In Section~\ref{sec-exp}, we present a quantitative analysis of the performance differences that

567: result from these different structures.

568:

569: \paragraph{More efficient use of sockets.}

570: The Nexus single-sided communication paradigm results in \hbox{MPICH-G2}

571: opening {\em two pairs of

572: sockets} between communicating processes and using each pair as a

573: simplex channel (i.e., data always flowing in one direction over each socket

574: pair). \hbox{MPICH-G2} opens a {\em single pair of sockets} between two

575: processes and sends data in both directions.  This approach reduces the use

576: of system resources; moreover, by using sockets in the bidirectional manner in which

577: they were intended, it also improves TCP efficiency.

578:

579: \paragraph{Multilevel topology-aware collective operations.}

580: Early implementations of MPI's collective operations sought to

581: construct communication structures that were optimal under the

582: assumption that all processes were equidistant from one

583: another~\cite{postal,logp}. Since this assumption is unlikely to be valid in

584: Grid environments, however, it is desirable that a Grid-enabled MPI

585: incorporate collective operation implementations that take into

586: account the actual topology. \hbox{MPICH-G2} does this, and we have

587: demonstrated substantial performance improvements for our {\em

588: multilevel topology-aware} approach~\cite{optcollops} relative both to

589: topology-{\em un}aware binomial trees and earlier topology-aware

590: approaches that distinguish only between ``intracluster'' and

591: ``intercluster'' communications~\cite{starT,magpie}.

592:

593: As we explain in the next subsection, \hbox{MPICH-G2}'s topology-aware

594: collective operations are constructed in terms of topology discovery

595: mechanisms that can also be used by topology-aware applications.

596:

597:

598: %

599: \subsection{Application-Level Management of Heterogeneity}

600: \label{g2-mgmt}

601:

602: We have experimented within MPICH-G2 with a variety of mechanisms for

603: application-level management of heterogeneity in the underlying

604: platform. We mention two here.

605:

606: \paragraph{Topology discovery.}

607: Once an MPI program starts, all processes can be viewed as equivalent,

608: distinguished only by their rank. This level of abstraction is

609: desirable from a programming viewpoint but makes it difficult to

610: write programs that exploit aspects of the underlying physical

611: topology, for example, to minimize expensive intercluster

612: communications.

613:

614: MPICH-G2 addresses this issue {\em within the standard MPI framework}

615: by using the MPI communicator construct to deliver topology

616: information to an application.

617: It associates {\em attributes} with each MPI

618: communicator to communicate this topology information, which is expressed

619: within each process in terms of {\em topology depths} and

620: {\em colors}, as described in Section~\ref{sec-startup}.

621:

622: \begin{figure}

623: \begin{small}

624: \begin{verbatim}

625: #include <mpi.h>

626:

627: int main(int argc, char *argv[])

628: {

629:   int me, flag;

630:   int *depths;

631:   int **colors;

632:   MPI_Comm LANcomm, VcommA, VcommB;

633:

634:   MPI_Init(&argc, &argv);

635:   MPI_Comm_rank(MPI_COMM_WORLD, &me);

636:   MPI_Attr_get(MPI_COMM_WORLD, MPICHX_TOPOLOGY_DEPTHS, &depths, &flag);

637:   MPI_Attr_get(MPI_COMM_WORLD, MPICHX_TOPOLOGY_COLORS, &colors, &flag);

638:

639:   MPI_Comm_split(MPI_COMM_WORLD, colors[me][1], 0, &LANcomm);

640:   MPI_Comm_split(MPI_COMM_WORLD, (depths[me] == 4 ? colors[me][3] : -1),

641:                  0, &VcommA);

642:   MPI_Comm_split(MPI_COMM_WORLD,

643:                  (depths[me] == 4 ? colors[me][3] : MPI_UNDEFINED),

644:                  0, &VcommB);

645:

646:   MPI_Finalize();

647: }

648: \end{verbatim}

649: \end{small}

650: \caption{An example \hbox{MPICH-G2} application that uses {\em topology depths}

651:     and {\em colors} to create communicators that group processes into

652:     various topology-aware clusters.}

653: \label{fig-hier}

654: \end{figure}

655:

656: MPICH-G2 applications can then query communicators to retrieve

657: attribute values and structure themselves appropriately. For example,

658: it is straightforward to create new communicators that reflect

659: the underlying network topology. Figure~\ref{fig-hier}

660: depicts an \hbox{MPICH-G2} application that first queries

661: the \hbox{MPICH-G2-defined} communicator attributes

662: \verb+MPICHX_TOPOLOGY_DEPTHS+ and \verb+MPICHX_TOPOLOGY_COLORS+

663: to discover topology depths and colors, respectively, and

664: then uses those values to create three communicators: \verb+LANcomm+,

665: which groups processes based on site boundaries, \verb+VcommA+, which

666: groups processes based on their ability to communicate with each

667: other over {\em v}MPI, while placing all processes that cannot communicate

668: over {\em v}MPI into a separate communicator, and \verb+VcommB+,

669: which groups the processes in much the same way as \verb+VcommA+, but

670: this time does not place processes that cannot communicate over {\em v}MPI

671: in a communicator (i.e., \verb+VcommB+ is set to

672: \verb+MPI_COMM_NULL+ for those processes).

673:

674: \paragraph{Quality-of-service management.}

675: We have experimented with similar techniques for purposes of quality of

676: service management~\cite{mpich-gq}. When running over a shared

677: network, an MPI application may wish to negotiate with an external

678: resource management system to obtain dedicated access to (part of) the

679: network. We show that communicator attributes can be used to set and

680: initiate \hbox{quality-of-service} parameters between selected processes.

681:

682:

683: %

684: %

685: %

686: \section{Performance Experiments}

687: \label{sec-exp}

688:

689: We present the results of detailed performance experiments that

690: characterize the performance of \hbox{MPICH-G2} and demonstrate the

691: major improvements achieved relative to its predecessor,

692: \hbox{MPICH-G}.  We begin by looking at the performance of {\em

693: intramachine} communication over a vendor MPI.  Then, we examine

694: performance when TCP is the only choice for communicating between a

695: pair of processes.  In all cases, mpptest~\cite{pvmmpi99-mpptest},

696: the performance tool included in the MPICH distribution, is used to

697: obtain all results.

698:

699:

700: %

701: \subsection{Vendor MPI}

702:

703: Evaluating the performance of MPICH-G2 when using a vendor MPI as an

704: underlying communication mechanism is not as simple as running a

705: single set of ping-pong tests.  As discussed earlier, the performance

706: achieved by \hbox{MPICH-G2} can be affected by outstanding requests

707: and by the use of \verb+MPI_ANY_SOURCE+.  Therefore, we have divided the

708: experiments into the three categories described in Section~\ref{g2-impl}.

709:

710: Our vendor MPI experiments were run on an SGI Origin2000 at Argonne

711: National Laboratory.

712: Both \hbox{MPICH-G2} and \hbox{MPICH-G} were built using a

713: nonthreaded, no-debug flavor of Globus 1.1.4 and performed

714: intramachine communication via SGI's implementation of MPI.

715:

716: \begin{figure}

717: \begin{center}

718: \includegraphics{vmpi-latency.eps}

719: \end{center}

720: \caption{vMPI experiments -- small message latency.}

721: \label{fig-vmpi-lat}

722: \end{figure}

723: One \hbox{MPICH-G2} design goal was to minimize latency overhead for

724: intramachine communication relative to an underlying vendor MPI.  As

725: can been seen in Figure~\ref{fig-vmpi-lat}, \hbox{MPICH-G2} does an

726: outstanding job in this regard: only a few extra microseconds of

727: latency are introduced by \hbox{MPICH-G2} when the source of the

728: message is specified and no other requests are outstanding.  In

729: contrast, \hbox{MPICH-G} added approximately 80 microseconds of

730: latency to each message, because the multiple steps required to

731: implement the Nexus single-sided communication model.

732:

733: The introduction of pending receive requests has a modest impact on

734: \hbox{MPICH-G2} message latencies.  Messages falling into the {\em

735: specified-pending} category incur slightly more overhead, as the

736: \hbox{MPICH-G2} progress engine must continuously poll (probe) the

737: vendor MPI rather than blocking in a receive.  Overall,

738: \hbox{MPICH-G2} latencies increase by several microseconds relative to

739: the first case but are still far less than those of \hbox{MPICH-G}.

740:

741: The use of \verb+MPI_ANY_SOURCE+ has the largest impact on

742: \hbox{MPICH-G2} performance.  The additional cost is associated with

743: having to poll TCP as well as the vendor MPI.  Polling TCP

744: increases the latency of

745: messages by nearly 20 microseconds over those in the {\em

746: specified-pending} category.  While the increase is significant, however, these

747: latencies are still considerably less than for \hbox{MPICH-G}.

748:

749: \begin{figure}

750: \begin{center}

751: \includegraphics{vmpi-bandwidth.eps}

752: \end{center}

753: \caption{vMPI experiments -- realized bandwidth.}

754: \label{fig-vmpi-bw}

755: \end{figure}

756: While \hbox{MPICH-G2} message latencies are affected by the use of

757: \verb+MPI_ANY_SOURCE+ and pending receive requests, the realized

758: bandwidths are largely unaffected.  Figure~\ref{fig-vmpi-bw} shows

759: the bandwidths obtained for messages up to one megabyte.

760: We see that the bandwidths for

761: \hbox{MPICH-G2} are nearly identical for all but small

762: messages.  While the large message bandwidths for \hbox{MPICH-G2} are

763: approximately 7\% less than those for the the vendor MPI (for reasons

764: we do not yet understand), they

765: represent an improvement of more than 60\% over \hbox{MPICH-G}.

766:

767:

768: %

769: \subsection{TCP/IP}

770:

771: Performance optimization work on \hbox{MPICH-G2} performed to date has

772: focused on intramachine messaging when a vendor MPI is used as the

773: underlying communication mechanism.  The \hbox{MPICH-G2} TCP/IP

774: communication code has not been optimized.  However, its performance

775: is quite reasonable when compared with \hbox{MPICH-G} and to MPICH

776: configured with the default TCP (ch\_p4) device.

777:

778: All TCP/IP performance measurements were taken using a pair of

779: SUN workstations in Argonne's Mathematics and Computer Science

780: Division.  These two machines were connected to a local-area network

781: via gigabit Ethernet.  Both \hbox{MPICH-G} and \hbox{MPICH-G2} were

782: built using a nonthreaded, no-debug flavor of Globus 1.1.4.

783:

784: \begin{figure}

785: \begin{center}

786: \includegraphics{tcp-latency.eps}

787: \end{center}

788: \caption{TCP/IP experiments -- small message latency.}

789: \label{fig-tcp-lat}

790: \end{figure}

791: Figure~\ref{fig-tcp-lat} shows the small message

792: latencies exhibited by all three systems.  We see that

793: for most message sizes, \hbox{MPICH-G2} is 20\% to 30\% slower than

794: MPICH/ch\_p4, although the difference is much smaller for very small

795: messages.  We also see that \hbox{MPICH-G2} latencies, in most cases,

796: are somewhat less than those of \hbox{MPICH-G}.

797:

798: The most notable data point is barely visible on the graph but

799: emphasizes a clear optimization that is missing in \hbox{MPICH-G2}.

800: The latency for zero-byte messages is 140 microseconds, while the

801: latency for an eight-byte message is 224 microseconds.  The reason for

802: this large difference is that \hbox{MPICH-G2} currently uses separate

803: system calls to send

804: the message header and the message data.

805: This data point suggests that

806: by combining these two writes into a single vector write, we could

807: reduce the latency of small messages significantly.  While this

808: difference might seem unimportant for machines separated by a wide-area

809: network, it can be significant when \hbox{MPICH-G2} is used to

810: combine multiple machines with the same machine room or even at the

811: same site.

812:

813: \begin{figure}

814: \begin{center}

815: \includegraphics{tcp-bandwidth.eps}

816: \end{center}

817: \caption{TCP/IP experiments -- realized bandwidth.}

818: \label{fig-tcp-bw}

819: \end{figure}

820:

821: Figure~\ref{fig-tcp-bw} shows the bandwidths

822: obtained by all three systems for message sizes up to one megabyte.

823: For large messages, we see that \hbox{MPICH-G2} performs approximately

824: 5\% better than the other two systems.  This improvement is a result

825: of the message data being sent directly from the user buffer rather

826: than being copied into a separate buffer before \verb+write+ is

827: called.  For preposted receives with contiguous data, further

828: improvement is possible.  Data for these receives can be read directly

829: into the user buffer, avoiding a buffer copy that, at present, always

830: takes place at the receiver.

831:

832:

833: %

834: %

835: %

836: \section{Application Experiences}

837: \label{sec-apps}

838:

839: MPICH-G2 has been used by many groups worldwide for a wide variety

840: of purposes. Here we mention a few relevant experiences that

841: highlight interesting features of the system.

842:

843: One interesting use of MPICH-G2 is to run conventional MPI programs

844: across multiple parallel computers within the same machine room.  In

845: this case, \hbox{MPICH-G2} is used primarily to manage startup and to

846: achieve efficient communication via use of different low-level

847: communication methods.  Other groups are using \hbox{MPICH-G2} to

848: distribute applications across computers located at different sites,

849: for example, Taylor performing MM5 climate modeling on the NSF

850: TeraGrid~\cite{teragrid,ncsa-news}, Mahinthakumar forming multivariate

851: geographic clusters to produce maps of regions of ecological

852: similarity~\cite{kumarsc99}, Larsson for studies of distributed

853: execution of a large computational electromagnetics code~\cite{Olle},

854: and Chen and Taylor in studies of automatic partitioning techniques,

855: as applied to finite element codes~\cite{Jian}.

856:

857: MPICH-G2 has also been successfully used in demonstrations that

858: promote MPI as an application-level interface to Grids for

859: nontraditional distributed computing applications, for example, Roy et al.\

860: for studies in using MPI idioms for setting QoS

861: parameters~\cite{mpich-gq} and Papka and Binns for creating

862: distributed visualization pipelines using \hbox{MPICH-G2's} client/server MPI-2

863: extensions~\cite{teragrid,ncsa-news}.

864:

865: \hbox{MPICH-G2} was awarded a 2001 Gordon Bell Award

866: for its role in an astrophysics application used for solving problems

867: in numerical relativity to study gravitational waves from colliding

868: black holes~\cite{CactusGBsc01}.  The winning team used

869: \hbox{MPICH-G2} to run across four supercomputers in California and

870: Illinois, achieving scaling of 88\% (1,140 CPUs) and 63\% (1,500 CPUs)

871: computing a problem size five times larger than any other previous

872: run.

873:

874:

875: %

876: %

877: %

878: \section{Future Work}

879:

880: The successful development of MPICH-G2 and its widespread adoption

881: both make it a useful platform for future research and create

882: significant interest in its continued development.

883:

884: One immediate area of concern is full support for MPI-2 features.  In

885: particular, support for dynamic process management will allow

886: \hbox{MPICH-G2} to be used for a wider class of Grid computations in

887: which either application requirements or resource availability changes

888: dynamically over time. The necessary support exists in the Globus

889: Toolkit, and so this work depends primarily on the availability of the

890: next-generation ADI-3. Less obvious, but very interesting, is how to

891: integrate support for fault tolerance into \hbox{MPICH-G2} in a

892: meaningful way.

893:

894: A second area of concern relates to exploring and refining

895: \hbox{MPICH-G2} support for application-level management of

896: heterogeneity. Initial experiments with topology discovery and

897: quality-of-service management have been encouraging, but it seems inevitable

898: that application experiences will reveal deficiencies in current

899: techniques or suggest additional \hbox{MPICH-G2} support that

900: could further improve application flexibility.

901:

902: Our work on collective operations can be improved in various ways.  In

903: particular, van de Geijn et al.~\cite{pipelining} have shown that are

904: advantages in implementing collective operations by segmenting and

905: pipelining messages when communicating over relatively slower

906: channels (e.g., TCP over local- and wide-area networks).  These

907: pipelining techniques can be used throughout many of the levels in

908: \hbox{MPICH-G2's} multilevel topology-aware collective operations.

909:

910:

911: %

912: %

913: %

914: \section{Related Work}

915:

916: A variety of approaches have been proposed to programming Grid

917: applications, including object

918: systems~\cite{grim:legion,GannonChapterCite}, Web

919: technologies~\cite{FoxChapterCite,webos}, problem solving

920: environments~\cite{NetSolve,Ninf}, CORBA, workflow systems,

921: high-throughput computing systems~\cite{Nimrod,Condor}, and

922: compiler-based systems~\cite{KennedyChapterCite}. We assume that while

923: different technologies will prove attractive for different purposes, a

924: programming model such as MPI that allows direct control over low-level

925: communications will always be attractive for certain applications.

926:

927: Other systems that support message passing in heterogeneous environments

928: include the pioneering Parallel Virtual Machine

929: (PVM)~\cite{pvmbook,PVMATM2} and the PACX-MPI~\cite{pacx},

930: MetaMPI~\cite{metampi}, and STAMPI~\cite{stampi} implementations of MPI,

931: each of which addresses issues relating to efficient communication in

932: heterogeneous wide-area systems.  STAMPI supports MPI-2 dynamic process

933: management features.  PACX-MPI, like \hbox{MPICH-G2}, supports the

934: automatic startup of distributed computations, but uses ssh rather than

935: the GRAM protocol with its integrated

936: GSI authentication, for that

937: purpose; nor does it address issues of executable staging.

938: PACX-MPI (and STAMPI) also differ in how it

939: addresses wide-area communication. While in \hbox{MPICH-G2}, any

940: processor may speak both local and wide-area communication protocols,

941: PACX-MPI and STAMPI forward all off-cluster communication operations to

942: an intermediate gateway node.

943:

944: Other implementations of MPI include MPICH with the ch\_p4 device

945: and LAM/MPI~\cite{lam,lam-www}.  By contrast these implementations

946: were designed for local area networks and not computational grids.

947:

948: The Interoperable MPI (IMPI) standards effort~\cite{impi-web} defines

949: standard message formats and protocols with a view to enabling

950: interoperability among different MPI implementations. IMPI does {\em

951: not} address issues of computation management and control;

952: in principle, the techniques developed within \hbox{MPICH-G2} could be

953: used for that purpose.

954:

955: Other related projects include MagPIe~\cite{magpie} and

956: MPI-StarT~\cite{starT}, which show how careful consideration of

957: communication topologies can result in significant improvements after

958: modifying the MPICH broadcast algorithm, which uses topology-{\em

959: un}aware binomial trees.  However, both limit their view of the

960: network to only two layers; processors are either near or far.

961: Further performance improvements can be realized by adopting the

962: multilevel network view.  We referred in the preceding section to the

963: work of van de Geijn et al.~\cite{pipelining}.  In~\cite{magpie-PC}

964: Kielman et al. have extended MagPIe by incorporating van de Geijn's

965: pipelining idea through a technique they call Parameterized LogP

966: (PLogP), which is an extension of the LogP model presented

967: by Culler et al~\cite{logp}.  In this extension, MagPIe still recognizes only a

968: two-layer communication network, but through parameterized studies of

969: the network they determine ``optimal'' packet sizes.

970:

971: Various projects have investigated programming model extensions to

972: enable application management of QoS, for example, Quo~\cite{Quo}.

973: The only other relevant effort in the context of MPI is work on

974: real-time extensions to MPI.  MPI/RT~\cite{mpirt-web} provides a QoS

975: interface but is not an established standard and introduces a new

976: programming interface.  Furthermore, the focus is on real-time needs

977: such as predictability of performance and system resource usage more

978: appropriate for embedded systems than for wide-area networks.

979:

980:

981: %

982: %

983: %

984: \section{Summary}

985:

986: We have described \hbox{MPICH-G2}, an implementation of the Message

987: Passing Interface that uses Globus Toolkit mechanisms to support the

988: execution of MPI programs in heterogeneous wide-area

989: environments. \hbox{MPICH-G2} masks details of underlying networks,

990: software systems, policies, and computer architectures so that diverse

991: distributed resources can appear as a single {\tt

992: MPI\_COMM\_WORLD}. Arbitrary MPI applications can be started on

993: heterogeneous collections of machines simply by typing mpirun:

994: authentication, authorization, executable staging, resource

995: allocation, job creation, startup, and routing of stdout and stderr

996: are all handled automatically via Globus Toolkit mechanisms.

997: \hbox{MPICH-G2} also enables the use of MPI features for user-level

998: management of heterogeneity, for example, via the use of MPI's

999: communicator construct to access system topology information.  A wide

1000: range of successful application experiences have demonstrated

1001: \hbox{MPICH-G2}'s utility in practical settings, both for traditional

1002: simulation applications and for less traditional applications such as

1003: distributed visualization pipelines.

1004:

1005: While \hbox{MPICH-G2} is already a sophisticated tool that is seeing

1006: widespread use, there are also several areas in which it can be

1007: extended and improved. Support for MPI-2 features, in particular

1008: dynamic process management, will be invaluable for Grid applications

1009: that adapt their resource usage to changing conditions and application

1010: requirements.  This support will be provided as soon as it is

1011: incorporated into MPICH. More challenging is the design of techniques

1012: for effective fault management, a major topic for future research.

1013: Here we may be able to draw upon techniques developed within systems

1014: such as PVM~\cite{pvmbook}.

1015:

1016: %

1017: %

1018: %

1019: \begin{acknowledge}

1020:

1021: We thank Olle~Larsson and Warren~Smith for early discussions and for

1022: prototyping the techniques that enable us to use vendor-supplied MPI.

1023: \hbox{MPICH-G2} is, to a large extent, the result of our

1024: \hbox{MPICH-G} experiences.  We therefore thank Jonathan~Geisler, who

1025: originally designed and implemented \hbox{MPICH-G} while at Argonne,

1026: and George~Thiruvathukal, who further developed \hbox{MPICH-G} also

1027: while at Argonne.  We thank William~Gropp, Ewing~Lusk, David~Ashton,

1028: Anthony~Chan, Rob~Ross, Debbie~Swider, and Rajeev~Thakur of the MPICH

1029: group at Argonne for their guidance, assistance, insight, and many

1030: discussions.  We thank Sebastien~Lacour for his efforts in conducting

1031: the performance evaluation and his many other contributions. His

1032: insight and ingenuity were invaluable to the implementation of the

1033: topology-aware components of \hbox{MPICH-G2}.  Finally, we thank all

1034: the members of the Globus development team for their support,

1035: patience, and many ideas.

1036:

1037: This work was supported in part by the Mathematical, Information, and

1038: Computational Sciences Division subprogram of the Office of Advanced

1039: Scientific Computing Research, U.S. Department of Energy, under Contract

1040: W-31-109-Eng-38; by the U.S. Department of Energy under Cooperative

1041: Agreement No. DE-FC02-99ER25398; by the Defense Advanced Research

1042: Projects Agency under contract N66001-96-C-8523; by the National Science

1043: Foundation; and by the NASA Information Power Grid program.

1044:

1045: \end{acknowledge}

1046:

1047:

1048: %

1049: %

1050: %

1051: \bibliographystyle{plain}

1052: \bibliography{foster_bibliography,mpi-allbib,mpi-book,globus,prop}

1053:

1054:

1055: %

1056: %

1057:

1058: Nicholas T. Karonis received a B.S. in finance and a B.S. in computer

1059: science from Northern Illinois University in 1985, an M.S. in computer

1060: science from Northern Illinois University in 1987, and a Ph.D. in

1061: computer science from Syracuse University in 1992.  He spent

1062: summers from 1981 to 1991 as a student at Argonne National Laboratory,

1063: where he worked on the p4 message-passing library, automated reasoning,

1064: and genetic sequence alignment.  From 1991 to 1995 he worked on the control

1065: system at Argonne's Advanced Photon Source and from 1995 to 1996

1066: for the Computing Division at Fermi National Accelerator Laboratory.

1067: Since 1996 he has been an assistant professor of computer science at

1068: Northern Illinois University and a resident associate guest of Argonne's

1069: Mathematics and Computer Science Division where he has been a member

1070: of the Globus Project.  His current research interests is message- passing

1071: systems in computational grids.

1072:

1073: Brian Toonen received his B.S. in computer science from the University of

1074: Wisconsin Oshkosh in 1993, and his M.S. in computer science from the

1075: University of Wisconsin-Madison in 1997.  He is a senior scientific

1076: programmer with the Mathematics and Computer Science Division at Argonne

1077: National Laboratory.  Brian's research interests include parallel and

1078: distributed computing, operating systems, and networking.  He is currently

1079: working with the MPICH team to create a portable, high-performance

1080: implementation of the MPI-2 standard.  Prior to joining the MPICH team, he

1081: was a senior developer for the Globus Project.

1082:

1083: Ian Foster received his B.Sc. (Hons I) at the University of Canterbury

1084: in 1979 and his Ph.D. from Imperial College, London, in 1998. He

1085: is a senior scientist and associate director of the

1086: Mathematics and Computer Science Division at Argonne National

1087: Laboratory, and professor of computer science at the University of

1088: Chicago.  He has published four books and over 150 papers and

1089: technical reports.  He co-leads the Globus Project, which provides

1090: protocols and services used by industrial and academic distributed

1091: computing projects worldwide.  He co-founded the influential Global

1092: Grid Forum and co-edited the book ``The Grid: Blueprint for a New

1093: Computing Infrastructure.''

1094:

1095: \end{document}

1096: