cs0206040/newp.tex
1: \documentclass[jpdc]{apjrnl}
2: \usepackage{graphics}
3: \listfiles
4: 
5: 
6: %
7: \begin{document}
8: 
9: \title{MPICH-G2: A Grid-Enabled Implementation of the Message Passing
10:        Interface}
11: 
12: \author{Nicholas T. Karonis
13: \\Department of Computer Science
14: \\Northern Illinois University
15: \\DeKalb, IL~~60115
16: \\Argonne National Laboratory
17: \\Argonne, IL~~60439
18: \\Email: karonis@niu.edu
19: \and
20: Brian Toonen
21: \\Mathematics and Computer Science Division
22: \\Argonne National Laboratory
23: \\Argonne, IL~~60439
24: \\Email: toonen@mcs.anl.gov
25: \and
26: Ian Foster
27: \\Argonne National Laboratory
28: \\Argonne, IL~~60439
29: \\The University of Chicago
30: \\Chicago, IL~~60637
31: \\Email: foster@mcs.anl.gov
32: }
33: 
34: \date{April 2002}
35: 
36: \maketitle
37: 
38: Proposed running head: MPICH-G2: A Grid-Enabled MPI
39: 
40: \pagebreak
41: 
42: 
43: %
44: %
45: %
46: \begin{abstract}
47: 
48: Application development for distributed computing ``Grids'' can
49: benefit from tools that variously hide or enable application-level
50: management of critical aspects of the heterogeneous environment. As
51: part of an investigation of these issues, we have developed
52: \hbox{MPICH-G2}, a Grid-enabled implementation of the Message Passing
53: Interface (MPI) that allows a user to run MPI programs across multiple
54: computers, at the same or different sites, using the same commands that
55: would be used on a parallel computer.  This library extends the
56: Argonne MPICH implementation of MPI to use services provided by the
57: Globus Toolkit for authentication, authorization, resource allocation,
58: executable staging, and I/O, as well as for process creation,
59: monitoring, and control.  Various performance-critical operations,
60: including startup and collective operations, are configured to exploit
61: network topology information. The library also exploits MPI constructs
62: for performance management; for example, the MPI communicator
63: construct is used for application-level discovery of, and
64: adaptation to, both network topology and network \hbox{quality-of-service}
65: mechanisms.  We describe the \hbox{MPICH-G2} design and
66: implementation, present performance results, and review application
67: experiences, including record-setting distributed simulations.
68: 
69: \end{abstract}
70: 
71: \begin{keywords}
72: MPI, Grid computing, message passing, Globus Toolkit, \hbox{MPICH-G2}
73: \end{keywords}
74: 
75: \pagebreak
76: 
77: 
78: %
79: %
80: %
81: \section{Introduction}
82: 
83: So-called computational Grids~\cite{GridBook,PhysicsToday} enable the
84: coupling and coordinated use of geographically distributed resources
85: for such purposes as large-scale computation, distributed data
86: analysis, and remote visualization. The development or
87: adaptation of applications for Grid environments is made challenging,
88: however, by the often heterogeneous nature of the resources involved and the
89: facts that that these resources typically live in different
90: administrative domains, run different software, are subject to
91: different access control policies, and may be connected by networks
92: with widely varying performance characteristics.
93: 
94: Such concerns have motivated various explorations of specialized, often
95: high-level, distributed programming models for Grid environments,
96: including various forms of object
97: systems~\cite{grim:legion,GannonChapterCite}, Web
98: technologies~\cite{FoxChapterCite,webos}, problem solving
99: environments~\cite{NetSolve,Ninf}, CORBA, workflow systems,
100: high-throughput computing systems~\cite{Nimrod,Condor}, and
101: compiler-based systems~\cite{KennedyChapterCite}.
102: 
103: In contrast, we explore here a different approach that might appear
104: reactionary in its simplicity but that, in fact, delivers a remarkably
105: sophisticated technology for managing the heterogeneity associated
106: with Grid environments. Specifically, we advocate the use of a
107: well-known low-level parallel programming model, the Message Passing
108: Interface (MPI), as a basis for Grid programming.  While not a
109: high-level programming model by any means, MPI incorporates
110: sophisticated support for the management of heterogeneity (e.g., data
111: types), for the construction of modular programs (the communicator
112: construct), for management of latency (asynchronous operations), and
113: for the representation of global operations (collective
114: operations). These and other features have allowed MPI to achieve
115: tremendous success as a standard programming model for parallel
116: computers. We hypothesize that these same features can also be used to
117: good effect for Grid computing.
118: 
119: Our investigation of MPI as a Grid programming model has focused on
120: three related questions.  First, can we implement MPI constructs
121: efficiently in Grid environments to {\em hide} heterogeneity
122: without introducing overhead?  Second, can we use MPI constructs to
123: enable users to {\em manage} heterogeneity, when this is required?
124: Third, do users find MPI useful in practice for application
125: development?
126: 
127: To allow for the experimental exploration of these questions,
128: we have developed \hbox{MPICH-G2}, a complete implementation of the
129: MPI-1 standard~\cite{mpi-forum:journal} that uses services provided by
130: the Globus Toolkit$^{TM}$~\cite{GlobusHCW98} to extend the popular Argonne
131: MPICH implementation of MPI~\cite{mpich} for Grid
132: execution. \hbox{MPICH-G2} passes the MPICH test suite and represents
133: a complete redesign and reimplementation of the earlier \hbox{MPICH-G}
134: system~\cite{mpi-nexus-pc} that increases performance significantly 
135: and incorporates a number of innovations.  
136: Our experiences with \hbox{MPICH-G2}, as reported in this article,
137: allow us to respond in the affirmative to each question posed in the
138: preceding paragraph.
139: 
140: MPICH-G2 hides heterogeneity by using Globus Toolkit services
141: for such purposes as authentication, authorization, executable
142: staging, process creation, process monitoring, process control,
143: communication, redirection of standard input and output, and remote
144: file access. The result is that a user can run MPI programs across
145: multiple computers at different sites using the same commands that
146: would be used on a parallel computer.  Furthermore, performance
147: studies show that overheads relative to native implementations of
148: basic communication functions are negligible.
149: 
150: MPICH-G2 enables the use of several different MPI features for
151: user management of heterogeneity. MPI's asynchronous operations can be
152: used for latency management in wide-area networks. MPI's communicator
153: construct can be used to represent the hierarchical structure of
154: heterogeneous systems and thus allow applications to adapt their
155: behavior to such structures. (In separate work, we present
156: topology-aware collective operations as one example of an
157: ``application''~\cite{optcollops}.) We also show how MPI's
158: communicator construct can be used for user-level management of
159: network quality of service, as first introduced in an earlier
160: article~\cite{mpich-gq}.
161: 
162: Many groups have used \hbox{MPICH-G2} for the execution of
163: both traditional parallel computing applications (e.g., numerical
164: simulation) and nontraditional distributed computing applications
165: (e.g., distributed visualization), in both local-area and wide-area
166: networks. This variety of applications and execution environments
167: persuades us that MPI can play a valuable role in Grid computing.
168: 
169: MPICH-G2 is not the only implementation of MPI for
170: heterogeneous systems.  Others include MPICH with the ch\_p4 device
171: (which provides
172: limited support for heterogeneity), PACX-MPI~\cite{pacx},
173: and STAMPI~\cite{stampi}, each of which has
174: interesting features, as we discuss later. Magpie~\cite{magpie},
175: IMPI~\cite{impi-web}, and PVM~\cite{pvmbook}
176: also address relevant issues.
177: \hbox{MPICH-G2} is
178: unique, however, in the degree to which it hides and manages heterogeneity, as
179: well as in its large user community.
180: 
181: In the rest of this article, we describe the problems that we faced in
182: developing \hbox{MPICH-G2}, the techniques used to overcome these
183: problems, and experimental results that indicate the performance of
184: the \hbox{MPICH-G2} implementation and the extent of its improvement
185: over \hbox{MPICH-G}. We conclude with a discussion of application
186: experiments and future directions.
187: 
188: 
189: %
190: %
191: %
192: \section{Background}
193: 
194: We first provide some brief background on MPI, MPICH, and the Globus
195: Toolkit.
196: 
197: 
198: %
199: \subsection{Message Passing Interface}
200: 
201: The Message Passing Interface standard defines a library of
202: routines that implement the message-passing model. These routines
203: include {\em point-to-point} communication functions, in which a {\em
204: send} operation is used to initiate a data transfer between two
205: concurrently executing program components and a matching {\em receive}
206: operation is used to extract that data from system data structures
207: into application memory space; and {\em collective} operations such as
208: broadcast and reductions that implement operations involving multiple
209: processes. Numerous other functions address other aspects of message
210: passing, including, in the MPI-2 extensions to
211: MPI~\cite{mpi-forum:mpi2-journal}, single-sided communication and
212: dynamic process creation.
213: 
214: The primary interest of MPI from our perspective, apart from its broad
215: adoption, is the care taken in its design to ensure that underlying
216: performance issues are accessible to, not masked from, the programmer.
217: MPI mechanisms such as asynchronous operations, communicators,
218: and collective operations all turn out to be useful in Grid environments.
219: 
220: 
221: %
222: \subsection{MPICH Architecture}
223: 
224: MPICH~\cite{gropp-lusk-doss-skjellum:mpich} is a popular
225: implementation of the Message Passing Interface standard.  It is a
226: high-performance, highly portable library originally developed as a
227: collaborative effort between Argonne National Laboratory and
228: Mississippi State University.  Argonne continues
229: research and development efforts aimed at improving MPICH performance
230: and functionality.
231: 
232: In its present form, MPICH is a complete implementation of the MPI-1
233: standard with extensions to support the parallel I/O functionality
234: defined in the MPI-2 standard.  It is a mature, widely distributed
235: library, with more than 2,000 downloads per month, not including
236: downloads that occur at mirror sites.  Its free distribution and wide
237: portability have contributed materially to the adoption of the MPI
238: standard by the parallel computing community.
239: 
240: MPICH derives its portability from its interfaces and layered
241: architecture.  At the top is the MPI interface as defined by the MPI
242: standards.  Directly beneath this interface is the MPICH layer, which
243: implements the MPI interface.  Much of the code in an MPI
244: implementation is independent of the networking device or process
245: management system.  This code, which includes error checking and
246: various manipulations of the opaque objects, is implemented directly
247: at the MPICH layer.  All other functionality is passed off to lower
248: layers be means of the Abstract Device Interface (ADI).
249: 
250: The ADI is a simpler interface than MPI proper
251: and focuses on moving data between the MPI layer and the network
252: subsystem.  Those interested in implementing MPI for a particular
253: platform need only define the routines in the ADI in order to obtain a
254: full implementation.  Existing implementations of this device
255: interface for various MPPs, SMPs, and networks provide complete MPI
256: functionality in a wide variety of environments.  \hbox{MPICH-G2} is
257: another implementation of the ADI and is otherwise known as the {\em
258: globus2} device.
259: 
260: %
261: %
262: %
263: %
264: %
265: %
266: %
267: %
268: %
269: %
270: %
271: %
272: %
273: %
274: %
275: %
276: 
277: 
278: %
279: \subsection{The Globus Toolkit}
280: 
281: The Globus Toolkit is a collection of software components designed to
282: support the development of applications for high-performance
283: distributed computing environments, or
284: ``Grids''~\cite{GlobusHCW98,GridBook}.  
285: Core components typically define a protocol for interacting with a
286: remote resource, plus an application program interface (API) used to
287: invoke that protocol.  (We introduce the protocols and APIs used
288: within \hbox{MPICH-G2} below.)  Higher-level libraries, services,
289: tools, and applications use core services to implement more complex
290: global functionality.
291: The various Globus Toolkit
292: components are reviewed in~\cite{Anatomy} and described in detail in
293: online documentation and in technical papers.
294: 
295: 
296: %
297: %
298: %
299: \section{MPICH-G2: A Grid-Enabled MPI}
300: 
301: As noted in the introduction, \hbox{MPICH-G2} is a complete
302: implementation of the MPI-1 standard that uses Globus Toolkit services
303: to support efficient and transparent execution in heterogeneous Grid
304: environments, while also allowing for application management of
305: heterogeneity.  (It also implements client/server management functions
306: found in Section 5.4 of the MPI-2
307: standard~\cite{mpi-forum:mpi2-journal}. However, we do not discuss
308: these functions here.)
309: 
310: In this section, we first describe the techniques used to hide
311: heterogeneity during startup and for process management, then the
312: techniques used to effect communication in heterogeneous systems,
313: and finally the support provided for application-level management
314: of heterogeneity.
315: 
316: 
317: %
318: \subsection{Hiding Heterogeneity during Startup and Management}
319: \label{sec-startup}
320: 
321: As illustrated in Figure~\ref{fig-g} and discussed here,
322: MPICH-G2 uses a range of Globus Toolkit services to address the
323: various complex issues that arise in
324: heterogeneous, multisite Grid environments, such as cross-site
325: authentication, the need to deal with multiple schedulers with
326: different characteristics, coordinated process creation, heterogeneous
327: communication structures, executable staging, and collation of
328: standard output. In fact, \hbox{MPICH-G2} serves as an exemplary
329: case study of how Globus Toolkit mechanisms can be used to create
330: a Grid-enabled programming tool, as we now explain.
331: 
332: \begin{figure}
333: \begin{center}
334: \resizebox{4.5in}{!}{\includegraphics{g2_fig.eps}}
335: \end{center}
336: \caption{Schematic of the MPICH-G2 startup, showing the various
337: Globus Toolkit components used to hide and manage heterogeneity. ``Fork,''
338: ``LSF,'' and ``LoadLeveler'' are different local schedulers.}
339: \label{fig-g}
340: \end{figure}
341: 
342: Prior to startup of an \hbox{MPICH-G2} application, the user employs
343: the {\em Grid Security Infrastructure} (GSI)~\cite{GlobusSecurity} to
344: obtain a (public key) proxy credential that is used to authenticate
345: the user to each remote sites. This step provides a single sign on
346: capability.
347: 
348: The user may also use the Monitoring and Discovery Service
349: (MDS)~\cite{mds97} to select computers on the basis of, for example,
350: configuration, availability, and network connectivity.
351: 
352: Once authenticated, the user uses the standard {\tt mpirun} command to
353: request the creation of an MPI computation. The \hbox{MPICH-G2}
354: implementation of this command uses the {\em Resource Specification 
355: Language~(RSL)}~\cite{GRAM97} to describe the job.  In brief,
356: users write {\em RSL scripts}, which identify resources 
357: (e.g., computers) and specify requirements (e.g., number of CPUs, memory, 
358: execution time, etc.) and parameters (e.g., location of executables, command 
359: line arguments, environment variables, etc.) for each.  Based on the
360: information found in an RSL script, \hbox{MPICH-G2}
361: calls a {\em co-allocation library}
362: distributed with the Globus Toolkit, the Dynamically-Updated Request
363: Online Coallocator (DUROC)~\cite{CoAllocation99}, to schedule and
364: start the application across the various computers specified by the
365: user.
366: 
367: The DUROC library itself uses the {\em Grid Resource Allocation and
368: Management} (GRAM)~\cite{GRAM97} API and protocol to start and
369: subsequently manage a set of subcomputations, one for each
370: computer. For each subcomputation, DUROC generates a GRAM request to a
371: remote GRAM server, which authenticates the user, performs local
372: authorization, and then interacts with the local scheduler to initiate
373: the computation. DUROC and associated \hbox{MPICH-G2} libraries tie
374: the various subcomputations together into a single MPI computation.
375: 
376: GRAM will, if directed, use {\em Global Access to Secondary Storage}
377: (GASS)~\cite{GASS99} to stage executable(s) from remote locations
378: (indicated by URLs).  GASS is also used, once an application has
379: started, to direct standard output and error (stdout and stderr)
380: streams to the user's terminal, and to provide access to files
381: regardless of location, thus masking essentially all aspects of
382: geographical distribution except those associated with performance.
383: 
384: Once the application has started, \hbox{MPICH-G2} selects the most
385: efficient communication method possible between any two processes,
386: using vendor-supplied MPI ({\em v}MPI) if available, or {\em Globus
387: communication} (Globus~IO) with {\em Globus Data Conversion}
388: (Globus~DC) for TCP, otherwise.
389: 
390: %
391: %
392: %
393: %
394: %
395: %
396: %
397: %
398: %
399: %
400: %
401: 
402: DUROC and GRAM also interact to monitor and manage the execution of the
403: application.  Each GRAM server monitors the life cycle of its subcomputation
404: as it passes from pending to running and then to terminating, communicating each
405: state transition back to DUROC.  Each subcomputation is held at a
406: DUROC-controlled barrier and is released from that barrier only after 
407: all subcomputations have started executing.  Also, a request to terminate
408: the computation (``control C'') may be initiated by the user at which 
409: time DUROC and the GRAM servers, communicating via GRAM process control
410: messages, terminate all processes.
411: 
412: \begin{figure}
413: \begin{center}
414: \includegraphics{grid.eps}
415: \end{center}
416: \caption{An example of an \hbox{MPICH-G2} application running on a 
417:     computational grid involving 4 processes on an IBM~SP at Site A and 8 
418:     processes distributed evenly across two Linux clusters at Site B.}
419: \label{fig-grid}
420: \end{figure}
421: 
422: \begin{figure}
423: \begin{center}
424: \begin{tabular}{|cc|cccccccccccc|}                           \hline
425: \multicolumn{2}{|c|} {\em Rank}    & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 \\ \hline \hline
426: \multicolumn{2}{|c|} {\em Depth}   & 4 & 4 & 4 & 4 & 3 & 3 & 3 & 3 & 3 & 3 &  3 &  3 \\ \hline
427:                      & wide area   & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0  &  0 \\ 
428: {\em Colors}         & local area  & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 & 1 &  1 &  1 \\
429:                      & system area & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 2 & 2 &  2 &  2 \\
430:                      & {\em v}MPI  & 0 & 0 & 0 & 0 &   &   &   &   &   &   &    &    \\ \hline
431: \end{tabular}
432: \end{center}
433: \caption{An example of {\em depths} and {\em colors} used by \hbox{MPICH-G2}
434:     to represent network topology in a computational grid.}
435: \label{fig-depths-colors}
436: \end{figure}
437: 
438: After the processes have started, \hbox{MPICH-G2} uses information
439: specified in the RSL script to create {\em multilevel clustering} of the 
440: processes based on the underlying network topology.
441: Figure~\ref{fig-grid}
442: depicts an MPI application involving 12 processes distributed across 
443: three machines located at two sites.  We depict 4 processes 
444: (\verb+MPI_COMM_WORLD+ ranks 0-3) on the IBM~SP at Site~A and 
445: 4 processes on each of two Linux clusters (\verb+MPI_COMM_WORLD+ ranks 4-7
446: and 8-11, respectively) at Site~B.
447: Each process in \verb+MPI_COMM_WORLD+ is assigned a {\em topology depth}.
448: Processes
449: that communicate using only TCP are assigned topology depths of 3 
450: (to distinguish between wide area, local area, and intramachine TCP 
451: messaging), and processes
452: that can also communicate using a {\em v}MPI have a topology
453: depth of 4.
454: Using these topology depths \hbox{MPICH-G2} groups processes at a particular
455: level through the assignment of {\em colors}.  Two processes are assigned
456: the same color at a particular level if they can communicate with each
457: other at the network level.
458: 
459: Figure~\ref{fig-depths-colors} depicts the {\em topology depths}
460: and {\em colors} for the processes depicted in Figure~\ref{fig-grid}.  Those
461: processes capable of communicating over {\em v}MPI, (i.e., those executing
462: on the IBM~SP), have a depth of 4, while the other processes, (i.e., those 
463: executing on a Linux cluster), have a depth of 3.  Since all processes are on 
464: the same wide-area network, they all have the same {\em color} (0) at the
465: wide-area level.  Similarly, at the local-area level, all the processes 
466: at Site A are assigned one color (0), while all the processes at Site B
467: are assigned another (1).  This structure continues through the system-area level,
468: where processes are assigned the same color if and only if they are 
469: on the same machine. Finally, processes that can communicate
470: over a {\em v}MPI are assigned the same color at the {\em v}MPI
471: level if and only if they can communicate directly with each other over
472: the {\em v}MPI.
473: 
474: Topology depths and colors are used in the multilevel topology-aware
475: collective operations and topology-discovery mechanism described in
476: Sections~\ref{g2-impl} and~\ref{g2-mgmt}, respectively.
477: 
478: 
479: %
480: \subsection{Heterogeneous Communications}
481: \label{g2-impl}
482: 
483: MPICH-G2 achieves major performance improvements relative to
484: the earlier \hbox{MPICH-G}~\cite{mpi-nexus-pc} by replacing
485: Nexus~\cite{JPDCNexus}, the multimethod, single-sided communication library used for
486: all communication in \hbox{MPICH-G}, with specialized MPICH-specific
487: communication code.  While Nexus has attractive features (e.g.,
488: multiprotocol support with highly tuned TCP support and automatic data
489: conversion), other attributes have proved less
490: attractive from a performance perspective. \hbox{MPICH-G2} now handles
491: all communication directly by reimplementing the good things about
492: Nexus and improving the others.  The result, as we show in
493: Section~\ref{sec-exp}, is that we achieve performance virtually
494: identical to vendor MPI and MPICH configured with the default TCP
495: (ch\_p4) device.  We provide here
496: a detailed description of the improvements and additions to
497: \hbox{MPICH-G} used to achieve this impressive performance.
498: 
499: \paragraph{Increased bandwidth.}
500: In \hbox{MPICH-G}, each communication involved the copying of data
501: to and from Nexus buffers in sending and receiving processes.
502: \hbox{MPICH-G2} eliminates these two extra copies in the case of
503: intramachine messages where a vendor MPI exists. In this situation, 
504: sends and receives now flow directly from and to application buffers, respectively.  In
505: addition, for TCP
506: messaging involving basic MPI datatypes (e.g., \verb+MPI_INT+, 
507: \verb+MPI_FLOAT+) the sending process also transmits directly 
508: from the application buffer.
509: 
510: \paragraph{Reduced latency for intramachine vendor MPI messaging.}
511: Multiprotocol support is achieved in Nexus by polling each protocol
512: (TCP, vendor MPI, etc.) for incoming messages in a roundrobin
513: fashion~\cite{MultiMethodJPDC}. However, this strategy is inefficient
514: in many situations: it is relatively expensive to poll a TCP socket
515: and in practice it is often the case that many processes in a
516: \hbox{MPICH-G2} computation use only vendor MPI (for communicating with
517: other processes on the same machine).
518: 
519: While this inefficiency can be reduced by adaptive
520: polling~\cite{MultiMethodJPDC} or by introducing distinct
521: proxy processes~\cite{pacx,stampi}, \hbox{MPICH-G2} takes a more direct
522: approach, exploiting the knowledge about message source that is
523: provided by TCP receive commands to eliminate TCP polling altogether
524: in many situations.  \hbox{MPICH-G2} polls TCP {\em only} when the
525: application is expecting data from a source that dictates, or might
526: dictate (e.g., \verb+MPI_Recv+ specifies source=\verb+MPI_ANY_SOURCE+),
527: TCP messaging.
528: 
529: This avoidance of unnecessary polling when coupled with the need to
530: guarantee progress on both the vendor MPI and TCP protocols leads to
531: implementation decisions that can affect an application's
532: point-to-point communication performance.  Specifically, for processes
533: executing on machines where a vendor MPI is available, the context in
534: which the application calls \verb+MPI_Recv+ affects the manner in
535: which \hbox{MPICH-G2} implements that function, as follows:
536: 
537: \begin{itemize}
538: \item {\bf Specified.}
539: The source rank specified in the call to \verb+MPI_Recv+ explicitly
540: identifies a process on the same machine (in the same vendor MPI job).
541: Furthermore, no asynchronous requests are outstanding (e.g., incomplete
542: \verb+MPI_Irecv+ and/or \verb+MPI_Isend+).  If these two conditions
543: are met, \hbox{MPICH-G2} implements \verb+MPI_Recv+ by directly
544: calling the \verb+MPI_Recv+ of the underlying vendor MPI.  This is the
545: most favorable circumstances under which an \verb+MPI_Recv+ can be
546: performed.
547: 
548: \item {\bf Specified-pending.}
549: This category is similar to the {\em specified} category in that the
550: \verb+MPI_Recv+ specifies an explicit source rank on the same machine.
551: This time, however, one or more unsatisfied receive requests are
552: present, and each such request specifies a source on the same machine.
553: This situation forces \hbox{MPICH-G2} to continuously poll
554: (\verb+MPI_Iprobe+) the vendor MPI for incoming messages.  This
555: scenario results in less efficient MPICH-G2 performance since the
556: induced polling loop increases latency.
557: 
558: \item {\bf Multimethod.}
559: Here the source rank for the \verb+MPI_Recv+ is \verb+MPI_ANY_SOURCE+
560: or \verb+MPI_Recv+ is called in the presence of unsatisfied
561: asynchronous requests that require, or might require, TCP messaging.
562: In this situation, \hbox{MPICH-G2} must poll both TCP and the
563: vendor MPI continuously.  This is the least efficient \hbox{MPICH-G2} scenario, since the relatively large cost of TCP polling results in even greater
564: latency.
565: \end{itemize}
566: In Section~\ref{sec-exp}, we present a quantitative analysis of the performance differences that
567: result from these different structures.
568: 
569: \paragraph{More efficient use of sockets.}
570: The Nexus single-sided communication paradigm results in \hbox{MPICH-G2}
571: opening {\em two pairs of 
572: sockets} between communicating processes and using each pair as a 
573: simplex channel (i.e., data always flowing in one direction over each socket 
574: pair). \hbox{MPICH-G2} opens a {\em single pair of sockets} between two 
575: processes and sends data in both directions.  This approach reduces the use 
576: of system resources; moreover, by using sockets in the bidirectional manner in which 
577: they were intended, it also improves TCP efficiency.
578: 
579: \paragraph{Multilevel topology-aware collective operations.}
580: Early implementations of MPI's collective operations sought to
581: construct communication structures that were optimal under the
582: assumption that all processes were equidistant from one
583: another~\cite{postal,logp}. Since this assumption is unlikely to be valid in
584: Grid environments, however, it is desirable that a Grid-enabled MPI
585: incorporate collective operation implementations that take into
586: account the actual topology. \hbox{MPICH-G2} does this, and we have
587: demonstrated substantial performance improvements for our {\em
588: multilevel topology-aware} approach~\cite{optcollops} relative both to
589: topology-{\em un}aware binomial trees and earlier topology-aware
590: approaches that distinguish only between ``intracluster'' and
591: ``intercluster'' communications~\cite{starT,magpie}.
592: 
593: As we explain in the next subsection, \hbox{MPICH-G2}'s topology-aware
594: collective operations are constructed in terms of topology discovery
595: mechanisms that can also be used by topology-aware applications.
596: 
597: 
598: %
599: \subsection{Application-Level Management of Heterogeneity}
600: \label{g2-mgmt}
601: 
602: We have experimented within MPICH-G2 with a variety of mechanisms for
603: application-level management of heterogeneity in the underlying
604: platform. We mention two here.
605: 
606: \paragraph{Topology discovery.}
607: Once an MPI program starts, all processes can be viewed as equivalent,
608: distinguished only by their rank. This level of abstraction is
609: desirable from a programming viewpoint but makes it difficult to
610: write programs that exploit aspects of the underlying physical
611: topology, for example, to minimize expensive intercluster
612: communications.
613: 
614: MPICH-G2 addresses this issue {\em within the standard MPI framework}
615: by using the MPI communicator construct to deliver topology
616: information to an application. 
617: It associates {\em attributes} with each MPI
618: communicator to communicate this topology information, which is expressed
619: within each process in terms of {\em topology depths} and
620: {\em colors}, as described in Section~\ref{sec-startup}.
621: 
622: \begin{figure}
623: \begin{small}
624: \begin{verbatim}
625: #include <mpi.h>
626: 
627: int main(int argc, char *argv[])
628: {
629:   int me, flag;
630:   int *depths;
631:   int **colors;
632:   MPI_Comm LANcomm, VcommA, VcommB;
633: 
634:   MPI_Init(&argc, &argv);
635:   MPI_Comm_rank(MPI_COMM_WORLD, &me);
636:   MPI_Attr_get(MPI_COMM_WORLD, MPICHX_TOPOLOGY_DEPTHS, &depths, &flag);
637:   MPI_Attr_get(MPI_COMM_WORLD, MPICHX_TOPOLOGY_COLORS, &colors, &flag);
638: 
639:   MPI_Comm_split(MPI_COMM_WORLD, colors[me][1], 0, &LANcomm);
640:   MPI_Comm_split(MPI_COMM_WORLD, (depths[me] == 4 ? colors[me][3] : -1), 
641:                  0, &VcommA);
642:   MPI_Comm_split(MPI_COMM_WORLD, 
643:                  (depths[me] == 4 ? colors[me][3] : MPI_UNDEFINED), 
644:                  0, &VcommB);
645: 
646:   MPI_Finalize();
647: }
648: \end{verbatim}
649: \end{small}
650: \caption{An example \hbox{MPICH-G2} application that uses {\em topology depths}
651:     and {\em colors} to create communicators that group processes into
652:     various topology-aware clusters.}
653: \label{fig-hier}
654: \end{figure}
655: 
656: MPICH-G2 applications can then query communicators to retrieve
657: attribute values and structure themselves appropriately. For example,
658: it is straightforward to create new communicators that reflect
659: the underlying network topology. Figure~\ref{fig-hier}
660: depicts an \hbox{MPICH-G2} application that first queries
661: the \hbox{MPICH-G2-defined} communicator attributes 
662: \verb+MPICHX_TOPOLOGY_DEPTHS+ and \verb+MPICHX_TOPOLOGY_COLORS+
663: to discover topology depths and colors, respectively, and
664: then uses those values to create three communicators: \verb+LANcomm+,
665: which groups processes based on site boundaries, \verb+VcommA+, which
666: groups processes based on their ability to communicate with each
667: other over {\em v}MPI, while placing all processes that cannot communicate
668: over {\em v}MPI into a separate communicator, and \verb+VcommB+,
669: which groups the processes in much the same way as \verb+VcommA+, but 
670: this time does not place processes that cannot communicate over {\em v}MPI 
671: in a communicator (i.e., \verb+VcommB+ is set to 
672: \verb+MPI_COMM_NULL+ for those processes).  
673: 
674: \paragraph{Quality-of-service management.}
675: We have experimented with similar techniques for purposes of quality of
676: service management~\cite{mpich-gq}. When running over a shared
677: network, an MPI application may wish to negotiate with an external
678: resource management system to obtain dedicated access to (part of) the
679: network. We show that communicator attributes can be used to set and
680: initiate \hbox{quality-of-service} parameters between selected processes.
681: 
682: 
683: %
684: %
685: %
686: \section{Performance Experiments}
687: \label{sec-exp}
688: 
689: We present the results of detailed performance experiments that
690: characterize the performance of \hbox{MPICH-G2} and demonstrate the
691: major improvements achieved relative to its predecessor,
692: \hbox{MPICH-G}.  We begin by looking at the performance of {\em
693: intramachine} communication over a vendor MPI.  Then, we examine
694: performance when TCP is the only choice for communicating between a
695: pair of processes.  In all cases, mpptest~\cite{pvmmpi99-mpptest},
696: the performance tool included in the MPICH distribution, is used to
697: obtain all results.
698: 
699: 
700: %
701: \subsection{Vendor MPI}
702: 
703: Evaluating the performance of MPICH-G2 when using a vendor MPI as an
704: underlying communication mechanism is not as simple as running a
705: single set of ping-pong tests.  As discussed earlier, the performance
706: achieved by \hbox{MPICH-G2} can be affected by outstanding requests
707: and by the use of \verb+MPI_ANY_SOURCE+.  Therefore, we have divided the
708: experiments into the three categories described in Section~\ref{g2-impl}.
709: 
710: Our vendor MPI experiments were run on an SGI Origin2000 at Argonne
711: National Laboratory.
712: Both \hbox{MPICH-G2} and \hbox{MPICH-G} were built using a
713: nonthreaded, no-debug flavor of Globus 1.1.4 and performed
714: intramachine communication via SGI's implementation of MPI.
715: 
716: \begin{figure}
717: \begin{center}
718: \includegraphics{vmpi-latency.eps}
719: \end{center}
720: \caption{vMPI experiments -- small message latency.}
721: \label{fig-vmpi-lat}
722: \end{figure}
723: One \hbox{MPICH-G2} design goal was to minimize latency overhead for
724: intramachine communication relative to an underlying vendor MPI.  As
725: can been seen in Figure~\ref{fig-vmpi-lat}, \hbox{MPICH-G2} does an
726: outstanding job in this regard: only a few extra microseconds of
727: latency are introduced by \hbox{MPICH-G2} when the source of the
728: message is specified and no other requests are outstanding.  In
729: contrast, \hbox{MPICH-G} added approximately 80 microseconds of
730: latency to each message, because the multiple steps required to
731: implement the Nexus single-sided communication model.
732: 
733: The introduction of pending receive requests has a modest impact on
734: \hbox{MPICH-G2} message latencies.  Messages falling into the {\em
735: specified-pending} category incur slightly more overhead, as the
736: \hbox{MPICH-G2} progress engine must continuously poll (probe) the
737: vendor MPI rather than blocking in a receive.  Overall,
738: \hbox{MPICH-G2} latencies increase by several microseconds relative to
739: the first case but are still far less than those of \hbox{MPICH-G}.
740: 
741: The use of \verb+MPI_ANY_SOURCE+ has the largest impact on
742: \hbox{MPICH-G2} performance.  The additional cost is associated with
743: having to poll TCP as well as the vendor MPI.  Polling TCP
744: increases the latency of
745: messages by nearly 20 microseconds over those in the {\em
746: specified-pending} category.  While the increase is significant, however, these
747: latencies are still considerably less than for \hbox{MPICH-G}.
748: 
749: \begin{figure}
750: \begin{center}
751: \includegraphics{vmpi-bandwidth.eps}
752: \end{center}
753: \caption{vMPI experiments -- realized bandwidth.}
754: \label{fig-vmpi-bw}
755: \end{figure}
756: While \hbox{MPICH-G2} message latencies are affected by the use of
757: \verb+MPI_ANY_SOURCE+ and pending receive requests, the realized
758: bandwidths are largely unaffected.  Figure~\ref{fig-vmpi-bw} shows
759: the bandwidths obtained for messages up to one megabyte.
760: We see that the bandwidths for
761: \hbox{MPICH-G2} are nearly identical for all but small
762: messages.  While the large message bandwidths for \hbox{MPICH-G2} are
763: approximately 7\% less than those for the the vendor MPI (for reasons
764: we do not yet understand), they
765: represent an improvement of more than 60\% over \hbox{MPICH-G}.
766: 
767: 
768: %
769: \subsection{TCP/IP}
770: 
771: Performance optimization work on \hbox{MPICH-G2} performed to date has
772: focused on intramachine messaging when a vendor MPI is used as the
773: underlying communication mechanism.  The \hbox{MPICH-G2} TCP/IP
774: communication code has not been optimized.  However, its performance
775: is quite reasonable when compared with \hbox{MPICH-G} and to MPICH
776: configured with the default TCP (ch\_p4) device.
777: 
778: All TCP/IP performance measurements were taken using a pair of
779: SUN workstations in Argonne's Mathematics and Computer Science
780: Division.  These two machines were connected to a local-area network
781: via gigabit Ethernet.  Both \hbox{MPICH-G} and \hbox{MPICH-G2} were
782: built using a nonthreaded, no-debug flavor of Globus 1.1.4.
783: 
784: \begin{figure}
785: \begin{center}
786: \includegraphics{tcp-latency.eps}
787: \end{center}
788: \caption{TCP/IP experiments -- small message latency.}
789: \label{fig-tcp-lat}
790: \end{figure}
791: Figure~\ref{fig-tcp-lat} shows the small message
792: latencies exhibited by all three systems.  We see that
793: for most message sizes, \hbox{MPICH-G2} is 20\% to 30\% slower than
794: MPICH/ch\_p4, although the difference is much smaller for very small
795: messages.  We also see that \hbox{MPICH-G2} latencies, in most cases,
796: are somewhat less than those of \hbox{MPICH-G}.
797: 
798: The most notable data point is barely visible on the graph but
799: emphasizes a clear optimization that is missing in \hbox{MPICH-G2}.
800: The latency for zero-byte messages is 140 microseconds, while the
801: latency for an eight-byte message is 224 microseconds.  The reason for
802: this large difference is that \hbox{MPICH-G2} currently uses separate
803: system calls to send
804: the message header and the message data.
805: This data point suggests that
806: by combining these two writes into a single vector write, we could
807: reduce the latency of small messages significantly.  While this
808: difference might seem unimportant for machines separated by a wide-area
809: network, it can be significant when \hbox{MPICH-G2} is used to
810: combine multiple machines with the same machine room or even at the
811: same site.
812: 
813: \begin{figure}
814: \begin{center}
815: \includegraphics{tcp-bandwidth.eps}
816: \end{center}
817: \caption{TCP/IP experiments -- realized bandwidth.}
818: \label{fig-tcp-bw}
819: \end{figure}
820: 
821: Figure~\ref{fig-tcp-bw} shows the bandwidths
822: obtained by all three systems for message sizes up to one megabyte.
823: For large messages, we see that \hbox{MPICH-G2} performs approximately
824: 5\% better than the other two systems.  This improvement is a result
825: of the message data being sent directly from the user buffer rather
826: than being copied into a separate buffer before \verb+write+ is
827: called.  For preposted receives with contiguous data, further
828: improvement is possible.  Data for these receives can be read directly
829: into the user buffer, avoiding a buffer copy that, at present, always
830: takes place at the receiver.
831: 
832: 
833: %
834: %
835: %
836: \section{Application Experiences}
837: \label{sec-apps}
838: 
839: MPICH-G2 has been used by many groups worldwide for a wide variety
840: of purposes. Here we mention a few relevant experiences that
841: highlight interesting features of the system.
842: 
843: One interesting use of MPICH-G2 is to run conventional MPI programs
844: across multiple parallel computers within the same machine room.  In
845: this case, \hbox{MPICH-G2} is used primarily to manage startup and to
846: achieve efficient communication via use of different low-level
847: communication methods.  Other groups are using \hbox{MPICH-G2} to
848: distribute applications across computers located at different sites,
849: for example, Taylor performing MM5 climate modeling on the NSF
850: TeraGrid~\cite{teragrid,ncsa-news}, Mahinthakumar forming multivariate
851: geographic clusters to produce maps of regions of ecological
852: similarity~\cite{kumarsc99}, Larsson for studies of distributed
853: execution of a large computational electromagnetics code~\cite{Olle},
854: and Chen and Taylor in studies of automatic partitioning techniques,
855: as applied to finite element codes~\cite{Jian}.
856: 
857: MPICH-G2 has also been successfully used in demonstrations that
858: promote MPI as an application-level interface to Grids for
859: nontraditional distributed computing applications, for example, Roy et al.\
860: for studies in using MPI idioms for setting QoS
861: parameters~\cite{mpich-gq} and Papka and Binns for creating
862: distributed visualization pipelines using \hbox{MPICH-G2's} client/server MPI-2
863: extensions~\cite{teragrid,ncsa-news}.
864: 
865: \hbox{MPICH-G2} was awarded a 2001 Gordon Bell Award
866: for its role in an astrophysics application used for solving problems
867: in numerical relativity to study gravitational waves from colliding
868: black holes~\cite{CactusGBsc01}.  The winning team used
869: \hbox{MPICH-G2} to run across four supercomputers in California and
870: Illinois, achieving scaling of 88\% (1,140 CPUs) and 63\% (1,500 CPUs)
871: computing a problem size five times larger than any other previous
872: run.
873: 
874: 
875: %
876: %
877: %
878: \section{Future Work}
879: 
880: The successful development of MPICH-G2 and its widespread adoption
881: both make it a useful platform for future research and create
882: significant interest in its continued development.
883: 
884: One immediate area of concern is full support for MPI-2 features.  In
885: particular, support for dynamic process management will allow
886: \hbox{MPICH-G2} to be used for a wider class of Grid computations in
887: which either application requirements or resource availability changes
888: dynamically over time. The necessary support exists in the Globus
889: Toolkit, and so this work depends primarily on the availability of the
890: next-generation ADI-3. Less obvious, but very interesting, is how to
891: integrate support for fault tolerance into \hbox{MPICH-G2} in a
892: meaningful way.
893: 
894: A second area of concern relates to exploring and refining
895: \hbox{MPICH-G2} support for application-level management of
896: heterogeneity. Initial experiments with topology discovery and 
897: quality-of-service management have been encouraging, but it seems inevitable
898: that application experiences will reveal deficiencies in current
899: techniques or suggest additional \hbox{MPICH-G2} support that
900: could further improve application flexibility.
901: 
902: Our work on collective operations can be improved in various ways.  In
903: particular, van de Geijn et al.~\cite{pipelining} have shown that are
904: advantages in implementing collective operations by segmenting and
905: pipelining messages when communicating over relatively slower
906: channels (e.g., TCP over local- and wide-area networks).  These
907: pipelining techniques can be used throughout many of the levels in
908: \hbox{MPICH-G2's} multilevel topology-aware collective operations.
909: 
910: 
911: %
912: %
913: %
914: \section{Related Work}
915: 
916: A variety of approaches have been proposed to programming Grid
917: applications, including object
918: systems~\cite{grim:legion,GannonChapterCite}, Web
919: technologies~\cite{FoxChapterCite,webos}, problem solving
920: environments~\cite{NetSolve,Ninf}, CORBA, workflow systems,
921: high-throughput computing systems~\cite{Nimrod,Condor}, and
922: compiler-based systems~\cite{KennedyChapterCite}. We assume that while
923: different technologies will prove attractive for different purposes, a
924: programming model such as MPI that allows direct control over low-level
925: communications will always be attractive for certain applications.
926: 
927: Other systems that support message passing in heterogeneous environments
928: include the pioneering Parallel Virtual Machine
929: (PVM)~\cite{pvmbook,PVMATM2} and the PACX-MPI~\cite{pacx},
930: MetaMPI~\cite{metampi}, and STAMPI~\cite{stampi} implementations of MPI,
931: each of which addresses issues relating to efficient communication in
932: heterogeneous wide-area systems.  STAMPI supports MPI-2 dynamic process
933: management features.  PACX-MPI, like \hbox{MPICH-G2}, supports the
934: automatic startup of distributed computations, but uses ssh rather than 
935: the GRAM protocol with its integrated
936: GSI authentication, for that 
937: purpose; nor does it address issues of executable staging. 
938: PACX-MPI (and STAMPI) also differ in how it
939: addresses wide-area communication. While in \hbox{MPICH-G2}, any
940: processor may speak both local and wide-area communication protocols,
941: PACX-MPI and STAMPI forward all off-cluster communication operations to
942: an intermediate gateway node.
943: 
944: Other implementations of MPI include MPICH with the ch\_p4 device
945: and LAM/MPI~\cite{lam,lam-www}.  By contrast these implementations
946: were designed for local area networks and not computational grids.
947: 
948: The Interoperable MPI (IMPI) standards effort~\cite{impi-web} defines
949: standard message formats and protocols with a view to enabling
950: interoperability among different MPI implementations. IMPI does {\em
951: not} address issues of computation management and control;
952: in principle, the techniques developed within \hbox{MPICH-G2} could be
953: used for that purpose.
954: 
955: Other related projects include MagPIe~\cite{magpie} and
956: MPI-StarT~\cite{starT}, which show how careful consideration of
957: communication topologies can result in significant improvements after
958: modifying the MPICH broadcast algorithm, which uses topology-{\em
959: un}aware binomial trees.  However, both limit their view of the
960: network to only two layers; processors are either near or far.
961: Further performance improvements can be realized by adopting the
962: multilevel network view.  We referred in the preceding section to the
963: work of van de Geijn et al.~\cite{pipelining}.  In~\cite{magpie-PC}
964: Kielman et al. have extended MagPIe by incorporating van de Geijn's
965: pipelining idea through a technique they call Parameterized LogP
966: (PLogP), which is an extension of the LogP model presented
967: by Culler et al~\cite{logp}.  In this extension, MagPIe still recognizes only a
968: two-layer communication network, but through parameterized studies of
969: the network they determine ``optimal'' packet sizes.
970: 
971: Various projects have investigated programming model extensions to
972: enable application management of QoS, for example, Quo~\cite{Quo}.
973: The only other relevant effort in the context of MPI is work on
974: real-time extensions to MPI.  MPI/RT~\cite{mpirt-web} provides a QoS
975: interface but is not an established standard and introduces a new
976: programming interface.  Furthermore, the focus is on real-time needs
977: such as predictability of performance and system resource usage more
978: appropriate for embedded systems than for wide-area networks.
979: 
980: 
981: %
982: %
983: %
984: \section{Summary}
985: 
986: We have described \hbox{MPICH-G2}, an implementation of the Message
987: Passing Interface that uses Globus Toolkit mechanisms to support the
988: execution of MPI programs in heterogeneous wide-area
989: environments. \hbox{MPICH-G2} masks details of underlying networks,
990: software systems, policies, and computer architectures so that diverse
991: distributed resources can appear as a single {\tt
992: MPI\_COMM\_WORLD}. Arbitrary MPI applications can be started on
993: heterogeneous collections of machines simply by typing mpirun:
994: authentication, authorization, executable staging, resource
995: allocation, job creation, startup, and routing of stdout and stderr
996: are all handled automatically via Globus Toolkit mechanisms.
997: \hbox{MPICH-G2} also enables the use of MPI features for user-level
998: management of heterogeneity, for example, via the use of MPI's
999: communicator construct to access system topology information.  A wide
1000: range of successful application experiences have demonstrated
1001: \hbox{MPICH-G2}'s utility in practical settings, both for traditional
1002: simulation applications and for less traditional applications such as
1003: distributed visualization pipelines.
1004: 
1005: While \hbox{MPICH-G2} is already a sophisticated tool that is seeing
1006: widespread use, there are also several areas in which it can be
1007: extended and improved. Support for MPI-2 features, in particular
1008: dynamic process management, will be invaluable for Grid applications
1009: that adapt their resource usage to changing conditions and application
1010: requirements.  This support will be provided as soon as it is
1011: incorporated into MPICH. More challenging is the design of techniques
1012: for effective fault management, a major topic for future research.
1013: Here we may be able to draw upon techniques developed within systems
1014: such as PVM~\cite{pvmbook}.
1015: 
1016: %
1017: %
1018: %
1019: \begin{acknowledge}
1020: 
1021: We thank Olle~Larsson and Warren~Smith for early discussions and for
1022: prototyping the techniques that enable us to use vendor-supplied MPI.
1023: \hbox{MPICH-G2} is, to a large extent, the result of our
1024: \hbox{MPICH-G} experiences.  We therefore thank Jonathan~Geisler, who
1025: originally designed and implemented \hbox{MPICH-G} while at Argonne,
1026: and George~Thiruvathukal, who further developed \hbox{MPICH-G} also
1027: while at Argonne.  We thank William~Gropp, Ewing~Lusk, David~Ashton,
1028: Anthony~Chan, Rob~Ross, Debbie~Swider, and Rajeev~Thakur of the MPICH
1029: group at Argonne for their guidance, assistance, insight, and many
1030: discussions.  We thank Sebastien~Lacour for his efforts in conducting
1031: the performance evaluation and his many other contributions. His
1032: insight and ingenuity were invaluable to the implementation of the
1033: topology-aware components of \hbox{MPICH-G2}.  Finally, we thank all
1034: the members of the Globus development team for their support,
1035: patience, and many ideas.
1036: 
1037: This work was supported in part by the Mathematical, Information, and
1038: Computational Sciences Division subprogram of the Office of Advanced
1039: Scientific Computing Research, U.S. Department of Energy, under Contract
1040: W-31-109-Eng-38; by the U.S. Department of Energy under Cooperative
1041: Agreement No. DE-FC02-99ER25398; by the Defense Advanced Research
1042: Projects Agency under contract N66001-96-C-8523; by the National Science
1043: Foundation; and by the NASA Information Power Grid program.
1044: 
1045: \end{acknowledge}
1046: 
1047: 
1048: %
1049: %
1050: %
1051: \bibliographystyle{plain}
1052: \bibliography{foster_bibliography,mpi-allbib,mpi-book,globus,prop}
1053: 
1054: 
1055: %
1056: %
1057: 
1058: Nicholas T. Karonis received a B.S. in finance and a B.S. in computer
1059: science from Northern Illinois University in 1985, an M.S. in computer
1060: science from Northern Illinois University in 1987, and a Ph.D. in 
1061: computer science from Syracuse University in 1992.  He spent 
1062: summers from 1981 to 1991 as a student at Argonne National Laboratory,
1063: where he worked on the p4 message-passing library, automated reasoning,
1064: and genetic sequence alignment.  From 1991 to 1995 he worked on the control
1065: system at Argonne's Advanced Photon Source and from 1995 to 1996 
1066: for the Computing Division at Fermi National Accelerator Laboratory.
1067: Since 1996 he has been an assistant professor of computer science at 
1068: Northern Illinois University and a resident associate guest of Argonne's
1069: Mathematics and Computer Science Division where he has been a member
1070: of the Globus Project.  His current research interests is message- passing 
1071: systems in computational grids.
1072: 
1073: Brian Toonen received his B.S. in computer science from the University of
1074: Wisconsin Oshkosh in 1993, and his M.S. in computer science from the
1075: University of Wisconsin-Madison in 1997.  He is a senior scientific
1076: programmer with the Mathematics and Computer Science Division at Argonne
1077: National Laboratory.  Brian's research interests include parallel and
1078: distributed computing, operating systems, and networking.  He is currently
1079: working with the MPICH team to create a portable, high-performance
1080: implementation of the MPI-2 standard.  Prior to joining the MPICH team, he
1081: was a senior developer for the Globus Project.
1082: 
1083: Ian Foster received his B.Sc. (Hons I) at the University of Canterbury
1084: in 1979 and his Ph.D. from Imperial College, London, in 1998. He
1085: is a senior scientist and associate director of the
1086: Mathematics and Computer Science Division at Argonne National
1087: Laboratory, and professor of computer science at the University of
1088: Chicago.  He has published four books and over 150 papers and
1089: technical reports.  He co-leads the Globus Project, which provides
1090: protocols and services used by industrial and academic distributed
1091: computing projects worldwide.  He co-founded the influential Global
1092: Grid Forum and co-edited the book ``The Grid: Blueprint for a New
1093: Computing Infrastructure.''
1094: 
1095: \end{document}
1096: