cs0206038/pnew.tex
1: 
2: 
3: \documentclass[jpdc]{apjrnl}
4: \usepackage{graphics}
5: %
6: \listfiles
7: 
8: 
9: %
10: \begin{document}
11: 
12: 
13: \title{A Multilevel Approach to Topology-Aware Collective Operations 
14:         in Computational Grids}
15: 
16: 
17: \author{Nicholas T. Karonis
18: %
19: \\Department of Computer Science
20: \\Northern Illinois University
21: \\DeKalb, IL~~60115
22: \\Argonne National Laboratory
23: \\Argonne, IL~~60439
24: \\Email: karonis@niu.edu
25: %
26: \and         %
27: Bronis de Supinski
28: %
29: \\Center for Applied Scientific Computing
30: \\Lawrence Livermore National Laboratory
31: \\Livermore, CA~~94551
32: \\Email: bronis@llnl.gov
33: %
34: \and         
35: Ian Foster
36: %
37: \\Argonne National Laboratory
38: \\Argonne, IL~~60439
39: \\The University of Chicago
40: \\Chicago, IL~~60637
41: \\Email: foster@mcs.anl.gov
42: %
43: \and         
44: William Gropp and Ewing Lusk
45: %
46: \\Mathematics and Computer Science Division
47: \\Argonne National Laboratory
48: \\Argonne, IL~~60439
49: \\Email: gropp@mcs.anl.gov, lusk@mcs.anl.gov
50: %
51: \and         
52: Sebastien Lacour
53: %
54: \\IRISA / INRIA Rennes
55: \\University of Beaulieu
56: \\35042 Rennes, France
57: \\Email: Sebastien.Lacour@irisa.fr
58: %
59: }
60: \date{April 2002}
61: 
62: 
63: \maketitle
64: 
65: 
66: Proposed running head: Multilevel Topology-Aware Collective Operations
67: 
68: 
69: 
70: \pagebreak
71: 
72: 
73: %
74: %
75: %
76: \begin{abstract}
77: The efficient implementation of collective communication operations has 
78: received much attention.  Initial efforts produced ``optimal'' trees based
79: on network communication models that assumed  
80: equal \mbox{point-to-point} latencies 
81: between any two processes.  This assumption is violated in
82: most practical settings, however, particularly in heterogeneous 
83: systems such as clusters of SMPs and wide-area ``computational Grids,''
84: with the result that collective operations 
85: perform suboptimally.  In response, more recent work has focused 
86: on creating {\em topology-aware} trees for collective operations that minimize
87: communication across slower channels (e.g., a wide-area network).  While these 
88: efforts have significant communication benefits, they all limit their view of 
89: the network to only two layers.  We present a strategy based upon a 
90: {\em multilayer} view of the network.  By creating {\em multilevel 
91: topology-aware} trees we take advantage of communication cost differences 
92: at every level in the network.  We used this strategy to implement 
93: topology-aware versions of several MPI collective operations 
94: in \hbox{MPICH-G2}, the Globus Toolkit$^{TM}$-enabled version of the 
95: popular MPICH implementation of the MPI standard.  
96: Using information about topology provided by \hbox{MPICH-G2},
97: we construct these multilevel 
98: topology-aware trees automatically during execution.
99: We present results demonstrating the advantages of our 
100: multilevel approach by comparing it to the default (topology-unaware)
101: implementation provided by MPICH and a topology-aware two-layer implementation.
102: \end{abstract}
103: 
104: 
105: \begin{keywords}
106: MPI, collective operations, \hbox{MPICH-G2}, grid computing, Globus Toolkit
107: \end{keywords}
108: 
109: 
110: \pagebreak
111: 
112: 
113: %
114: %
115: %
116: \section{Introduction}
117: 
118: 
119: The problem of building ``optimal'' communication trees for
120: collective operations
121: has received much attention in recent years.
122: The telephone model, which assumes that send and receive times are
123: equal and that messages are not packetized, implies that the optimal broadcast 
124: algorithm uses a binomial tree.
125: Under models that expand the telephone model to account for message latency,
126: such as the postal~\cite{postal} or LogP~\cite{logp} models, the communication 
127: topology of an optimal
128: broadcast algorithm becomes a generalized Fibonacci tree.
129: All of these approaches construct optimal trees for collective operations by first 
130: modeling the communication characteristics of a network with a set of 
131: parameters and then building the optimal trees based on parameter values and 
132: their model.
133: 
134: 
135: Underlying this work is
136: the assumption that the communication times between all process pairs
137: in the computation are equal.  While this is a reasonable approximation 
138: when the entire computation is performed on a single machine, 
139: it is not reasonable when the computation is executed on a cluster of 
140: symmetric multiprocessors (SMPs) in a local-area network, or worse, in 
141: a {\em computational Grid}~\cite{Globus,GridBook,globus-physics} environment, 
142: in which
143: multiple parallel computers are connected by
144: local-area, campus-area, or even wide-area networks.  Rapid
145: improvements in network performance have engendered considerable
146: interest in parallel computing
147: in the last context, as evidenced by experiments and
148: initiatives such as
149: the I-WAY~\cite{isoftcpe}, National Technology Grid~\cite{CACMgrid}, 
150: Information Power Grid~\cite{IPG99}, and TeraGrid~\cite{teragrid}.
151: 
152: %
153: 
154: 
155: 
156: Under these circumstances the trees produced by the conventional
157: models perform suboptimally.
158: In such heterogeneous environments, communication costs over different links
159: %
160: %
161: can differ by an order of magnitude or
162: more.  In these situations, {\em topology-aware} algorithms can 
163: dramatically
164: improveme the performance.  For example, in the case of N
165: processors distributed into two clusters, a traditional
166: reduction algorithm may generate \mbox{O(log N)} intercluster messages,
167: while a topology-aware algorithm generates only 1, for a
168: cost saving of a factor of \mbox{O(log N)} if intercluster message 
169: costs dominate.
170: 
171: 
172: Previous work~\cite{starT,magpie} has demonstrated that 
173: topology-aware 
174: collective operations can indeed reduce communication costs by reducing 
175: the amount of communication performed over slow channels.  However, this work
176: limited the depth of network stratification to only two levels: other 
177: processors are either near or far.  In~\cite{optcollops} we compared a
178: prototype of our multilevel approach to the topology-{\em un}aware binomial
179: tree algorithm distributed with MPICH and to MagPIe, one of the topology-aware
180: two-level techniques.  In that prototype we ``guessed'' which computers shared
181: a local network by inspecting their fully qualified domain names, and 
182: thereafter representing our multilevel clustering of processes with a
183: sequence of {\em hidden communicators} inside MPI communicators.
184: 
185: In this paper we present a much improved refinement of that prototype 
186: that allows collective operations to exploit knowledge concerning the 
187: structure of a multilevel network, in which the neighbors are processors
188: that are categorized according to their expected \mbox{point-to-point}
189: communication characteristics.  The identification of which processes
190: share a local network is now a simple matter of users providing values
191: for selected environment variables.  Additionally the use of hidden
192: communicators to represent the multilevel clustering has been replaced
193: by integer vectors.  The use of hidden communicators required us to
194: implement the collective operations as a {\em sequence of collective
195: operations}, for example, an \verb+MPI_Bcast+ was implemented as a 
196: sequence of \verb+MPI_Bcast+s sequencing over each of the hidden communicators
197: in turn, which typically resulted in the use of binomial trees at each level.  
198: By replacing the hidden communicators with integer vectors we
199: are now free to implement collective operations using \mbox{point-to-point}
200: operations over any tree we create.
201: 
202: 
203: To permit experimental studies, we have implemented our multilevel 
204: approach for five of the collective operations supported by the Message Passing
205: Interface (MPI) standard~\cite{mpi-forum:journal}: \verb+MPI_Bcast+, 
206: \verb+MPI_Reduce+, \verb+MPI_Barrier+, \verb+MPI_Gather+, and 
207: \verb+MPI_Scatter+.  We use \hbox{MPICH-G2}~\cite{mpichg2}, the successor to 
208: \hbox{MPICH-G}~\cite{mpi-nexus-pc}, which 
209: is based on the popular MPICH implementation~\cite{mpich} of the
210: MPI standard. \hbox{MPICH-G2} uses services provided 
211: by the Globus Toolkit$^{TM}$, or simply Globus, to support execution in 
212: heterogeneous and distributed 
213: environments.  This use of \hbox{MPICH-G2} enables experimentation within
214: realistic wide-area environments that would not otherwise be easily accessible.
215: 
216: 
217: In the sections that follow, we describe our multilevel topology approach.
218: Then, we present experimental results that illustrate the benefits of
219: our multilevel approach by comparing it with (1) the topology-{\em un}aware
220: implementation currently distributed with MPICH and (2) MagPIe~\cite{magpie},
221: one of the topology-aware two-level implementations of collective operations.
222: We briefly discuss other recent topology-aware and optimized collective 
223: operations efforts and conclude with a discussion of future work.
224: 
225: 
226: %
227: %
228: %
229: \section{Multilevel Topology-Aware Approach}
230: 
231: 
232: \begin{figure}
233: \begin{centering}
234: %
235: %
236: %
237: %
238: %
239: %
240: %
241: %
242: %
243: %
244: %
245: %
246: %
247: \includegraphics{grid.eps}
248: \caption{An example of a Grid computation involving 10 processes on one 
249:     IBM~SP at SDSC and another 10 processes distributed evenly across two 
250:     SGI~Origin2000s (O2K$_a$ and O2K$_b$) at NCSA.}\label{fig-mach}
251: \end{centering}
252: \end{figure}
253: 
254: 
255: Figure~\ref{fig-mach} depicts an MPI application
256: involving 20 processes distributed over three machines located at
257: the San Diego Supercomputer Center (SDSC) and the National Center
258: for Supercomputing Applications (NCSA).
259: We depict 10 processes on the IBM~SP at SDSC and 5 processes
260: on each of two Origin2000s, O2K$_a$ and O2K$_b$, at NCSA.  
261: The slowest communication is between sites, which uses TCP over a
262: wide-area network, with faster communication between the O2Ks at NCSA,
263: which uses TCP over their local-area network, and the fastest
264: communication, of course, within each machine.  
265: 
266: In the remainder of this section we describe a broadcast using first
267: the topology-unaware implementation currently distributed with MPICH, 
268: then a 2-level topology-aware approach, and finally our multilevel
269: topology-aware broadcast.
270: 
271: %
272: \subsection{A Topology-{\em Un}aware Broadcast}
273: 
274: \begin{figure}
275: \begin{centering}
276: \includegraphics{binomial.eps}
277: \caption{The binomial trees $B_0$ through $B_3$.}\label{fig-binomial}
278: \end{centering}
279: \end{figure}
280: 
281: Topology-unaware implementations of broadcast, including the one
282: distributed with MPICH, often make the simplifying 
283: assumption that the communication times between all process 
284: pairs in the computation are equal.  Under this assumption the broadcast
285: is often implemented by using a {\em binomial tree}.
286: 
287: A binomial tree $B_k$ is an ordered tree (i.e., children of each node
288: are ordered) of order $k \ge 0$ defined recursively.  As 
289: shown in Figure~\ref{fig-binomial}, the binomial tree $B_0$ consists
290: of a single node.  The binomial tree $B_k$ ($k > 0$) has a root
291: with $k$ children where the $i^{th}$ child ($0 < i \le k$) is the root 
292: of the binomial tree $B_{k-i}$.  Figure~\ref{fig-binomial} depicts the binomial 
293: trees $B_0$ through $B_3$.
294: 
295: When communication times between all process pairs in the computation are
296: equal and have relatively low latency, \hbox{Bar-Noy} and Kipnis 
297: show that implementing a broadcast with a binomial tree has the desirable 
298: property that all processes will complete the broadcast at approximately
299: the same time thus, achieving proper load balancing~\cite{postal}.
300: 
301: %
302: \subsection{A 2-Level Topology-Aware Broadcast}
303: \label{subsec-2lvl}
304: 
305: 
306: \begin{figure}
307: \begin{centering}
308: %
309: %
310: %
311: %
312: %
313: %
314: %
315: %
316: %
317: %
318: %
319: %
320: %
321: %
322: %
323: %
324: %
325: %
326: \includegraphics{magpie.eps}
327: \caption{An example of two 2-level topology-aware broadcast trees
328:     rooted at SDSC spanning 2 Origin2000s (O2K$_a$ and O2K$_b$) at NCSA 
329:     and an IBM~SP at SDSC: (a) clustering processes on {\em machine boundaries} 
330:     and (b) clustering on {\em site boundaries}.}\label{fig-2lvlbcast}
331: \end{centering}
332: \end{figure}
333: 
334: 
335: Existing 2-level topology-aware approaches~\cite{starT,magpie} cluster 
336: processes into groups.  The two natural choices for the machines
337: depicted in Figure~\ref{fig-mach} are to cluster the processes based
338: either on {\em machine boundaries}, creating three groups -- the IBM~SP, O2K$_a$, 
339: and O2K$_b$, or {\em site boundaries} creating two groups -- SDSC and NCSA.  
340: While both are reasonable choices and would improve
341: performance when compared with the topology-{\em un}aware binomial tree
342: distributed with MPICH,
343: both choices ignore the disparity in network performance between the
344: local- and wide-area networks.  Consider, for example,
345: a broadcast rooted at one of the processes at SDSC.  Figure~\ref{fig-2lvlbcast}a
346: depicts the broadcast tree of the 2-level approach when the processes
347: are clustered on machine boundaries.  The broadcast starts
348: with the SDSC root process sending messages to designated processes on each 
349: of the O2Ks at NCSA, resulting in two messages travelling across the wide-area 
350: network, and concludes with broadcasts within each machine.  By contrast, 
351: Figure~\ref{fig-2lvlbcast}b depicts the broadcast tree when the processes
352: are clustered on site boundaries.  In this case the root at SDSC
353: sends a single message across the wide-area network to a process on 
354: one of the two O2Ks at NCSA and concludes with a broadcast within the 
355: IBM~SP with another simultaneous broadcast across all the processes
356: at NCSA, which would typically require multiple messages to travel
357: across NCSA's local network.
358: 
359: 
360: %
361: \subsection{A Multilevel Topology-Aware Broadcast}
362: 
363: 
364: \begin{figure}
365: \begin{centering}
366: %
367: %
368: %
369: %
370: %
371: %
372: %
373: %
374: %
375: %
376: %
377: %
378: %
379: %
380: %
381: %
382: %
383: \includegraphics{g2.eps}
384: \caption{An example of a multilevel topology-aware broadcast tree rooted
385:     at SDSC spanning 2 Origin 2000s (O2K$_a$ and O2K$_b$) at NCSA and an 
386:     IBM~SP at SDSC.}\label{fig-mlvlbcast}
387: \end{centering}
388: \end{figure}
389: 
390: 
391: The multilevel topology-aware approach we present minimizes messaging 
392: across the slowest links {\em at each level} by clustering the processes 
393: at the wide-area level into site groups, and then within each site group, 
394: clustering processes at the local-area level into machine groups.
395: Using the same broadcast example from Section~\ref{subsec-2lvl}, we depict in
396: Figure~\ref{fig-mlvlbcast} the broadcast tree used by a multilevel approach.
397: Here the broadcast starts with the SDSC root process sending a single message 
398: across the wide-area network to one of the processes at NCSA, in 
399: Figure~\ref{fig-mlvlbcast} we depict a process on O2K$_a$.  The broadcast 
400: continues with the receiving process on O2K$_a$ sending a single message 
401: across NCSA's local network to a process on O2K$_b$ and the entire broadcast 
402: concludes with broadcasts within each machine.  This multilevel 
403: clustering minimizes messaging over the slower wide- and local-area 
404: networks.
405: 
406: 
407: %
408: %
409: %
410: \section{Multilevel Topology-aware Approach in \hbox{MPICH-G2}}
411: 
412: 
413: In this section we describe our implementation of multilevel topology-aware
414: collective operations in the Globus Toolkit-based \hbox{MPICH-G2}.  For illustrative purposes,
415: we discuss our implementation of \verb+MPI_Bcast+ in detail.
416: 
417: %
418: %
419: %
420: %
421: %
422: %
423: %
424: %
425: %
426: %
427: %
428: %
429: %
430: %
431: %
432: %
433: %
434: %
435: %
436: %
437: %
438: %
439: %
440: 
441: 
442: %
443: \subsection{RSL Specification of Topology}
444: 
445: 
446: MPICH-G2 uses the Globus Toolkit's Resource Specification Language~(RSL)~\cite{GRAM97} to
447: describe the resources required to run an application.  
448: Users write {\em RSL scripts}, which identify resources 
449: (e.g., computers) and specify requirements (e.g., number of CPUs, memory, 
450: execution time) and parameters (e.g., location of executables, command 
451: line arguments, environment variables) for each.  An RSL script can
452: be used as
453: the user interface to \verb+globusrun+, an upper-level Globus service
454: that first authenticates the user by using the Grid Security 
455: %
456: Infrastructure~(GSI)~\cite{GSI-journal} and then
457: schedules and monitors the job across the various machines by using
458: two other Globus Toolkit services: the Dynamically-Updated Request Online 
459: Coallocator~(DUROC)~\cite{CoAllocation99} and Grid Resource Allocation 
460: and Management~(GRAM)~\cite{GRAM97}.  RSL is designed to be an
461: easy-to-use language to describe multiresource multisite jobs while
462: hiding all the site-specific details associated with requesting
463: such resources.  
464: 
465: 
466: \begin{figure}
467: \begin{footnotesize}
468: \begin{verbatim}
469: +
470: ( &(resourceManagerContact="sp.npaci.edu") 
471:    (count=10)
472:    (jobtype=mpi)
473:    (label="subjob 0")
474:    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0))
475:    (directory=/homes/users/smith)
476:    (executable=/homes/users/smith/myapp)
477: )
478: ( &(resourceManagerContact="o2ka.ncsa.uiuc.edu") 
479:    (count=5)
480:    (jobtype=mpi)
481:    (label="subjob 1")
482:    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 1))
483:    (directory=/users/smith)
484:    (executable=/users/smith/myapp)
485: )
486: ( &(resourceManagerContact="o2kb.ncsa.uiuc.edu") 
487:    (count=5)
488:    (jobtype=mpi)
489:    (label="subjob 2")
490:    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 2))
491:    (directory=/users/smith)
492:    (executable=/users/smith/myapp)
493: )
494: \end{verbatim}
495: \end{footnotesize}
496: \caption{\strut{An RSL script for an \hbox{MPICH-G2} application running
497:     on three machines that facilitates {\em 2-level process 
498:     clustering}.}\label{fig-rsl2lvl}}
499: \end{figure}
500: 
501: 
502: %
503: %
504: %
505: %
506: %
507: %
508: %
509: %
510: %
511: %
512: %
513: %
514: %
515: %
516: %
517: %
518: %
519: %
520: %
521: %
522: %
523: %
524: %
525: %
526: %
527: %
528: %
529: %
530: %
531: %
532: %
533: 
534: 
535: \begin{figure}
536: \begin{footnotesize}
537: \begin{verbatim}
538: +
539: ( &(resourceManagerContact="sp.npaci.edu") 
540:    (count=10)
541:    (jobtype=mpi)
542:    (label="subjob 0")
543:    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0))
544:    (directory=/homes/users/smith)
545:    (executable=/homes/users/smith/myapp)
546: )
547: ( &(resourceManagerContact="o2ka.ncsa.uiuc.edu") 
548:    (count=5)
549:    (jobtype=mpi)
550:    (label="subjob 1")
551:    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 1)
552:        (GLOBUS_LAN_ID NCSAlan))
553:    (directory=/users/smith)
554:    (executable=/users/smith/myapp)
555: )
556: ( &(resourceManagerContact="o2kb.ncsa.uiuc.edu") 
557:    (count=5)
558:    (jobtype=mpi)
559:    (label="subjob 2")
560:    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 2)
561:        (GLOBUS_LAN_ID NCSAlan))
562:    (directory=/users/smith)
563:    (executable=/users/smith/myapp)
564: )
565: \end{verbatim}
566: \end{footnotesize}
567: \caption{\strut{An RSL script for an \hbox{MPICH-G2} application running
568:     on three machines that facilitates {\em multilevel process 
569:     clustering}.}\label{fig-rslMlvl}}
570: \end{figure}
571: 
572: 
573: Figure~\ref{fig-rsl2lvl} depicts an RSL script for an \hbox{MPICH-G2}
574: application intended to run on the computational Grid depicted in
575: Figure~\ref{fig-mach}.  It depicts a job as a set of three {\em subjobs}, 
576: where each subjob is associated with a particular resource, in our example, 
577: a computer.  Subjobs define a natural machine-boundary partitioning of the
578: processes in \verb+MPI_COMM_WORLD+ and are sufficient for a 2-level machine
579: boundary clustering of the processes.  To achieve a multilevel clustering,
580: the user must identify those machines that are on the same local network
581: by specifying a value for an \hbox{MPICH-G2}-defined environment variable 
582: \verb+GLOBUS_LAN_ID+, as depicted in the RSL script in 
583: Figure~\ref{fig-rslMlvl}.  Specifying the same value (\verb+NCSAlan+) 
584: in the second and third subjobs instructs \hbox{MPICH-G2} to cluster these 
585: two machines into the same local-area network group.  This same technique 
586: can be used to cluster many subjobs in the same local-area network group 
587: while simultaneously creating multiple local-area network groups through 
588: the assignment of multiple yet unique \verb+GLOBUS_LAN_ID+ values.  This 
589: simple specification (the only difference between Figures~\ref{fig-rsl2lvl} 
590: and~~\ref{fig-rslMlvl}) is all that is required to create multilevel 
591: topology-aware clustering of the processes.
592: 
593: 
594: %
595: %
596: %
597: %
598: %
599: %
600: %
601: %
602: %
603: %
604: %
605: %
606: %
607: %
608: %
609: 
610: 
611: The multilevel clustering information specified in RSL (i.e., processes
612: gathered first into machine groups and then local network groups composed
613: of machine groups) creates a multilevel grouping of the processes in 
614: \verb+MPI_COMM_WORLD+ and is distributed to all the processes during 
615: \hbox{MPICH-G2} bootstrapping to be stored within \verb+MPI_COMM_WORLD+ on
616: each process.  When new communicators are created (e.g., via \verb+MPI_Comm_split+),
617: \hbox{MPICH-G2} propagates the relevant multilevel clustering information to 
618: the newly created communicator so that {\em all communicators} in 
619: \hbox{MPICH-G2} have the multilevel clustering information pertaining to their 
620: process groups.  As an interesting side effect we have made this multilevel 
621: topology information available to MPI applications through existing
622: MPI communicator caching idioms. See~\cite{mpichg2} for a full 
623: description of \hbox{MPICH-G2}'s topology discovery mechanism.
624: %
625: 
626: 
627: %
628: \subsection{MPICH-G2's Multilevel Topology-Aware Broadcast}
629: 
630: 
631: A multilevel topology-aware clustering of processes is not sufficient
632: in itself to allow the construction of a broadcast tree such as that depicted in 
633: Figure~\ref{fig-mlvlbcast}: \hbox{MPICH-G2} also needs to know which process is 
634: the root of the broadcast.  Construction of the multilevel 
635: topology-aware tree is therefore deferred until the application calls a 
636: collective operation.  At that time each process simultaneously and 
637: independently (i.e., without communication) construct an identical tree based 
638: on the multilevel process grouping found in the communicator and the parameters 
639: passed (e.g., identifying the root process of a broadcast) to the collective 
640: operation.
641: 
642: 
643: One benefit of using a multilevel topology-aware tree to implement 
644: a collective operation is that we are free to select different subtree 
645: topologies at each level.  For example, a multilevel broadcast tree can start
646: with a broadcast from the root to selected processes at each site across a 
647: wide-area network, followed by broadcasts at each site to selected processes 
648: on each machine across the local networks, and concluding with broadcasts 
649: within each machine.  We have the freedom to use different broadcast 
650: topologies at each stage in the sequence.  Bar-Noy and Kipnis show
651: that in high-latency networks (e.g., a wide-area network) 
652: the optimal broadcast topology is a flat tree in which the root sends the 
653: data to all other processes directly, while in a low-latency network 
654: (e.g., intramachine messaging), the optimal broadcast topology is a binomial 
655: tree~\cite{postal}.  We take advantage of these findings and the flexibility 
656: of our multilevel approach in our implementation of \verb+MPI_Bcast+ by using
657: a flat broadcast tree at the initial wide-area level and binomial trees at 
658: the local-area and intramachine levels.
659: 
660: 
661: In the next section we present results demonstrating the advantages of our
662: multilevel approach by comparing it with the default (topology-{\em un}aware)
663: implementation provided by MPICH and a topology-aware two-layer implementation.
664: 
665: 
666: %
667: %
668: %
669: \section{Experimental Results}
670: \label{sec-exp}
671: 
672: 
673: %
674: %
675: 
676: 
677: To demonstrate the advantages of our multilevel approach, we examine
678: its effects on \verb+MPI_Bcast+.  
679: The MPICH implementation of \verb+MPI_Bcast+ is based on binomial trees;
680: hence, in a distributed heterogeneous environment like a computational
681: Grid its performance is acutely sensitive to the distribution 
682: of the processes and the root of the broadcast.  
683: For example, in an application using $P=2^k$ processes distributed evenly 
684: across $C=2^i, 0 \le i \le k$ clusters, a broadcast implemented using
685: a binomial tree propagates the message down its longest path 
686: using at least $log_2C$ intercluster messages and $log_2\frac{P}{C}$
687: intracluster messages.
688: In contrast, 
689: under certain intercluster network performance
690: conditions described by \mbox{Bar-Noy} and Kipnis in their postal model,
691: our multilevel method could be used to send 
692: 1 intercluster message and $log_2\frac{P}{C}$ intracluster messages.
693: Assuming an intercluster latency $l_s$ sec and bandwidth $b_s$ Kb/sec; 
694: and an intracluster latency $l_f$ sec and bandwidth $b_f$ Kb/sec, 
695: broadcasting a message of N Kb using the 
696: binomial tree conservatively takes 
697: \mbox{$O((logC)(l_s+\frac{N}{b_s}) + (log\frac{P}{C})(l_f+\frac{N}{b_f}))$},
698: whereas
699: broadcasting the same message using our multilevel method takes only 
700: \mbox{$O((l_s+\frac{N}{b_s}) + (log\frac{P}{C})(l_f+\frac{N}{b_f}))$}.
701: 
702: 
703: \begin{figure}
704: \begin{small}
705: \begin{verbatim}
706: For (each message size M)
707:   MPI_Barrier(MPI_COMM_WORLD)
708:   if (MPI_COMM_WORLD rank == 0)
709:     t0 = get_time()
710:   For (r = 0; r < Nprocs; r ++)
711:     MPI_Bcast(root=r to MPI_COMM_WORLD message size M)
712:     ack_barrier()
713:   if (MPI_COMM_WORLD rank == 0)
714:     t1 = get_time()
715:     report message size M, time t1-t0
716: \end{verbatim}
717: \end{small}
718: \caption{\strut{The broadcast timing application.}\label{fig-tim-bcast}}
719: \end{figure}
720: 
721: 
722: We wrote a small MPI application (depicted in Figure~\ref{fig-tim-bcast}) 
723: that times the broadcasts of messages of increasing size.  
724: To represent a broadcast with an arbitrary
725: root, we timed how long it would take to broadcast each message
726: of size M as each process in \verb+MPI_COMM_WORLD+ took its turn
727: as the root.  Also, in order to eliminate any potential pipelining
728: that might occur between consecutive broadcasts, we inserted a barrier
729: (\verb+ack_barrier()+) after each broadcast in which all processes
730: other than rank 0 \verb+MPI_Send+ an ACK message to process 0 and then
731: wait to \verb+MPI_Recv+ a GO message.  Process 0, after \verb+MPI_Recv+'ing
732: the ACK message from all the other processes, \verb+MPI_Send+'s a GO
733: message to each of the other processes, one at a time.  We chose to write our
734: own barrier rather than calling \verb+MPI_Barrier+ because 
735: we have reimplemented \verb+MPI_Barrier+ to reflect multilevel topology and
736: we wished these tests to reflect the differences only in the broadcast 
737: implementations.
738: 
739: 
740: %
741: %
742: %
743: %
744: %
745: %
746: %
747: %
748: %
749: %
750: %
751: 
752: We conducted experiments running the MPI
753: application depicted in Figure~\ref{fig-tim-bcast} on 
754: three computers: the IBM~SP at the San Diego Supercomputer Center
755: (SDSC-SP) and the IBM~SP (ANL-SP) and SGI~Origin200 (ANL-O2K) at
756: Argonne National Laboratory.
757: We compare
758: our multilevel topology approach to the binomial tree provided by MPICH
759: and include comparisons to the 2-level approach provided by MagPIe.  
760: We ran the application four times, each time using 
761: 16 processes on each of the three computers.  These results are
762: depicted in Figure~\ref{fig-bcast-magpie}.  
763: The curves labeled ``MagPIe-machine'' and ``MagPIe-site'' represent
764: two runs using MagPIe version 2.0.1, each time with 
765: a different cluster definition.  
766: In our first MagPIe 
767: run (``MagPIe-machine'') we defined three clusters, one for each computer, 
768: of 16 processes each.  In our second MagPIe run (``MagPIe-site'') we defined 
769: two clusters: an ANL cluster comprising the two ANL machines having 32 
770: processes and an SDSC cluster comprising the SDSC-SP having only 16 processes.
771: 
772: %
773: %
774: %
775: \begin{figure}
776: \begin{centering}
777: \includegraphics{timing_bcast.16anlsp16sdscsp16sdscsp.magpie.eps}
778: \caption{Original MPICH broadcast~vs.~topology-aware MPICH broadcast~vs.~MagPIe
779:     broadcast running 16 processes 
780:     on the IBM~SP at SDSC and 16 processes on each the IBM~SP and 
781:     SGI~Origin2000 at ANL.}\label{fig-bcast-magpie}
782: \end{centering}
783: \end{figure}
784: 
785: Figure~\ref{fig-bcast-magpie} 
786: shows there are significant benefits to the multilevel approach when
787: compared with a simple binomial tree and even when compared with a 2-level
788: approach as implemented by MagPIe.
789: A multilevel view of the network allows an application to avoid 
790: slower channels {\em at each level}.  In our experiments, the broadcast is 
791: optimized by sending one message across the wide-area network, then one 
792: message across the local-area network, and then many messages
793: within each computer.
794: 
795: 
796: %
797: %
798: %
799: \section{Related Work}
800: 
801: 
802: Previous efforts have focused on creating ``optimal'' trees for collective 
803: operations where \mbox{point-to-point} communications are not necessarily 
804: equal between any two processes.  Husbands and Hoe present 
805: MPI-StarT~\cite{starT}, an MPI implementation for a cluster of SMPs 
806: interconnected by a high-performance interconnect.  They report significant 
807: improvements after modifying the MPICH broadcast algorithm, which uses 
808: binomial trees.  Their modifications use information that describes their 
809: cluster topology by minimizing intercluster communication during collective 
810: operations. MagPIe~\cite{magpie} is another MPI system designed to construct 
811: collective operation trees in heterogeneous communication environments.  
812: MagPIe recognizes a two-layer communication network that distinguishes between 
813: local- and wide-area communication.  By minimizing wide-area communication, 
814: much in the same way MPI-StarT minimizes intercluster communication, MagPIe 
815: has seen significant improvements in all the MPI collective operations. 
816: 
817: 
818: Both efforts have produced impressive results and clearly demonstrate 
819: that there are significant advantages to implementing collective operations in
820: a topology-aware manner.  However, both limit
821: their view of the network to only two layers; MPI-StarT distinguishes
822: between intra- and intercluster communication within the same
823: local-area, and MagPIe distinguishes between local- and wide-area communication.
824: There are opportunities for further optimization by 
825: using trees that stratify the network deeper than two layers.
826: 
827: 
828: In~\cite{pipelining} van de Geijn et al. show the advantages of
829: implementing collective operations by segmenting and pipelining
830: messages when communicating over relatively slower channels (e.g., TCP
831: over local- and wide-area networks).  
832: 
833: 
834: 
835: 
836: In~\cite{magpie-PC} Kielman et al. extend MagPIe by incorporating
837: van de Geijn's pipelining idea through a technique they call Parameterized
838: LogP (PLogP), which is an extension of the LogP model presented 
839: by Culler et al~\cite{logp}.
840: In this extension, MagPIe still recognizes only a two-layer communication
841: network, but through parameterized studies of the network, the researchers determine
842: ``optimal'' packet sizes.  This technique works well for applications
843: that always run on the same computational grid having relatively 
844: stable performance, but requires retuning when moving the application
845: from one computing environment or network to another.
846: 
847: 
848: %
849: %
850: %
851: \section{Future Work}
852: 
853: 
854: We have implemented five of the MPI collective operations in a 
855: topology-aware multilevel manner in \hbox{MPICH-G2}.  Encouraged by our
856: initial results, we plan to upgrade \hbox{MPICH-G2's} remaining MPI collective 
857: operations in a similar manner.
858: 
859: 
860: %
861: %
862: %
863: %
864: %
865: %
866: %
867: %
868: %
869: %
870: %
871: %
872: %
873: %
874: %
875: %
876: %
877: %
878: %
879: %
880: %
881: %
882: %
883: %
884: %
885: 
886: 
887: Our general strategy implements a collective operation by first stratifying 
888: the network into multiple levels and then minimizing the communication
889: across the slowest channels.  In doing so, however, we may encounter
890: a tree that has multiple siblings at a particular level, for example,
891: many sites connected across the wide-area network or many machines
892: at a particular site.  When this situation happens, we implement the collective
893: operation at that level using a binomial tree at all but the wide-area
894: network level.  Unfortunately, a binomial tree is not always the best choice.
895: \mbox{Bar-Noy} and Kipnis show that the shape of a collective 
896: operation tree depends heavily on the \mbox{point-to-point} communication 
897: characteristics of the send/receive primitives on which it is implemented.  
898: Their model incorporates a latency parameter $\lambda \ge 1$.  They show
899: that for low latencies, (for example, communication within a single machine), 
900: the optimal broadcast tree is a binomial tree, but for higher latencies,
901: (for example, communication across a wide-area network), the optimal broadcast
902: tree becomes flatter.  We will investigate ways to select better, if
903: not optimal, collective operation trees by choosing those that respect
904: the different communication characteristics at each level of our multilevel
905: view.
906: 
907: 
908: The pipelining techniques presented by van de Geijn et al. can be used at 
909: each of the levels in \hbox{MPICH-G2's} multilevel topology-aware collective 
910: operations.  Using techniques similar to Kielman's PLogP method, we will 
911: develop methods to determine the appropriate packet sizes with respect to 
912: network performance at {\em each level} of our multilevel view. 
913: 
914: 
915: %
916: %
917: %
918: \section{Summary}
919: 
920: 
921: As Grid computations become increasingly prevalent, the need for 
922: topology-aware collective operations also increases.  We have a version 
923: of \hbox{MPICH-G2} that implements five collective operations in a multilevel 
924: topology-aware manner.  We have shown, at least for \verb+MPI_Bcast+, that when 
925: compared with the binomial tree provided by MPICH and the 2-level approach 
926: provided by MagPIe there are significant advantages to executing 
927: collective operations using a multilevel view of the network.
928: Through a simple process of identifying machines that are common
929: to a local-area network, we have provided a means by which an MPI
930: application may take advantage of the multilevel topology-aware algorithms
931: without requiring code modifications or special functions.
932: 
933: 
934: %
935: %
936: %
937: \begin{acknowledge}
938: 
939: 
940: We thank the San Diego Supercomputer Center and the National Center
941: for Supercomputing Applications for providing access to their
942: machines.  We also thank the members of the Globus development
943: team for their support, patience, and many ideas. This work was
944: supported in part by the Mathematical, Information, and Computational
945: Sciences Division subprogram of the Office of Advanced Scientific
946: Computing Research, U.S. Department of Energy, under Contract
947: W-31-109-Eng-38; by the U.S. Department of Energy under
948: Cooperative Agreement No. DE-FC02-99ER25398; 
949: by the National Science Foundation; by DARPA; and by
950: the NASA Information Power Grid program.
951: 
952: \end{acknowledge}
953: 
954: 
955: %
956: \bibliographystyle{plain}
957: \bibliography{foster_bibliography,mpi-allbib,mpi-book,globus,prop}
958: 
959: 
960: %
961: 
962: Nicholas T. Karonis received a B.S. in finance and a B.S. in computer
963: science from Northern Illinois University in 1985, an M.S. in computer
964: science from Northern Illinois University in 1987, and a Ph.D. in
965: computer science from Syracuse University in 1992.  He spent
966: summers from 1981 to 1991 as a student at Argonne National Laboratory
967: where he worked on the p4 message-passing library, automated reasoning,
968: and genetic sequence allignment.  From 1991 to 1995 he worked on the control
969: system at Argonne's Advanced Photon Source and from 1995 to 1996
970: for the Computing Division at Fermi National Accelerator Laboratory.
971: Since 1996 he has been an assistant professor of computer science at
972: Northern Illinois University and a resident associate guest of Argonne's
973: Mathematics and Computer Science Division where he has been a member
974: of the Globus Project.  His current research interest is message-passing 
975: systems in computational Grids.
976: 
977: Bronis R. de Supinski is a computer scientist in the Center
978: for Applied Scientific Computing at Lawrence Livermore
979: National Laboratory. His research interests include
980: message passing implementations and tools, memory performance
981: improvement, cache coherence and distributed shared memory,
982: consistency semantics and performance evaluation modeling
983: and tools. Bronis earned his Ph.D. in computer science
984: from the University of Virginia in 1998. He is a
985: member of the ACM and the IEEE Computer Society.
986: 
987: 
988: Ian Foster received his B.Sc. (Hons I) at the University of Canterbury 
989: in 1979 and his Ph.D. from Imperial College, London, in 1998. He
990: is senior scientist and associate director of the
991: Mathematics and Computer Science Division at Argonne National
992: Laboratory, and professor of computer science at the University of
993: Chicago.  He has published four books and over 150 papers and
994: technical reports.  He co-leads the Globus Project, which provides
995: protocols and services used by industrial and academic distributed
996: computing projects worldwide.  He co-founded the influential Global
997: Grid Forum and co-edited the book ``The Grid: Blueprint for a New
998: Computing Infrastructure.''
999: 
1000: 
1001: William Gropp received his B.S. in mathematics from Case Western Reserve 
1002: University in 1977, a an M.S. in physics from the University of Washington in 
1003: 1978, and a Ph.D. in computer science from Stanford in 1982. He held the 
1004: positions of assistant (1982-1988) and associate (1988-1990) professor in 
1005: the Computer Science Department at Yale University. In 1990, he joined the 
1006: numerical analysis group at Argonne, where he is a senior computer 
1007: scientist and associate director of the Mathematics and Computer Science 
1008: Division, a senior scientist in the Department of Computer Science at the 
1009: University of Chicago, and a Senior Fellow in the Argonne-University of Chicago 
1010: Computation Institute. His research interests are in parallel computing, 
1011: software for scientific computing, and numerical methods for partial 
1012: differential equations. He has played a major role in the development of 
1013: the MPI message-passing standard. He is co-author of MPICH, the most widely used 
1014: implementation of MPI, and was involved in the MPI Forum as a 
1015: chapter author for both MPI-1 and MPI-2. He has written many books and 
1016: papers on MPI including "Using MPI" and "Using MPI-2". He is also one of 
1017: the designers of the PETSc parallel numerical library, and has developed 
1018: efficient and scalable parallel algorithms for the solution of linear and 
1019: nonlinear equations.
1020: 
1021: 
1022: Ewing Lusk received his B.A. in mathematics from the University of Notre Dame
1023: in 1965 and his Ph.D. in mathematics from the University of Maryland in 1970.
1024: He is currently a senior computer scientist in the Mathematics and Computer
1025: Science Division at Argonne National Laboratory.  His current projects include
1026: implementation of the MPI message-passing standard, research into programming
1027: models for parallel architectures, and parallel performance analysis tools.
1028: He is a leading member of the team responsible for MPICH implementation of the
1029: MPI message-passing interface standard.  He is the author of five books and
1030: more than seventy-five research articles in mathematics, automated deduction,
1031: and parallel computing.
1032: 
1033: 
1034: Sebastien Lacour graduated in physics in 1999 at the Ecole
1035: Normale Superieure of Lyon, France. He received his master's
1036: degree in computer science in 2002 at IFSIC, University of
1037: Rennes, France.  He is currently a Ph.D. student at IRISA/INRIA in
1038: Rennes.  His research interests include networks, compilation, and
1039: parallel and distributed systems.  His current work focuses on
1040: distributed shared-memory systems over large-scale, hierarchical
1041: architectures (multicluster platforms).
1042: 
1043: 
1044: \end{document}
1045: