1:
2:
3: \documentclass[jpdc]{apjrnl}
4: \usepackage{graphics}
5: %
6: \listfiles
7:
8:
9: %
10: \begin{document}
11:
12:
13: \title{A Multilevel Approach to Topology-Aware Collective Operations
14: in Computational Grids}
15:
16:
17: \author{Nicholas T. Karonis
18: %
19: \\Department of Computer Science
20: \\Northern Illinois University
21: \\DeKalb, IL~~60115
22: \\Argonne National Laboratory
23: \\Argonne, IL~~60439
24: \\Email: karonis@niu.edu
25: %
26: \and %
27: Bronis de Supinski
28: %
29: \\Center for Applied Scientific Computing
30: \\Lawrence Livermore National Laboratory
31: \\Livermore, CA~~94551
32: \\Email: bronis@llnl.gov
33: %
34: \and
35: Ian Foster
36: %
37: \\Argonne National Laboratory
38: \\Argonne, IL~~60439
39: \\The University of Chicago
40: \\Chicago, IL~~60637
41: \\Email: foster@mcs.anl.gov
42: %
43: \and
44: William Gropp and Ewing Lusk
45: %
46: \\Mathematics and Computer Science Division
47: \\Argonne National Laboratory
48: \\Argonne, IL~~60439
49: \\Email: gropp@mcs.anl.gov, lusk@mcs.anl.gov
50: %
51: \and
52: Sebastien Lacour
53: %
54: \\IRISA / INRIA Rennes
55: \\University of Beaulieu
56: \\35042 Rennes, France
57: \\Email: Sebastien.Lacour@irisa.fr
58: %
59: }
60: \date{April 2002}
61:
62:
63: \maketitle
64:
65:
66: Proposed running head: Multilevel Topology-Aware Collective Operations
67:
68:
69:
70: \pagebreak
71:
72:
73: %
74: %
75: %
76: \begin{abstract}
77: The efficient implementation of collective communication operations has
78: received much attention. Initial efforts produced ``optimal'' trees based
79: on network communication models that assumed
80: equal \mbox{point-to-point} latencies
81: between any two processes. This assumption is violated in
82: most practical settings, however, particularly in heterogeneous
83: systems such as clusters of SMPs and wide-area ``computational Grids,''
84: with the result that collective operations
85: perform suboptimally. In response, more recent work has focused
86: on creating {\em topology-aware} trees for collective operations that minimize
87: communication across slower channels (e.g., a wide-area network). While these
88: efforts have significant communication benefits, they all limit their view of
89: the network to only two layers. We present a strategy based upon a
90: {\em multilayer} view of the network. By creating {\em multilevel
91: topology-aware} trees we take advantage of communication cost differences
92: at every level in the network. We used this strategy to implement
93: topology-aware versions of several MPI collective operations
94: in \hbox{MPICH-G2}, the Globus Toolkit$^{TM}$-enabled version of the
95: popular MPICH implementation of the MPI standard.
96: Using information about topology provided by \hbox{MPICH-G2},
97: we construct these multilevel
98: topology-aware trees automatically during execution.
99: We present results demonstrating the advantages of our
100: multilevel approach by comparing it to the default (topology-unaware)
101: implementation provided by MPICH and a topology-aware two-layer implementation.
102: \end{abstract}
103:
104:
105: \begin{keywords}
106: MPI, collective operations, \hbox{MPICH-G2}, grid computing, Globus Toolkit
107: \end{keywords}
108:
109:
110: \pagebreak
111:
112:
113: %
114: %
115: %
116: \section{Introduction}
117:
118:
119: The problem of building ``optimal'' communication trees for
120: collective operations
121: has received much attention in recent years.
122: The telephone model, which assumes that send and receive times are
123: equal and that messages are not packetized, implies that the optimal broadcast
124: algorithm uses a binomial tree.
125: Under models that expand the telephone model to account for message latency,
126: such as the postal~\cite{postal} or LogP~\cite{logp} models, the communication
127: topology of an optimal
128: broadcast algorithm becomes a generalized Fibonacci tree.
129: All of these approaches construct optimal trees for collective operations by first
130: modeling the communication characteristics of a network with a set of
131: parameters and then building the optimal trees based on parameter values and
132: their model.
133:
134:
135: Underlying this work is
136: the assumption that the communication times between all process pairs
137: in the computation are equal. While this is a reasonable approximation
138: when the entire computation is performed on a single machine,
139: it is not reasonable when the computation is executed on a cluster of
140: symmetric multiprocessors (SMPs) in a local-area network, or worse, in
141: a {\em computational Grid}~\cite{Globus,GridBook,globus-physics} environment,
142: in which
143: multiple parallel computers are connected by
144: local-area, campus-area, or even wide-area networks. Rapid
145: improvements in network performance have engendered considerable
146: interest in parallel computing
147: in the last context, as evidenced by experiments and
148: initiatives such as
149: the I-WAY~\cite{isoftcpe}, National Technology Grid~\cite{CACMgrid},
150: Information Power Grid~\cite{IPG99}, and TeraGrid~\cite{teragrid}.
151:
152: %
153:
154:
155:
156: Under these circumstances the trees produced by the conventional
157: models perform suboptimally.
158: In such heterogeneous environments, communication costs over different links
159: %
160: %
161: can differ by an order of magnitude or
162: more. In these situations, {\em topology-aware} algorithms can
163: dramatically
164: improveme the performance. For example, in the case of N
165: processors distributed into two clusters, a traditional
166: reduction algorithm may generate \mbox{O(log N)} intercluster messages,
167: while a topology-aware algorithm generates only 1, for a
168: cost saving of a factor of \mbox{O(log N)} if intercluster message
169: costs dominate.
170:
171:
172: Previous work~\cite{starT,magpie} has demonstrated that
173: topology-aware
174: collective operations can indeed reduce communication costs by reducing
175: the amount of communication performed over slow channels. However, this work
176: limited the depth of network stratification to only two levels: other
177: processors are either near or far. In~\cite{optcollops} we compared a
178: prototype of our multilevel approach to the topology-{\em un}aware binomial
179: tree algorithm distributed with MPICH and to MagPIe, one of the topology-aware
180: two-level techniques. In that prototype we ``guessed'' which computers shared
181: a local network by inspecting their fully qualified domain names, and
182: thereafter representing our multilevel clustering of processes with a
183: sequence of {\em hidden communicators} inside MPI communicators.
184:
185: In this paper we present a much improved refinement of that prototype
186: that allows collective operations to exploit knowledge concerning the
187: structure of a multilevel network, in which the neighbors are processors
188: that are categorized according to their expected \mbox{point-to-point}
189: communication characteristics. The identification of which processes
190: share a local network is now a simple matter of users providing values
191: for selected environment variables. Additionally the use of hidden
192: communicators to represent the multilevel clustering has been replaced
193: by integer vectors. The use of hidden communicators required us to
194: implement the collective operations as a {\em sequence of collective
195: operations}, for example, an \verb+MPI_Bcast+ was implemented as a
196: sequence of \verb+MPI_Bcast+s sequencing over each of the hidden communicators
197: in turn, which typically resulted in the use of binomial trees at each level.
198: By replacing the hidden communicators with integer vectors we
199: are now free to implement collective operations using \mbox{point-to-point}
200: operations over any tree we create.
201:
202:
203: To permit experimental studies, we have implemented our multilevel
204: approach for five of the collective operations supported by the Message Passing
205: Interface (MPI) standard~\cite{mpi-forum:journal}: \verb+MPI_Bcast+,
206: \verb+MPI_Reduce+, \verb+MPI_Barrier+, \verb+MPI_Gather+, and
207: \verb+MPI_Scatter+. We use \hbox{MPICH-G2}~\cite{mpichg2}, the successor to
208: \hbox{MPICH-G}~\cite{mpi-nexus-pc}, which
209: is based on the popular MPICH implementation~\cite{mpich} of the
210: MPI standard. \hbox{MPICH-G2} uses services provided
211: by the Globus Toolkit$^{TM}$, or simply Globus, to support execution in
212: heterogeneous and distributed
213: environments. This use of \hbox{MPICH-G2} enables experimentation within
214: realistic wide-area environments that would not otherwise be easily accessible.
215:
216:
217: In the sections that follow, we describe our multilevel topology approach.
218: Then, we present experimental results that illustrate the benefits of
219: our multilevel approach by comparing it with (1) the topology-{\em un}aware
220: implementation currently distributed with MPICH and (2) MagPIe~\cite{magpie},
221: one of the topology-aware two-level implementations of collective operations.
222: We briefly discuss other recent topology-aware and optimized collective
223: operations efforts and conclude with a discussion of future work.
224:
225:
226: %
227: %
228: %
229: \section{Multilevel Topology-Aware Approach}
230:
231:
232: \begin{figure}
233: \begin{centering}
234: %
235: %
236: %
237: %
238: %
239: %
240: %
241: %
242: %
243: %
244: %
245: %
246: %
247: \includegraphics{grid.eps}
248: \caption{An example of a Grid computation involving 10 processes on one
249: IBM~SP at SDSC and another 10 processes distributed evenly across two
250: SGI~Origin2000s (O2K$_a$ and O2K$_b$) at NCSA.}\label{fig-mach}
251: \end{centering}
252: \end{figure}
253:
254:
255: Figure~\ref{fig-mach} depicts an MPI application
256: involving 20 processes distributed over three machines located at
257: the San Diego Supercomputer Center (SDSC) and the National Center
258: for Supercomputing Applications (NCSA).
259: We depict 10 processes on the IBM~SP at SDSC and 5 processes
260: on each of two Origin2000s, O2K$_a$ and O2K$_b$, at NCSA.
261: The slowest communication is between sites, which uses TCP over a
262: wide-area network, with faster communication between the O2Ks at NCSA,
263: which uses TCP over their local-area network, and the fastest
264: communication, of course, within each machine.
265:
266: In the remainder of this section we describe a broadcast using first
267: the topology-unaware implementation currently distributed with MPICH,
268: then a 2-level topology-aware approach, and finally our multilevel
269: topology-aware broadcast.
270:
271: %
272: \subsection{A Topology-{\em Un}aware Broadcast}
273:
274: \begin{figure}
275: \begin{centering}
276: \includegraphics{binomial.eps}
277: \caption{The binomial trees $B_0$ through $B_3$.}\label{fig-binomial}
278: \end{centering}
279: \end{figure}
280:
281: Topology-unaware implementations of broadcast, including the one
282: distributed with MPICH, often make the simplifying
283: assumption that the communication times between all process
284: pairs in the computation are equal. Under this assumption the broadcast
285: is often implemented by using a {\em binomial tree}.
286:
287: A binomial tree $B_k$ is an ordered tree (i.e., children of each node
288: are ordered) of order $k \ge 0$ defined recursively. As
289: shown in Figure~\ref{fig-binomial}, the binomial tree $B_0$ consists
290: of a single node. The binomial tree $B_k$ ($k > 0$) has a root
291: with $k$ children where the $i^{th}$ child ($0 < i \le k$) is the root
292: of the binomial tree $B_{k-i}$. Figure~\ref{fig-binomial} depicts the binomial
293: trees $B_0$ through $B_3$.
294:
295: When communication times between all process pairs in the computation are
296: equal and have relatively low latency, \hbox{Bar-Noy} and Kipnis
297: show that implementing a broadcast with a binomial tree has the desirable
298: property that all processes will complete the broadcast at approximately
299: the same time thus, achieving proper load balancing~\cite{postal}.
300:
301: %
302: \subsection{A 2-Level Topology-Aware Broadcast}
303: \label{subsec-2lvl}
304:
305:
306: \begin{figure}
307: \begin{centering}
308: %
309: %
310: %
311: %
312: %
313: %
314: %
315: %
316: %
317: %
318: %
319: %
320: %
321: %
322: %
323: %
324: %
325: %
326: \includegraphics{magpie.eps}
327: \caption{An example of two 2-level topology-aware broadcast trees
328: rooted at SDSC spanning 2 Origin2000s (O2K$_a$ and O2K$_b$) at NCSA
329: and an IBM~SP at SDSC: (a) clustering processes on {\em machine boundaries}
330: and (b) clustering on {\em site boundaries}.}\label{fig-2lvlbcast}
331: \end{centering}
332: \end{figure}
333:
334:
335: Existing 2-level topology-aware approaches~\cite{starT,magpie} cluster
336: processes into groups. The two natural choices for the machines
337: depicted in Figure~\ref{fig-mach} are to cluster the processes based
338: either on {\em machine boundaries}, creating three groups -- the IBM~SP, O2K$_a$,
339: and O2K$_b$, or {\em site boundaries} creating two groups -- SDSC and NCSA.
340: While both are reasonable choices and would improve
341: performance when compared with the topology-{\em un}aware binomial tree
342: distributed with MPICH,
343: both choices ignore the disparity in network performance between the
344: local- and wide-area networks. Consider, for example,
345: a broadcast rooted at one of the processes at SDSC. Figure~\ref{fig-2lvlbcast}a
346: depicts the broadcast tree of the 2-level approach when the processes
347: are clustered on machine boundaries. The broadcast starts
348: with the SDSC root process sending messages to designated processes on each
349: of the O2Ks at NCSA, resulting in two messages travelling across the wide-area
350: network, and concludes with broadcasts within each machine. By contrast,
351: Figure~\ref{fig-2lvlbcast}b depicts the broadcast tree when the processes
352: are clustered on site boundaries. In this case the root at SDSC
353: sends a single message across the wide-area network to a process on
354: one of the two O2Ks at NCSA and concludes with a broadcast within the
355: IBM~SP with another simultaneous broadcast across all the processes
356: at NCSA, which would typically require multiple messages to travel
357: across NCSA's local network.
358:
359:
360: %
361: \subsection{A Multilevel Topology-Aware Broadcast}
362:
363:
364: \begin{figure}
365: \begin{centering}
366: %
367: %
368: %
369: %
370: %
371: %
372: %
373: %
374: %
375: %
376: %
377: %
378: %
379: %
380: %
381: %
382: %
383: \includegraphics{g2.eps}
384: \caption{An example of a multilevel topology-aware broadcast tree rooted
385: at SDSC spanning 2 Origin 2000s (O2K$_a$ and O2K$_b$) at NCSA and an
386: IBM~SP at SDSC.}\label{fig-mlvlbcast}
387: \end{centering}
388: \end{figure}
389:
390:
391: The multilevel topology-aware approach we present minimizes messaging
392: across the slowest links {\em at each level} by clustering the processes
393: at the wide-area level into site groups, and then within each site group,
394: clustering processes at the local-area level into machine groups.
395: Using the same broadcast example from Section~\ref{subsec-2lvl}, we depict in
396: Figure~\ref{fig-mlvlbcast} the broadcast tree used by a multilevel approach.
397: Here the broadcast starts with the SDSC root process sending a single message
398: across the wide-area network to one of the processes at NCSA, in
399: Figure~\ref{fig-mlvlbcast} we depict a process on O2K$_a$. The broadcast
400: continues with the receiving process on O2K$_a$ sending a single message
401: across NCSA's local network to a process on O2K$_b$ and the entire broadcast
402: concludes with broadcasts within each machine. This multilevel
403: clustering minimizes messaging over the slower wide- and local-area
404: networks.
405:
406:
407: %
408: %
409: %
410: \section{Multilevel Topology-aware Approach in \hbox{MPICH-G2}}
411:
412:
413: In this section we describe our implementation of multilevel topology-aware
414: collective operations in the Globus Toolkit-based \hbox{MPICH-G2}. For illustrative purposes,
415: we discuss our implementation of \verb+MPI_Bcast+ in detail.
416:
417: %
418: %
419: %
420: %
421: %
422: %
423: %
424: %
425: %
426: %
427: %
428: %
429: %
430: %
431: %
432: %
433: %
434: %
435: %
436: %
437: %
438: %
439: %
440:
441:
442: %
443: \subsection{RSL Specification of Topology}
444:
445:
446: MPICH-G2 uses the Globus Toolkit's Resource Specification Language~(RSL)~\cite{GRAM97} to
447: describe the resources required to run an application.
448: Users write {\em RSL scripts}, which identify resources
449: (e.g., computers) and specify requirements (e.g., number of CPUs, memory,
450: execution time) and parameters (e.g., location of executables, command
451: line arguments, environment variables) for each. An RSL script can
452: be used as
453: the user interface to \verb+globusrun+, an upper-level Globus service
454: that first authenticates the user by using the Grid Security
455: %
456: Infrastructure~(GSI)~\cite{GSI-journal} and then
457: schedules and monitors the job across the various machines by using
458: two other Globus Toolkit services: the Dynamically-Updated Request Online
459: Coallocator~(DUROC)~\cite{CoAllocation99} and Grid Resource Allocation
460: and Management~(GRAM)~\cite{GRAM97}. RSL is designed to be an
461: easy-to-use language to describe multiresource multisite jobs while
462: hiding all the site-specific details associated with requesting
463: such resources.
464:
465:
466: \begin{figure}
467: \begin{footnotesize}
468: \begin{verbatim}
469: +
470: ( &(resourceManagerContact="sp.npaci.edu")
471: (count=10)
472: (jobtype=mpi)
473: (label="subjob 0")
474: (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0))
475: (directory=/homes/users/smith)
476: (executable=/homes/users/smith/myapp)
477: )
478: ( &(resourceManagerContact="o2ka.ncsa.uiuc.edu")
479: (count=5)
480: (jobtype=mpi)
481: (label="subjob 1")
482: (environment=(GLOBUS_DUROC_SUBJOB_INDEX 1))
483: (directory=/users/smith)
484: (executable=/users/smith/myapp)
485: )
486: ( &(resourceManagerContact="o2kb.ncsa.uiuc.edu")
487: (count=5)
488: (jobtype=mpi)
489: (label="subjob 2")
490: (environment=(GLOBUS_DUROC_SUBJOB_INDEX 2))
491: (directory=/users/smith)
492: (executable=/users/smith/myapp)
493: )
494: \end{verbatim}
495: \end{footnotesize}
496: \caption{\strut{An RSL script for an \hbox{MPICH-G2} application running
497: on three machines that facilitates {\em 2-level process
498: clustering}.}\label{fig-rsl2lvl}}
499: \end{figure}
500:
501:
502: %
503: %
504: %
505: %
506: %
507: %
508: %
509: %
510: %
511: %
512: %
513: %
514: %
515: %
516: %
517: %
518: %
519: %
520: %
521: %
522: %
523: %
524: %
525: %
526: %
527: %
528: %
529: %
530: %
531: %
532: %
533:
534:
535: \begin{figure}
536: \begin{footnotesize}
537: \begin{verbatim}
538: +
539: ( &(resourceManagerContact="sp.npaci.edu")
540: (count=10)
541: (jobtype=mpi)
542: (label="subjob 0")
543: (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0))
544: (directory=/homes/users/smith)
545: (executable=/homes/users/smith/myapp)
546: )
547: ( &(resourceManagerContact="o2ka.ncsa.uiuc.edu")
548: (count=5)
549: (jobtype=mpi)
550: (label="subjob 1")
551: (environment=(GLOBUS_DUROC_SUBJOB_INDEX 1)
552: (GLOBUS_LAN_ID NCSAlan))
553: (directory=/users/smith)
554: (executable=/users/smith/myapp)
555: )
556: ( &(resourceManagerContact="o2kb.ncsa.uiuc.edu")
557: (count=5)
558: (jobtype=mpi)
559: (label="subjob 2")
560: (environment=(GLOBUS_DUROC_SUBJOB_INDEX 2)
561: (GLOBUS_LAN_ID NCSAlan))
562: (directory=/users/smith)
563: (executable=/users/smith/myapp)
564: )
565: \end{verbatim}
566: \end{footnotesize}
567: \caption{\strut{An RSL script for an \hbox{MPICH-G2} application running
568: on three machines that facilitates {\em multilevel process
569: clustering}.}\label{fig-rslMlvl}}
570: \end{figure}
571:
572:
573: Figure~\ref{fig-rsl2lvl} depicts an RSL script for an \hbox{MPICH-G2}
574: application intended to run on the computational Grid depicted in
575: Figure~\ref{fig-mach}. It depicts a job as a set of three {\em subjobs},
576: where each subjob is associated with a particular resource, in our example,
577: a computer. Subjobs define a natural machine-boundary partitioning of the
578: processes in \verb+MPI_COMM_WORLD+ and are sufficient for a 2-level machine
579: boundary clustering of the processes. To achieve a multilevel clustering,
580: the user must identify those machines that are on the same local network
581: by specifying a value for an \hbox{MPICH-G2}-defined environment variable
582: \verb+GLOBUS_LAN_ID+, as depicted in the RSL script in
583: Figure~\ref{fig-rslMlvl}. Specifying the same value (\verb+NCSAlan+)
584: in the second and third subjobs instructs \hbox{MPICH-G2} to cluster these
585: two machines into the same local-area network group. This same technique
586: can be used to cluster many subjobs in the same local-area network group
587: while simultaneously creating multiple local-area network groups through
588: the assignment of multiple yet unique \verb+GLOBUS_LAN_ID+ values. This
589: simple specification (the only difference between Figures~\ref{fig-rsl2lvl}
590: and~~\ref{fig-rslMlvl}) is all that is required to create multilevel
591: topology-aware clustering of the processes.
592:
593:
594: %
595: %
596: %
597: %
598: %
599: %
600: %
601: %
602: %
603: %
604: %
605: %
606: %
607: %
608: %
609:
610:
611: The multilevel clustering information specified in RSL (i.e., processes
612: gathered first into machine groups and then local network groups composed
613: of machine groups) creates a multilevel grouping of the processes in
614: \verb+MPI_COMM_WORLD+ and is distributed to all the processes during
615: \hbox{MPICH-G2} bootstrapping to be stored within \verb+MPI_COMM_WORLD+ on
616: each process. When new communicators are created (e.g., via \verb+MPI_Comm_split+),
617: \hbox{MPICH-G2} propagates the relevant multilevel clustering information to
618: the newly created communicator so that {\em all communicators} in
619: \hbox{MPICH-G2} have the multilevel clustering information pertaining to their
620: process groups. As an interesting side effect we have made this multilevel
621: topology information available to MPI applications through existing
622: MPI communicator caching idioms. See~\cite{mpichg2} for a full
623: description of \hbox{MPICH-G2}'s topology discovery mechanism.
624: %
625:
626:
627: %
628: \subsection{MPICH-G2's Multilevel Topology-Aware Broadcast}
629:
630:
631: A multilevel topology-aware clustering of processes is not sufficient
632: in itself to allow the construction of a broadcast tree such as that depicted in
633: Figure~\ref{fig-mlvlbcast}: \hbox{MPICH-G2} also needs to know which process is
634: the root of the broadcast. Construction of the multilevel
635: topology-aware tree is therefore deferred until the application calls a
636: collective operation. At that time each process simultaneously and
637: independently (i.e., without communication) construct an identical tree based
638: on the multilevel process grouping found in the communicator and the parameters
639: passed (e.g., identifying the root process of a broadcast) to the collective
640: operation.
641:
642:
643: One benefit of using a multilevel topology-aware tree to implement
644: a collective operation is that we are free to select different subtree
645: topologies at each level. For example, a multilevel broadcast tree can start
646: with a broadcast from the root to selected processes at each site across a
647: wide-area network, followed by broadcasts at each site to selected processes
648: on each machine across the local networks, and concluding with broadcasts
649: within each machine. We have the freedom to use different broadcast
650: topologies at each stage in the sequence. Bar-Noy and Kipnis show
651: that in high-latency networks (e.g., a wide-area network)
652: the optimal broadcast topology is a flat tree in which the root sends the
653: data to all other processes directly, while in a low-latency network
654: (e.g., intramachine messaging), the optimal broadcast topology is a binomial
655: tree~\cite{postal}. We take advantage of these findings and the flexibility
656: of our multilevel approach in our implementation of \verb+MPI_Bcast+ by using
657: a flat broadcast tree at the initial wide-area level and binomial trees at
658: the local-area and intramachine levels.
659:
660:
661: In the next section we present results demonstrating the advantages of our
662: multilevel approach by comparing it with the default (topology-{\em un}aware)
663: implementation provided by MPICH and a topology-aware two-layer implementation.
664:
665:
666: %
667: %
668: %
669: \section{Experimental Results}
670: \label{sec-exp}
671:
672:
673: %
674: %
675:
676:
677: To demonstrate the advantages of our multilevel approach, we examine
678: its effects on \verb+MPI_Bcast+.
679: The MPICH implementation of \verb+MPI_Bcast+ is based on binomial trees;
680: hence, in a distributed heterogeneous environment like a computational
681: Grid its performance is acutely sensitive to the distribution
682: of the processes and the root of the broadcast.
683: For example, in an application using $P=2^k$ processes distributed evenly
684: across $C=2^i, 0 \le i \le k$ clusters, a broadcast implemented using
685: a binomial tree propagates the message down its longest path
686: using at least $log_2C$ intercluster messages and $log_2\frac{P}{C}$
687: intracluster messages.
688: In contrast,
689: under certain intercluster network performance
690: conditions described by \mbox{Bar-Noy} and Kipnis in their postal model,
691: our multilevel method could be used to send
692: 1 intercluster message and $log_2\frac{P}{C}$ intracluster messages.
693: Assuming an intercluster latency $l_s$ sec and bandwidth $b_s$ Kb/sec;
694: and an intracluster latency $l_f$ sec and bandwidth $b_f$ Kb/sec,
695: broadcasting a message of N Kb using the
696: binomial tree conservatively takes
697: \mbox{$O((logC)(l_s+\frac{N}{b_s}) + (log\frac{P}{C})(l_f+\frac{N}{b_f}))$},
698: whereas
699: broadcasting the same message using our multilevel method takes only
700: \mbox{$O((l_s+\frac{N}{b_s}) + (log\frac{P}{C})(l_f+\frac{N}{b_f}))$}.
701:
702:
703: \begin{figure}
704: \begin{small}
705: \begin{verbatim}
706: For (each message size M)
707: MPI_Barrier(MPI_COMM_WORLD)
708: if (MPI_COMM_WORLD rank == 0)
709: t0 = get_time()
710: For (r = 0; r < Nprocs; r ++)
711: MPI_Bcast(root=r to MPI_COMM_WORLD message size M)
712: ack_barrier()
713: if (MPI_COMM_WORLD rank == 0)
714: t1 = get_time()
715: report message size M, time t1-t0
716: \end{verbatim}
717: \end{small}
718: \caption{\strut{The broadcast timing application.}\label{fig-tim-bcast}}
719: \end{figure}
720:
721:
722: We wrote a small MPI application (depicted in Figure~\ref{fig-tim-bcast})
723: that times the broadcasts of messages of increasing size.
724: To represent a broadcast with an arbitrary
725: root, we timed how long it would take to broadcast each message
726: of size M as each process in \verb+MPI_COMM_WORLD+ took its turn
727: as the root. Also, in order to eliminate any potential pipelining
728: that might occur between consecutive broadcasts, we inserted a barrier
729: (\verb+ack_barrier()+) after each broadcast in which all processes
730: other than rank 0 \verb+MPI_Send+ an ACK message to process 0 and then
731: wait to \verb+MPI_Recv+ a GO message. Process 0, after \verb+MPI_Recv+'ing
732: the ACK message from all the other processes, \verb+MPI_Send+'s a GO
733: message to each of the other processes, one at a time. We chose to write our
734: own barrier rather than calling \verb+MPI_Barrier+ because
735: we have reimplemented \verb+MPI_Barrier+ to reflect multilevel topology and
736: we wished these tests to reflect the differences only in the broadcast
737: implementations.
738:
739:
740: %
741: %
742: %
743: %
744: %
745: %
746: %
747: %
748: %
749: %
750: %
751:
752: We conducted experiments running the MPI
753: application depicted in Figure~\ref{fig-tim-bcast} on
754: three computers: the IBM~SP at the San Diego Supercomputer Center
755: (SDSC-SP) and the IBM~SP (ANL-SP) and SGI~Origin200 (ANL-O2K) at
756: Argonne National Laboratory.
757: We compare
758: our multilevel topology approach to the binomial tree provided by MPICH
759: and include comparisons to the 2-level approach provided by MagPIe.
760: We ran the application four times, each time using
761: 16 processes on each of the three computers. These results are
762: depicted in Figure~\ref{fig-bcast-magpie}.
763: The curves labeled ``MagPIe-machine'' and ``MagPIe-site'' represent
764: two runs using MagPIe version 2.0.1, each time with
765: a different cluster definition.
766: In our first MagPIe
767: run (``MagPIe-machine'') we defined three clusters, one for each computer,
768: of 16 processes each. In our second MagPIe run (``MagPIe-site'') we defined
769: two clusters: an ANL cluster comprising the two ANL machines having 32
770: processes and an SDSC cluster comprising the SDSC-SP having only 16 processes.
771:
772: %
773: %
774: %
775: \begin{figure}
776: \begin{centering}
777: \includegraphics{timing_bcast.16anlsp16sdscsp16sdscsp.magpie.eps}
778: \caption{Original MPICH broadcast~vs.~topology-aware MPICH broadcast~vs.~MagPIe
779: broadcast running 16 processes
780: on the IBM~SP at SDSC and 16 processes on each the IBM~SP and
781: SGI~Origin2000 at ANL.}\label{fig-bcast-magpie}
782: \end{centering}
783: \end{figure}
784:
785: Figure~\ref{fig-bcast-magpie}
786: shows there are significant benefits to the multilevel approach when
787: compared with a simple binomial tree and even when compared with a 2-level
788: approach as implemented by MagPIe.
789: A multilevel view of the network allows an application to avoid
790: slower channels {\em at each level}. In our experiments, the broadcast is
791: optimized by sending one message across the wide-area network, then one
792: message across the local-area network, and then many messages
793: within each computer.
794:
795:
796: %
797: %
798: %
799: \section{Related Work}
800:
801:
802: Previous efforts have focused on creating ``optimal'' trees for collective
803: operations where \mbox{point-to-point} communications are not necessarily
804: equal between any two processes. Husbands and Hoe present
805: MPI-StarT~\cite{starT}, an MPI implementation for a cluster of SMPs
806: interconnected by a high-performance interconnect. They report significant
807: improvements after modifying the MPICH broadcast algorithm, which uses
808: binomial trees. Their modifications use information that describes their
809: cluster topology by minimizing intercluster communication during collective
810: operations. MagPIe~\cite{magpie} is another MPI system designed to construct
811: collective operation trees in heterogeneous communication environments.
812: MagPIe recognizes a two-layer communication network that distinguishes between
813: local- and wide-area communication. By minimizing wide-area communication,
814: much in the same way MPI-StarT minimizes intercluster communication, MagPIe
815: has seen significant improvements in all the MPI collective operations.
816:
817:
818: Both efforts have produced impressive results and clearly demonstrate
819: that there are significant advantages to implementing collective operations in
820: a topology-aware manner. However, both limit
821: their view of the network to only two layers; MPI-StarT distinguishes
822: between intra- and intercluster communication within the same
823: local-area, and MagPIe distinguishes between local- and wide-area communication.
824: There are opportunities for further optimization by
825: using trees that stratify the network deeper than two layers.
826:
827:
828: In~\cite{pipelining} van de Geijn et al. show the advantages of
829: implementing collective operations by segmenting and pipelining
830: messages when communicating over relatively slower channels (e.g., TCP
831: over local- and wide-area networks).
832:
833:
834:
835:
836: In~\cite{magpie-PC} Kielman et al. extend MagPIe by incorporating
837: van de Geijn's pipelining idea through a technique they call Parameterized
838: LogP (PLogP), which is an extension of the LogP model presented
839: by Culler et al~\cite{logp}.
840: In this extension, MagPIe still recognizes only a two-layer communication
841: network, but through parameterized studies of the network, the researchers determine
842: ``optimal'' packet sizes. This technique works well for applications
843: that always run on the same computational grid having relatively
844: stable performance, but requires retuning when moving the application
845: from one computing environment or network to another.
846:
847:
848: %
849: %
850: %
851: \section{Future Work}
852:
853:
854: We have implemented five of the MPI collective operations in a
855: topology-aware multilevel manner in \hbox{MPICH-G2}. Encouraged by our
856: initial results, we plan to upgrade \hbox{MPICH-G2's} remaining MPI collective
857: operations in a similar manner.
858:
859:
860: %
861: %
862: %
863: %
864: %
865: %
866: %
867: %
868: %
869: %
870: %
871: %
872: %
873: %
874: %
875: %
876: %
877: %
878: %
879: %
880: %
881: %
882: %
883: %
884: %
885:
886:
887: Our general strategy implements a collective operation by first stratifying
888: the network into multiple levels and then minimizing the communication
889: across the slowest channels. In doing so, however, we may encounter
890: a tree that has multiple siblings at a particular level, for example,
891: many sites connected across the wide-area network or many machines
892: at a particular site. When this situation happens, we implement the collective
893: operation at that level using a binomial tree at all but the wide-area
894: network level. Unfortunately, a binomial tree is not always the best choice.
895: \mbox{Bar-Noy} and Kipnis show that the shape of a collective
896: operation tree depends heavily on the \mbox{point-to-point} communication
897: characteristics of the send/receive primitives on which it is implemented.
898: Their model incorporates a latency parameter $\lambda \ge 1$. They show
899: that for low latencies, (for example, communication within a single machine),
900: the optimal broadcast tree is a binomial tree, but for higher latencies,
901: (for example, communication across a wide-area network), the optimal broadcast
902: tree becomes flatter. We will investigate ways to select better, if
903: not optimal, collective operation trees by choosing those that respect
904: the different communication characteristics at each level of our multilevel
905: view.
906:
907:
908: The pipelining techniques presented by van de Geijn et al. can be used at
909: each of the levels in \hbox{MPICH-G2's} multilevel topology-aware collective
910: operations. Using techniques similar to Kielman's PLogP method, we will
911: develop methods to determine the appropriate packet sizes with respect to
912: network performance at {\em each level} of our multilevel view.
913:
914:
915: %
916: %
917: %
918: \section{Summary}
919:
920:
921: As Grid computations become increasingly prevalent, the need for
922: topology-aware collective operations also increases. We have a version
923: of \hbox{MPICH-G2} that implements five collective operations in a multilevel
924: topology-aware manner. We have shown, at least for \verb+MPI_Bcast+, that when
925: compared with the binomial tree provided by MPICH and the 2-level approach
926: provided by MagPIe there are significant advantages to executing
927: collective operations using a multilevel view of the network.
928: Through a simple process of identifying machines that are common
929: to a local-area network, we have provided a means by which an MPI
930: application may take advantage of the multilevel topology-aware algorithms
931: without requiring code modifications or special functions.
932:
933:
934: %
935: %
936: %
937: \begin{acknowledge}
938:
939:
940: We thank the San Diego Supercomputer Center and the National Center
941: for Supercomputing Applications for providing access to their
942: machines. We also thank the members of the Globus development
943: team for their support, patience, and many ideas. This work was
944: supported in part by the Mathematical, Information, and Computational
945: Sciences Division subprogram of the Office of Advanced Scientific
946: Computing Research, U.S. Department of Energy, under Contract
947: W-31-109-Eng-38; by the U.S. Department of Energy under
948: Cooperative Agreement No. DE-FC02-99ER25398;
949: by the National Science Foundation; by DARPA; and by
950: the NASA Information Power Grid program.
951:
952: \end{acknowledge}
953:
954:
955: %
956: \bibliographystyle{plain}
957: \bibliography{foster_bibliography,mpi-allbib,mpi-book,globus,prop}
958:
959:
960: %
961:
962: Nicholas T. Karonis received a B.S. in finance and a B.S. in computer
963: science from Northern Illinois University in 1985, an M.S. in computer
964: science from Northern Illinois University in 1987, and a Ph.D. in
965: computer science from Syracuse University in 1992. He spent
966: summers from 1981 to 1991 as a student at Argonne National Laboratory
967: where he worked on the p4 message-passing library, automated reasoning,
968: and genetic sequence allignment. From 1991 to 1995 he worked on the control
969: system at Argonne's Advanced Photon Source and from 1995 to 1996
970: for the Computing Division at Fermi National Accelerator Laboratory.
971: Since 1996 he has been an assistant professor of computer science at
972: Northern Illinois University and a resident associate guest of Argonne's
973: Mathematics and Computer Science Division where he has been a member
974: of the Globus Project. His current research interest is message-passing
975: systems in computational Grids.
976:
977: Bronis R. de Supinski is a computer scientist in the Center
978: for Applied Scientific Computing at Lawrence Livermore
979: National Laboratory. His research interests include
980: message passing implementations and tools, memory performance
981: improvement, cache coherence and distributed shared memory,
982: consistency semantics and performance evaluation modeling
983: and tools. Bronis earned his Ph.D. in computer science
984: from the University of Virginia in 1998. He is a
985: member of the ACM and the IEEE Computer Society.
986:
987:
988: Ian Foster received his B.Sc. (Hons I) at the University of Canterbury
989: in 1979 and his Ph.D. from Imperial College, London, in 1998. He
990: is senior scientist and associate director of the
991: Mathematics and Computer Science Division at Argonne National
992: Laboratory, and professor of computer science at the University of
993: Chicago. He has published four books and over 150 papers and
994: technical reports. He co-leads the Globus Project, which provides
995: protocols and services used by industrial and academic distributed
996: computing projects worldwide. He co-founded the influential Global
997: Grid Forum and co-edited the book ``The Grid: Blueprint for a New
998: Computing Infrastructure.''
999:
1000:
1001: William Gropp received his B.S. in mathematics from Case Western Reserve
1002: University in 1977, a an M.S. in physics from the University of Washington in
1003: 1978, and a Ph.D. in computer science from Stanford in 1982. He held the
1004: positions of assistant (1982-1988) and associate (1988-1990) professor in
1005: the Computer Science Department at Yale University. In 1990, he joined the
1006: numerical analysis group at Argonne, where he is a senior computer
1007: scientist and associate director of the Mathematics and Computer Science
1008: Division, a senior scientist in the Department of Computer Science at the
1009: University of Chicago, and a Senior Fellow in the Argonne-University of Chicago
1010: Computation Institute. His research interests are in parallel computing,
1011: software for scientific computing, and numerical methods for partial
1012: differential equations. He has played a major role in the development of
1013: the MPI message-passing standard. He is co-author of MPICH, the most widely used
1014: implementation of MPI, and was involved in the MPI Forum as a
1015: chapter author for both MPI-1 and MPI-2. He has written many books and
1016: papers on MPI including "Using MPI" and "Using MPI-2". He is also one of
1017: the designers of the PETSc parallel numerical library, and has developed
1018: efficient and scalable parallel algorithms for the solution of linear and
1019: nonlinear equations.
1020:
1021:
1022: Ewing Lusk received his B.A. in mathematics from the University of Notre Dame
1023: in 1965 and his Ph.D. in mathematics from the University of Maryland in 1970.
1024: He is currently a senior computer scientist in the Mathematics and Computer
1025: Science Division at Argonne National Laboratory. His current projects include
1026: implementation of the MPI message-passing standard, research into programming
1027: models for parallel architectures, and parallel performance analysis tools.
1028: He is a leading member of the team responsible for MPICH implementation of the
1029: MPI message-passing interface standard. He is the author of five books and
1030: more than seventy-five research articles in mathematics, automated deduction,
1031: and parallel computing.
1032:
1033:
1034: Sebastien Lacour graduated in physics in 1999 at the Ecole
1035: Normale Superieure of Lyon, France. He received his master's
1036: degree in computer science in 2002 at IFSIC, University of
1037: Rennes, France. He is currently a Ph.D. student at IRISA/INRIA in
1038: Rennes. His research interests include networks, compilation, and
1039: parallel and distributed systems. His current work focuses on
1040: distributed shared-memory systems over large-scale, hierarchical
1041: architectures (multicluster platforms).
1042:
1043:
1044: \end{document}
1045: