1: \documentclass{aadebug}
2: \usepackage[psamsfonts]{amssymb}
3: \usepackage{graphicx}
4: \usepackage{type1cm}
5: \usepackage{times, mathptmx}
6:
7: \newcommand{\lw}[1]{\smash{\lower1.5ex\hbox{#1}}}
8:
9: \newcommand{\sor}{\stackrel{S}{\rightarrow}}
10: \newcommand{\cor}{\stackrel{C}{\rightarrow}}
11: \newcommand{\nor}{\stackrel{N}{\rightarrow}}
12: \newcommand{\hbr}{\rightarrow}
13: \newcommand{\todo}[1]{{\bf (Todo: #1)}}
14:
15: \begin{document}
16: \corr{0309027}{127}
17:
18: \runningheads{Masao Okita et al.}{Debugging Tool for Localizing Faulty Processes in Message Passing Programs}
19:
20: \title{Debugging Tool for Localizing Faulty Processes in Message Passing Programs}
21:
22: \author{
23: Masao~Okita\addressnum{1}\comma\extranum{1},
24: Fumihiko~Ino\addressnum{1}\comma\extranum{1},\\
25: and Kenichi Hagihara\addressnum{1}\comma\extranum{1}
26: }
27:
28:
29: \address{1}{
30: Graduate School of Information Science and Technology,
31: Osaka University\\
32: 1-3 Machikaneyama,
33: Toyonaka,
34: Osaka 560-8531,
35: Japan
36: }
37: \extra{1}{E-mail: \{m-okita,ino,hagihara\}@ist.osaka-u.ac.jp}
38:
39: \pdfinfo{
40: /Title (Debugging Tool for Localizing Faulty Processes in Message Passing Programs)
41: /Author (Masao Okita, Fumihiko Ino, and Kenichi Hagihara)
42: }
43:
44: \begin{abstract}
45: In message passing programs, once a process terminates with an
46: unexpected error, the terminated process can propagate the error to the
47: rest of processes through communication dependencies, resulting in a program
48: failure. Therefore, to locate faults, developers must identify the
49: group of processes involved in the original error and {\em
50: faulty processes} that activate faults. This paper presents a
51: novel debugging tool, named {\em MPI-PreDebugger} (MPI-PD), for
52: localizing faulty processes in message passing programs. MPI-PD
53: automatically distinguishes the original and the propagated errors by
54: checking communication errors during program execution. If MPI-PD
55: observes any communication errors, it backtraces communication dependencies and
56: points out potential faulty processes in a timeline view. We also
57: introduce three case studies, in which MPI-PD has been shown to play
58: the key role in their debugging. From these studies, we believe that
59: MPI-PD helps developers to locate faults and allows them to concentrate in
60: correcting their programs.
61: \end{abstract}
62:
63: \keywords{parallel processing; message passing; debugging; fault localization}
64:
65: \section{Introduction}
66: \label{sec:introduction}
67: In recent years, cluster/grid computing
68: \cite{buyya99cluster,ian98grid} is emerging as a cost-effective
69: methodology for high performance computing. The message passing
70: paradigm \cite{mpif94mpi} is a widely employed programming paradigm
71: that gives us efficient parallel programs on these computing
72: environments.
73:
74: However, debugging message passing programs is usually time-consuming,
75: since we have to investigate a large amount of debugging information
76: compared to sequential programs. Furthermore, once a process
77: terminates with an unexpected error \cite{ melliar-smith77ldrs}, the
78: terminated process can propagate the error to the rest of processes
79: through communication dependencies. For example, if a process terminates
80: before sending an intended message, the receiver process that has no
81: original fault also terminates, since it fails to receive the expected
82: message. This {\em error propagation} makes it complicated to locate
83: the hidden faults from a number of observed errors.
84:
85: To give developers valuable insights for debugging, a number of
86: debugging tools have been developed for message passing
87: programs. Post-mortem performance debuggers such as ParaGraph
88: \cite{heath91paragraph}, ATEMPT \cite{kran96graph}, XMPI
89: \cite{www02xmpi}, and Vampir \cite{pallas99vampir} visualize detailed
90: timeline view of communications, so that developers can
91: intuitively understand program behaviors.
92:
93: Source-level debuggers such as TotalView \cite{etnus01tv}, MPIGDB
94: \cite{ralph00mpd}, and CDB \cite{wu02cdb} allow stepwise
95: execution of programs. TotalView also has a facility for visualizing,
96: named Message Queue Graph (MQG), which shows the states of the
97: pending send and receive operations. MPIGDB is based on a
98: sequential debugger, GDB \cite{stallman02gdb}, and allows developers
99: to broadcast terminal input to all GDB processes attached to
100: computing processes. CDB also provides a similar debugging
101: environment by employing GDB at its lower layer.
102:
103: {\em Fault localization} \cite{jones02faultlocalzation} is another
104: approach for debugging programs.
105: {\em Relative debugging} \cite{hood00relative,gregory01relative} is
106: a kind of fault localization
107: for programs that have been
108: ported from sequential to parallel architectures or between different
109: parallel architectures.
110: It dynamically compares data between two
111: executing programs, so that can locate errors in the compared
112: programs. In \cite{robert96racecondition}, Netzer et al.~have pointed
113: out that unforeseen consequences of bugs can cause messages to arrive
114: in unexpected orders. Their algorithm dynamically locates errors by
115: detecting unintended nondeterminism, or race conditions.
116:
117: Process grouping
118: \cite{kran02ipdps,kunz93wpdd,stringhini00pdpta} is a
119: fundamental technique for scalable visualizing and debugging. DeWiz
120: \cite{kran02pdp,kran02ipdps} aims at identifying closely related
121: processes and reducing the amount of trace data. Given a specific
122: process, DeWiz isolates the related processes according to the
123: accumulated length of transmitted messages.
124:
125: Thus, a number of tools provide useful debugging functions. However,
126: developers still suffer for selecting the original error from a number
127: of observed errors, including original and propagated errors.
128: Once the original error is given to developers,
129: they can immediately investigate faults by using existing debuggers and
130: concentrate in correcting them.
131:
132: In this paper, we propose a novel debugging tool, named {\em
133: MPI-PreDebugger} (MPI-PD), for localizing faulty processes in message
134: passing programs. Current MPI-PD supports programs written using the
135: Message Passing Interface (MPI) standard \cite{mpif94mpi} and focuses
136: on faults that terminate program execution. MPI-PD aims at reducing
137: developers' workloads required for localizing faulty processes in
138: timeline visualization.
139:
140: To achieve this, MPI-PD dynamically checks communication errors in
141: accordance with the error definition in a program execution model. If
142: MPI-PD observes any communication errors, it then generates a trace
143: file, backtraces communication dependencies and points out
144: potentially faulty processes in a timeline view. Thus, MPI-PD reduces
145: the amount of debugging information before developers visualize and
146: investigate it by using performance debuggers and source-level
147: debuggers.
148:
149: The rest of this paper is organized as follows. Section
150: \ref{sec:definition} formally characterizes communication errors in
151: MPI programs and makes clear the differences among faults, errors, and
152: failures. Section \ref{sec:algorithm} gives an algorithm for
153: localizing faulty processes in a given trace file while Section
154: \ref{sec:mpipd} presents MPI-PD, which implements the proposed
155: algorithm. Section \ref{sec:studies} introduces three case studies
156: assisted by MPI-PD. At last, Section \ref{sec:conclusions} concludes
157: this paper.
158:
159:
160: \section{Modeling Behavior of Message Passing Programs}
161: \label{sec:definition}
162: This section shows a definition of communication errors in MPI
163: programs. We define it by extending the program execution model
164: described in \cite{robert92model}.
165:
166:
167: \subsection{Event graph: program execution model}
168: \label{subsec:event_graph}
169: An execution of a message passing program is defined as a directed
170: graph, $G=(E,\rightarrow)$, where $E$ represents a finite set of
171: {\em events} while $\rightarrow$ represents the
172: {\em happened-before relation} \cite{lamport78cacm} defined over $E$
173: \cite{robert92model}. In the following, we call this directed graph
174: the {\em event graph} \cite{kran02pdp}.
175:
176: An event in this context represents the execution instance of a set
177: of consecutively executed statements in some process
178: \cite{robert92model}. Any event $e \in E$ is observed during a
179: program execution. In the following, let $e_{p,i}$ be the
180: $i^{\rm th}$ event on process $p$.
181:
182: The happened-before relation $\hbr$ shows how events potentially
183: affect one another \cite{lamport78cacm}. This relation is defined as the
184: irreflexive transitive closure of the union of two other relations:
185: $\hbr = (\sor \cup \cor)^+$. Here, $\sor$ and $\cor$ respectively
186: represent the sequential order relation and the concurrent order
187: relation as follows \cite{kran02pdp}:
188:
189: \begin{figure}[ht]
190: \centering
191: \includegraphics[scale=.40]{eps/relation-blocking.eps}
192: \qquad
193: \includegraphics[scale=.40]{eps/relation-nonblocking.eps}
194:
195: (a) Blocking communication\hspace{1em}
196: (b) Nonblocking communication
197: \caption{Order relations between events. A node represents an event
198: and an arrow represents a relation.}
199: \label{fig:relation}
200: \end{figure}
201:
202: \begin{description}
203: \item[Sequential order relation, $\sor$:]
204:
205: As illustrated in Figure \ref{fig:relation}(a), the
206: sequential order of events, $e_{p,i} \sor e_{p,i+1}$,
207: defines that the $i^{\rm th}$ event $e_{p,i}$ on any
208: sequential process $p$ occurred before the
209: $i+1^{\rm st}$ event $e_{p,i+1}$.
210:
211: \item[Concurrent order relation, $\cor$:]
212:
213: As illustrated in Figure \ref{fig:relation}(a), the
214: concurrent order of events, $e_{p,i} \cor e_{q,j}$,
215: defines that the $i^{\rm th}$ event $e_{p,i}$ on any
216: process $p$ occurred directly before the $j^{\rm th}$
217: event $e_{q,j}$ on another process $q$, if $e_{p,i}$ is
218: the sending of a message by process $p$ and $e_{q,j}$ is
219: the receipt of the same message by another process $q$.
220:
221: \end{description}
222:
223: Although the event graph is a sufficient model for visualizing the
224: behavior of message passing programs, we have to add one relation to
225: this graph to characterize the errors relevant to
226: nonblocking communications \cite{mpif94mpi}. This additional relation
227: exists between a pair of events caused by the initiation and the
228: completion of a nonblocking send/receive operation:
229:
230: \begin{description}
231: \item[Nonblocking order relation, $\nor$:]
232:
233: As illustrated in Figure \ref{fig:relation}(b), the
234: nonblocking order relation, $\nor$, shows the order in
235: which nonblocking messages are initialized and then
236: completed: $e_{p,i} \nor e_{p,k}$ defines that
237: $e_{p,i} \sor e_{p,k}$, if $e_{p,i}$ is the send/receipt
238: initiation of a message by process $p$ and $e_{p,k}$ is
239: the completion of the same message by the same process $p$.
240: \end{description}
241:
242: In our extended event graph, the happened-before relation is
243: redefined as $\hbr = (\sor \cup \cor \cup \nor)^+$.
244:
245: \subsection{Fault, error, and failure}
246: \label{subsec:fault-error-failure}
247: The concepts of faults, errors, and failures
248: \cite{melliar-smith77ldrs} used in our discussion are briefly
249: explained as follows: a program with a bug has a fault in itself and
250: an active fault causes an error. If the error fails to be corrected,
251: it causes a failure.
252:
253: \begin{figure}[ht]
254: \centering
255: \includegraphics[scale=.5]{eps/fault-error-failure.eps}
256:
257: \caption{Fault, error, and failure events. While a crossed node
258: represents an unexpectedly terminated event, a dotted node
259: represents expected but non-occurred event.}
260: \label{fig:fault-error-failure}
261: \end{figure}
262:
263: Figure \ref{fig:fault-error-failure} shows an example that interprets
264: these three concepts on events. In this example, process $r$ is the
265: faulty process, since it executes a faulty statement and causes a
266: faulty event. It also terminates against developer's
267: intension, so that causes a failure event. After this, process $q$
268: fails to pass a message to process $r$, so that causes an error
269: event, resulting in a failure event (since it terminates). Process
270: $p$ also faces with a communication error, however, its error handler
271: avoids its failure.
272:
273: Let $is\_failed(e)$ denote whether event $e$ causes a failure or
274: not. Since failure events have no successor and occur when programs
275: unexpectedly terminate, $is\_failed(e)$ is defined as follows:
276: \begin{eqnarray}
277: is\_failed(e) \hspace{-.5em} &=& \hspace{-.5em} {\rm~the~program~terminated~unexpectedly.} \nonumber
278: \label{eqn:isfailed}
279: \end{eqnarray}
280:
281: \subsection{Communication errors in MPI programs}
282: \label{subsec:comm_faults}
283: In MPI programs, an event causes a communication error, if it
284: satisfies one of the following two conditions: isolated or
285: truncated, defined as follows;
286:
287: \begin{itemize}
288: \item {\em Isolated events}.
289:
290: \begin{itemize}
291: \item An event $e_{p,i}$ ($e_{q,j}$) is called an
292: isolated send (receive) event,
293: if $\neg \exists~e_{q,j} \in E~(e_{p,i} \in E)$
294: such that $e_{p,i} \cor e_{q,j}$, respectively
295: \cite{kran02pdp}.
296: \item An event $e_{p,i}$ ($e_{p,k}$) is called an
297: isolated send/receive initiation (completion)
298: event, if
299: $\neg \exists~e_{p,k} \in E~(e_{p,i} \in E)$ such
300: that $e_{p,i} \nor e_{p,k}$, respectively.
301: \end{itemize}
302:
303: \item {\em Truncated events}.
304:
305: \begin{itemize}
306: \item Two events $e_{p,i}$ and $e_{q,j}$ are called
307: truncated events, if $e_{p,i} \cor e_{q,j}$ and
308: $len(e_{p,i}) > len(e_{q,j})$, where $len(e_{p,i})$
309: and $len(e_{q,j})$ represent the length of the send
310: buffer specified in event $e_{p,i}$ and the receive
311: buffer specified in event $e_{q,j}$, respectively.
312: \end{itemize}
313: \end{itemize}
314:
315: Isolated events are caused under the following two situations. One is
316: the mismatch of occurred events and the other is the
317: non-occurrence of expected events. First, occurred but mismatched
318: events can trigger off an error propagation. For example, an MPI
319: routine call with an invalid tag/communicator \cite{mpif94mpi} or an
320: invalid source/destination rank fails to pass the
321: intended message. Similar mismatch can occur between the initiation
322: and the completion of a nonblocking send/receive operation. Next,
323: expected but non-occurred events cause serious problems, since they
324: can propagate errors through all processes. For example, if a
325: process terminates before sending an intended message,
326: the receiver process that has no original fault also terminates,
327: since it fails to receive the expected message. Thus, isolated events
328: propagate errors similarly to the domino effect,
329: leading to a program failure.
330:
331: A pair of truncated events indicates an occurrence of an overflow at the
332: receive buffer.
333: In a strict sense, a message should be passed between the send and the
334: receive operations with the same buffer length
335: \cite{kran02pdp}. However, as MPI does, we also permit
336: passing a message between events $e_{p,i}$ and $e_{q,j}$ such that
337: $e_{p,i} \cor e_{q,j}$ and $len(e_{p,i}) < len(e_{q,j})$. In practice,
338: some nondeterministic applications require this flexibility, because
339: the receiver processes in these applications want to receive a variable
340: length message at one receive operation.
341: Therefore, we permit passing a message
342: between events with different buffer length except for truncated
343: events.
344:
345: Thus, the error of an event can depend on that of an event on another
346: process. In this paper we call that processes $p$ and $q$ have a
347: {\em communication dependency} if the error of event $e_{p,i}$ on process
348: $p$ determines that of event $e_{q,j}$ on another process $q$.
349:
350: Here notice that MPI has four communication modes \cite{mpif94mpi}: the
351: standard, buffered, synchronous, and ready modes. These modes differ by
352: when they solve the matching of outgoing messages. For example, when
353: two processes send a message to each other, they fall into a deadlock
354: in the synchronous mode while they are deadlock-free in the buffered
355: mode. Therefore, we have to check communication errors without
356: destroying these communication semantics in the target programs. That
357: is, outgoing messages have to be checked in the same mode as their
358: original mode. The error detection mechanism employed in MPI-PD is
359: presented later in Section \ref{subsec:error-detection}.
360:
361: For collective communications, since they can be implemented by using
362: point-to-point communications, we repeatedly apply the above
363: error definition to all of the point-to-point messages that compose
364: the collective communication.
365:
366: In the following, let $is\_isolated(e_{p,i})$ denote whether event
367: $e_{p,i}$ is isolated event or not. Let
368: $is\_truncated(e_{p,i},e_{q,j})$ also denote whether events
369: $e_{p,i}$ and $e_{q,j}$ are truncated events or not.
370:
371: \section{Algorithm for Localizing Faulty Processes}
372: \label{sec:algorithm}
373: This section presents the details of our proposed algorithm. We
374: describe how to localize faulty processes in a given event graph. We
375: assume here that the event graph is already generated by the error
376: detection mechanism presented later in Section
377: \ref{subsec:error-detection}.
378:
379: \begin{figure}[htb]\footnotesize
380: \setbox0\vbox{
381: 1. {\bf Algorithm} LocalizeFaultyProcesses($P$, $G$, $P_e$, $E_e$)\\
382: \hspace{.5em}2. \hspace{1em} // Input: $P$, a set of process ranks.\\
383: \hspace{.5em}3. \hspace{1em} // \hspace{2.4em} $G$, an event graph.\\
384: \hspace{.5em}4. \hspace{1em} // Output: $P_e$, a set of localized faulty process ranks.\\
385: \hspace{.5em}5. \hspace{1em} // \hspace{3em} $E_e$, a set of failure events on each processes.\\
386: \hspace{.5em}6. {\bf begin}\\
387: \hspace{.5em}7. \hspace{2em}// (1) Identify failure events occurred on each processes.\\
388: \hspace{.5em}8. \hspace{2em}$E_e := \emptyset$;\\
389: \hspace{.5em}9. \hspace{2em}{\bf foreach} ($p \in P$) {\bf begin}\\
390: 10. \hspace{4em}{\bf if} $e_{p,i}$ such that $is\_failed(e_{p,i})=true$ exists. \hspace{0.5em}{\bf then} \hspace{0.5em}$fe_p := e_{p,i}$\\
391: 11. \hspace{4em}{\bf else}\hspace{0.5em}$fe_p := null$\\
392: 12. \hspace{4em}{\bf endif}\\
393: 13. \hspace{4em}$E_e := E_e \cup \{ fe_p \}$;\\
394: 14. \hspace{2em}{\bf end}\\
395: 15. \hspace{2em}// (2) Localize faulty processes by recursive analysis.\\
396: 16. \hspace{2em}$P_e := \emptyset$;\\
397: 17. \hspace{2em}{\bf foreach} ($p \in P$) {\bf begin}\\
398: 18. \hspace{4em}{\bf if} (BacktraceCommDep($p$, $\emptyset$) $\ne$ 0)\hspace{0.5em}{\bf then}\hspace{0.5em}$P_e := P_e \cup \{ p \}$;\hspace{3em}// Process $p$ has faults.\\
399: 19. \hspace*{2em}{\bf end}\\
400: 20. {\bf end}\\
401: 21. // A recursive function that backtraces communication dependencies from process $p$.\\
402: 22. {\bf function} BacktraceCommDep($p$, $P_{dep}$)\\
403: 23. {\bf begin}\\
404: 24. \hspace*{2em}{\bf if} (($p \in P_e$) $\vert \vert$ (($fe_p = null$) \&\& ($P_{dep} = \emptyset$))) \hspace{0.5em}{\bf then} \hspace{0.5em}{\bf return} 0;\hspace{1.4em}// $p$ is already traced or valid.\\
405: 25. \hspace*{2em}{\bf else if} ($fe_p$ is a calculation event) \hspace{0.5em}{\bf then} \hspace{0.5em}{\bf return} --1;\hspace{5em}// (a) Calculation fault.\\
406: 26. \hspace*{2em}{\bf else if} ($fe_p = null$) \hspace{0.5em}{\bf then}\hspace{0.5em}{\bf return} --2;\hspace{11em}// (b) Non-occurred event.\\
407: 27. \hspace*{2em}{\bf else if} ($p \in P_{dep}$) \hspace{0.5em}{\bf then}\hspace{0.5em}{\bf return} $p$;\hspace{12.5em}// (c) Deadlock or (d) Overflow.\\
408: 28. \hspace*{2em}{\bf endif}\\
409: 29. \hspace*{2em}$q := ptnr(fe_p)$;\hspace{3em}// Source/destination rank for $fe_p$\\
410: 30. \hspace*{2em}$Q_{dep} := P_{dep} \cup \{ p \}$;\hspace{1.4em}// Update the call history.\\
411: 31. \hspace*{2em}$retval :=$ BacktraceCommDep($q$, $Q_{dep}$);\\
412: 32. \hspace*{2em}{\bf if} ($retval \ne 0$) \hspace{0.5em}{\bf then}\hspace{0.5em}$P_e := P_e \cup \{ q \}$;\hspace{1.4em}// Process $q$ has faults.\\
413: 33. \hspace*{2em}{\bf if} ($retval = p$) \hspace{0.5em}{\bf then}\hspace{0.5em}$retval := 0$;\\
414: 34. \hspace*{2em}{\bf else if} ($retval < 0$) \hspace{0.5em}{\bf then}\hspace{0.5em}$retval$++;\\
415: 35. \hspace*{2em}{\bf endif}\\
416: 36. \hspace*{2em}{\bf return} $retval$;\\
417: 37. {\bf end}
418: }\centerline{\fbox{\box0}}
419: \caption{Algorithm for localizing faulty processes.}
420: \label{fig:algorithm}
421: \end{figure}
422:
423: \begin{figure}[ht]
424: \centering
425: \hspace{5em}
426: \includegraphics[scale=.43]{eps/failure-status-1.eps}
427:
428: (a) Calculation fault\hspace{4em}
429: (b) Non-occurred event
430: \includegraphics[scale=.43]{eps/failure-status-2.eps}
431:
432: (c) Deadlock\hspace{8em}
433: (d) Overflow
434: \caption{Four failure situations classified by proposed algorithm.}
435: \label{fig:failure-status}
436: \end{figure}
437:
438: Figure \ref{fig:algorithm} shows our algorithm, which requires a set
439: of process ranks, $P$, and an event graph, $G$, and returns sets of
440: localized faulty processes and the failure events on each process,
441: $P_e$ and $E_e$, respectively. Our algorithm consists of two stages
442: as follows:
443:
444: \begin{itemize}
445: \item Identification of failure events (see line 7--14 in Figure
446: \ref{fig:algorithm}).
447: \item Localization of faulty processes (see line 15--37 in Figure
448: \ref{fig:algorithm}).
449: \end{itemize}
450:
451: At the first stage, the algorithm identifies all failure events.
452: After this stage, it localizes
453: faulty processes by backtracing communication dependencies in a recursive
454: manner. Our algorithm then classifies program failure into the
455: following four situations:
456:
457: \begin{description}
458: \item[(a) Calculation fault:]
459:
460: Figure \ref{fig:failure-status}(a) illustrates this
461: situation. As a result of backtracing, our algorithm finds
462: that process $s$ terminates unexpectedly and has no
463: communication dependency to any other processes. Therefore, the
464: algorithm determines that the faulty process is process
465: $s$, which causes a calculation fault.
466:
467: \item[(b) Non-occurred event:]
468:
469: Figure \ref{fig:failure-status}(b) illustrates this
470: situation, in which process $s$ has a communication
471: dependency from $r$ but terminates successfully.
472: In this situation, we think whether process $r$ could have sent a
473: message redundantly or process $s$ could
474: have missed to call a receive routine.
475: However, it seems to be difficult to
476: automatically identify the faulty process from processes
477: $r$ and $s$. Therefore, our
478: algorithm determines that the faulty processes are both of
479: processes $r$ and $s$, or a process left by a normally terminated
480: process and the terminated process.
481:
482: \item[(c) Deadlock:]
483:
484: A deadlock occurs if there exists a cyclic communication
485: dependency. In Figure \ref{fig:failure-status}(c),
486: processes $q$, $r$ and $s$ fall into a deadlock. Our
487: algorithm determines that the faulty processes are all the
488: processes that participate in the deadlock.
489:
490: \item[(d) Buffer overflow:]
491:
492: In Figure \ref{fig:failure-status}(d), process $s$ causes
493: a buffer overflow.
494: As same as situation (b), it also seems to be difficult to
495: identify which of processes $r$ and $s$ has called an MPI
496: routine with an invalid buffer length. Therefore, our
497: algorithm determines that the faulty processes are both of
498: processes $r$ and $s$, which have a pair of truncated events.
499: \end{description}
500:
501: Notice that the algorithm described in Figure
502: \ref{fig:algorithm} backtraces communication dependencies by assuming that
503: all the source/destination ranks are valid. Therefore, if a faulty
504: process calls an MPI routine with an invalid source/destination,
505: this algorithm can omit the faulty process from the localized
506: processes. We discuss this problem later in Section
507: \ref{subsec:studies_applicability}.
508:
509: \section{MPI-PreDebugger}
510: \label{sec:mpipd}
511: This section presents the details of MPI-PD, including its environment
512: for debugging and its mechanism for run-time error detection.
513:
514: \begin{figure*}[ht]
515: \centering
516: \includegraphics[width=14.0cm]{eps/debugging-process.eps}
517: \caption{Debugging process with MPI-PD.}
518: \label{fig:debugging-process}
519: \end{figure*}
520:
521: \subsection{Overview of debugging environment}
522: \label{subsec:overview}
523:
524: Figure \ref{fig:debugging-process} shows the debugging process with
525: MPI-PD. The debugging functions in MPI-PD are implemented using the
526: C++ language and the Ruby-GNOME toolkit \cite{www02rubygnome} and
527: composed of three components: the instrument tool mpi2pd, the
528: run-time error detection library libpdmpi.a, and the localize
529: and visualize tool pdview.
530:
531: The instrument tool mpi2pd automatically replaces all of the
532: MPI routines in programs with instrumented MPI routines
533: based on pattern-match rules.
534: The instrumented routine is a combination of the original MPI routine
535: and the run-time error detection function.
536: After this replacement, developers have to generate the
537: object codes by compiling their programs and the executable binary
538: file by linking the object codes with the run-time error detection
539: library.
540:
541: The run-time error detection library checks communication errors
542: whenever the processes call the instrumented MPI routines (see Section
543: \ref{subsec:error-detection}). If the library detects any
544: communication error, it terminates program execution and generates a
545: trace file. The trace file has the following information for every
546: event observed during program execution: (1) event number, (2) process
547: rank, (3) corresponding line in source code and its file name, and (4)
548: corresponding MPI routine and its arguments.
549:
550: Given a trace file, the visualization tool pdview allows developers to
551: view the behavior of the terminated program, as shown in Figure
552: \ref{fig:debugging-process}. It visualizes the event graph, which
553: has the process axis in vertical and the time axis in horizontal,
554: and shows the result of the fault localization described in Section
555: \ref{sec:algorithm}. In the event graph, a colored node corresponds
556: to an event and the type of the MPI operation that caused the event
557: decides its color. A solid line between two nodes
558: corresponds to a successful communication while a dotted line
559: corresponds to a failure communication.
560:
561: In default mode, pdview avoids visualizing the entire event graph. It
562: visualizes all of failure events occurred on each process and the
563: successful events occurred directly before the failure
564: events. Furthermore, pdview can isolate faulty processes from
565: the event graph. Developers can visualize an isolated event graph by
566: selecting process whichever they want. In addition to
567: these visualization functions, pdview also shows following information:
568:
569: \begin{itemize}
570: \item Faulty processes localized by the proposed algorithm.
571: \item Failure situation selected from four situations (see
572: Figure \ref{fig:failure-status}).
573: \end{itemize}
574:
575: Furthermore, developers can investigate every visualized event. If
576: they click the mouse on a node in the visualized event graph, then
577: pdview pops up a dialog, which shows information (1)--(4) about the
578: corresponding event and its error reason (isolated/truncated). This
579: information is useful for developers to locate faults in
580: programs. After this fault localization, source-level debuggers can
581: effectively assist developers to investigate the detailed behavior of
582: the localized part.
583:
584:
585: \subsection{Mechanism for run-time error detection}
586: \label{subsec:error-detection}
587: MPI-PD checks the occurrence of communication errors during program
588: execution. If it detects any errors, it generates a trace file.
589:
590: To realize this, we employ three methodologies. We first discuss on
591: the synchronous blocking send ({\tt MPI\_Ssend}) then others. The
592: three methodologies are as follows:
593:
594: \begin{itemize}
595: \item Manager process:
596: To generate trace files under a deadlock situation, we
597: employ a manager process $M_p$ for every process $p$.
598: $M_p$ checks the value
599: of $is\_failed(e_{p,i})$ before its responsible process $p$
600: executes event $e_{p,i}$.
601: We present later how to check $is\_failed(e_{p,i})$
602: at next paragraph.
603: If $M_p$ obtains
604: $is\_failed(e_{p,i})=false$, it allows $p$ to execute event
605: $e_{p,i}$ and pushes the information about $e_{p,i}$ into
606: its local Event Graph $E_p$. Otherwise, it detects a
607: communication error, terminates $p$ and generates a trace file
608: from $E_p$.
609:
610: \item Message queue: To handle nonblocking communications, we employ
611: a message queue. For nonblocking communications, to decide the
612: failure of completion event $e_{p,k}$, we have to refer the
613: information about its corresponding initiation event $e_{p,i}$
614: ($e_{p,i} \nor e_{p,k}$). Therefore, for all processes $p$,
615: manager $M_p$ has its own message queue $Q_p$ for
616: referring to the information about the past events.
617:
618: \item Timeout mechanism: We also employ a timeout mechanism due to
619: the difficulty in distinguishing the valid and the invalid
620: computation. For example, a receive event $e_{q,j}$ that never
621: receive a message has to be decided as
622: $is\_isolated(e_{q,j})=true$. However, it is hard for $M_q$
623: to identify whether the sender $p$ sends the message or not.
624: That is, $p$ can send the message after heavy computation or can
625: fall into an infinite loop. Therefore,
626: $M_p$ holds a timeout time $t(e_{p,i})$ for every
627: $e_{p,i}$ and decides $is\_isolated(e_{p,i})=true$ when the
628: time is up.
629: \end{itemize}
630:
631: Figure \ref{fig:error-detection} shows the process of run-time error
632: detection for {\tt MPI\_Ssend}. In Figure \ref{fig:error-detection},
633: the manager of the sender has three states (states C, S1 and S2) and
634: that of the receiver has four states (states C, R1, R2 and R3) as
635: follows:
636:
637: \begin{figure*}[ht]
638: \centering
639: \includegraphics[width=6.0cm]{eps/error-detection-success.eps}
640: \qquad
641: \includegraphics[width=6.0cm]{eps/error-detection-failure.eps}
642:
643: (a) Successful case \hspace{13em}(b) Failure case
644: \caption{Process of run-time error detection for the synchronous
645: blocking send ({\tt MPI\_Ssend}). Events $e_{p,i}$ and $e_{q,j}$
646: correspond to {\tt MPI\_Ssend} and {\tt MPI\_Recv} calls,
647: respectively.}
648: \label{fig:error-detection}
649: \end{figure*}
650:
651:
652: \begin{description}
653: \item[Common state for the sender/receiver:]
654: \item State C: {\em Timeout checking and control-message waiting}. In
655: this state, $M_p$ continues to check $Q_p$ whether there
656: exist any timeout events, until it receives any control
657: message (ack or request messages) from $p$ or another
658: manager. If $M_p$ detects a timeout event $e_{p,i}$,
659: then it decides $is\_failed(e_{p,i})=true$ and sends an
660: abort request $abort_p(e_{p,i})$ to $p$. It also adds the
661: failure event $e_{p,i}$ to $E_p$ and terminates.
662: If $M_p$ receives a control message, then it changes
663: its state to an appropriate state.
664:
665: \item[States for the sender:]
666: \item State S1: {\em Send initiating}. If $M_p$ receives a send
667: request $req_p(e_{p,i})$ from $p$, then it pushes the
668: information about $e_{p,i}$ into $Q_p$ with
669: $t(e_{p,i})$. It also checks the destination rank of
670: $e_{p,i}$ and transmits a send request $req_m(e_{p,i})$ to
671: the destination process's manager, $M_q$ (go to state C).
672: \item State S2: {\em Message sending}. If $M_p$ receives an
673: ack $ack_m(e_{q,j})$ from another manager, then it
674: searches $Q_p$ and selects
675: $e_{p,i}$ such that $is\_isolated(e_{p,i}) = false$. It
676: also checks whether $e_{p,i}$ and $e_{q,j}$ are
677: truncated events.
678: \begin{itemize}
679: \item If $is\_truncated(e_{p,i},e_{q,j})=false$, $M_p$
680: decides $is\_failed(e_{p,i})=false$ and sends an
681: ack $ack_p(e_{p,i})$ to
682: $p$. After this acknowledgement, it deletes
683: $e_{p,i}$ from $Q_p$, and adds
684: both $e_{p,i}$ and $e_{q,j}$ to $E_p$ (go to
685: state C).
686: \item Otherwise, $M_p$ decides
687: $is\_failed(e_{p,i})=true$ and sends an abort
688: request $abort_p(e_{p,i})$ to $p$.
689: It also adds both $e_{p,i}$ and $e_{q,j}$ to $E_p$
690: as failure events and terminates.
691: \end{itemize}
692:
693: \item[States for the receiver:]
694: \item State R1: {\em Receive initiating}. If $M_q$ receives a receive
695: request $req_q(e_{q,j})$, it then searches $Q_q$ and
696: selects $e_{p,i}$ such that $is\_isolated(e_{p,i}) \lor
697: is\_isolated(e_{q,j})=false$.
698:
699: \begin{itemize}
700: \item If such $e_{p,i}$ exists, $M_q$
701: decides that $e_{p,i}$ and $e_{q,j}$ are the
702: matching events (go to state R3).
703: \item Otherwise, it leaves the error detection on
704: $e_{q,j}$ and pushes the information about
705: $e_{q,j}$ into $Q_q$ with $t(e_{q,j})$ (go to
706: state C).
707: \end{itemize}
708:
709: \item State R2: {\em Send-request receiving}. If $M_q$
710: receives a request $req_m(e_{p,i})$ from another manager,
711: then it searches $Q_q$ and selects
712: $e_{q,j}$ such that
713: $is\_isolated(e_{p,i}) \lor is\_isolated(e_{q,j})=false$.
714:
715: \begin{itemize}
716: \item If such $e_{q,j}$ exists, $M_q$
717: decides that $e_{p,i}$ and $e_{q,j}$ are the
718: matching events (go to state R3).
719: \item Otherwise, it leaves the error detection on
720: $e_{p,i}$ and pushes the information about
721: $e_{p,i}$ into $Q_q$ with $t(e_{p,i})$ (go to
722: state C).
723: \end{itemize}
724:
725: \item State R3: {\em Message receiving}. $M_q$ sends an ack
726: $ack_m(e_{q,j})$ to $M_p$. It then checks if
727: $e_{p,i}$ and $e_{q,j}$ are truncated events.
728:
729: \begin{itemize}
730: \item If $is\_truncated(e_{p,i},e_{q,j})=false$, then
731: $M_q$ decides $is\_failed(e_{q,j})=false$
732: and sends an ack $ack_r(e_{q,j})$ to
733: $q$. After this acknowledgement, it deletes
734: $e_{q,j}$ ($e_{p,i}$) from $Q_q$
735: and adds both $e_{p,i}$ and $e_{q,j}$ to $E_q$
736: (go to state C).
737: \item If $is\_truncated(e_{p,i},e_{q,j})=true$, then
738: $M_q$ decides $is\_failed(e_{q,j})=true$
739: and sends an abort request $abort_q(e_{q,j})$ to $q$.
740: It also adds both $e_{p,i}$ and $e_{q,j}$ to $E_q$
741: as failure events and terminates.
742: \end{itemize}
743: \end{description}
744:
745: The manager processes buffer all events until they detect an error,
746: so that their local memory are possibly full.
747: Our algorithm described in Figure \ref{fig:algorithm} requires
748: failure events on each process.
749: Therefore, if local memory of $M_p$ is full,
750: we allow $M_p$ to delete information about the oldest successful event
751: from $E_p$.
752:
753: Here, recall that we have to keep the communication semantics, as
754: explained in Section \ref{subsec:comm_faults}. Therefore, for the
755: blocking buffered mode send ({\tt MPI\_Bsend}), we alter the sequence
756: of error detection. That is, to keep the buffered behavior of message
757: passing, process $p$ passes the original message immediately after
758: sending request $req_p(e_{p,i})$ to its manager $M_p$. This
759: alternation omits receiving an ack $ack_p(e_{p,i})$ from
760: $M_p$. Instead of this omission, $p$ checks an abort message
761: $abort_p(e_{p,i})$ from $M_p$ whenever it calls an
762: instrumented MPI routine. If $p$ receives the abort message
763: $abort_p(e_{p,i})$, it terminates its execution. Otherwise, it
764: continues processing the original routine.
765: This alteration allows $p$ to execute a few events after an original
766: faulty event, however there is no influence on faulty process
767: localization since $M_p$ identifies the faulty event correctly.
768:
769: For nonblocking communications, we process states S1 and R1 at the
770: send initiation and the receive initiation of nonblocking operations,
771: respectively; and process send acks at the completion of the nonblocking
772: operations. For collective communications, we can apply the same
773: approach as for the blocking mode point-to-point routines, since the
774: collective communications can be implemented by using those
775: point-to-point routines.
776:
777: Thus, exchanging information about every event among managers
778: enables us to detect communication errors and generate trace files
779: before program failure.
780:
781: \section{Case Studies: Debugging Message Passing Programs with MPI-PD}
782: \label{sec:studies}
783:
784: In this section we introduce three case studies. The aim of each
785: study is to investigate the effectiveness of MPI-PD from the
786: following point of view:
787:
788: \begin{enumerate}
789: \item {\em Applicability}:
790: We investigated what kinds of faults are effective for MPI-PD.
791: To do this, we applied MPI-PD to a few ten of the Gaussian
792: programs developed by MPI beginners (see Section
793: \ref{subsec:studies_applicability}).
794:
795: \item {\em Scalability}:
796: This study shows an example of scalable debugging using
797: MPI-PD. We applied MPI-PD to a parallel rendering program
798: \cite{takeuti03sac} developed by MPI experts on 64 processes
799: (see Section \ref{subsec:studies_scalability}).
800:
801: \item {\em Usability}:
802: We investigated the usability of faulty process
803: localization. To do this, we applied MPI-PD to a complicated
804: program generated automatically by a parallelizing
805: compiler \cite{y-yamamt01ebcsh}. We also compared
806: visualization results between proposed MPI-PD and existing
807: TotalView \cite{etnus01tv} (see Section
808: \ref{subsec:studies_usability}).
809: \end{enumerate}
810:
811: \begin{table*}[tb]
812: \caption{Summary of case studies. $|L|$, $|P|$, and $|E|$ represent
813: the numbers of lines, processes, and events, respectively.}
814: \label{tab:summary}
815: \begin{center}
816: \begin{tabular}{|l|l|c|l|c|c|}\hline
817: \lw{Case study} & \multicolumn{3}{l|}{Details of program}
818: & \multicolumn{2}{l|}{Details of trace file}
819: \\ \cline{2-6}
820: & Developer & $|L|$ & Employed MPI routines
821: & $|P|$ & $|E|$
822: \\ \hline
823: 1. Applicability & Beginner & ~~~300 & {\tt Send}, {\tt Recv}, {\tt Isend}, {\tt Irecv}, {\tt Wait}
824: & ~~~4~~ & ~~412
825: \\ \hline
826: 2. Scalability & Expert & 40,000 & {\tt Send}, {\tt Recv}, {\tt Sendrecv}
827: & ~~64~~ & 9,774
828: \\ \hline
829: 3. Usability & Compiler & 20,000 &{\tt Isend}, {\tt Irecv}, {\tt Waitall}
830: & ~~15~~ & ~~253
831: \\ \hline
832: \end{tabular}
833: \end{center}
834: \end{table*}
835:
836: Table \ref{tab:summary} shows a summary of the above studies. In the
837: following, we omit ``{\tt MPI\_}'', the prefix of MPI routines, as shown
838: in Table \ref{tab:summary}.
839:
840: In these studies we used a PC cluster with 64 symmetric
841: multiprocessor (SMP) nodes. Each node in the cluster has two Pentium
842: III 1GHz processors and connects to a Myrinet-2000
843: switch \cite{nanette95myrinet}. We also employed an MPI
844: implementation, MPICH-GM \cite{www02mpichgm}.
845:
846: \subsection{Study 1: Applicability of MPI-PD}
847: \label{subsec:studies_applicability}
848:
849: In this study, we applied MPI-PD to 28 faulty programs developed by
850: six graduate students through a practice in MPI programming. These
851: programs solve simultaneous equations using Gaussian elimination.
852:
853: \begin{table}[tb]
854: \caption{Application results of MPI-PD.}
855: \label{tab:application}
856: \begin{center}
857: \begin{tabular}{|l|c|c|}\hline
858: \lw{Debugging phase} & \multicolumn{2}{c|}{Number of programs}\\ \cline{2-3}
859: & Success & Failure\\ \hline
860: MPI Program execution & 13 of 28 & 15 of 28\\ \hline
861: Event graph visualization & 15 of 15 & \hspace{.5em}0 of 15 \\ \hline
862: Faulty process localization & 12 of 15 & \hspace{.5em}3 of 15\\ \hline
863: \end{tabular}
864: \end{center}
865: \end{table}
866:
867: We first executed the programs on our PC cluster and then
868: visualized localization results by using MPI-PD. Table
869: \ref{tab:application} shows the application results at each
870: debugging phase.
871:
872: At the execution phase, 15 of 28 programs unexpectedly terminated. As
873: we mentioned in Section \ref{sec:introduction}, since current MPI-PD
874: focuses on faults with program failures, it failed to visualize the
875: event graph for the remaining 13 programs that never terminated but
876: returned incorrect results. These programs contain semantic faults
877: such as invalid specifications of operators/variables and invalid
878: writing to message buffers before the completion of nonblocking
879: communications.
880:
881: At the localization phase, MPI-PD successfully localized faulty
882: processes for 12 of 15 programs while it failed to localize them for
883: the remaining three programs. These three programs have calculation
884: faults activated by all processes at the same statement. Therefore,
885: every process terminated outside the instrumented MPI routines, so
886: that their trace files contained no information about failure
887: events. Thus, MPI-PD failed to localize their faulty
888: processes. However, in these cases, since every process terminates
889: without any communication dependency, error propagation is unable to
890: occur. Therefore, developers have to investigate every process. That
891: is, they have to investigate their programs between the last MPI
892: routine executed in a success and the next MPI routine expected to be
893: executed, especially where the common statements that every process
894: executes.
895:
896: The 12 programs which MPI-PD successfully localized had a variety
897: of faults classified into following four types.
898: Notice that MPI-PD localized not the faults but the faulty processes
899: which activate them.
900:
901: \begin{itemize}
902: \item Invalid source/destination rank (six programs).
903: \item Invalid length of message buffer (three programs).
904: \item Calculation fault (two programs).
905: \item Deadlock occurred when passing long messages (one program).
906: \end{itemize}
907:
908: We next confirmed that there was no faulty process omitted from the
909: localized results. For all cases where invalid source/destination
910: ranks were specified, MPI-PD pointed out deadlock processes,
911: including the faulty process. Therefore, the deadlock processes
912: pointed out by MPI-PD can include valid processes, so that there
913: exists a room for improving the accuracy of localization. However,
914: this redundancy was a little problem for the programs applied in this
915: study. Since their faults appear on any number of processes,
916: developers are allowed to scale down the number of processes without
917: missing the activated faults.
918:
919: \subsection{Study 2: Scalable debugging with MPI-PD}
920: \label{subsec:studies_scalability}
921:
922: \begin{figure}[ht]
923: \centering
924: \includegraphics[width=8.3cm]{eps/VR64-all.eps}
925: \caption{Localized faulty processes in event graph visualized by MPI-PD.}
926: \label{fig:vr-all}
927: \end{figure}
928:
929: We applied MPI-PD to a parallel rendering program
930: \cite{takeuti03sac} implemented on 64 processes. This program has a
931: fault in gathering and compositing rendered images generated by distributed
932: processors. For the purpose of high-speed compositing, the
933: developers have implemented own collective communication routines
934: for the gather and the broadcast operations by using point-to-point
935: routines, {\tt Send} and {\tt Recv}. Their collective routines are
936: called at every compositing stage with splitting the processes into
937: two groups. That is, given $n$ processes, each of $2^{i-1}$ groups
938: performs collective communications at the $i^{\rm th}$ stage, where
939: $1 \leq i \leq \log n$.
940:
941: Figure \ref{fig:vr-all} shows the event graph for all processes
942: visualized by MPI-PD. While the program generates the total of 9,774
943: events, the visualized event graph is composed of 164 events classified into 64
944: failure events and 100 successful events occurred directly before the
945: failure events. In Figure \ref{fig:vr-all}, MPI-PD points out five
946: faulty processes from 64 processes: processes PE21, PE37, PE44, PE48,
947: and PE52. It also points out that these five processes fall into a
948: deadlock and that each of them has one failure event.
949:
950: As we mentioned in Section \ref{subsec:overview}, MPI-PD allows
951: developers to visualize specific processes whichever they want. For
952: example, developers can view only the deadlock processes as shown in
953: Figure \ref{fig:vr-fp}, so that easily know how the processes fell
954: into the deadlock. They can also add related processes that
955: communicated to the deadlock processes (see Figure
956: \ref{fig:vr-fpplus}), so that intuitively know process PE48
957: received many messages compared to the other four faulty processes:
958: processes PE21, PE37, PE44, and PE52.
959:
960: Thus, MPI-PD guided the developers to the five faulty events, so that
961: they easily found that process PE48, the root process of a
962: broadcast operation, called an excessive {\tt Send}
963: routine due to the lack of a {\tt break} statement. Therefore, MPI-PD assists
964: developers in scalable debugging, where the numbers of processes and
965: events are too large for them to understand the behavior of programs.
966:
967: We also indicate that the buffered send operation makes it complicated
968: to locate faults, since this operation causes a gap between the faulty
969: send event and the failure event. For example, when we executed
970: the rendering program without error detection, since process PE48
971: pushed out messages in the buffered mode, it successfully returned
972: from the faulty {\tt Send} routine and terminated at a succeeding {\tt
973: Recv} routine. Therefore, without MPI-PD, the developers can
974: investigate the {\tt Recv} routine, which causes a non-original fault,
975: or a fault due to error propagation. Thus, MPI-PD's run-time error
976: detection is necessary for handling the buffered send operation.
977:
978:
979: \begin{figure}[ht]
980: \centering \includegraphics[width=8.3cm]{eps/VR64-fp.eps}
981: \caption{Faulty processes isolated by MPI-PD. This graph shows only
982: faulty processes and communications among them.}
983: \label{fig:vr-fp}
984: \end{figure}
985:
986: \begin{figure}[ht]
987: \centering \includegraphics[width=8.3cm]{eps/VR64-fpplus.eps}
988: \caption{Faulty processes and their related processes isolated by MPI-PD.
989: Related processes are such that faulty processes communicate with them.}
990: \label{fig:vr-fpplus}
991: \end{figure}
992:
993:
994: \subsection{Study 3: Comparison with existing debuggers}
995: \label{subsec:studies_usability}
996:
997: To make clear the usability of fault localization, we compared MPI-PD
998: with TotalView \cite{etnus01tv} by applying them to a complicated program. This
999: program is automatically generated by a parallelizing compiler based
1000: on a task scheduling algorithm, Scheduling with Packaged
1001: Point-to-point Communications (SPPC) \cite{y-yamamt01ebcsh}.
1002:
1003: The MPI program generated by SPPC consists of two layers, the calculation
1004: and the communication layers, which repeatedly appear during program
1005: execution. In the calculation layer, each process independently
1006: performs calculation without any communication. In the communication
1007: layer, it exchanges messages by calling nonblocking communication
1008: routines. Each process first calls many initiation routines,
1009: {\tt Isend} and {\tt Irecv}, then a completion routine, {\tt
1010: Waitall}. Since the parallelizing compiler mechanically generates
1011: large-scale MPI programs, it requires a complicated work to debug
1012: them. Furthermore, since the {\tt Waitall} routine completes all of
1013: initiated communications at a time, it is time-consuming to
1014: distinguish failure communications from a number of communications
1015: completed by the {\tt Waitall} routine.
1016:
1017: Figure \ref{fig:sppc} shows the visualizations
1018: obtained by MPI-PD and TotalView. While MPI-PD
1019: visualizes all of failure events occurred on each process and the
1020: successful events occurred directly before the failure events,
1021: TotalView shows {\em pending sends/receives} and {\em
1022: unexpected messages} \cite{james99debugger,etnus01tv} at an arbitrary execution step.
1023: Pending sends/receives represent the sends/receives that have been
1024: initiated but have not yet been matched. Unexpected messages
1025: represent messages that have been sent to a process but have not
1026: yet been received.
1027:
1028: In this program, every process terminated at a call of {\tt Waitall}
1029: routine. At the termination, the processes tried to complete the total
1030: of 171 nonblocking operations. For this faulty program, TotalView
1031: visualizes 50 pending receives, represented as arrows in Figure
1032: \ref{fig:sppc}(b). However, it is time-consuming for the developers to
1033: investigate each of the 50 pending receives. On the other hand, MPI-PD
1034: checks the error of every communication and localizes faulty
1035: processes, so that it visualizes 34 of 171 events as shown in Figure
1036: \ref{fig:sppc}(a). Since eight of 34 events are successfully communicated
1037: events, MPI-PD reduces the number of events that have to be
1038: investigated from 171 to 26 events. Furthermore, it points out that
1039: processes PE5 and PE10 fall into a deadlock. Here, processes PE5 and
1040: PE10 have three and seven error events, respectively, so that the
1041: number of events that have to be investigated is reduced further from
1042: 171 to 10 events.
1043:
1044: With the assistance of MPI-PD, the developer has successfully debugged
1045: this program less than five minutes. He first investigated process PE5
1046: and confirmed that it had no fault, and then process
1047: PE10. At last, he reached at the fault where an invalid source was
1048: specified at an {\tt Irecv} routine.
1049:
1050: \begin{figure}[ht]
1051: \centering
1052: \includegraphics[width=6.0cm]{eps/sppc15.eps}
1053: \qquad
1054: \includegraphics[width=6.0cm]{eps/mqgL.eps}
1055:
1056: \hspace{1em}(a) Event Graph by MPI-PD \hspace{6em}(b) Message Queue Graph by TotalView
1057: \caption{Visualizations obtained by MPI-PD and TotalView.}
1058: \label{fig:sppc}
1059: \end{figure}
1060:
1061: \begin{table*}[tb]
1062: \caption{Difference among MPI-PD, TotalView, and DeWiz.}
1063: \label{tab:difference}
1064: \begin{center}
1065: \begin{tabular}{|l|c|c|c|} \hline
1066: Function & MPI-PD & DeWiz \cite{kran02pdp,kran02ipdps} & TotalView \cite{etnus01tv}\\ \hline
1067: 1. Faulty process localization & by dependency analysis & --- & ---\\ \hline
1068: 2. Run-time error detection & every message & every message & every message\\ \hline
1069: 3. Process grouping & by dependency analysis & by message length & ---\\ \hline
1070: 4. Timeline visualization & yes & yes & ---\\ \hline
1071: 5. Trace file reduction & --- & yes & ---\\ \hline
1072: 6. Stepwise execution & --- & --- & yes\\ \hline
1073: \end{tabular}
1074: \end{center}
1075: \end{table*}
1076:
1077: Table \ref{tab:difference} summarizes the difference among MPI-PD,
1078: TotalView, and DeWiz \cite{kran02pdp,kran02ipdps}. While MPI-PD is
1079: useful to reduce events that have to be investigated, TotalView allows
1080: us to execute the target program in stepwise. DeWiz also provides an
1081: analysis using the event graph. However, DeWiz aims at
1082: identifying closely related processes and reducing the total amount of
1083: trace data. In DeWiz, by giving a specific process, then its process
1084: grouping function accumulates the length of transmitted messages for
1085: every pair of processes and isolates related processes by using a
1086: certain threshold. Therefore, developers have to decide which
1087: processes have to be specified, and this is a similar problem addressed in
1088: this paper. Furthermore, since error propagation has no relevance to message
1089: length, their message length based approach is inappropriate for the
1090: purpose of faulty process localization.
1091:
1092: Summarizing the above discussions, DeWiz is useful to reduce the
1093: total amount of trace files and TotalView is useful to investigate
1094: the detailed behavior of programs. MPI-PD is useful to reduce the
1095: number of events that have to be investigated for
1096: debugging. Therefore, we think that appropriate combined use of these
1097: tools is a good choice for debugging message passing programs.
1098: For example, we first localized faulty processes by using MPI-PD
1099: and next investigate them in detail by using TotalView.
1100:
1101:
1102: \section{Conclusions}
1103: \label{sec:conclusions}
1104: We have presented a novel debugging tool, named MPI-PD, for localizing
1105: faulty processes in message passing programs, aiming at reducing
1106: developers' efforts. MPI-PD helps us to identify the source of
1107: failure from a number of observed errors by automatically checking
1108: communication errors during program execution. If MPI-PD observes any
1109: communication errors, it then generates a trace file, backtraces
1110: communication dependencies and points out potentially faulty
1111: processes in the event graph visualization.
1112:
1113: MPI-PD reduces the amount of debugging information before visualizing
1114: and investigating it by using post-mortem performance debuggers and
1115: source-level debuggers, respectively.
1116: Therefore, we think that appropriate combined use of these tools
1117: is a good choice for debugging message passing programs.
1118:
1119: \section*{Acknowledgements}
1120: This work was partly supported by JSPS Grant-in-Aid for Young
1121: Researchers (B)(15700030), for Scientific
1122: Research (C)(2)(14580374), JSPS Research for the Future Program
1123: JSPS-RFTF99I00903, and Network Development Laboratories, NEC.
1124: We are also grateful to the anonymous reviewers
1125: for their valuable comments.
1126:
1127: \bibliographystyle{plain}
1128: \bibliography{main}
1129:
1130: \end{document}
1131: