0310:cs0310015/main.tex

1: \documentclass{aadebug}

2: \usepackage[psamsfonts]{amssymb}

3: \usepackage{graphicx}

4: \usepackage{type1cm}

5: \usepackage{times, mathptmx}

6:

7: \newcommand{\lw}[1]{\smash{\lower1.5ex\hbox{#1}}}

8:

9: \newcommand{\sor}{\stackrel{S}{\rightarrow}}

10: \newcommand{\cor}{\stackrel{C}{\rightarrow}}

11: \newcommand{\nor}{\stackrel{N}{\rightarrow}}

12: \newcommand{\hbr}{\rightarrow}

13: \newcommand{\todo}[1]{{\bf (Todo: #1)}}

14:

15: \begin{document}

16: \corr{0309027}{127}

17:

18: \runningheads{Masao Okita et al.}{Debugging Tool for Localizing Faulty Processes in Message Passing Programs}

19:

20: \title{Debugging Tool for Localizing Faulty Processes in Message Passing Programs}

21:

22: \author{

23: Masao~Okita\addressnum{1}\comma\extranum{1},

24: Fumihiko~Ino\addressnum{1}\comma\extranum{1},\\

25: and Kenichi Hagihara\addressnum{1}\comma\extranum{1}

26: }

27:

28:

29: \address{1}{

30: Graduate School of Information Science and Technology,

31: Osaka University\\

32: 1-3 Machikaneyama,

33: Toyonaka,

34: Osaka 560-8531,

35: Japan

36: }

37: \extra{1}{E-mail: \{m-okita,ino,hagihara\}@ist.osaka-u.ac.jp}

38:

39: \pdfinfo{

40: /Title (Debugging Tool for Localizing Faulty Processes in Message Passing Programs)

41: /Author (Masao Okita, Fumihiko Ino, and Kenichi Hagihara)

42: }

43:

44: \begin{abstract}

45: In message passing programs, once a process terminates with an

46: unexpected error, the terminated process can propagate the error to the

47: rest of processes through communication dependencies, resulting in a program

48: failure. Therefore, to locate faults, developers must identify the

49: group of processes involved in the original error and {\em

50: faulty processes} that activate faults. This paper presents a

51: novel debugging tool, named {\em MPI-PreDebugger} (MPI-PD), for

52: localizing faulty processes in message passing programs. MPI-PD

53: automatically distinguishes the original and the propagated errors by

54: checking communication errors during program execution. If MPI-PD

55: observes any communication errors, it backtraces communication dependencies and

56: points out potential faulty processes in a timeline view. We also

57: introduce three case studies, in which MPI-PD has been shown to play

58: the key role in their debugging. From these studies, we believe that

59: MPI-PD helps developers to locate faults and allows them to concentrate in

60: correcting their programs.

61: \end{abstract}

62:

63: \keywords{parallel processing; message passing; debugging; fault localization}

64:

65: \section{Introduction}

66: \label{sec:introduction}

67: In recent years, cluster/grid computing

68: \cite{buyya99cluster,ian98grid} is emerging as a cost-effective

69: methodology for high performance computing. The message passing

70: paradigm \cite{mpif94mpi} is a widely employed programming paradigm

71: that gives us efficient parallel programs on these computing

72: environments.

73:

74: However, debugging message passing programs is usually time-consuming,

75: since we have to investigate a large amount of debugging information

76: compared to sequential programs. Furthermore, once a process

77: terminates with an unexpected error \cite{ melliar-smith77ldrs}, the

78: terminated process can propagate the error to the rest of processes

79: through communication dependencies. For example, if a process terminates

80: before sending an intended message, the receiver process that has no

81: original fault also terminates, since it fails to receive the expected

82: message. This {\em error propagation} makes it complicated to locate

83: the hidden faults from a number of observed errors.

84:

85: To give developers valuable insights for debugging, a number of

86: debugging tools have been developed for message passing

87: programs. Post-mortem performance debuggers such as ParaGraph

88: \cite{heath91paragraph}, ATEMPT \cite{kran96graph}, XMPI

89: \cite{www02xmpi}, and Vampir \cite{pallas99vampir} visualize detailed

90: timeline view of communications, so that developers can

91: intuitively understand program behaviors.

92:

93: Source-level debuggers such as TotalView \cite{etnus01tv}, MPIGDB

94: \cite{ralph00mpd}, and CDB \cite{wu02cdb} allow stepwise

95: execution of programs. TotalView also has a facility for visualizing,

96: named Message Queue Graph (MQG), which shows the states of the

97: pending send and receive operations. MPIGDB is based on a

98: sequential debugger, GDB \cite{stallman02gdb}, and allows developers

99: to broadcast terminal input to all GDB processes attached to

100: computing processes. CDB also provides a similar debugging

101: environment by employing GDB at its lower layer.

102:

103: {\em Fault localization} \cite{jones02faultlocalzation} is another

104: approach for debugging programs.

105: {\em Relative debugging} \cite{hood00relative,gregory01relative} is

106: a kind of fault localization

107: for programs that have been

108: ported from sequential to parallel architectures or between different

109: parallel architectures.

110: It dynamically compares data between two

111: executing programs, so that can locate errors in the compared

112: programs. In \cite{robert96racecondition}, Netzer et al.~have pointed

113: out that unforeseen consequences of bugs can cause messages to arrive

114: in unexpected orders. Their algorithm dynamically locates errors by

115: detecting unintended nondeterminism, or race conditions.

116:

117: Process grouping

118: \cite{kran02ipdps,kunz93wpdd,stringhini00pdpta} is a

119: fundamental technique for scalable visualizing and debugging. DeWiz

120: \cite{kran02pdp,kran02ipdps} aims at identifying closely related

121: processes and reducing the amount of trace data. Given a specific

122: process, DeWiz isolates the related processes according to the

123: accumulated length of transmitted messages.

124:

125: Thus, a number of tools provide useful debugging functions. However,

126: developers still suffer for selecting the original error from a number

127: of observed errors, including original and propagated errors.

128: Once the original error is given to developers,

129: they can immediately investigate faults by using existing debuggers and

130: concentrate in correcting them.

131:

132: In this paper, we propose a novel debugging tool, named {\em

133: MPI-PreDebugger} (MPI-PD), for localizing faulty processes in message

134: passing programs. Current MPI-PD supports programs written using the

135: Message Passing Interface (MPI) standard \cite{mpif94mpi} and focuses

136: on faults that terminate program execution. MPI-PD aims at reducing

137: developers' workloads required for localizing faulty processes in

138: timeline visualization.

139:

140: To achieve this, MPI-PD dynamically checks communication errors in

141: accordance with the error definition in a program execution model. If

142: MPI-PD observes any communication errors, it then generates a trace

143: file, backtraces communication dependencies and points out

144: potentially faulty processes in a timeline view. Thus, MPI-PD reduces

145: the amount of debugging information before developers visualize and

146: investigate it by using performance debuggers and source-level

147: debuggers.

148:

149: The rest of this paper is organized as follows. Section

150: \ref{sec:definition} formally characterizes communication errors in

151: MPI programs and makes clear the differences among faults, errors, and

152: failures. Section \ref{sec:algorithm} gives an algorithm for

153: localizing faulty processes in a given trace file while Section

154: \ref{sec:mpipd} presents MPI-PD, which implements the proposed

155: algorithm. Section \ref{sec:studies} introduces three case studies

156: assisted by MPI-PD. At last, Section \ref{sec:conclusions} concludes

157: this paper.

158:

159:

160: \section{Modeling Behavior of Message Passing Programs}

161: \label{sec:definition}

162: This section shows a definition of communication errors in MPI

163: programs. We define it by extending the program execution model

164: described in \cite{robert92model}.

165:

166:

167: \subsection{Event graph: program execution model}

168: \label{subsec:event_graph}

169: An execution of a message passing program is defined as a directed

170: graph, $G=(E,\rightarrow)$, where $E$ represents a finite set of

171: {\em events} while $\rightarrow$ represents the

172: {\em happened-before relation} \cite{lamport78cacm} defined over $E$

173: \cite{robert92model}. In the following, we call this directed graph

174: the {\em event graph} \cite{kran02pdp}.

175:

176: An event in this context represents the execution instance of a set

177: of consecutively executed statements in some process

178: \cite{robert92model}. Any event $e \in E$ is observed during a

179: program execution. In the following, let $e_{p,i}$ be the

180: $i^{\rm th}$ event on process $p$.

181:

182: The happened-before relation $\hbr$ shows how events potentially

183: affect one another \cite{lamport78cacm}. This relation is defined as the

184: irreflexive transitive closure of the union of two other relations:

185: $\hbr = (\sor \cup \cor)^+$. Here, $\sor$ and $\cor$ respectively

186: represent the sequential order relation and the concurrent order

187: relation as follows \cite{kran02pdp}:

188:

189: \begin{figure}[ht]

190: \centering

191: \includegraphics[scale=.40]{eps/relation-blocking.eps}

192: \qquad

193: \includegraphics[scale=.40]{eps/relation-nonblocking.eps}

194:

195: (a) Blocking communication\hspace{1em}

196: (b) Nonblocking communication

197: \caption{Order relations between events. A node represents an event

198: and an arrow represents a relation.}

199: \label{fig:relation}

200: \end{figure}

201:

202: \begin{description}

203: \item[Sequential order relation, $\sor$:]

204:

205: As illustrated in Figure \ref{fig:relation}(a), the

206: sequential order of events, $e_{p,i} \sor e_{p,i+1}$,

207: defines that the $i^{\rm th}$ event $e_{p,i}$ on any

208: sequential process $p$ occurred before the

209: $i+1^{\rm st}$ event $e_{p,i+1}$.

210:

211: \item[Concurrent order relation, $\cor$:]

212:

213: As illustrated in Figure \ref{fig:relation}(a), the

214: concurrent order of events, $e_{p,i} \cor e_{q,j}$,

215: defines that the $i^{\rm th}$ event $e_{p,i}$ on any

216: process $p$ occurred directly before the $j^{\rm th}$

217: event $e_{q,j}$ on another process $q$, if $e_{p,i}$ is

218: the sending of a message by process $p$ and $e_{q,j}$ is

219: the receipt of the same message by another process $q$.

220:

221: \end{description}

222:

223: Although the event graph is a sufficient model for visualizing the

224: behavior of message passing programs, we have to add one relation to

225: this graph to characterize the errors relevant to

226: nonblocking communications \cite{mpif94mpi}. This additional relation

227: exists between a pair of events caused by the initiation and the

228: completion of a nonblocking send/receive operation:

229:

230: \begin{description}

231: \item[Nonblocking order relation, $\nor$:]

232:

233: As illustrated in Figure \ref{fig:relation}(b), the

234: nonblocking order relation, $\nor$, shows the order in

235: which nonblocking messages are initialized and then

236: completed: $e_{p,i} \nor e_{p,k}$ defines that

237: $e_{p,i} \sor e_{p,k}$, if $e_{p,i}$ is the send/receipt

238: initiation of a message by process $p$ and $e_{p,k}$ is

239: the completion of the same message by the same process $p$.

240: \end{description}

241:

242: In our extended event graph, the happened-before relation is

243: redefined as $\hbr = (\sor \cup \cor \cup \nor)^+$.

244:

245: \subsection{Fault, error, and failure}

246: \label{subsec:fault-error-failure}

247: The concepts of faults, errors, and failures

248: \cite{melliar-smith77ldrs} used in our discussion are briefly

249: explained as follows: a program with a bug has a fault in itself and

250: an active fault causes an error. If the error fails to be corrected,

251: it causes a failure.

252:

253: \begin{figure}[ht]

254: \centering

255: \includegraphics[scale=.5]{eps/fault-error-failure.eps}

256:

257: \caption{Fault, error, and failure events. While a crossed node

258: represents an unexpectedly terminated event, a dotted node

259: represents expected but non-occurred event.}

260: \label{fig:fault-error-failure}

261: \end{figure}

262:

263: Figure \ref{fig:fault-error-failure} shows an example that interprets

264: these three concepts on events. In this example, process $r$ is the

265: faulty process, since it executes a faulty statement and causes a

266: faulty event. It also terminates against developer's

267: intension, so that causes a failure event. After this, process $q$

268: fails to pass a message to process $r$, so that causes an error

269: event, resulting in a failure event (since it terminates). Process

270: $p$ also faces with a communication error, however, its error handler

271: avoids its failure.

272:

273: Let $is\_failed(e)$ denote whether event $e$ causes a failure or

274: not. Since failure events have no successor and occur when programs

275: unexpectedly terminate, $is\_failed(e)$ is defined as follows:

276: \begin{eqnarray}

277: is\_failed(e) \hspace{-.5em} &=& \hspace{-.5em} {\rm~the~program~terminated~unexpectedly.} \nonumber

278: \label{eqn:isfailed}

279: \end{eqnarray}

280:

281: \subsection{Communication errors in MPI programs}

282: \label{subsec:comm_faults}

283: In MPI programs, an event causes a communication error, if it

284: satisfies one of the following two conditions: isolated or

285: truncated, defined as follows;

286:

287: \begin{itemize}

288: \item {\em Isolated events}.

289:

290: \begin{itemize}

291: \item An event $e_{p,i}$ ($e_{q,j}$) is called an

292:     isolated send (receive) event,

293:     if $\neg \exists~e_{q,j} \in E~(e_{p,i} \in E)$

294:     such that $e_{p,i} \cor e_{q,j}$, respectively

295:     \cite{kran02pdp}.

296: \item An event $e_{p,i}$ ($e_{p,k}$) is called an

297:     isolated send/receive initiation (completion)

298:     event, if

299:     $\neg \exists~e_{p,k} \in E~(e_{p,i} \in E)$ such

300:     that $e_{p,i} \nor e_{p,k}$, respectively.

301: \end{itemize}

302:

303: \item {\em Truncated events}.

304:

305: \begin{itemize}

306: \item Two events $e_{p,i}$ and $e_{q,j}$ are called

307:     truncated events, if $e_{p,i} \cor e_{q,j}$ and

308:     $len(e_{p,i}) > len(e_{q,j})$, where $len(e_{p,i})$

309:     and $len(e_{q,j})$ represent the length of the send

310:     buffer specified in event $e_{p,i}$ and the receive

311:     buffer specified in event $e_{q,j}$, respectively.

312: \end{itemize}

313: \end{itemize}

314:

315: Isolated events are caused under the following two situations. One is

316: the mismatch of occurred events and the other is the

317: non-occurrence of expected events. First, occurred but mismatched

318: events can trigger off an error propagation. For example, an MPI

319: routine call with an invalid tag/communicator \cite{mpif94mpi} or an

320: invalid source/destination rank fails to pass the

321: intended message. Similar mismatch can occur between the initiation

322: and the completion of a nonblocking send/receive operation. Next,

323: expected but non-occurred events cause serious problems, since they

324: can propagate errors through all processes. For example, if a

325: process terminates before sending an intended message,

326: the receiver process that has no original fault also terminates,

327: since it fails to receive the expected message. Thus, isolated events

328: propagate errors similarly to the domino effect,

329: leading to a program failure.

330:

331: A pair of truncated events indicates an occurrence of an overflow at the

332: receive buffer.

333: In a strict sense, a message should be passed between the send and the

334: receive operations with the same buffer length

335: \cite{kran02pdp}. However, as MPI does, we also permit

336: passing a message between events $e_{p,i}$ and $e_{q,j}$ such that

337: $e_{p,i} \cor e_{q,j}$ and $len(e_{p,i}) < len(e_{q,j})$. In practice,

338: some nondeterministic applications require this flexibility, because

339: the receiver processes in these applications want to receive a variable

340: length message at one receive operation.

341: Therefore, we permit passing a message

342: between events with different buffer length except for truncated

343: events.

344:

345: Thus, the error of an event can depend on that of an event on another

346: process. In this paper we call that processes $p$ and $q$ have a

347: {\em communication dependency} if the error of event $e_{p,i}$ on process

348: $p$ determines that of event $e_{q,j}$ on another process $q$.

349:

350: Here notice that MPI has four communication modes \cite{mpif94mpi}: the

351: standard, buffered, synchronous, and ready modes. These modes differ by

352: when they solve the matching of outgoing messages. For example, when

353: two processes send a message to each other, they fall into a deadlock

354: in the synchronous mode while they are deadlock-free in the buffered

355: mode. Therefore, we have to check communication errors without

356: destroying these communication semantics in the target programs. That

357: is, outgoing messages have to be checked in the same mode as their

358: original mode. The error detection mechanism employed in MPI-PD is

359: presented later in Section \ref{subsec:error-detection}.

360:

361: For collective communications, since they can be implemented by using

362: point-to-point communications, we repeatedly apply the above

363: error definition to all of the point-to-point messages that compose

364: the collective communication.

365:

366: In the following, let $is\_isolated(e_{p,i})$ denote whether event

367: $e_{p,i}$ is isolated event or not. Let

368: $is\_truncated(e_{p,i},e_{q,j})$ also denote whether events

369: $e_{p,i}$ and $e_{q,j}$ are truncated events or not.

370:

371: \section{Algorithm for Localizing Faulty Processes}

372: \label{sec:algorithm}

373: This section presents the details of our proposed algorithm. We

374: describe how to localize faulty processes in a given event graph. We

375: assume here that the event graph is already generated by the error

376: detection mechanism presented later in Section

377: \ref{subsec:error-detection}.

378:

379: \begin{figure}[htb]\footnotesize

380: \setbox0\vbox{

381: 1. {\bf Algorithm} LocalizeFaultyProcesses($P$, $G$, $P_e$, $E_e$)\\

382: \hspace{.5em}2. \hspace{1em} // Input: $P$, a set of process ranks.\\

383: \hspace{.5em}3. \hspace{1em} // \hspace{2.4em} $G$, an event graph.\\

384: \hspace{.5em}4. \hspace{1em} // Output: $P_e$, a set of localized faulty process ranks.\\

385: \hspace{.5em}5. \hspace{1em} // \hspace{3em} $E_e$, a set of failure events on each processes.\\

386: \hspace{.5em}6. {\bf begin}\\

387: \hspace{.5em}7. \hspace{2em}// (1) Identify failure events occurred on each processes.\\

388: \hspace{.5em}8. \hspace{2em}$E_e := \emptyset$;\\

389: \hspace{.5em}9. \hspace{2em}{\bf foreach} ($p \in P$) {\bf begin}\\

390: 10. \hspace{4em}{\bf if} $e_{p,i}$ such that $is\_failed(e_{p,i})=true$ exists. \hspace{0.5em}{\bf then} \hspace{0.5em}$fe_p := e_{p,i}$\\

391: 11. \hspace{4em}{\bf else}\hspace{0.5em}$fe_p := null$\\

392: 12. \hspace{4em}{\bf endif}\\

393: 13. \hspace{4em}$E_e := E_e \cup \{ fe_p \}$;\\

394: 14. \hspace{2em}{\bf end}\\

395: 15. \hspace{2em}// (2) Localize faulty processes by recursive analysis.\\

396: 16. \hspace{2em}$P_e := \emptyset$;\\

397: 17. \hspace{2em}{\bf foreach} ($p \in P$) {\bf begin}\\

398: 18. \hspace{4em}{\bf if} (BacktraceCommDep($p$, $\emptyset$) $\ne$ 0)\hspace{0.5em}{\bf then}\hspace{0.5em}$P_e := P_e \cup \{ p \}$;\hspace{3em}// Process $p$ has faults.\\

399: 19. \hspace*{2em}{\bf end}\\

400: 20. {\bf end}\\

401: 21. // A recursive function that backtraces communication dependencies from process $p$.\\

402: 22. {\bf function} BacktraceCommDep($p$, $P_{dep}$)\\

403: 23. {\bf begin}\\

404: 24. \hspace*{2em}{\bf if} (($p \in P_e$) $\vert \vert$ (($fe_p = null$) \&\& ($P_{dep} = \emptyset$))) \hspace{0.5em}{\bf then} \hspace{0.5em}{\bf return} 0;\hspace{1.4em}// $p$ is already traced or valid.\\

405: 25. \hspace*{2em}{\bf else if} ($fe_p$ is a calculation event) \hspace{0.5em}{\bf then} \hspace{0.5em}{\bf return} --1;\hspace{5em}// (a) Calculation fault.\\

406: 26. \hspace*{2em}{\bf else if} ($fe_p = null$) \hspace{0.5em}{\bf then}\hspace{0.5em}{\bf return} --2;\hspace{11em}// (b) Non-occurred event.\\

407: 27. \hspace*{2em}{\bf else if} ($p \in P_{dep}$) \hspace{0.5em}{\bf then}\hspace{0.5em}{\bf return} $p$;\hspace{12.5em}// (c) Deadlock or (d) Overflow.\\

408: 28. \hspace*{2em}{\bf endif}\\

409: 29. \hspace*{2em}$q := ptnr(fe_p)$;\hspace{3em}// Source/destination rank for $fe_p$\\

410: 30. \hspace*{2em}$Q_{dep} := P_{dep} \cup \{ p \}$;\hspace{1.4em}// Update the call history.\\

411: 31. \hspace*{2em}$retval :=$ BacktraceCommDep($q$, $Q_{dep}$);\\

412: 32. \hspace*{2em}{\bf if} ($retval \ne 0$) \hspace{0.5em}{\bf then}\hspace{0.5em}$P_e := P_e \cup \{ q \}$;\hspace{1.4em}// Process $q$ has faults.\\

413: 33. \hspace*{2em}{\bf if} ($retval = p$) \hspace{0.5em}{\bf then}\hspace{0.5em}$retval := 0$;\\

414: 34. \hspace*{2em}{\bf else if} ($retval < 0$) \hspace{0.5em}{\bf then}\hspace{0.5em}$retval$++;\\

415: 35. \hspace*{2em}{\bf endif}\\

416: 36. \hspace*{2em}{\bf return} $retval$;\\

417: 37. {\bf end}

418: }\centerline{\fbox{\box0}}

419: \caption{Algorithm for localizing faulty processes.}

420: \label{fig:algorithm}

421: \end{figure}

422:

423: \begin{figure}[ht]

424: \centering

425: \hspace{5em}

426: \includegraphics[scale=.43]{eps/failure-status-1.eps}

427:

428: (a) Calculation fault\hspace{4em}

429: (b) Non-occurred event

430: \includegraphics[scale=.43]{eps/failure-status-2.eps}

431:

432: (c) Deadlock\hspace{8em}

433: (d) Overflow

434: \caption{Four failure situations classified by proposed algorithm.}

435: \label{fig:failure-status}

436: \end{figure}

437:

438: Figure \ref{fig:algorithm} shows our algorithm, which requires a set

439: of process ranks, $P$, and an event graph, $G$, and returns sets of

440: localized faulty processes and the failure events on each process,

441: $P_e$ and $E_e$, respectively. Our algorithm consists of two stages

442: as follows:

443:

444: \begin{itemize}

445: \item Identification of failure events (see line 7--14 in Figure

446: \ref{fig:algorithm}).

447: \item Localization of faulty processes (see line 15--37 in Figure

448: \ref{fig:algorithm}).

449: \end{itemize}

450:

451: At the first stage, the algorithm identifies all failure events.

452: After this stage, it localizes

453: faulty processes by backtracing communication dependencies in a recursive

454: manner. Our algorithm then classifies program failure into the

455: following four situations:

456:

457: \begin{description}

458: \item[(a) Calculation fault:]

459:

460: Figure \ref{fig:failure-status}(a) illustrates this

461: situation. As a result of backtracing, our algorithm finds

462: that process $s$ terminates unexpectedly and has no

463: communication dependency to any other processes. Therefore, the

464: algorithm determines that the faulty process is process

465: $s$, which causes a calculation fault.

466:

467: \item[(b) Non-occurred event:]

468:

469: Figure \ref{fig:failure-status}(b) illustrates this

470: situation, in which process $s$ has a communication

471: dependency from $r$ but terminates successfully.

472: In this situation, we think whether process $r$ could have sent a

473: message redundantly or process $s$ could

474: have missed to call a receive routine.

475: However, it seems to be difficult to

476: automatically identify the faulty process from processes

477: $r$ and $s$. Therefore, our

478: algorithm determines that the faulty processes are both of

479: processes $r$ and $s$, or a process left by a normally terminated

480: process and the terminated process.

481:

482: \item[(c) Deadlock:]

483:

484: A deadlock occurs if there exists a cyclic communication

485: dependency. In Figure \ref{fig:failure-status}(c),

486: processes $q$, $r$ and $s$ fall into a deadlock. Our

487: algorithm determines that the faulty processes are all the

488: processes that participate in the deadlock.

489:

490: \item[(d) Buffer overflow:]

491:

492: In Figure \ref{fig:failure-status}(d), process $s$ causes

493: a buffer overflow.

494: As same as situation (b), it also seems to be difficult to

495: identify which of processes $r$ and $s$ has called an MPI

496: routine with an invalid buffer length. Therefore, our

497: algorithm determines that the faulty processes are both of

498: processes $r$ and $s$, which have a pair of truncated events.

499: \end{description}

500:

501: Notice that the algorithm described in Figure

502: \ref{fig:algorithm} backtraces communication dependencies by assuming that

503: all the source/destination ranks are valid. Therefore, if a faulty

504: process calls an MPI routine with an invalid source/destination,

505: this algorithm can omit the faulty process from the localized

506: processes. We discuss this problem later in Section

507: \ref{subsec:studies_applicability}.

508:

509: \section{MPI-PreDebugger}

510: \label{sec:mpipd}

511: This section presents the details of MPI-PD, including its environment

512: for debugging and its mechanism for run-time error detection.

513:

514: \begin{figure*}[ht]

515: \centering

516: \includegraphics[width=14.0cm]{eps/debugging-process.eps}

517: \caption{Debugging process with MPI-PD.}

518: \label{fig:debugging-process}

519: \end{figure*}

520:

521: \subsection{Overview of debugging environment}

522: \label{subsec:overview}

523:

524: Figure \ref{fig:debugging-process} shows the debugging process with

525: MPI-PD. The debugging functions in MPI-PD are implemented using the

526: C++ language and the Ruby-GNOME toolkit \cite{www02rubygnome} and

527: composed of three components: the instrument tool mpi2pd, the

528: run-time error detection library libpdmpi.a, and the localize

529: and visualize tool pdview.

530:

531: The instrument tool mpi2pd automatically replaces all of the

532: MPI routines in programs with instrumented MPI routines

533: based on pattern-match rules.

534: The instrumented routine is a combination of the original MPI routine

535: and the run-time error detection function.

536: After this replacement, developers have to generate the

537: object codes by compiling their programs and the executable binary

538: file by linking the object codes with the run-time error detection

539: library.

540:

541: The run-time error detection library checks communication errors

542: whenever the processes call the instrumented MPI routines (see Section

543: \ref{subsec:error-detection}). If the library detects any

544: communication error, it terminates program execution and generates a

545: trace file. The trace file has the following information for every

546: event observed during program execution: (1) event number, (2) process

547: rank, (3) corresponding line in source code and its file name, and (4)

548: corresponding MPI routine and its arguments.

549:

550: Given a trace file, the visualization tool pdview allows developers to

551: view the behavior of the terminated program, as shown in Figure

552: \ref{fig:debugging-process}. It visualizes the event graph, which

553: has the process axis in vertical and the time axis in horizontal,

554: and shows the result of the fault localization described in Section

555: \ref{sec:algorithm}. In the event graph, a colored node corresponds

556: to an event and the type of the MPI operation that caused the event

557: decides its color. A solid line between two nodes

558: corresponds to a successful communication while a dotted line

559: corresponds to a failure communication.

560:

561: In default mode, pdview avoids visualizing the entire event graph. It

562: visualizes all of failure events occurred on each process and the

563: successful events occurred directly before the failure

564: events. Furthermore, pdview can isolate faulty processes from

565: the event graph. Developers can visualize an isolated event graph by

566: selecting process whichever they want. In addition to

567: these visualization functions, pdview also shows following information:

568:

569: \begin{itemize}

570: \item Faulty processes localized by the proposed algorithm.

571: \item Failure situation selected from four situations (see

572: Figure \ref{fig:failure-status}).

573: \end{itemize}

574:

575: Furthermore, developers can investigate every visualized event. If

576: they click the mouse on a node in the visualized event graph, then

577: pdview pops up a dialog, which shows information (1)--(4) about the

578: corresponding event and its error reason (isolated/truncated). This

579: information is useful for developers to locate faults in

580: programs. After this fault localization, source-level debuggers can

581: effectively assist developers to investigate the detailed behavior of

582: the localized part.

583:

584:

585: \subsection{Mechanism for run-time error detection}

586: \label{subsec:error-detection}

587: MPI-PD checks the occurrence of communication errors during program

588: execution. If it detects any errors, it generates a trace file.

589:

590: To realize this, we employ three methodologies. We first discuss on

591: the synchronous blocking send ({\tt MPI\_Ssend}) then others. The

592: three methodologies are as follows:

593:

594: \begin{itemize}

595: \item Manager process:

596: To generate trace files under a deadlock situation, we

597: employ a manager process $M_p$ for every process $p$.

598: $M_p$ checks the value

599: of $is\_failed(e_{p,i})$ before its responsible process $p$

600: executes event $e_{p,i}$.

601: We present later how to check $is\_failed(e_{p,i})$

602: at next paragraph.

603: If $M_p$ obtains

604: $is\_failed(e_{p,i})=false$, it allows $p$ to execute event

605: $e_{p,i}$ and pushes the information about $e_{p,i}$ into

606: its local Event Graph $E_p$. Otherwise, it detects a

607: communication error, terminates $p$ and generates a trace file

608: from $E_p$.

609:

610: \item Message queue: To handle nonblocking communications, we employ

611: a message queue. For nonblocking communications, to decide the

612: failure of completion event $e_{p,k}$, we have to refer the

613: information about its corresponding initiation event $e_{p,i}$

614: ($e_{p,i} \nor e_{p,k}$). Therefore, for all processes $p$,

615: manager $M_p$ has its own message queue $Q_p$ for

616: referring to the information about the past events.

617:

618: \item Timeout mechanism: We also employ a timeout mechanism due to

619: the difficulty in distinguishing the valid and the invalid

620: computation. For example, a receive event $e_{q,j}$ that never

621: receive a message has to be decided as

622: $is\_isolated(e_{q,j})=true$. However, it is hard for $M_q$

623: to identify whether the sender $p$ sends the message or not.

624: That is, $p$ can send the message after heavy computation or can

625: fall into an infinite loop. Therefore,

626: $M_p$ holds a timeout time $t(e_{p,i})$ for every

627: $e_{p,i}$ and decides $is\_isolated(e_{p,i})=true$ when the

628: time is up.

629: \end{itemize}

630:

631: Figure \ref{fig:error-detection} shows the process of run-time error

632: detection for {\tt MPI\_Ssend}. In Figure \ref{fig:error-detection},

633: the manager of the sender has three states (states C, S1 and S2) and

634: that of the receiver has four states (states C, R1, R2 and R3) as

635: follows:

636:

637: \begin{figure*}[ht]

638: \centering

639: \includegraphics[width=6.0cm]{eps/error-detection-success.eps}

640: \qquad

641: \includegraphics[width=6.0cm]{eps/error-detection-failure.eps}

642:

643: (a) Successful case \hspace{13em}(b) Failure case

644: \caption{Process of run-time error detection for the synchronous

645: blocking send ({\tt MPI\_Ssend}). Events $e_{p,i}$ and $e_{q,j}$

646: correspond to {\tt MPI\_Ssend} and {\tt MPI\_Recv} calls,

647: respectively.}

648: \label{fig:error-detection}

649: \end{figure*}

650:

651:

652: \begin{description}

653: \item[Common state for the sender/receiver:]

654: \item State C: {\em Timeout checking and control-message waiting}. In

655: this state, $M_p$ continues to check $Q_p$ whether there

656: exist any timeout events, until it receives any control

657: message (ack or request messages) from $p$ or another

658: manager. If $M_p$ detects a timeout event $e_{p,i}$,

659: then it decides $is\_failed(e_{p,i})=true$ and sends an

660: abort request $abort_p(e_{p,i})$ to $p$. It also adds the

661: failure event $e_{p,i}$ to $E_p$ and terminates.

662: If $M_p$ receives a control message, then it changes

663: its state to an appropriate state.

664:

665: \item[States for the sender:]

666: \item State S1: {\em Send initiating}. If $M_p$ receives a send

667: request $req_p(e_{p,i})$ from $p$, then it pushes the

668: information about $e_{p,i}$ into $Q_p$ with

669: $t(e_{p,i})$. It also checks the destination rank of

670: $e_{p,i}$ and transmits a send request $req_m(e_{p,i})$ to

671: the destination process's manager, $M_q$ (go to state C).

672: \item State S2: {\em Message sending}. If $M_p$ receives an

673: ack $ack_m(e_{q,j})$ from another manager, then it

674: searches $Q_p$ and selects

675: $e_{p,i}$ such that $is\_isolated(e_{p,i}) = false$. It

676: also checks whether $e_{p,i}$ and $e_{q,j}$ are

677: truncated events.

678: \begin{itemize}

679: \item If $is\_truncated(e_{p,i},e_{q,j})=false$, $M_p$

680:      decides $is\_failed(e_{p,i})=false$ and sends an

681:      ack $ack_p(e_{p,i})$ to

682:      $p$. After this acknowledgement, it deletes

683:      $e_{p,i}$ from $Q_p$, and adds

684:      both $e_{p,i}$ and $e_{q,j}$ to $E_p$ (go to

685:      state C).

686: \item Otherwise, $M_p$ decides

687:      $is\_failed(e_{p,i})=true$ and sends an abort

688:      request $abort_p(e_{p,i})$ to $p$.

689:      It also adds both $e_{p,i}$ and $e_{q,j}$ to $E_p$

690:      as failure events and terminates.

691: \end{itemize}

692:

693: \item[States for the receiver:]

694: \item State R1: {\em Receive initiating}. If $M_q$ receives a receive

695: request $req_q(e_{q,j})$, it then searches $Q_q$ and

696: selects $e_{p,i}$ such that $is\_isolated(e_{p,i}) \lor

697: is\_isolated(e_{q,j})=false$.

698:

699: \begin{itemize}

700: \item If such $e_{p,i}$ exists, $M_q$

701:      decides that $e_{p,i}$ and $e_{q,j}$ are the

702:      matching events (go to state R3).

703: \item Otherwise, it leaves the error detection on

704:      $e_{q,j}$ and pushes the information about

705:      $e_{q,j}$ into $Q_q$ with $t(e_{q,j})$ (go to

706:      state C).

707: \end{itemize}

708:

709: \item State R2: {\em Send-request receiving}. If $M_q$

710: receives a request $req_m(e_{p,i})$ from another manager,

711: then it searches $Q_q$ and selects

712: $e_{q,j}$ such that

713: $is\_isolated(e_{p,i}) \lor is\_isolated(e_{q,j})=false$.

714:

715: \begin{itemize}

716: \item If such $e_{q,j}$ exists, $M_q$

717:      decides that $e_{p,i}$ and $e_{q,j}$ are the

718:      matching events (go to state R3).

719: \item Otherwise, it leaves the error detection on

720:      $e_{p,i}$ and pushes the information about

721:      $e_{p,i}$ into $Q_q$ with $t(e_{p,i})$ (go to

722:      state C).

723: \end{itemize}

724:

725: \item State R3: {\em Message receiving}. $M_q$ sends an ack

726: $ack_m(e_{q,j})$ to $M_p$.  It then checks if

727: $e_{p,i}$ and $e_{q,j}$ are truncated events.

728:

729: \begin{itemize}

730: \item If $is\_truncated(e_{p,i},e_{q,j})=false$, then

731:      $M_q$ decides $is\_failed(e_{q,j})=false$

732:      and sends an ack $ack_r(e_{q,j})$ to

733:      $q$. After this acknowledgement, it deletes

734:      $e_{q,j}$ ($e_{p,i}$) from $Q_q$

735:      and adds both $e_{p,i}$ and $e_{q,j}$ to $E_q$

736:      (go to state C).

737: \item If $is\_truncated(e_{p,i},e_{q,j})=true$, then

738:      $M_q$ decides $is\_failed(e_{q,j})=true$

739:      and sends an abort request $abort_q(e_{q,j})$ to $q$.

740:      It also adds both $e_{p,i}$ and $e_{q,j}$ to $E_q$

741:      as failure events and terminates.

742: \end{itemize}

743: \end{description}

744:

745: The manager processes buffer all events until they detect an error,

746: so that their local memory are possibly full.

747: Our algorithm described in Figure \ref{fig:algorithm} requires

748: failure events on each process.

749: Therefore, if local memory of $M_p$ is full,

750: we allow $M_p$ to delete information about the oldest successful event

751: from $E_p$.

752:

753: Here, recall that we have to keep the communication semantics, as

754: explained in Section \ref{subsec:comm_faults}. Therefore, for the

755: blocking buffered mode send ({\tt MPI\_Bsend}), we alter the sequence

756: of error detection. That is, to keep the buffered behavior of message

757: passing, process $p$ passes the original message immediately after

758: sending request $req_p(e_{p,i})$ to its manager $M_p$. This

759: alternation omits receiving an ack $ack_p(e_{p,i})$ from

760: $M_p$. Instead of this omission, $p$ checks an abort message

761: $abort_p(e_{p,i})$ from $M_p$ whenever it calls an

762: instrumented MPI routine. If $p$ receives the abort message

763: $abort_p(e_{p,i})$, it terminates its execution. Otherwise, it

764: continues processing the original routine.

765: This alteration allows $p$ to execute a few events after an original

766: faulty event, however there is no influence on faulty process

767: localization since $M_p$ identifies the faulty event correctly.

768:

769: For nonblocking communications, we process states S1 and R1 at the

770: send initiation and the receive initiation of nonblocking operations,

771: respectively; and process send acks at the completion of the nonblocking

772: operations. For collective communications, we can apply the same

773: approach as for the blocking mode point-to-point routines, since the

774: collective communications can be implemented by using those

775: point-to-point routines.

776:

777: Thus, exchanging information about every event among managers

778: enables us to detect communication errors and generate trace files

779: before program failure.

780:

781: \section{Case Studies: Debugging Message Passing Programs with MPI-PD}

782: \label{sec:studies}

783:

784: In this section we introduce three case studies. The aim of each

785: study is to investigate the effectiveness of MPI-PD from the

786: following point of view:

787:

788: \begin{enumerate}

789: \item {\em Applicability}:

790: We investigated what kinds of faults are effective for MPI-PD.

791: To do this, we applied MPI-PD to a few ten of the Gaussian

792: programs developed by MPI beginners (see Section

793: \ref{subsec:studies_applicability}).

794:

795: \item {\em Scalability}:

796: This study shows an example of scalable debugging using

797: MPI-PD. We applied MPI-PD to a parallel rendering program

798: \cite{takeuti03sac} developed by MPI experts on 64 processes

799: (see Section \ref{subsec:studies_scalability}).

800:

801: \item {\em Usability}:

802: We investigated the usability of faulty process

803: localization. To do this, we applied MPI-PD to a complicated

804: program generated automatically by a parallelizing

805: compiler \cite{y-yamamt01ebcsh}. We also compared

806: visualization results between proposed MPI-PD and existing

807: TotalView \cite{etnus01tv} (see	Section

808: \ref{subsec:studies_usability}).

809: \end{enumerate}

810:

811: \begin{table*}[tb]

812: \caption{Summary of case studies. $|L|$, $|P|$, and $|E|$ represent

813: the numbers of lines, processes, and events, respectively.}

814: \label{tab:summary}

815: \begin{center}

816: \begin{tabular}{|l|l|c|l|c|c|}\hline

817: \lw{Case study}  & \multicolumn{3}{l|}{Details of program}

818:       & \multicolumn{2}{l|}{Details of trace file}

819:       \\ \cline{2-6}

820:      & Developer & $|L|$ & Employed MPI routines

821:       & $|P|$ & $|E|$

822:       \\ \hline

823: 1. Applicability & Beginner  & ~~~300 & {\tt Send}, {\tt Recv}, {\tt Isend}, {\tt Irecv}, {\tt Wait}

824:       & ~~~4~~ & ~~412

825:       \\ \hline

826: 2. Scalability   & Expert    & 40,000 & {\tt Send}, {\tt Recv}, {\tt Sendrecv}

827:       & ~~64~~ & 9,774

828:       \\ \hline

829: 3. Usability     & Compiler   & 20,000 &{\tt Isend}, {\tt Irecv}, {\tt Waitall}

830:       & ~~15~~ & ~~253

831:       \\ \hline

832: \end{tabular}

833: \end{center}

834: \end{table*}

835:

836: Table \ref{tab:summary} shows a summary of the above studies. In the

837: following, we omit ``{\tt MPI\_}'', the prefix of MPI routines, as shown

838: in Table \ref{tab:summary}.

839:

840: In these studies we used a PC cluster with 64 symmetric

841: multiprocessor (SMP) nodes. Each node in the cluster has two Pentium

842: III 1GHz processors and connects to a Myrinet-2000

843: switch \cite{nanette95myrinet}. We also employed an MPI

844: implementation, MPICH-GM \cite{www02mpichgm}.

845:

846: \subsection{Study 1: Applicability of MPI-PD}

847: \label{subsec:studies_applicability}

848:

849: In this study, we applied MPI-PD to 28 faulty programs developed by

850: six graduate students through a practice in MPI programming. These

851: programs solve simultaneous equations using Gaussian elimination.

852:

853: \begin{table}[tb]

854: \caption{Application results of MPI-PD.}

855: \label{tab:application}

856: \begin{center}

857: \begin{tabular}{|l|c|c|}\hline

858: \lw{Debugging phase}       & \multicolumn{2}{c|}{Number of programs}\\ \cline{2-3}

859: 		& Success & Failure\\ \hline

860: MPI Program execution      & 13 of 28   & 15 of 28\\ \hline

861: Event graph visualization  & 15 of 15   & \hspace{.5em}0 of 15 \\ \hline

862: Faulty process localization & 12 of 15   & \hspace{.5em}3 of 15\\ \hline

863: \end{tabular}

864: \end{center}

865: \end{table}

866:

867: We first executed the programs on our PC cluster and then

868: visualized localization results by using MPI-PD. Table

869: \ref{tab:application} shows the application results at each

870: debugging phase.

871:

872: At the execution phase, 15 of 28 programs unexpectedly terminated. As

873: we mentioned in Section \ref{sec:introduction}, since current MPI-PD

874: focuses on faults with program failures, it failed to visualize the

875: event graph for the remaining 13 programs that never terminated but

876: returned incorrect results. These programs contain semantic faults

877: such as invalid specifications of operators/variables and invalid

878: writing to message buffers before the completion of nonblocking

879: communications.

880:

881: At the localization phase, MPI-PD successfully localized faulty

882: processes for 12 of 15 programs while it failed to localize them for

883: the remaining three programs. These three programs have calculation

884: faults activated by all processes at the same statement. Therefore,

885: every process terminated outside the instrumented MPI routines, so

886: that their trace files contained no information about failure

887: events. Thus, MPI-PD failed to localize their faulty

888: processes. However, in these cases, since every process terminates

889: without any communication dependency, error propagation is unable to

890: occur. Therefore, developers have to investigate every process. That

891: is, they have to investigate their programs between the last MPI

892: routine executed in a success and the next MPI routine expected to be

893: executed, especially where the common statements that every process

894: executes.

895:

896: The 12 programs which MPI-PD successfully localized had a variety

897: of faults classified into following four types.

898: Notice that MPI-PD localized not the faults but the faulty processes

899: which activate them.

900:

901: \begin{itemize}

902: \item Invalid source/destination rank (six programs).

903: \item Invalid length of message buffer (three programs).

904: \item Calculation fault (two programs).

905: \item Deadlock occurred when passing long messages (one program).

906: \end{itemize}

907:

908: We next confirmed that there was no faulty process omitted from the

909: localized results. For all cases where invalid source/destination

910: ranks were specified, MPI-PD pointed out deadlock processes,

911: including the faulty process. Therefore, the deadlock processes

912: pointed out by MPI-PD can include valid processes, so that there

913: exists a room for improving the accuracy of localization. However,

914: this redundancy was a little problem for the programs applied in this

915: study. Since their faults appear on any number of processes,

916: developers are allowed to scale down the number of processes without

917: missing the activated faults.

918:

919: \subsection{Study 2: Scalable debugging with MPI-PD}

920: \label{subsec:studies_scalability}

921:

922: \begin{figure}[ht]

923: \centering

924: \includegraphics[width=8.3cm]{eps/VR64-all.eps}

925: \caption{Localized faulty processes in event graph visualized by MPI-PD.}

926: \label{fig:vr-all}

927: \end{figure}

928:

929: We applied MPI-PD to a parallel rendering program

930: \cite{takeuti03sac} implemented on 64 processes. This program has a

931: fault in gathering and compositing rendered images generated by distributed

932: processors. For the purpose of high-speed compositing, the

933: developers have implemented own collective communication routines

934: for the gather and the broadcast operations by using point-to-point

935: routines, {\tt Send} and {\tt Recv}. Their collective routines are

936: called at every compositing stage with splitting the processes into

937: two groups. That is, given $n$ processes, each of $2^{i-1}$ groups

938: performs collective communications at the $i^{\rm th}$ stage, where

939: $1 \leq i \leq \log n$.

940:

941: Figure \ref{fig:vr-all} shows the event graph for all processes

942: visualized by MPI-PD. While the program generates the total of 9,774

943: events, the visualized event graph is composed of 164 events classified into 64

944: failure events and 100 successful events occurred directly before the

945: failure events. In Figure \ref{fig:vr-all}, MPI-PD points out five

946: faulty processes from 64 processes: processes PE21, PE37, PE44, PE48,

947: and PE52. It also points out that these five processes fall into a

948: deadlock and that each of them has one failure event.

949:

950: As we mentioned in Section \ref{subsec:overview}, MPI-PD allows

951: developers to visualize specific processes whichever they want. For

952: example, developers can view only the deadlock processes as shown in

953: Figure \ref{fig:vr-fp}, so that easily know how the processes fell

954: into the deadlock. They can also add related processes that

955: communicated to the deadlock processes (see Figure

956: \ref{fig:vr-fpplus}), so that intuitively know process PE48

957: received many messages compared to the other four faulty processes:

958: processes PE21, PE37, PE44, and PE52.

959:

960: Thus, MPI-PD guided the developers to the five faulty events, so that

961: they easily found that process PE48, the root process of a

962: broadcast operation, called an excessive {\tt Send}

963: routine due to the lack of a {\tt break} statement. Therefore, MPI-PD assists

964: developers in scalable debugging, where the numbers of processes and

965: events are too large for them to understand the behavior of programs.

966:

967: We also indicate that the buffered send operation makes it complicated

968: to locate faults, since this operation causes a gap between the faulty

969: send event and the failure event. For example, when we executed

970: the rendering program without error detection, since process PE48

971: pushed out messages in the buffered mode, it successfully returned

972: from the faulty {\tt Send} routine and terminated at a succeeding {\tt

973: Recv} routine. Therefore, without MPI-PD, the developers can

974: investigate the {\tt Recv} routine, which causes a non-original fault,

975: or a fault due to error propagation. Thus, MPI-PD's run-time error

976: detection is necessary for handling the buffered send operation.

977:

978:

979: \begin{figure}[ht]

980: \centering \includegraphics[width=8.3cm]{eps/VR64-fp.eps}

981: \caption{Faulty processes isolated by MPI-PD. This graph shows only

982: faulty processes and communications among them.}

983: \label{fig:vr-fp}

984: \end{figure}

985:

986: \begin{figure}[ht]

987: \centering \includegraphics[width=8.3cm]{eps/VR64-fpplus.eps}

988: \caption{Faulty processes and their related processes isolated by MPI-PD.

989: Related processes are such that faulty processes communicate with them.}

990: \label{fig:vr-fpplus}

991: \end{figure}

992:

993:

994: \subsection{Study 3: Comparison with existing debuggers}

995: \label{subsec:studies_usability}

996:

997: To make clear the usability of fault localization, we compared MPI-PD

998: with TotalView \cite{etnus01tv} by applying them to a complicated program. This

999: program is automatically generated by a parallelizing compiler based

1000: on a task scheduling algorithm, Scheduling with Packaged

1001: Point-to-point Communications (SPPC) \cite{y-yamamt01ebcsh}.

1002:

1003: The MPI program generated by SPPC consists of two layers, the calculation

1004: and the communication layers, which repeatedly appear during program

1005: execution. In the calculation layer, each process independently

1006: performs calculation without any communication. In the communication

1007: layer, it exchanges messages by calling nonblocking communication

1008: routines. Each process first calls many initiation routines,

1009: {\tt Isend} and {\tt Irecv}, then a completion routine, {\tt

1010: Waitall}. Since the parallelizing compiler mechanically generates

1011: large-scale MPI programs, it requires a complicated work to debug

1012: them. Furthermore, since the {\tt Waitall} routine completes all of

1013: initiated communications at a time, it is time-consuming to

1014: distinguish failure communications from a number of communications

1015: completed by the {\tt Waitall} routine.

1016:

1017: Figure \ref{fig:sppc} shows the visualizations

1018: obtained by MPI-PD and TotalView. While MPI-PD

1019: visualizes all of failure events occurred on each process and the

1020: successful events occurred directly before the failure events,

1021: TotalView shows {\em pending sends/receives} and {\em

1022: unexpected messages} \cite{james99debugger,etnus01tv} at an arbitrary execution step.

1023: Pending sends/receives represent the sends/receives that have been

1024: initiated but have not yet been matched. Unexpected messages

1025: represent messages that have been sent to a process but have not

1026: yet been received.

1027:

1028: In this program, every process terminated at a call of {\tt Waitall}

1029: routine. At the termination, the processes tried to complete the total

1030: of 171 nonblocking operations. For this faulty program, TotalView

1031: visualizes 50 pending receives, represented as arrows in Figure

1032: \ref{fig:sppc}(b). However, it is time-consuming for the developers to

1033: investigate each of the 50 pending receives. On the other hand, MPI-PD

1034: checks the error of every communication and localizes faulty

1035: processes, so that it visualizes 34 of 171 events as shown in Figure

1036: \ref{fig:sppc}(a). Since eight of 34 events are successfully communicated

1037: events, MPI-PD reduces the number of events that have to be

1038: investigated from 171 to 26 events. Furthermore, it points out that

1039: processes PE5 and PE10 fall into a deadlock. Here, processes PE5 and

1040: PE10 have three and seven error events, respectively, so that the

1041: number of events that have to be investigated is reduced further from

1042: 171 to 10 events.

1043:

1044: With the assistance of MPI-PD, the developer has successfully debugged

1045: this program less than five minutes. He first investigated process PE5

1046: and confirmed that it had no fault, and then process

1047: PE10. At last, he reached at the fault where an invalid source was

1048: specified at an {\tt Irecv} routine.

1049:

1050: \begin{figure}[ht]

1051: \centering

1052: \includegraphics[width=6.0cm]{eps/sppc15.eps}

1053: \qquad

1054: \includegraphics[width=6.0cm]{eps/mqgL.eps}

1055:

1056: \hspace{1em}(a) Event Graph by MPI-PD \hspace{6em}(b) Message Queue Graph by TotalView

1057: \caption{Visualizations obtained by MPI-PD and TotalView.}

1058: \label{fig:sppc}

1059: \end{figure}

1060:

1061: \begin{table*}[tb]

1062: \caption{Difference among MPI-PD, TotalView, and DeWiz.}

1063: \label{tab:difference}

1064: \begin{center}

1065: \begin{tabular}{|l|c|c|c|} \hline

1066: Function                        & MPI-PD & DeWiz \cite{kran02pdp,kran02ipdps} & TotalView \cite{etnus01tv}\\ \hline

1067: 1. Faulty process localization & by dependency analysis &   --- & ---\\ \hline

1068: 2. Run-time error detection     & every message & every message & every message\\ \hline

1069: 3. Process grouping             & by dependency analysis & by message length & ---\\ \hline

1070: 4. Timeline visualization       &    yes &   yes & ---\\ \hline

1071: 5. Trace file reduction         &    --- &   yes & ---\\ \hline

1072: 6. Stepwise execution           &    --- &   --- & yes\\ \hline

1073: \end{tabular}

1074: \end{center}

1075: \end{table*}

1076:

1077: Table \ref{tab:difference} summarizes the difference among MPI-PD,

1078: TotalView, and DeWiz \cite{kran02pdp,kran02ipdps}. While MPI-PD is

1079: useful to reduce events that have to be investigated, TotalView allows

1080: us to execute the target program in stepwise. DeWiz also provides an

1081: analysis using the event graph. However, DeWiz aims at

1082: identifying closely related processes and reducing the total amount of

1083: trace data. In DeWiz, by giving a specific process, then its process

1084: grouping function accumulates the length of transmitted messages for

1085: every pair of processes and isolates related processes by using a

1086: certain threshold. Therefore, developers have to decide which

1087: processes have to be specified, and this is a similar problem addressed in

1088: this paper. Furthermore, since error propagation has no relevance to message

1089: length, their message length based approach is inappropriate for the

1090: purpose of faulty process localization.

1091:

1092: Summarizing the above discussions, DeWiz is useful to reduce the

1093: total amount of trace files and TotalView is useful to investigate

1094: the detailed behavior of programs. MPI-PD is useful to reduce the

1095: number of events that have to be investigated for

1096: debugging. Therefore, we think that appropriate combined use of these

1097: tools is a good choice for debugging message passing programs.

1098: For example, we first localized faulty processes by using MPI-PD

1099: and next investigate them in detail by using TotalView.

1100:

1101:

1102: \section{Conclusions}

1103: \label{sec:conclusions}

1104:  We have presented a novel debugging tool, named MPI-PD, for localizing

1105:  faulty processes in message passing programs, aiming at reducing

1106:  developers' efforts. MPI-PD helps us to identify the source of

1107:  failure from a number of observed errors by automatically checking

1108:  communication errors during program execution. If MPI-PD observes any

1109:  communication errors, it then generates a trace file, backtraces

1110:  communication dependencies and points out potentially faulty

1111:  processes in the event graph visualization.

1112:

1113:  MPI-PD reduces the amount of debugging information before visualizing

1114:  and investigating it by using post-mortem performance debuggers and

1115:  source-level debuggers, respectively.

1116:  Therefore, we think that appropriate combined use of these tools

1117:  is a good choice for debugging message passing programs.

1118:

1119:  \section*{Acknowledgements}

1120:  This work was partly supported by JSPS Grant-in-Aid for Young

1121:  Researchers (B)(15700030), for Scientific

1122:  Research (C)(2)(14580374), JSPS Research for the Future Program

1123:  JSPS-RFTF99I00903, and Network Development Laboratories, NEC.

1124:  We are also grateful to the anonymous reviewers

1125:  for their valuable comments.

1126:

1127: \bibliographystyle{plain}

1128: \bibliography{main}

1129:

1130: \end{document}

1131: