0308:cs0308007/paper.tex

1: \documentclass{tlp}

2: \usepackage{epsf}

3: \usepackage{isolatin1}

4: \usepackage{multicol}

5: \usepackage{amssymb}

6:

7:

8:

9: \title[On Applying Or-Parallelism and Tabling to Logic Programs]

10:       {On Applying Or-Parallelism and Tabling to Logic Programs}

11:

12: \author[R. Rocha, F. Silva and V. Santos Costa]

13:        {RICARDO ROCHA, FERNANDO SILVA\\

14:          DCC-FC \& LIACC\\

15:          Universidade do Porto, Portugal\\

16:          \email{\{ricroc,fds\}@ncc.up.pt}

17:        \and VITOR SANTOS COSTA\\

18:          COPPE Systems \& LIACC\\

19:          Universidade do Rio de Janeiro, Brasil\\

20:          \email{vitor@cos.ufrj.br}

21:        }

22:

23:

24:

25: \begin{document}

26: \maketitle

27:

28: \begin{abstract}

29:   Logic Programming languages, such as Prolog, provide a high-level,

30:   declarative approach to programming. Logic Programming offers great

31:   potential for implicit parallelism, thus allowing parallel systems

32:   to often reduce a program's execution time without programmer

33:   intervention. We believe that for complex applications that take

34:   several hours, if not days, to return an answer, even limited

35:   speedups from parallel execution can directly translate to very

36:   significant productivity gains.

37:

38:   It has been argued that Prolog's evaluation strategy --~SLD

39:   resolution~-- often limits the potential of the logic programming

40:   paradigm. The past years have therefore seen widening efforts at

41:   increasing Prolog's declarativeness and expressiveness. Tabling has

42:   proved to be a viable technique to efficiently overcome SLD's

43:   susceptibility to infinite loops and redundant subcomputations.

44:

45:   Our research demonstrates that implicit or-parallelism is a natural

46:   fit for logic programs with tabling. To substantiate this belief, we

47:   have designed and implemented an or-parallel tabling engine

48:   --~OPTYap~-- and we used a shared-memory parallel machine to

49:   evaluate its performance. To the best of our knowledge, OPTYap is

50:   the first implementation of a parallel tabling engine for logic

51:   programming systems. OPTYap builds on Yap's efficient sequential

52:   Prolog engine.  Its execution model is based on the SLG-WAM for

53:   tabling, and on the environment copying for or-parallelism.

54:

55:   Preliminary results indicate that the mechanisms proposed to

56:   parallelize search in the context of SLD resolution can indeed be

57:   effectively and naturally generalized to parallelize tabled

58:   computations, and that the resulting systems can achieve good

59:   performance on shared-memory parallel machines. More importantly, it

60:   emphasizes our belief that through applying or-parallelism and

61:   tabling to logic programs the range of applications for Logic

62:   Programming can be increased.

63: \end{abstract}

64:

65: \begin{keywords}

66: Or-Parallelism, Tabling, Implementation, Performance.

67: \end{keywords}

68:

69:

70:

71: \section{Introduction}

72:

73: Logic programming provides a high-level, declarative approach to

74: programming. Arguably, Prolog is the most popular and powerful logic

75: programming language. Prolog's popularity was sparked by the success

76: of the sequential execution model presented in 1983 by David H. D.

77: Warren, the \emph{Warren Abstract Machine}

78: (\emph{WAM})~\cite{Warren-83}. Throughout its history, Prolog has

79: demonstrated the potential of logic programming in application areas

80: such as Artificial Intelligence, Natural Language Processing,

81: Knowledge Based Systems, Machine Learning, Database Management, or

82: Expert Systems.

83:

84: Logic programs are written in a subset of First-Order Logic, Horn

85: clauses, that has an intuitive interpretation as positive facts and as

86: rules. Programs use the logic to express the problem, whilst questions

87: are answered by a resolution procedure with the aid of user

88: annotations. The combination was summarized by Kowalski's

89: motto~\cite{Kowalski-79}:

90: \[algorithm~=~logic~+control\]

91: Ideally, one would want Prolog programs to be written as logical

92: statements first, and for control to be tackled as a separate issue.

93: In practice, the limitations of Prolog's operational semantics, SLD

94: resolution, mean that Prolog programmers must be concerned with SLD

95: semantics throughout program development.

96:

97: Several proposals have been put forth to overcome some of these

98: limitations and therefore improve the declarativeness and

99: expressiveness of Prolog. One such proposal that has been gaining in

100: popularity is \emph{tabling}, also referred to as \emph{tabulation} or

101: \emph{memoing}~\cite{Michie-68}. In a nutshell, tabling consists of

102: storing intermediate answers for subgoals so that they can be reused

103: when a repeated subgoal appears during the resolution process. It can

104: be shown that tabling based execution models, such as SLG

105: resolution~\cite{Chen-96}, are able to reduce the search space, avoid

106: looping, and that they have better termination properties than SLD

107: based models. For instance, SLG resolution is guaranteed to terminate

108: for all logical programs with the \emph{bounded term-size

109: property}~\cite{Chen-96}.

110:

111: Work on SLG resolution, as implemented in the XSB logic programming

112: system~\cite{xsb}, proved the viability of tabling technology for

113: applications such as Natural Language Processing, Knowledge Based

114: Systems and Data Cleaning, Model Checking, and Program Analysis. SLG

115: resolution also includes several extensions to Prolog, namely support

116: for negation~\cite{Apt-94}, hence allowing for novel applications in

117: the areas of Non-Monotonic Reasoning and Deductive Databases.

118:

119: One of the major advantages of logic programming is that it is well

120: suited for parallel execution. The interest in the parallel execution

121: of logic programs mainly arose from the fact that parallelism can be

122: exploited \emph{implicitly} from logic programs. This means that

123: parallelism can be automatically exploited, that is, without input

124: from the programmer to express or manage parallelism, ideally making

125: parallel logic programming as easy as logic programming.

126:

127: Logic programming offers two major forms of implicit parallelism,

128: \emph{Or-Parallelism} and \emph{And-Parallelism}. Or-parallelism

129: results from the parallel execution of alternative clauses for a given

130: predicate goal, while and-parallelism stems from the parallel

131: evaluation of subgoals in an alternative clause. Some of the most

132: well-known systems that successfully supported these forms of

133: parallelism are: Aurora~\cite{Aurora-88} and Muse~\cite{Ali-90a} for

134: or-parallelism; \&-Prolog~\cite{Hermenegildo-91},

135: DASWAM~\cite{Shen-92}, and ACE~\cite{Pontelli-97} for and-parallelism;

136: and Andorra-I~\cite{Costa-91} for or-parallelism together with

137: and-parallelism. A detailed presentation of such systems and the

138: challenges and problems in their implementation can be found

139: in~\cite{Gupta-01}. Arguably, or-parallel systems have been the most

140: successful parallel logic programming systems so far.  Experience has

141: shown that or-parallel systems can obtain very good speedups for

142: applications that require search.  Examples can be found in

143: application areas such Parsing, Optimization, Structured Database

144: Querying, Expert Systems and Knowledge Discovery applications.

145:

146: The good results obtained with parallelism and with tabling rises the

147: question of whether further efficiency improvements may be achievable

148: through parallelism. Freire and colleagues were the first to research

149: this area~\cite{Freire-95}. Although tabling works for both

150: deterministic and non-deterministic applications, Freire focused on

151: the search process, because tabling has frequently been used to reduce

152: the search space. In their model, each tabled subgoal is computed

153: independently in a separate computational thread, a \emph{generator

154: thread}. Each generator thread is the sole responsible for fully

155: exploiting its subgoal and obtain the complete set of answers.

156: Arguably, Freire's model will work particularly well if we have many

157: non-deterministic generators. On the other hand, it will not exploit

158: parallelism if there is a single generator and many non-tabled

159: subgoals. It also does not exploit parallelism between a generator's

160: clauses. As we discuss in Section~\ref{section_performance_analysis},

161: experience has shown that interesting applications do indeed have a

162: limited number of generators.

163:

164: Ideally, we would like to exploit maximum parallelism and take maximum

165: advantage of current technology for tabling and parallel systems. To

166: exploit maximum parallelism, we would like to exploit parallelism from

167: both tabled and non-tabled subgoals. Further, we would like to reuse

168: existing technology for tabling and parallelism. As such, we would

169: like to exploit parallelism from tabled and non-tabled subgoals

170: \emph{in much the same way}. As with Freire, we would focus on

171: or-parallelism first, and we will focus throughout on shared-memory

172: platforms.

173:

174: Towards this goal, we proposed two new computational

175: models~\cite{Rocha-99a}, \emph{Or-Parallelism within Tabling}

176: (\emph{OPT}) and \emph{Tabling within Or-Parallelism} (\emph{TOP}).

177: Both models are based on the idea that all open alternatives in the

178: search tree should be amenable to parallel exploitation, be they from

179: tabled or non-tabled subgoals. The OPT model further assumes tabling

180: as the base component of the parallel system, that is, each

181: \emph{worker}\footnote{The term \emph{worker} is widely used in the

182: literature to refer to each computational unit contributing to the

183: parallel execution.} is a full sequential tabling engine. OPT

184: triggers or-parallelism when workers run out of alternatives to

185: exploit: at this point, a worker will share part of its SLG

186: derivations with the other. In contrast, the TOP model represents the

187: whole SLG forest as a shared search tree, thus unifying parallelism

188: with tabling. Workers are logically positioned at branches in this

189: tree. When a branch completes or suspends, workers move to nodes with

190: open alternatives, that is, alternatives with either open

191: clauses or new answers stored in the table.

192:

193: The main contribution of this work is the design and performance

194: evaluation of what to the best of our knowledge is the first parallel

195: tabling logic programming system, OPTYap~\cite{Rocha-01}. We chose the

196: OPT model for two main advantages, both stemming from the fact that

197: OPT encapsulates or-parallelism within tabling. First, implementation

198: of the OPT models follows naturally from two well-understood

199: implementation issues: we need to implement a tabling engine, and then

200: we need to support or-parallelism. Second, in the OPT model a worker

201: can keep its nodes \emph{private} until reaching a sharing point. This

202: is a key issue in reducing parallel overheads. We remark that it is

203: common in or-parallel works to say that work is initially

204: \emph{private}, and that is made \emph{public} after sharing.

205:

206: OPTYap builds on the YapOr~\cite{Rocha-99b} and YapTab~\cite{Rocha-00}

207: engines. YapOr was previous work on supporting or-parallelism over

208: Yap's Prolog system~\cite{Costa-99b}. YapOr is based on the

209: environment copying model for shared-memory machines, as originally

210: implemented in Muse~\cite{Ali-90b}. YapTab is a sequential tabling

211: engine that extends Yap's execution model to support tabled evaluation

212: for definite programs. YapTab's implementation is largely based on the

213: ground-breaking design of the XSB system~\cite{Sagonas-94,Rao-97},

214: which implements the SLG-WAM~\cite{Swift-94b,Sagonas-96,Sagonas-98}.

215: YapTab has been designed from scratch and its development was done

216: taking into account the major purpose of further integration to

217: achieve an efficient parallel tabling computational model, whilst

218: comparing favorably with current \emph{state of the art} technology.

219: In other words, we aim at respecting the \emph{no-slowdown

220: principle}~\cite{Hermenegildo-Phd}: our or-parallel tabling system

221: should, when executed with a single worker, run as fast or faster

222: than the current available sequential tabling systems. Otherwise,

223: parallel performance results would not be significant and fair.

224:

225: In order to validate our design we studied in detail the performance

226: of OPTYap in shared-memory machines up to 32 workers. The results we

227: gathered show that OPTYap does indeed introduce low overheads for

228: sequential execution and that it compares favorably with current

229: versions of XSB.  Furthermore, the results show that OPTYap maintains

230: YapOr's speedups for parallel execution of non-tabled programs, and

231: that there are tabled applications that can achieve very high

232: performance through parallelism. This substantiates our belief that

233: tabling and parallelism can together contribute to increasing the

234: range of applications for Logic Programming.

235:

236:

237:

238: \section{Tabling for Logic Programs}

239:

240: The basic idea behind tabling is straightforward: programs are

241: evaluated by storing newly found answers of current subgoals in an

242: appropriate data space, called the \emph{table space}. The method then

243: uses this table to verify whether calls to subgoals are repeated.

244: Whenever such a repeated call is found, the subgoal's answers are

245: recalled from the table instead of being re-evaluated against the

246: program clauses. In practice, two major issues have to be addressed:

247:

248: \begin{enumerate}

249: \item What is a repeated subgoal? We may say that a subgoal repeats if

250:   it is the same as a previous subgoal, up to variable renaming;

251:   alternatively, we may say it is repeated if it is an instance of a

252:   previous subgoal. The former approach is known as

253:   \emph{variant-based tabling}~\cite{Ramakrishnan-99}, the latter as

254:   \emph{subsumption-based tabling}~\cite{Rao-96}. Variant-based

255:   tabling has been researched first and is arguably better understood,

256:   although there has been significant recent progress in

257:   subsumption-based tabling~\cite{Johnson-99}. We shall use

258:   variant-based tabling approach in this work.

259: \item How to execute subgoals? Clearly, we must change the selection

260:   function and search rule to accommodate for repeated subgoals. In

261:   particular, we must address the situation where we recursively call

262:   a tabled subgoal before we have fully tabled all its

263:   answers. Several strategies to do so have been

264:   proposed~\cite{Tamaki-86,Vieille-89,Chen-96}. We use the popular SLG

265:   resolution~\cite{Chen-96} in this work, mainly because this approach

266:   has good termination properties.

267: \end{enumerate}

268:

269: In the following, we illustrate the main principles of tabled

270: evaluation using SLG resolution through an example.

271:

272:

273:

274: \subsection{Tabled Evaluation}

275:

276: Consider the Prolog program of Figure~\ref{fig_finite_SLG_tree}. The

277: program defines a small directed graph, represented by the

278: \texttt{arc/2} predicate, with a relation of reachability, given by

279: the \texttt{path/2} predicate. In this example we ask the query goal

280: \texttt{?- path(a,Z)} on this program. Note that traditional Prolog

281: would immediately enter an infinite loop because the first clause of

282: \texttt{path/2} leads to a repeated call to \texttt{path(a,Z)}. In

283: contrast, if tabling is applied then termination is ensured. The

284: declaration \texttt{:- table path/2} in the program code indicates

285: that predicate \texttt{path/2} should be tabled.

286: Figure~\ref{fig_finite_SLG_tree} illustrates the evaluation sequence

287: when using tabling.

288:

289: \begin{figure}[!ht]

290: \centerline{

291: \epsfxsize=12cm

292: \epsffile{finite_SLG_tree.eps}

293: }

294: \caption{A finite tabled evaluation.}

295: \label{fig_finite_SLG_tree}

296: \end{figure}

297:

298: At the top, the figure illustrates the program code and the state of

299: the table space at the end of the evaluation. The main sub-figure

300: shows the forest of SLG trees for the original query. The topmost tree

301: represents the original invocation of the tabled subgoal

302: \texttt{path(a,Z)}. It thus computes all nodes reachable from node

303: \texttt{a}. As we shall see, computing all nodes reachable from

304: \texttt{a} requires computing all nodes reachable from \texttt{b} and

305: all nodes reachable from \texttt{c}. The middle tree represents the

306: SLG tree \texttt{path(b,Z)}, that is, it computes all nodes reachable

307: from node \texttt{b}. The bottommost tree represents the SLG tree

308: \texttt{path(c,Z)}.

309:

310: Next, we describe in detail the evaluation sequence presented in the

311: figure. For simplicity of presentation, the root nodes of the SLG

312: trees \texttt{path(b,Z)} and \texttt{path(c,Z)}, nodes $6$ and $13$,

313: are shown twice. The numbering of nodes denotes the evaluation

314: sequence.

315:

316: Whenever a tabled subgoal is first called, a new tree is added to the

317: forest of trees and a new entry is added to the table space. We name

318: first calls to tabled subgoals \emph{generator nodes} (nodes depicted

319: by white oval boxes). In this case, execution starts with a generator

320: node, node $0$. The evaluation thus begins by creating a new tree

321: rooted by \texttt{path(a,Z)} and by inserting a new entry in the table

322: space for it.

323:

324: The second step is to resolve \texttt{path(a,Z)} against the first

325: clause for \texttt{path/2}, creating node $1$. Node $1$ is a variant

326: call to \texttt{path(a,Z)}. We do not resolve the subgoal against the

327: program at these nodes, instead we consume answers from the table

328: space. Such nodes are thus called \emph{consumer nodes} (nodes

329: depicted by gray oval boxes). At this point, the table does not have

330: answers for this call. The consumer therefore must \emph{suspend},

331: either by freezing the whole stacks~\cite{Sagonas-98}, or by copying

332: the stacks to separate storage~\cite{Demoen-00}.

333:

334: The only possible move after suspending is to backtrack to node $0$.

335: We then try the second clause to \texttt{path/2}, thus calling

336: \texttt{arc(a,Z)}. The \texttt{arc/2} predicate is not tabled, hence

337: it must be resolved against the program, as Prolog would. We name such

338: nodes \emph{interior nodes}. The first clause for \texttt{arc/2}

339: immediately succeeds (step $3$). We return back to the context for the

340: original goal, obtaining an answer for \texttt{path(a,Z)}, and store

341: the answer \texttt{Z=b} in the table.

342:

343: We can now choose between two options. We may backtrack and try the

344: alternative clauses for \texttt{arc/2}. Otherwise, we may suspend the

345: current execution, and resume node $1$ with the newly found answer. We

346: decide to continue exploiting the interior node. Both steps $4$ and

347: $5$ fail, so we backtrack to node $0$. Node $0$ has no more clauses

348: left to try, so we try to check whether it has \emph{completed}. It

349: has not, as node $1$ has not consumed all its answers. We therefore

350: must resume node $1$. The stacks are thus restored to their state at

351: node $1$, and the answer \texttt{Z=b} is forwarded to this node. The

352: subgoal succeeds trivially and we call the continuation,

353: \texttt{path(b,Z}). This is the first call to \texttt{path(b,Z)}, so

354: we must create a new tree rooted by \texttt{path(b,Z)} (node $6$),

355: insert a new entry in the table space for it, and proceed with the

356: evaluation of \texttt{path(b,Z)}, as shown in the middle tree.

357:

358: Again, \texttt{path(b,Z)} calls itself recursively, and suspends at

359: node $7$. We now have two consumers, node $1$ and node $7$. The only

360: answer in the table was already consumed, so we have to backtrack to

361: node $6$. This leads to generating a new interior node (node $8$) and

362: consulting the program for clauses to \texttt{arc(b,Z)}. The first

363: clause fails (step $9$), but the second clause matches (step $10$).

364: The answer is returned to node $6$ and stored in the table. We next

365: have three choices: continue forward execution, backtrack to the open

366: interior node, or resume the consumer node $7$. In the example we

367: choose to follow a Prolog-like strategy and continue forward

368: execution. Step $11$ thus returns the binding \texttt{Z=c} to the

369: subgoal \texttt{path(a,Z)}. We store this answer in

370: \texttt{path(a,Z)}'s table entry.

371:

372: This will be the last answer to \texttt{path(a,Z)}, but we can only

373: prove so after fully exploiting the tree: we still have an open

374: interior node (node $8$), and two suspended consumers (nodes $1$ and

375: $7$). We now choose to backtrack to node $8$, and exploit the last

376: clause for \texttt{arc/2} (step $12$). At this point we fail all the

377: way back to node $6$. We cannot complete node $6$ yet, as we have an

378: unfinished consumer below (node $7$). The only answer in the table for

379: this consumer is \texttt{Z=c}. We use this answer and obtain a first

380: call to \texttt{path(c,Z)}.

381:

382: The new generator, node $13$, needs a new table. Again, we try the

383: first clause and suspend on the recursive call (node $14$). Next, we

384: backtrack to the second clause. Resolution on \texttt{arc(c,Z)} (node

385: $15$) fails twice (steps $16$ and $17$), and then generates an answer,

386: \texttt{Z=b} (step $18$). We return the answer to node $13$, and store

387: the answer in the table. Again, we choose to continue forward

388: execution, thus finding a new answer to \texttt{path(b,Z)}, which is

389: again stored in the table (step $19$). Next, we continue forward

390: execution (step $20$), and find an answer to \texttt{path(a,Z)},

391: \texttt{Z=b}. This answer had already been found at step $3$. SLG

392: resolution does not store duplicate answers in the table. Instead,

393: repeated answers \emph{fail}. This is how the SLG-WAM avoids

394: unnecessary computations, and even looping in some cases.

395:

396: What to do next? We do not have interior nodes to exploit, so we

397: backtrack to generator node $13$. The generator cannot complete

398: because it has a consumer below (node $14$). We thus try to complete

399: by sending answers to consumer node $14$. The first answer,

400: \texttt{Z=b}, leads to a new consumer for \texttt{path(b,Z)} (node

401: $21$). The table has two answers for \texttt{path(b,Z)}, so we can

402: continue the consumer immediately. This gives a new answer

403: \texttt{Z=c} to \texttt{path(c,Z)}, which is stored in the table

404: (step $22$). Continuing forward execution results in the answer

405: \texttt{Z=c} to \texttt{path(b,Z)} (step $23$). This answer repeats

406: what we found in step $10$, so we must fail at this point.

407: Backtracking sends us back to consumer node $21$. We then consume the

408: second answer for \texttt{path(b,Z)}, which generates a repeated

409: answer, so we fail again (step $24$). We then try consumer node $14$.

410: It next consumes the second answer, again leading to repeated

411: subgoals, as shown in steps $25$ to $27$. At this point we fail back

412: to node $13$, which makes sure that all answers to the consumers below

413: (nodes $14$, $21$, and $25$) have been tried. Unfortunately, node $13$

414: cannot complete, because it depends on subgoal \texttt{path(b,Z)}

415: (node $21$). Completing \texttt{path(c,Z)} earlier is not safe because

416: we can loose answers. Note that, at this point, new answers can still

417: be found for subgoal \texttt{path(b,Z)}. If new answers are found,

418: consumer node $21$ should be resumed with the newly found answers,

419: which in turn can lead to new answers for subgoal \texttt{path(c,Z)}.

420: If we complete sooner, we can loose such answers.

421:

422: Execution thus backtracks and we try the answer left for consumer node

423: $7$. Steps $28$ to $30$ show that again we only get repeated answers.

424: We fail and return to node $6$. All nodes in the trees for node $6$

425: and node $13$ have been exploited. As these trees do not depend on any

426: other tree, we are sure no more answers are forthcoming, so at last

427: step $31$ declares the two trees to be complete, and closes the

428: corresponding table entries.

429:

430: Next we backtrack to consumer node $1$. We had not tried \texttt{Z=c}

431: on this node, but exploiting this answer leads to no further answers

432: (steps $32$ to $34$). The computation has thus fully exploited every

433: node, and we can complete the remaining table entry (step $35$).

434:

435:

436:

437: \subsection{SLG-WAM Operations}

438:

439: The example showed four new main operations: entering a tabled subgoal;

440: adding a new answer to a generator; exporting an answer from the

441: table; and trying to complete the tree. In more detail:

442:

443: \begin{enumerate}

444: \item The \emph{tabled subgoal call} operation is a call to a tabled

445:   subgoal. It checks if a subgoal is in the table, and if not, adds a

446:   new entry for it and allocates a new generator node (nodes $0$, $6$

447:   and $13$). Otherwise, it allocates a consumer node and starts

448:   consuming the available answers (nodes $1$, $7$, $14$, $21$, $25$,

449:   $28$ and $32$).

450: \item The \emph{new answer} operation returns a new answer to a

451:   generator. It verifies whether a newly generated answer is already

452:   in the table, and if not, inserts it (steps $3$, $10$, $11$, $18$,

453:   $19$ and $22$). Otherwise, it fails (steps $20$, $23$, $24$, $26$,

454:   $27$, $29$, $30$, $33$, and $34$).

455: \item The \emph{answer resolution} operation forwards answers from

456:   the table to a consumer node. It verifies whether newly found

457:   answers are available for a particular consumer node and, if any,

458:   consumes the next one. Otherwise, it schedules a possible resolution

459:   to continue the execution. Answers are consumed in the same order

460:   they are inserted in the table. The answer resolution operation is

461:   executed every time the computation reaches a consumer node.

462: \item The \emph{completion} operation determines whether a tabled

463:   subgoal is \emph{completely evaluated}. It executes when we

464:   backtrack to a generator node and all of its clauses have been

465:   tried. If the subgoal has been completely evaluated, the operation

466:   closes its table entry and reclaims space (steps $31$ and $35$).

467:   Otherwise, it schedules a possible resolution to continue the

468:   execution.

469: \end{enumerate}

470:

471: The example also shows that we have some latitude on where and when to

472: apply these operations. The actual sequence of operations thus depends

473: on a \emph{scheduling strategy}. We next discuss the main principles

474: for completion and scheduling strategies in some more detail.

475:

476:

477:

478: \subsection{Completion}

479:

480: Completion is needed in order to recover space and to support

481: negation. We are most interested on space recovery in this work.

482: Arguably, in this case we could delay completion until the very end of

483: execution. Unfortunately, doing so would also mean that we could only

484: recover space for suspended (consumer) subgoals at the very end of the

485: execution. Instead we shall try to achieve \emph{incremental

486: completion}~\cite{Chen-95} to detect whether a generator node has been

487: fully exploited, and if so to recover space for all its consumers.

488:

489: Completion is hard because a number of generators may be mutually

490: dependent. Figure~\ref{fig_graph_dependencies} shows the dependencies

491: for the completed graph. Node $0$ depends on itself recursively

492: through consumer node $1$, and on generator node $6$. Node $6$ depends

493: on itself, consumer nodes $7$ and $28$, and on node $13$. Node $13$

494: also depends on itself, consumer nodes $14$ and $25$, and on node $6$

495: through consumer node $21$. There is thus a loop between nodes $6$ and

496: $13$: if we find a new answer for node $6$, we may get new answers for

497: node $13$, and so for node $6$.

498:

499: \begin{figure}[!ht]

500: \centerline{

501: \epsfxsize=7cm

502: \epsffile{graph_dependencies.eps}

503: }

504: \caption{Node dependencies for the completed graph.}

505: \label{fig_graph_dependencies}

506: \end{figure}

507:

508: In general, a set of mutually dependent subgoals forms a

509: \emph{Strongly Connected Component} (or

510: \emph{SCC})~\cite{Tarjan-72}. Clearly, we can only complete SCCs

511: together. We will usually represent an SCC through the oldest

512: generator. More precisely, the youngest generator node which does not

513: depend on older generators is called the \emph{leader node}. A leader

514: node is also the oldest node for its SCC, and defines the current

515: completion point.

516:

517: XSB uses a stack of generators to detect completion

518: points~\cite{Sagonas-98}. Each time a new generator is introduced it

519: becomes the current leader node. Each time a new consumer is

520: introduced one verifies if it is for an older generator node ${\cal

521: G}$. If so, ${\cal G}$'s leader node becomes the current leader

522: node. Unfortunately, this algorithm does not scale well for parallel

523: execution, which is not easily representable with a single stack.

524:

525:

526:

527: \subsection{Scheduling}

528:

529: At several points we had to choose between continuing forward

530: execution, backtracking to interior nodes, returning answers to

531: consumer nodes, or performing completion. Ideally, we would like to

532: run these operations in \emph{parallel}. In a sequential system, the

533: decision on which operation to perform is crucial to system

534: performance and is determined by the \emph{scheduling strategy}.

535: Different scheduling strategies may have a significant impact on

536: performance, and may lead to different order of answers. YapTab

537: implements two different scheduling strategies, \emph{batched} and

538: \emph{local}~\cite{Freire-96}. YapTab's default scheduling strategy is

539: batched.

540:

541: Batched scheduling is the strategy we followed in the example: it

542: favors forward execution first, backtracking to interior nodes next,

543: and returning answers or completion last. It thus tries to delay the

544: need to move around the search tree by \emph{batching} the return of

545: answers. When new answers are found for a particular tabled subgoal,

546: they are added to the table space and the evaluation continues until

547: it resolves all program clauses for the subgoal in hand.

548:

549: Batched scheduling runs all interior nodes before restarting the

550: consumers. In the worst case, this strategy may result in creating a

551: complex graph of interdependent consumers. Local scheduling is an

552: alternative tabling scheduling strategy that tries to evaluate

553: subgoals as independently as possible, by executing one SCC at a time.

554: Answers are only returned to the leader's calling environment when its

555: SCC is completely evaluated.

556:

557:

558:

559: \section{The Sequential Tabling Engine}

560:

561: We next give a brief introduction to the implementation of YapTab.

562: Throughout, we focus on support for the parallel execution of definite

563: programs.

564:

565: The YapTab design is WAM based, as is the SLG-WAM. Yap data

566: structures' are very close to the WAM's~\cite{Warren-83}: there is a

567: \emph{local stack}, storing both choice points and environment frames;

568: a \emph{global stack}, storing compound terms and variables; a

569: \emph{code space area}, storing code and the internal database; a

570: \emph{trail}; and a \emph{auxiliary stack}. To support the SLG-WAM we

571: must extend the WAM with a new data area, the \emph{table space}; a

572: new set of registers, the \emph{freeze registers}; an extension of the

573: standard trail, the \emph{forward trail}. We must support four new

574: operations: \emph{tabled subgoal call}, \emph{new answer},

575: \emph{answer resolution}, and \emph{completion}. Last, we must support

576: one or several \emph{scheduling strategies}.

577:

578: We reconsidered decisions in the original SLG-WAM that can be a

579: potential source of parallel overheads. Namely, we argue that the

580: stack based completion detection mechanism used in the SLG-WAM is not

581: suitable to a parallel implementation. The SLG-WAM considers that the

582: control of leader detection and scheduling of unconsumed answers

583: should be done at the level of the data structures corresponding to

584: first calls to tabled subgoals, and it does so by associating

585: completion frames to generator nodes. On the other hand, YapTab

586: considers that such control should be performed through the data

587: structures corresponding to variant calls to tabled subgoals, and thus

588: it associates a new data structure, the \emph{dependency frame}, to

589: consumer nodes. We believe that managing dependencies at the level of

590: the consumer nodes is a more intuitive approach that we can take

591: advantage of.

592:

593: The introduction of this new data structure allows us to reduce the

594: number of extra fields in tabled choice points and to eliminate the

595: need for a separate completion stack. Furthermore, allocating the data

596: structure in a separate area simplifies the implementation of

597: parallelism. We next review the main data structures and algorithms of

598: the YapTab design. A more detailed description is given

599: in~\cite{Rocha-PhD}.

600:

601:

602:

603: \subsection{Table Space}

604:

605: The table space can be accessed in different ways: to look up if a

606: subgoal is in the table, and if not insert it; to verify whether a

607: newly found answer is already in the table, and if not insert it; to

608: pick up answers to consumer nodes; and to mark subgoals as

609: completed. Hence, a correct design of the algorithms to access and

610: manipulate the table data is a critical issue to obtain an efficient

611: tabling system implementation.

612:

613: Our implementation of tables uses tries as proposed by Ramakrishnan

614: \emph{et al.}~\cite{Ramakrishnan-99}. Tries provide complete

615: discrimination for terms and permit lookup and possibly insertion to

616: be performed in a single pass through a term. In

617: section~\ref{section_concurrent_table_access} we discuss how OPTYap

618: supports concurrent access to tries.

619:

620: Figure~\ref{fig_tries} shows the completed table for the query shown

621: in Figure~\ref{fig_finite_SLG_tree}. Table lookup starts from the

622: \emph{table entry} data structure. Each table predicate has one such

623: structure, which is allocated at compilation time. A pointer to the

624: table entry can thus be included in the compiled code. Calls to the

625: predicate will always access the table starting from this point.

626:

627: \begin{figure}[!ht]

628: \centerline{

629: \epsfxsize=11cm

630: \epsffile{tries_example.eps}

631: }

632: \caption{Using tries to organize the table space.}

633: \label{fig_tries}

634: \end{figure}

635:

636: The table entry points to a tree of trie nodes, the \emph{subgoal trie

637:   structure}. More precisely, each different call to \texttt{path/2}

638: corresponds to a unique path through the subgoal trie structure. Such

639: a path always starts from the table entry, follows a sequence of

640: subgoal trie data units, the \emph{subgoal trie nodes}, and terminates

641: at a leaf data structure, the \emph{subgoal frame}.

642:

643: Each subgoal trie node represents a binding for an argument or

644: sub-argument of the subgoal. In the example, we have three possible

645: bindings for the first argument, \texttt{X=c}, \texttt{X=b}, and

646: \texttt{X=a}. Each binding stores two pointers: one to be followed if

647: the argument matches the binding, the other to be followed otherwise.

648:

649: We often have to search through a chain of sibling nodes that

650: represent alternative paths, e.g., in the query \texttt{path(a,Z)} we

651: have to search through nodes \texttt{X=c} and \texttt{X=b} until

652: finding node \texttt{X=a}. By default, this search is done

653: sequentially. When the chain becomes larger then a threshold value, we

654: dynamically index the nodes through a hash table to provide direct

655: node access and therefore optimize the search.

656:

657: Each subgoal frame stores information about the subgoal, namely an

658: entry point to its \emph{answer trie structure}. Each unique path

659: through the answer trie data units, the \emph{answer trie nodes},

660: corresponds to a different answer to the entry subgoal. All answer

661: leave nodes are inserted in a linked list: the subgoal trie points at

662: the first and last entry in this list. Leaves' answer nodes are

663: chained together in insertion time order, so that we can recover

664: answers in the same order they were inserted. A consumer node thus

665: needs only to point at the leaf node for its last consumed answer, and

666: consumes more answers just by following the chain of leaves.

667:

668:

669:

670: \subsection{Generator and Consumer Nodes}

671:

672: Generator and consumer nodes correspond, respectively, to first and

673: variant calls to tabled subgoals, while interior nodes correspond to

674: normal, not tabled, subgoals. Interior nodes are implemented at the

675: engine level as WAM choice points. To implement generator nodes we

676: extended the WAM choice points with a pointer to the corresponding

677: subgoal frame. To implement consumer nodes we use the notion of

678: \emph{dependency frame}. Dependency frames will be stored in a proper

679: space, the \emph{dependency space}.

680: Figure~\ref{fig_nodes_relationships} illustrates how generator and

681: consumer nodes interact with the table and dependency spaces. As we

682: shall see in section~\ref{section_leader_nodes}, having a separate

683: dependency space is quite useful for our copying-based implementation,

684: although dependency frames could be stored together with the

685: corresponding choice point in the sequential implementation. All

686: dependency frames are linked together to form a dependency list of

687: consumer nodes. Additionally, dependency frames store information

688: about the last consumed answer for the correspondent consumer node;

689: and information for detecting completion points, as we discuss next.

690:

691: \begin{figure}[!ht]

692: \centerline{

693: \epsfxsize=12cm

694: \epsffile{nodes_relationships.eps}

695: }

696: \caption{The nodes and their relationship with the table and dependency spaces.}

697: \label{fig_nodes_relationships}

698: \end{figure}

699:

700:

701:

702: \subsection{Leader Nodes}

703:

704: We need to perform completion in order to recover space and in order

705: to determine negative loops between subgoals in programs with

706: negation. In this work we focus on positive programs only, so our goal

707: will be to recover space. Unfortunately, as an artifact of the

708: SLG-WAM, it can happen that the stack segments for a SCC ${\cal S}$

709: remain within the stack segments for another SCC ${\cal S'}$. In such

710: cases, ${\cal S}$ cannot be recovered in advance when completed, and

711: thus, recovering its space must be delayed until ${\cal S'}$ also

712: completes. To approximate SCCs in a stack-based implementation,

713: Sagonas~\cite{Sagonas-PhD} denotes a set of SCCs whose space must be

714: recovered together as an \emph{Approximate SCC} or \emph{ASCC}. For

715: simplicity, in the following we will use the SCC notation to refer to

716: both ASCCs and SCCs.

717:

718: The completion operation takes place when we backtrack to a generator

719: node that \textbf{(i)} has exhausted all its alternatives and that

720: \textbf{(ii)} is as a leader node (remember that the youngest

721: generator node which does not depend on older generators is called a

722: leader node). We designed novel algorithms to quickly determine

723: whether a generator node is a leader node. The key idea in our

724: algorithms is that each dependency frame holds a pointer to the

725: resulting leader node of the SCC that includes the correspondent

726: consumer node. Using the leader node pointer from the dependency

727: frames, a generator node can quickly determine whether it is a leader

728: node. More precisely, in our algorithm, a generator ${\cal L}$ is a

729: leader node when either \textbf{(a)} ${\cal L}$ is the youngest tabled

730: node, or \textbf{(b)} the youngest consumer that says ${\cal L}$ is

731: the leader.

732:

733: Our algorithm thus requires computing leader node information whenever

734: creating a new consumer node ${\cal C}$. We proceed as follows. First,

735: we hypothesize that the leader node is ${\cal C}$'s generator, say

736: ${\cal G}$. Next, for all consumer nodes older than ${\cal C}$ and

737: younger than ${\cal G}$, we check whether they depend on an older

738: generator node. Consider that there is at least one such node and that

739: the oldest of these nodes is ${\cal G'}$. If so then ${\cal G'}$ is

740: the leader node. Otherwise, our hypothesis was correct and the leader

741: node is indeed ${\cal G}$. Leader node information is implemented as a

742: pointer to the choice point of the newly computed leader node.

743:

744: Figure~\ref{fig_spotting_current_leader} uses the example from

745: Figure~\ref{fig_finite_SLG_tree} to illustrate the leader node

746: algorithm. For compactness, the figure presents calls to

747: \texttt{path(a,Z)}, \texttt{path(b,Z)}, \texttt{path(c,Z)} and

748: \texttt{arc(a,Z)}, as \texttt{pa}, \texttt{pb}, \texttt{pc}, and

749: \texttt{aa}, respectively. Figure~\ref{fig_spotting_current_leader}(a)

750: shows the initial configuration. The generator node ${\cal N}_0$ is

751: the current leader node because it is the only subgoal. Figure

752: \ref{fig_spotting_current_leader}(b) shows the dependency graph after

753: creating node ${\cal N}_2$. First, we called a variant of

754: \texttt{path(a,Z)}, and allocated the corresponding dependency frame.

755: ${\cal N}_0$ is the generator node for the variant call

756: \texttt{path(a,Z)}, ${\cal N}_0$ is the leader node for ${\cal

757:   N}_1$'s. ${\cal N}_1$ then suspended, we backtracked to ${\cal N}_0$

758: and called \texttt{arc(a,Z)}. As \texttt{arc(a,Z)} is not tabled, we

759: had to allocate an interior node for ${\cal N}_2$.

760:

761: \begin{figure}[!ht]

762: \centerline{

763: \epsfxsize=12cm

764: \epsffile{spotting_leader.eps}

765: }

766: \caption{Spotting the current leader node.}

767: \label{fig_spotting_current_leader}

768: \end{figure}

769:

770: Figure \ref{fig_spotting_current_leader}(c) shows the graph after we

771: created node ${\cal N}_{14}$. We have already created first and

772: variant calls to subgoals \texttt{path(b,Z)} and \texttt{path(c,Z)}.

773: Two new dependency frames were allocated and initialized. We thus have

774: three SCCs on stack: one per generator. The youngest SCC on stack is

775: for subgoal \texttt{path(c,Z)}. As a result, the current leader node

776: for the new set of nodes becomes ${\cal N}_{13}$. This is the one

777: referred in the youngest dependency frame.

778:

779: Figure \ref{fig_spotting_current_leader}(d) shows the interesting case

780: where tabled nodes exist between a consumer and its generator. In the

781: example, consumer node ${\cal N}_{21}$, has two consumers, ${\cal

782:   N}_7$ and ${\cal N}_{14}$, separating it from its generator, ${\cal

783:   N}_6$.  As both consumers do not depend on nodes older than ${\cal

784:   N}_6$, the leader node for ${\cal N}_{21}$ is still ${\cal N}_6$,

785: and ${\cal N}_6$ becomes the current leader node. This situation

786: represents the point at which subgoal \texttt{path(c,Z)} starts

787: depending on subgoal \texttt{path(b,Z)} and their SCCs are merged

788: together. Next, we allocated consumer node ${\cal N}_{25}$. Nodes

789: ${\cal N}_{14}$ and ${\cal N}_{21}$ are between ${\cal N}_{25}$ and

790: the generator ${\cal N}_{13}$. Our algorithm says that since ${\cal

791:   N}_{21}$ depends on an older generator node, ${\cal N}_6$, the

792: leader node information for ${\cal N}_{25}$ is also ${\cal N}_6$. As a

793: result, ${\cal N}_6$ remains the current leader node.

794:

795: Finally, Figure \ref{fig_spotting_current_leader}(e) shows the point

796: after the subgoals \texttt{path(b,Z)} and \texttt{path(c,Z)} have

797: completed and the segments belonging to their SCC have been

798: released. The computation switches back to ${\cal N}_1$, consumes the

799: next answer and calls \texttt{path(c,Z)}. At this point,

800: \texttt{path(c,Z)} is already completed, and thus we can avoid

801: consumer node allocation and instead perform what is called the

802: \emph{completed table optimization}~\cite{Sagonas-98}. This

803: optimization allocates a node, similar to an interior node, that will

804: consume the set of found answers executing compiled code directly from

805: the trie data structure associated with the completed

806: subgoal~\cite{Ramakrishnan-99}.

807:

808:

809:

810: \subsection{Completion and Answer Resolution}

811: \label{section_completion_answer_resolution}

812:

813: After backtracking to a leader node, we must check whether all younger

814: consumer nodes have consumed all their answers. To do so, we walk the

815: chain of dependency frames looking for a frame which has not yet

816: consumed all the generated answers. If there is such a frame, we

817: should resume the computation of the corresponding consumer node. We

818: do this by restoring the stack pointers and backtracking to the node.

819: Otherwise, we can perform completion. This includes \textbf{(i)}

820: marking as complete all the subgoals in the SCC; \textbf{(ii)}

821: deallocating all younger dependency frames; and \textbf{(iii)}

822: backtracking to the previous node to continue the execution.

823:

824: Backtracking to a consumer node results in executing the answer

825: resolution operation. The operation first checks the table space for

826: unconsumed answers. If there are new answers, it loads the next

827: available answer and proceeds. Otherwise, it backtracks again. If this

828: is the first time that backtracking from that consumer node takes

829: place, then it is performed as usual. Otherwise, we know that the

830: computation has been resumed from an older generator node ${\cal G}$

831: during an unsuccessful completion operation. Therefore, backtracking

832: must be done to the next consumer node that has unconsumed answers and

833: that is younger than ${\cal G}$. If no such consumer node can be

834: found, backtracking must be done to the generator node ${\cal G}$.

835:

836: The process of resuming a consumer node, consuming the available set

837: of answers, suspending and then resuming another consumer node can be

838: seen as an iterative process which repeats until a fixpoint is

839: reached. This fixpoint is reached when the SCC is completely

840: evaluated.

841:

842:

843:

844: \section{Or-Parallelism within Tabling}

845:

846: The first step in our research was to design a model that would allow

847: concurrent execution of all available alternatives, be they from

848: generator, consumer or interior nodes. We researched two designs: the

849: TOP (Tabling within Or Parallelism) model and the OPT (Or-Parallelism

850: within Tabling) model.

851:

852: Parallelism in the TOP model is supported by considering that a

853: parallel evaluation is performed by a set of independent WAM engines,

854: each managing an unique branch of the search tree at a time. These

855: engines are extended to include direct support to the basic table

856: access operations, that allow the insertion of new subgoals and

857: answers. When exploiting parallelism, some branches may be

858: \emph{suspended}. Generator and interior nodes suspend alternatives

859: because we do not have enough processors to exploit them all. Consumer

860: nodes may also suspend because they are waiting for more answers.

861: Workers move in the search tree, looking for points where they can

862: exploit parallelism.

863:

864: Parallel evaluation in the OPT model is done by a set of independent

865: tabling engines that \emph{may} share different common branches of the

866: search tree during execution. Each worker can be considered a

867: sequential tabling engine that fully implements the tabling

868: operations: access the table space to insert new subgoals or answers;

869: allocate data structures for the different types of nodes; suspend

870: tabled subgoals; resume subcomputations to consume newly found

871: answers; and complete private (not shared) subgoals. As most of the

872: computation time is spent in exploiting the search tree involved in a

873: tabled evaluation, we can say that tabling is the base component of

874: the system.

875:

876: The or-parallel component of the system is triggered to allow

877: synchronized access to the shared parts of the execution tree, in

878: order to get new work when a worker runs out of alternatives to

879: exploit, and to perform completion of shared subgoals. Unexploited

880: alternatives should be made available for parallel execution,

881: regardless of whether they originate from generator, consumer or

882: interior nodes. From the viewpoint of SLG resolution, the OPT

883: computational model generalizes the Warren's multi-sequential engine

884: framework for the exploitation of or-parallelism. Or-parallelism stems

885: from having several engines that implement SLG resolution, instead of

886: implementing Prolog's SLD resolution.

887:

888: We have already seen that the SLG-WAM presents several opportunities

889: for parallelism. Figure~\ref{fig_opt_example} illustrates how this

890: parallelism can be specifically exploited in the OPT model. The

891: example assumes two workers, ${{\cal W}_1}$ and ${{\cal W}_2}$, and

892: the program code and query goal from Figure~\ref{fig_finite_SLG_tree}.

893: For simplicity, we use the same abbreviation introduced in

894: Figure~\ref{fig_spotting_current_leader} to denote the subgoals.

895:

896: \begin{figure}[!ht]

897: \centerline{

898: \epsfxsize=10cm

899: \epsffile{opt_example.eps}

900: }

901: \caption{Exploiting parallelism in the OPT model.}

902: \label{fig_opt_example}

903: \end{figure}

904:

905: Consider that worker ${\cal W}_1$ starts the evaluation. It first

906: allocates a generator and a consumer node for tabled subgoal

907: \texttt{path(a,Z)}. Because there are no available answers for

908: \texttt{path(a,Z)}, it backtracks. The next alternative leads to a

909: non-tabled subgoal \texttt{arc(a,Z)} for which we create an interior

910: node. The first alternative for \texttt{arc(a,Z)} succeeds with the

911: answer \texttt{Z=b}. The worker inserts the newly found answer in the

912: table and starts exploiting the next alternative for

913: \texttt{arc(a,Z)}. This is shown in the left sub-figure. At this

914: point, worker ${\cal W}_2$ requests for work. Assume that worker

915: ${\cal W}_1$ decides to share all of its private nodes. The two

916: workers will share three nodes: the generator and consumer nodes for

917: \texttt{path(a,Z)}, and the interior node for \texttt{arc(a,Z)}.

918: Worker ${\cal W}_2$ takes the next unexploited alternative of

919: \texttt{arc(a,Z)} and from now on, either worker can find further

920: answers for \texttt{path(a,Z)} or resume the shared consumer node.

921:

922: The OPT model offers two important advantages over the TOP model.

923: First, OPT reduces to a minimum the overlap between or-parallelism and

924: tabling. Namely, as the example shows, in OPT it is straightforward to

925: make nodes public only when we want to share them. This is very

926: important because execution of private nodes is almost as fast as

927: sequential execution. Second, OPT enables different data structures

928: for or-parallelism and for tabling. For instance, one can use the

929: SLG-WAM for tabling, and environment copying or binding arrays for

930: or-parallelism.

931:

932: The question now is whether we can achieve an implementation of the

933: OPT model, and whether that implementation is \emph{efficient}. We

934: implemented OPTYap in order to answer this question. In OPTYap,

935: tabling is implemented by freezing the whole stacks when a consumer

936: blocks. Or-parallelism is implemented through copying of stacks. More

937: precisely, we optimize copying by using \emph{incremental copying},

938: where workers only copy the differences between their stacks. We

939: adopted this framework because environment copying and the SLG-WAM

940: are, respectively, two of the most successful or-parallel and tabling

941: engines. In our case, we already had the experience of implementing

942: environment copying in the Yap Prolog, the YapOr system, with

943: excellent performance results~\cite{Rocha-99b}. Adopting YapOr for the

944: or-parallel component of the combined system was therefore our first

945: choice.

946:

947: Regarding the tabling component, an alternative to freezing the stacks

948: is copying them to a separate storage as in CHAT~\cite{Demoen-00}. We

949: found two major problems with CHAT. First, to take best advantage of

950: CHAT we need to have separate environment and choice point stacks, but

951: Yap has an integrated local stack. Second, and more importantly, we

952: believe that CHAT is less suitable than the SLG-WAM to an efficient

953: extension to or-parallelism because of its incremental completion

954: technique. CHAT implements incremental completion through an

955: incremental copying mechanism that saves intermediate states of the

956: execution stacks up to the nearest generator node. This works fine for

957: sequential tabling, because leader nodes are always generator nodes.

958: However, as we will see, for parallel tabling this does not hold

959: because any public node can be a potential leader node. To preserve

960: incremental completion efficiency in a parallel tabling environment,

961: incremental saving should be performed up to the parent node, as

962: potentially it can be a leader node. Obviously, this node-to-node

963: segmentation of the incremental saving technique will degrade the

964: efficiency of any parallel system.

965:

966:

967:

968: \section{The Or-Parallel Tabling Engine}

969:

970: The OPT model requires changes to both the initial designs for

971: parallelism and tabling. As we enumerated next, support or-parallelism

972: plus tabling requires changes to memory allocation, table access, the

973: completion algorithm. We must further ensure that environment copying

974: and tabling suspension do not interfere. Or-parallelism issues refer

975: to scheduling and to speculative work. In more detail:

976:

977: \begin{enumerate}

978: \item We must support parallel memory allocation and deallocation of

979:   the several data structures we use. Fortunately, most of our data

980:   structures are fixed-sized and parallel memory allocation can be

981:   implemented efficiently.

982: \item We must allow for several workers to concurrently read and

983:   update the table. To do so workers need to be able to lock the

984:   table. As we shall see finer locking allows for more parallelism,

985:   but coarser locking has less overheads.

986: \item OPTYap uses the copying model, where workers do not see the

987:   whole search tree, but instead only the branches corresponding to

988:   their current SLG-WAM. It is thus possible that a generator may not

989:   be in the stacks for a consumer (and vice-versa). We show that one

990:   can generalize the concept of leader node for such cases, and that

991:   such a generalization still gives a conservative approximation for a

992:   SCC. Completion can thus be performed when we are the last worker

993:   backtracking to the generalized leader nodes, and there is no more

994:   work below. The first condition can be easily checked through the

995:   or-parallel machinery. The second condition uses the sequential

996:   tabling machinery.

997: \item Or-parallelism and tabling are not strictly orthogonal. More

998:   precisely, naively sharing or-parallel work might result in

999:   overwriting suspended stacks. Several approaches may be used to

1000:   tackle this problem, we have proposed and implemented a suspension

1001:   mechanism that gives maximum scheduling flexibility.

1002: \item Scheduling or-parallel work in our system is based on the Muse

1003:   scheduler~\cite{Ali-90b}. Intuitively this corresponds to a form of

1004:   hierarchical scheduling, where we favor tabled scheduling

1005:   operations, and resort to the more expensive or-parallel scheduling

1006:   when no tabling operations are available. Other approaches are

1007:   possible, but this one has served OPTYap well so far. We also

1008:   discuss how moving around the shared parts of the search tree

1009:   changes in the presence of parallelism.

1010: \item Last, we briefly discuss pruning issues. Although pruning in the

1011:   presence of tabling is a complex issue~\cite{Guo-02,Castro-03}, we

1012:   still should execute correctly for non-tabled regions of the search

1013:   tree (interior nodes).

1014: \end{enumerate}

1015:

1016: We next discuss these issues in some detail, presenting the general

1017: execution framework.

1018:

1019:

1020:

1021: \subsection{Memory Organization}

1022:

1023: In OPTYap, memory is divided into a \emph{global} addressing space and

1024: a collection of \emph{local} spaces, as illustrated in

1025: Figure~\ref{fig_optyap_memory}. The global space includes the code

1026: area and a parallel data area that consists of all the data structures

1027: required to support concurrent execution. Each local space represents

1028: one system worker and it contains the four WAM execution stacks

1029: inherited from Yap: global stack, local stack, trail, and auxiliary

1030: stack.

1031:

1032: \begin{figure}[!ht]

1033: \centerline{

1034: \epsfxsize=7cm

1035: \epsffile{optyap_memory.eps}

1036: }

1037: \caption{Memory organization in OPTYap.}

1038: \label{fig_optyap_memory}

1039: \end{figure}

1040:

1041: The parallel data area includes the table and dependency spaces

1042: inherited from YapTab, and the \emph{or-frame space}~\cite{Ali-90b}

1043: inherited from YapOr to synchronize access to shared nodes.

1044: Additionally, we have an extra data structure to preserve the stacks

1045: of suspended SCCs (further details in section~\ref{section_scc}).

1046: Remember that we use specific extra fields in the choice points to

1047: access the data structures in the parallel data area. When sharing

1048: work, the execution stacks of the sharing worker are copied from its

1049: local space to the local space of the requesting worker. The data

1050: structures from the parallel data area associated with the shared

1051: stacks are automatically inherited by the requesting worker in the

1052: copied choice points.

1053:

1054: The efficiency of a parallel system largely depends on how concurrent

1055: handling of shared data is achieved and synchronized. Page faults and

1056: memory cache misses are a major source of overhead regarding data

1057: access or update in parallel systems. OPTYap tries to avoid these

1058: overheads by adopting a page-based organization scheme to split memory

1059: among different data structures, in a way similar to Bonwick's Slab

1060: memory allocator~\cite{Bonwick-94}. Each memory page of the parallel

1061: data area only contains data structures of the same type. Whenever a

1062: new request for a data structure of type ${\cal T}$ appears, the next

1063: available structure on one of the ${\cal T}$ pages is returned. If

1064: there are no available structures in any ${\cal T}$ page, then one of

1065: the free pages is made to be of type ${\cal T}$. A page is freed when

1066: all its data structures are released. A free page can be immediately

1067: reassigned to a different structure type.

1068:

1069:

1070:

1071: \subsection{Concurrent Table Access}

1072: \label{section_concurrent_table_access}

1073:

1074: Our experience showed that the table space is the major data area open

1075: to concurrent access operations in a parallel tabling environment. To

1076: maximize parallelism, whilst minimizing overheads, accessing and

1077: updating the table space must be carefully controlled. Reader/writer

1078: locks are the ideal implementation scheme for this purpose. In a

1079: nutshell, we can say that there are two critical issues that determine

1080: the efficiency of a locking scheme for the table. One is the

1081: \emph{lock duration}, that is, the amount of time a data structure is

1082: locked. The other is the \emph{lock grain}, that is, the amount of

1083: data structures that are protected through a single lock request. It

1084: is the balance between lock duration and lock grain that compromises

1085: the efficiency of different table locking approaches. For instance, if

1086: the lock scheme is short duration or fine grained, then inserting many

1087: trie nodes in sequence, corresponding to a long trie path, may result

1088: in a large number of lock requests. On the other hand, if the lock

1089: scheme is long duration or coarse grain, then going through a trie

1090: path without extending or updating its trie structure, may

1091: unnecessarily lock data and prevent possible concurrent access by

1092: others.

1093:

1094: Unfortunately, it is impossible beforehand to know which locking

1095: scheme would be optimal. Therefore, in OPTYap we experimented with

1096: four alternative locking schemes to deal with concurrent accesses to

1097: the table space data structures, the \emph{Table Lock at Entry Level}

1098: scheme, TLEL, the \emph{Table Lock at Node Level} scheme, TLNL, the

1099: \emph{Table Lock at Write Level} scheme, TLWL, and the \emph{Table

1100: Lock at Write Level - Allocate Before Check} scheme, TLWL-ABC.

1101:

1102: The TLEL scheme essentially allows a single writer per subgoal trie

1103: structure and a single writer per answer trie structure. The main

1104: drawback of TLEL is the contention resulting from long lock

1105: duration. The TLNL enables a single writer per chain of sibling nodes

1106: that represent alternative paths from a common parent node. The TLWL

1107: scheme is similar to TLNL in that it enables a single writer per chain

1108: of sibling nodes that represent alternative paths to a common parent

1109: node. However, in TLWL, the common parent node is only locked when

1110: writing to the table is likely. TLWL also avoids the TLNL memory usage

1111: problem by replacing trie node lock fields with a global array of lock

1112: entries. Last, the TLWL-ABC scheme anticipates the allocation

1113: and initialization of nodes that are likely to be inserted in the

1114: table space before locking.

1115:

1116: Through experimentation, we observed that the locking schemes, TLWL

1117: and TLWL-ABC, present the best speedup ratios and they are the only

1118: schemes showing scalability. Since none of these two schemes clearly

1119: outperform the other, we assumed TLWL as the default. The observed

1120: slowdown with higher number of workers for TLEL and TLNL schemes is

1121: mainly due to their locking of the table space even when writing is

1122: not likely. In particular, for repeated answers they pay the cost of

1123: performing locking operations without inserting any new trie node. For

1124: these schemes the number of potential contention points is

1125: proportional to the number of answers found during execution, being

1126: they unique or redundant.

1127:

1128:

1129:

1130: \subsection{Leader Nodes}

1131: \label{section_leader_nodes}

1132:

1133: Or-parallel systems execute alternatives early. As a result, different

1134: workers may execute the generator and the consumer subgoals. In fact,

1135: it is possible that generators will execute earlier, and in a

1136: different branch than in sequential execution. As

1137: Figure~\ref{fig_guess_leader} shows, this may induce complex

1138: dependencies between workers, therefore requiring a more elaborate

1139: completion algorithm that may involve branches from several workers.

1140:

1141: \begin{figure}[!ht]

1142: \centerline{

1143: \epsfxsize=7cm

1144: \epsffile{guess_leader.eps}

1145: }

1146: \caption{At which node should we check for completion?}

1147: \label{fig_guess_leader}

1148: \end{figure}

1149:

1150: In this example, worker ${\cal W}_1$ takes the leftmost alternative

1151: while worker ${\cal W}_2$ takes the rightmost from the youngest common

1152: node. While exploiting their alternatives, ${\cal W}_1$ calls a tabled

1153: subgoal \texttt{a} and ${\cal W}_2$ calls a tabled subgoal \texttt{b}.

1154: As this is the first call to both subgoals, a generator node is stored

1155: for each one. Next, each worker calls the tabled subgoal firstly

1156: called by the other, and two consumer nodes, one per worker, are

1157: therefore allocated. At this point both workers hold a consumer node

1158: while not having the corresponding generator node in their branches.

1159: Conversely, the owner of each generator node has consumer nodes being

1160: executed by a different worker. The question is where should we check

1161: for completion? Intuitively, we would like to choose a node that is

1162: common to both branches and the youngest common node seems the better

1163: choice. But that node is not a generator node!

1164:

1165: We could avoid this problem by disallowing consumer nodes for

1166: generator nodes on other branches. Unfortunately, such a solution

1167: would severely restrict parallelism. Our solution was therefore to

1168: \emph{allow completion at all kind of public nodes}.

1169:

1170: To clarify these new situations we introduce a new concept, the

1171: \emph{Generator Dependency Node} (or \emph{GDN}). Its purpose is to

1172: signal the nodes that are candidates to be leader nodes, therefore

1173: representing a similar role as that of the generator nodes for

1174: sequential tabling. A GDN is calculated whenever a new consumer node,

1175: say ${\cal C}$, is created. We define the GDN ${\cal D}$ for a

1176: consumer node ${\cal C}$ with generator $\cal G$ to be \emph{the

1177:   youngest node on ${\cal C}$'s current branch that is an ancestor of

1178:   ${\cal G}$}. Obviously, if ${\cal G}$ belongs to the current branch

1179: of ${\cal C}$ then ${\cal G}$ must be the GDN. Thus GDN reduces to

1180: leader node for sequential computations. On the other hand, if the

1181: worker allocating ${\cal C}$ is not the one that allocated ${\cal G}$

1182: then the youngest node ${\cal D}$ is a public node, but not

1183: necessarily ${\cal G}$. Figure~\ref{fig_public_generator_dependency}

1184: presents three different situations that better illustrate the GDN

1185: concept. ${\cal W_G}$ is always the worker that allocated the

1186: generator node ${\cal G}$, and ${\cal W_C}$ is the worker that is

1187: allocating a consumer node ${\cal C}$.

1188:

1189: \begin{figure}[!ht]

1190: \centerline{

1191: \epsfxsize=11cm

1192: \epsffile{public_generator_dependency.eps}

1193: }

1194: \caption{Spotting the generator dependency node.}

1195: \label{fig_public_generator_dependency}

1196: \end{figure}

1197:

1198: In situation (a), the generator node ${\cal G}$ is on the branch of

1199: the consumer node ${\cal C}$, and thus, ${\cal G}$ is the GDN. In

1200: situation (b), nodes ${\cal N}_1$ and ${\cal N}_2$ are on the branch

1201: of ${\cal C}$ and both contain a branch leading to the generator

1202: ${\cal G}$. As ${\cal N}_2$ is the youngest node of the two, it is the

1203: GDN. Situation (c) differs from (b) in that the public nodes represent

1204: more than one branch and, in this case, are interleaved in the

1205: physical stack. In this situation, ${\cal N}_1$ is the unique node

1206: that belongs to ${\cal C}$'s branch and that also contains ${\cal G}$

1207: in a branch below. ${\cal N}_2$ contains ${\cal G}$ in a branch below,

1208: but it is not on ${\cal C}$'s branch, while ${\cal N}_3$ is on ${\cal

1209:   C}$'s branch, but it does not contain ${\cal G}$ in a branch below.

1210: Therefore, ${\cal N}_1$ is the GDN. Notice that in both cases (b) and

1211: (c) the GDN can be a generator, a consumer or an interior node.

1212:

1213: The procedure that computes the leader node information when

1214: allocating a new dependency frame now relies on the GDN

1215: concept. Remember that it is through this information that a node can

1216: determine whether it is a leader node. The main difference from the

1217: sequential algorithm is that now we first hypothesize that the leader

1218: node for the consumer node in hand is its GDN, and not its generator

1219: node. Then, we check the consumer nodes younger than the newly found

1220: GDN for an older dependency. Note that as soon as an older dependency

1221: ${\cal D}$ is found in a consumer node ${\cal C'}$, the remaining

1222: consumer nodes, older than ${\cal C'}$ but younger than the GDN, do

1223: not need to be checked. This is safe because the previous computation

1224: of the leader node information for the consumer node ${\cal C'}$

1225: already represents the oldest dependency that includes the remaining

1226: consumer nodes. We next give an argument on the correctness of the

1227: algorithm.

1228:

1229: Consider a consumer node with GDN ${\cal G}$ and assume that its

1230: leader node ${\cal D}$ is found in the dependency frame for consumer

1231: node ${\cal C}$. Now hypothesize that there is a consumer node ${\cal

1232:   N}$ younger than ${\cal G}$ with a reference ${\cal D'}$ older than

1233: ${\cal D}$. Therefore, when previously computing the leader node for

1234: ${\cal C}$ one of the following situations occurred: \textbf{(i)}

1235: ${\cal D}$ is the GDN for ${\cal C}$ or \textbf{(ii)} ${\cal D}$ was

1236: found in a dependency frame for a consumer node ${\cal C'}$. Situation

1237: \textbf{(i)} is not possible because ${\cal N}$ is younger than ${\cal

1238:   D}$ and it holds a reference older than ${\cal D}$. Regarding

1239: situation \textbf{(ii)}, ${\cal C'}$ is necessarily younger than

1240: ${\cal N}$ as otherwise the reference found for ${\cal C}$ had been

1241: ${\cal D'}$. By recursively applying the previous argument to the

1242: computation of the leader node for ${\cal C'}$ we conclude that our

1243: initial hypothesis cannot hold because the number of nodes between

1244: ${\cal C}$ and ${\cal N}$ is finite.

1245:

1246: With this scheme, concurrency is not a problem. Each worker views its

1247: own leader node independently from the execution being done by

1248: others. A new consumer node is always a private node and a new

1249: dependency frame is always the youngest dependency frame for a

1250: worker. The leader information stored in a dependency frame denotes

1251: the resulting leader node at the time the correspondent consumer node

1252: was allocated. Thus, after computing such information it remains

1253: unchanged. If when allocating a new consumer node the leader changes,

1254: the new leader information is only stored in the dependency frame for

1255: the new consumer, therefore not influencing others. Observe, for

1256: example, the situation from Figure~\ref{fig_dependency_frames}. Two

1257: workers, ${\cal W}_1$ and ${\cal W}_2$, exploiting different

1258: alternatives from a common public node, ${\cal N}_4$, are allocating

1259: new private consumer nodes. They compute the leader node information

1260: for the new dependency frames without requiring any explicit

1261: communication between both and without requiring any synchronization

1262: if consulting the common dependency frame for node ${\cal N}_4$. The

1263: resulting dependency chain for each worker is illustrated on each side

1264: of the figure. Note that the dependency frame for consumer node ${\cal

1265: N}_4$ is common to both workers. It is illustrated twice only for

1266: simplicity.

1267:

1268: \begin{figure}[!ht]

1269: \centerline{

1270: \epsfxsize=9cm

1271: \epsffile{dependency_frames.eps}

1272: }

1273: \caption{Dependency frames in the parallel environment.}

1274: \label{fig_dependency_frames}

1275: \end{figure}

1276:

1277: Within this scenario, worker ${\cal W}_1$ will check for completion at

1278: node ${\cal N}_1$, its current leader node, and worker ${\cal W}_2$

1279: will check for completion at node ${\cal N}_2$. Obviously, ${\cal

1280:   W}_2$ cannot perform completion when reaching ${\cal N}_2$. If

1281: ${\cal W}_1$ finds new answers for subgoal \texttt{c}, they should be

1282: consumed in node ${\cal N}_6$. Moreover, as ${\cal W}_1$ has a

1283: dependency for an older node, ${\cal N}_1$, the SCCs from both workers

1284: should only be completed together at node ${\cal N}_1$. However,

1285: ${\cal W}_1$ can allocate another consumer node that changes its

1286: current leader node. Therefore, ${\cal W}_2$ cannot know beforehand

1287: the leader where both SCCs should be completed. Determining the leader

1288: node where several dependent SCCs from different workers may be

1289: completed together is the problem that we address next.

1290:

1291:

1292:

1293: \subsection{SCC Suspension}

1294: \label{section_scc}

1295:

1296: Different paths may be followed when a worker ${\cal W}$ reaches a

1297: leader node for a SCC ${\cal S}$. The simplest case is when the node

1298: is private. In this case, we proceed as for sequential tabling.

1299: Otherwise, the node is public, and other workers can still influence

1300: ${\cal S}$. For instance, these workers may find new answers for a

1301: consumer node in ${\cal S}$, in which case the consumer must be

1302: resumed to consume the new answers. Clearly, in such cases, ${\cal W}$

1303: should not complete. On the other hand, ${\cal W}$ has tried all

1304: available alternatives and would like to move anywhere in the tree,

1305: say to node ${\cal N}$, to try other work. According to the copying

1306: model we use for or-parallelism, we should backtrack to the youngest

1307: node common to ${\cal N}$'s branch, that is, we should reset our

1308: stacks to the values of the common node. According to the freezing

1309: model that we use for tabling, we cannot recover the current consumers

1310: because they are frozen. We thus have a contradiction.

1311:

1312: Note that this is the only case where or-parallelism and tabling

1313: conflict. One solution would be to disallow movement in this case.

1314: Unfortunately, we would again severely restrict parallelism. As a

1315: result, in order to allow ${\cal W}$ to continue execution it becomes

1316: necessary to \emph{suspend the SCC} at hand. Suspending a SCC includes

1317: saving the SCC's stacks to a proper space, leaving in the leader node

1318: a reference to the suspended SCC. These suspended computations are

1319: considered again when the remaining workers do completion.

1320:

1321: In order to find out which suspended SCCs need to be resumed, each

1322: worker maintains a list of nodes with suspended SCCs. The last worker

1323: backtracking from a public node ${\cal N}$ checks if it holds

1324: references to suspended SCCs. If so, then ${\cal N}$ is included in

1325: the worker's list of nodes with suspended SCCs (the nodes are linked

1326: in stack order). If the node already belongs to other worker's list,

1327: it is not collected.

1328:

1329: A suspended SCC should be resumed if it contains consumer nodes with

1330: unconsumed answers. To resume a suspended SCC a worker needs to copy

1331: the saved stacks to the correct position in its own stacks, and thus,

1332: it has to suspend its current SCC first. Figure~\ref{fig_resuming_scc}

1333: illustrates the management of suspended SCCs when searching for SCCs

1334: to resume. It considers a worker ${\cal W}$, positioned in the leader

1335: node ${\cal N}_1$ of its current SCC ${\cal S}_1$. ${\cal W}$ consults

1336: its list of nodes with suspended SCCs, and starts checking the

1337: suspended SCC ${\cal S}_4$ for unconsumed answers. Assuming that

1338: ${\cal S}_4$ does not contain unconsumed answers, the search continues

1339: in the next node in the list. Here, suppose that SCC ${\cal S}_2$ does

1340: not have consumer nodes with unconsumed answers, but SCC ${\cal S}_3$

1341: does. The current SCC ${\cal S}_1$ is then suspended, and only then

1342: ${\cal S}_3$ resumed.

1343:

1344: \begin{figure}[!ht]

1345: \centerline{

1346: \epsfxsize=12cm

1347: \epsffile{resuming_scc.eps}

1348: }

1349: \caption{Resuming a suspended SCC.}

1350: \label{fig_resuming_scc}

1351: \end{figure}

1352:

1353: Notice that node ${\cal N}_3$ was removed from ${\cal W}$'s list of

1354: suspended SCCs because ${\cal S}_3$ may not include ${\cal N}_3$ in

1355: its stack segments. For simplicity and efficiency, instead of checking

1356: ${\cal S}_3$'s segments, we simply remove ${\cal N}_3$'s from ${\cal

1357:   W}$'s list. Note that this is a safe decision as a SCC only depends

1358: from branches below the leader node. Thus, if ${\cal S}_3$ does not

1359: include ${\cal N}_3$ then no new answers can be found for ${\cal

1360:   S}_4$'s consumer nodes. Otherwise, if this is not the case then

1361: ${\cal W}$ or other workers can eventually be scheduled to a node held

1362: by ${\cal S}_4$ and find new answers for at least one of its consumer

1363: nodes. In this case, when failing, these workers will necessarily

1364: backtrack through ${\cal N}_3$, ${\cal S}_4$'s leader. Therefore, the

1365: last worker backtracking from ${\cal N}_3$ will collect it for its own

1366: list, which allows ${\cal S}_4$ to be later resumed when executing

1367: completion in an older leader node.

1368:

1369:

1370:

1371: \subsection{The Flow of Control}

1372:

1373: Actual execution control of a parallel tabled evaluation mainly flows

1374: through four procedures. The process of completely evaluating SCCs is

1375: accomplished by the \texttt{completion()} and

1376: \texttt{answer\_resolution()} procedures, while parallel

1377: synchronization is achieved by the \texttt{getwork()} and

1378: \texttt{scheduler()} procedures. Here we focus on the execution in

1379: engine mode, that is on the \texttt{completion()},

1380: \texttt{answer\_resolution()} and \texttt{getwork()} procedures, and

1381: leave scheduling for the following section.

1382: Figure~\ref{fig_control_flow} presents a general overview of how

1383: control flows between the three procedures and how it flows within

1384: each procedure.

1385:

1386: \begin{figure}[!ht]

1387: \centerline{

1388: \epsfxsize=12cm

1389: \epsffile{control_flow.eps}

1390: }

1391: \caption{The flow of control in a parallel tabled evaluation.}

1392: \label{fig_control_flow}

1393: \end{figure}

1394:

1395: A novel completion procedure, \texttt{public\_completion()},

1396: implements completion detection for public leader nodes. As for

1397: private nodes, whenever a public node finds that it is a leader, it

1398: starts to check for younger consumer nodes with unconsumed answers. If

1399: there is such a node, we resume the computation to it. Otherwise, it

1400: checks for suspended SCCs with unconsumed answers. Remember that to

1401: resume a suspended SCC a worker needs to suspend its current SCC

1402: first.

1403:

1404: We thus adopted the strategy of resuming suspended SCCs \emph{only

1405:   when the worker finds itself at a leader node}, since this is a

1406: decision point where the worker either completes or suspends the

1407: current SCC. Hence, if the worker resumes a suspended SCC it does not

1408: introduce further dependencies. This is not the case if the worker

1409: would resume a suspended SCC ${\cal R}$ as soon as it reached the node

1410: where it had suspended. In that situation, the worker would have to

1411: suspend its current SCC ${\cal S}$, and after resuming ${\cal R}$ it

1412: would probably have to also resume ${\cal S}$ to continue its

1413: execution. A first disadvantage is that the worker would have to make

1414: more suspensions and resumptions. Moreover, if we resume earlier,

1415: ${\cal R}$ may include consumer nodes with unconsumed answers that are

1416: common with ${\cal S}$. More importantly, suspending in non-leader

1417: nodes leads to further complexity that can be very difficult to

1418: manage.

1419:

1420: A SCC ${\cal S}$ is completely evaluated when \textbf{(i)} there are

1421: no unconsumed answers in any consumer node belonging to ${\cal S}$ or

1422: in any consumer node within a SCC suspended in a node belonging to

1423: ${\cal S}$; and \textbf{(ii)} there are no other representations of

1424: the leader node ${\cal N}$ in the computational environment, be ${\cal

1425: N}$ represented in the execution stacks of a worker or be ${\cal N}$

1426: in the suspended stack segments of a SCC. Completing a SCC includes

1427: \textbf{(i)} marking all dependent subgoals as complete; \textbf{(ii)}

1428: releasing the frames belonging to the complete branches, including the

1429: branches in suspended SCCs; \textbf{(iii)} releasing the frozen stacks

1430: and the memory space used to hold the stacks from suspended SCCs; and

1431: \textbf{(iv)} readjusting the freeze registers and the whole set of

1432: stack and frame pointers.

1433:

1434: The answer resolution operation for the parallel environment

1435: essentially uses the same algorithm as previously described for

1436: private nodes (please refer to

1437: section~\ref{section_completion_answer_resolution}). Initially, the

1438: procedure checks for unconsumed answers to be loaded for execution. If

1439: we have answers, execution will jump to them. Otherwise, we schedule

1440: for a backtracking node. If this is not the first time that

1441: backtracking from that consumer node takes place, we know that the

1442: computation has been resumed from an older leader node ${\cal L}$

1443: during an unsuccessful completion operation. ${\cal L}$ is thus the

1444: oldest node to where we can backtrack. Backtracking must be done to

1445: the next consumer node that has unconsumed answers and that is younger

1446: than ${\cal L}$. Otherwise, if there are no such consumer nodes,

1447: backtracking must be done to ${\cal L}$.

1448:

1449: The \texttt{getwork()} procedure contributes to the progress of a

1450: parallel tabled evaluation by moving to effective work. The usual way

1451: to execute \texttt{getwork()} is through failure to the youngest

1452: public node on the current branch. We can distinguish two main

1453: procedures in \texttt{getwork()}. One detects completion points and

1454: therefore makes the computation flow to the

1455: \texttt{public\_completion()} procedure. The other corresponds to

1456: or-parallel execution. It synchronizes to check for available

1457: alternatives and executes the next one, if any. Otherwise, it invokes

1458: the scheduler. A completion point is detected when ${\cal N}$ is the

1459: leader node pointed by the youngest dependency frame. The exception is

1460: if ${\cal N}$ is itself a generator node for a consumer node within

1461: the current SCC and it contains unexploited alternatives. In such

1462: cases, the current SCC is not fully exploited. Hence, we should

1463: exploit first the available alternatives, and only then invoke

1464: completion.

1465:

1466:

1467:

1468: \subsection{Scheduling Work}

1469:

1470: Scheduling work is the scheduler's task. It is about efficiently

1471: distributing the available work for exploitation between the running

1472: workers. In a parallel tabling environment we have the extra

1473: constraint of keeping the correctness of sequential tabling semantics.

1474: A worker enters in scheduling mode when it runs out of work and

1475: returns to execution whenever a new piece of unexploited work is

1476: assigned to it by the scheduler.

1477:

1478: The scheduler for the OPTYap engine is mainly based on YapOr's

1479: scheduler. All the scheduler strategies implemented for YapOr were

1480: used in OPTYap. However, extensions were introduced in order to

1481: preserve the correctness of tabling semantics. These extensions allow

1482: support for leader nodes, frozen stack segments, and suspended

1483: SCCs. The OPTYap model was designed to enclose the computation within

1484: a SCC until the SCC was suspended or completely evaluated. Thus,

1485: OPTYap introduces the constraint that the \emph{computation cannot

1486: flow outside the current SCC, and workers cannot be scheduled to

1487: execute at nodes older than their current leader node}. Therefore,

1488: when scheduling for the nearest node with unexploited alternatives, if

1489: it is found that the current leader node is younger than the potential

1490: nearest node with unexploited alternatives, then the current leader

1491: node is the node scheduled to proceed with the evaluation.

1492:

1493: The next case is when the scheduling to determine the nearest node

1494: with unexploited alternatives does not return any node to proceed

1495: execution. The scheduler then starts searching for busy\footnote{A

1496: worker is said to be busy when it is in engine mode exploiting

1497: alternatives. A worker is said to be idle when it is in scheduling

1498: mode searching for work.} workers that can be demanded for work. If

1499: such a worker ${\cal B}$ is found, then the requesting worker moves up

1500: to the youngest node that is common to ${\cal B}$, in order to become

1501: partially consistent with part of ${\cal B}$. Otherwise, no busy

1502: worker was found, and the scheduler moves the idle worker to a better

1503: position in the search tree. Therefore, we can enumerate three

1504: different situations for a worker to move up to a node ${\cal N}$:

1505: \textbf{(i)} ${\cal N}$ is the nearest node with unexploited

1506: alternatives; \textbf{(ii)} ${\cal N}$ is the youngest node common

1507: with the busy worker we found; or \textbf{(iii)} ${\cal N}$

1508: corresponds to a better position in the search tree.

1509:

1510: The process of moving up in the search tree from a current node ${\cal

1511: N}_0$ to a target node ${\cal N}_f$ is mainly implemented by the

1512: \texttt{move\_up\_one\_node()} procedure. This procedure is invoked

1513: for each node that has to be traversed until reaching ${\cal

1514: N}_f$. The presence of frozen stack segments or the presence of

1515: suspended SCCs in the nodes being traversed influences and can even

1516: abort the usual moving up process.

1517:

1518: Assume that the idle worker ${\cal W}$ is currently positioned at

1519: ${{\cal N}}_i$ and that it wants to move up one node. Initially, the

1520: procedure checks for frozen nodes on the stack to infer whether ${\cal

1521: W}$ is moving within a SCC. If so, ${\cal W}$ simply moves up. The

1522: interesting case is when ${\cal W}$ is not within a SCC. If ${{\cal

1523: N}}_i$ holds a suspended SCC, then ${\cal W}$ can safely resume it. If

1524: resumption does not take place, the procedure proceeds to check

1525: whether ${\cal W}$ holds the unique representation of ${\cal

1526: N}_i$. This being the case, the suspended SCCs in ${\cal N}_i$ can be

1527: completed. Completion can be safely performed over the suspended SCCs

1528: in ${\cal N}_i$ not only because the SCCs are completely evaluated, as

1529: none was previously resumed, but also because no more dependencies

1530: exist, as there are no other branches below ${\cal N}_i$. Moreover, if

1531: ${\cal N}_i$ is a generator node then its correspondent subgoal can be

1532: also marked as completed. Otherwise, ${\cal W}$ simply moves up.

1533:

1534: The scheduler extensions described are mainly related with tabling

1535: support. As the scheduling strategies inherited from the YapOr's

1536: scheduler were designed for an or-parallel model, and not for an

1537: or-parallel tabling model, further work is still needed to implement

1538: and experiment with proper scheduling strategies that can take

1539: advantage of the parallel tabling environment.

1540:

1541:

1542:

1543: \subsection{Speculative Work}

1544:

1545: In~\cite{Ciepielewski-91}, Ciepielewski defines speculative work as

1546: \emph{work which would not be done in a system with one

1547: processor}. The definition clearly shows that speculative work is an

1548: implementation problem for parallelism and it must be addressed

1549: carefully in order to reduce its impact. The presence of pruning

1550: operators during or-parallel execution introduces the problem of

1551: speculative work~\cite{Hausman-PhD,Ali-92a,Beaumont-93}. Prolog has an

1552: explicit pruning operator, the \emph{cut} operator. When a computation

1553: executes a cut operation, all branches to the right of the cut are

1554: pruned. Computations that can potentially be pruned are thus

1555: \emph{speculative}. Earlier execution of such computations may result

1556: in wasted effort compared to sequential execution.

1557:

1558: In parallel tabling, not only the answers found for the query goal may

1559: not be valid, but also answers found for tabled predicates may be

1560: invalidated. The problem here is even more serious because tabled

1561: answers can be consumed elsewhere in the tree, which makes

1562: impracticable any late attempt to prune computations resulting from

1563: the consumption of invalid tabled answers. Indeed, consuming invalid

1564: tabled answers may result in finding more invalid answers for the same

1565: or other tabled predicates. Notice that finding and consuming answers

1566: is the natural way to get a tabled computation going forward. Delaying

1567: the consumption of answers may compromise such flow. Therefore, tabled

1568: answers should be released as soon as it is found that they are safe

1569: from being pruned. Whereas for all-solution queries the requirement is

1570: that, at the end of the execution, we will have the set of valid

1571: answers; in tabling the requirement is to have the set of valid tabled

1572: answers released as soon as possible.

1573:

1574: Currently, OPTYap implements an extension of the cut scheme proposed

1575: by Ali and Karlsson~\cite{Ali-92a}, that prunes useless work as early

1576: as possible, by optimizing the delivery of tabled answers as soon as

1577: it is found that they are safe from being pruned~\cite{Rocha-PhD}. As

1578: cut semantics for operations that prune tabled nodes is still an open

1579: problem, OPTYap does not handle cut operations that prune tabled nodes

1580: and for such cases execution is aborted.

1581:

1582:

1583:

1584: \section{Related Work}

1585:

1586: A first proposal on how to exploit implicit parallelism in tabling

1587: systems was Freire's \emph{Table-parallelism}~\cite{Freire-95}. In

1588: this model, each tabled subgoal is computed independently in a single

1589: computational thread, a \emph{generator thread}. Each generator thread

1590: is associated with a unique tabled subgoal and it is responsible for

1591: fully exploiting its search tree in order to obtain the complete set

1592: of answers. A generator thread dependent on other tabled subgoals will

1593: asynchronously consume answers as the correspondent generator threads

1594: will make them available. Within this model, parallelism results from

1595: having several generator threads running concurrently. Parallelism

1596: arising from non-tabled subgoals or from execution alternatives to

1597: tabled subgoals is not exploited. Moreover, we expect that scheduling

1598: and load balancing would be even harder than for traditional parallel

1599: systems.

1600:

1601: More recent work~\cite{Guo-01}, proposes a different approach to the

1602: problem of exploiting implicit parallelism in tabled logic

1603: programs. The approach is a consequence of a new sequential tabling

1604: scheme based on \emph{dynamic reordering of alternatives with variant

1605: calls}. This dynamic alternative reordering strategy not only tables

1606: the answers to tabled subgoals, but also the alternatives leading to

1607: variant calls, the \emph{looping alternatives}. Looping alternative

1608: are reordered and placed at the end of the alternative list for the

1609: call. After exploiting all matching clauses, the subgoal enters a

1610: looping state, where the looping alternatives, if they exist, start

1611: being tried repeatedly until a fixpoint is reached.  An important

1612: characteristic of tabling is that it avoids recomputation of tabled

1613: subgoals. An interesting point of the dynamic reordering strategy is

1614: that it avoids recomputation through performing recomputation. The

1615: process of retrying alternatives may cause redundant recomputations of

1616: the non-tabled subgoals that appear in the body of a looping

1617: alternative. It may also cause redundant consumption of answers if the

1618: body of a looping alternative contains more than one variant subgoal

1619: call. Within this model, parallelism arises if we schedule the

1620: multiple looping alternatives to different workers. Therefore,

1621: parallelism may not come so naturally as for SLD evaluations and

1622: parallel execution may lead to doing more work.

1623:

1624: There have been other proposals for concurrent tabling but in a

1625: distributed memory context. Hu~\cite{Hu-PhD} was the first to

1626: formulate a method for distributed tabled evaluation termed

1627: \emph{Multi-Processor SLG (SLGMP)}. This method matches subgoals with

1628: processors in a similar way to Freire's approach.  Each processor gets

1629: a single subgoal and it is responsible for fully exploiting its search

1630: tree and obtain the complete set of answers. One of the main

1631: contributions of SLGMP is its controlled scheme of propagation of

1632: subgoal dependencies in order to safely perform distributed

1633: completion. An implementation prototype of SLGMP was developed, but as

1634: far as we know no results have been reported.

1635:

1636: A different approach for distributed tabling was proposed by

1637: Dam�sio~\cite{Damasio-00}. The architecture for this proposal relies

1638: on four types of components: a \emph{goal manager} that interfaces

1639: with the outside world; a \emph{table manager} that selects the

1640: clients for storing tables; \emph{table storage clients} that keep the

1641: consumers and answers of tables; and \emph{prover clients} that

1642: perform evaluation. An interesting aspect of this proposal is the

1643: completion detection algorithm. It is based on a classical credit

1644: recovery algorithm~\cite{Mattern-89} for distributed termination

1645: detection. Dependencies among subgoals are not propagated and,

1646: instead, a controller client, associated with each SCC, controls the

1647: credits for its SCC and detects completion if the credits reach the

1648: zero value. An implementation prototype has also been developed, but

1649: further analysis is required.

1650:

1651: Marques \emph{et al.}~\cite{Marques-00} have proposed an initial

1652: design for an architecture for a multi-threaded tabling engine. Their

1653: first aim is to implement an engine capable of processing multiple

1654: query requests concurrently. The main idea behind this proposal seems

1655: very interesting, however the work is still in an initial stage.

1656:

1657: Other related mechanisms for sequential tabling have also been

1658: proposed. Demoen and Sagonas proposed a copying approach to deal with

1659: tabled evaluations and implemented two different models, the

1660: CAT~\cite{Demoen-98} and the CHAT~\cite{Demoen-00}. The main idea of

1661: the CAT implementation is that it replaces SLG-WAM's freezing of the

1662: stacks by copying the state of suspended computations to a proper

1663: separate stack area. The CHAT implementation improves the CAT design

1664: by combining ideas from the SLG-WAM with those from the CAT. It avoids

1665: copying all the execution stacks that represent the state of a

1666: suspended computation by introducing a technique for freezing stacks

1667: without using freeze registers.

1668:

1669: Zhou \emph{et al.}~\cite{Zhou-00,Zhou-01a} developed a linear tabling

1670: mechanism that works on a single SLD tree without requiring

1671: suspensions/resumptions of computations. The main idea is to let

1672: variant calls execute from the remaining clauses of the former first

1673: call. It works as follows: when there are answers available in the

1674: table, the call consumes the answers; otherwise, it uses the predicate

1675: clauses to produce answers.  Meanwhile, if a call that is a variant of

1676: some former call occurs, it takes the remaining clauses from the

1677: former call and tries to produce new answers by using them. The

1678: variant call is then repeatedly re-executed, until all the available

1679: answers and clauses have been exhausted, that is, until a fixpoint is

1680: reached.

1681:

1682:

1683:

1684: \section{Performance Analysis}

1685: \label{section_performance_analysis}

1686:

1687: To assess the efficiency of our parallel tabling implementation and

1688: address the question of whether parallel tabling is worthwhile, we

1689: present next a detailed analysis of OPTYap's performance. We start by

1690: presenting an overall view of the overheads of supporting the several Yap

1691: extensions: YapOr, YapTab and OPTYap. Then, we compare YapOr's

1692: parallel performance with that of OPTYap for a set of non-tabled

1693: programs. Next, we use a set of tabled programs to measure the

1694: sequential behavior of YapTab, OPTYap and XSB, and to assess OPTYap's

1695: performance when running the tabled programs in parallel.

1696:

1697: YapOr, YapTab and OPTYap are based on Yap's~4.2.1 engine\footnote{Note

1698: that sequential execution would be somewhat better with more recent

1699: Yap engines.}. We used the same compilation flags for Yap, YapOr,

1700: YapTab and OPTYap. Regarding XSB Prolog, we used version~2.3 with the

1701: default configuration and the default execution parameters. All

1702: systems use batched scheduling for tabling.

1703:

1704: The environment for our experiments was \emph{oscar}, a Silicon

1705: Graphics Cray Origin2000 parallel computer from the Oxford

1706: Supercomputing Centre. \emph{Oscar} consists of 96 MIPS 195 MHz R10000

1707: processors each with 256 Mbytes of main memory (for a total shared

1708: memory of 24 Gbytes) and running the IRIX~6.5.12 kernel. While

1709: benchmarking, the jobs were submitted to an execution queue

1710: responsible for scheduling the pending jobs through the available

1711: processors in such a way that, when a job is scheduled for execution,

1712: the processors attached to the job are fully available during the

1713: period of time requested for execution. We have limited our

1714: experiments to 32 processors because the machine was always with a

1715: very high load and we were limited to a guest-account.

1716:

1717:

1718:

1719: \subsection{Performance on Non-Tabled Programs}

1720:

1721: Fundamental criteria to judge the success of an or-parallel, tabling,

1722: or of a combined or-parallel tabling model includes measuring the

1723: overhead introduced by the model when running programs that do not

1724: take advantage of the particular extension. Ideally, a program should

1725: not pay a penalty for mechanisms that it does not require.

1726:

1727: To place our performance results in perspective we first evaluate how

1728: the original Yap Prolog engine compares against the several Yap

1729: extensions and against the most well-known tabling engine, XSB

1730: Prolog. We use a set of standard non-tabled logic programming

1731: benchmarks. All benchmarks find all the answers for the

1732: problem. Multiple answers are computed through automatic failure after

1733: a valid answer has been found. The set includes the following

1734: benchmark programs:

1735:

1736: \begin{description}

1737: \item[cubes:] solves the N-cubes or instant insanity problem from

1738:   Tick's book~\cite{Tick-91}. It consists of stacking 7 colored cubes

1739:   in a column so that no color appears twice within any given side of

1740:   the column.

1741:

1742: \item[ham:] finds all hamiltonian cycles for a graph consisting of 26

1743:   nodes with each node connected to other 3 nodes.

1744:

1745: \item[map:] solves the problem of coloring a map of 10 countries with

1746:   five colors such that no two adjacent countries have the same color.

1747:

1748: \item[nsort:] naive sort algorithm. It sorts a list of 10 elements by

1749:   brute force starting from the reverse order (and worst) case.

1750:

1751: \item[puzzle:] places numbers 1 to 19 in an hexagon pattern such that

1752:   the sums in all 15 diagonals add to the same value (also taken from

1753:   Tick's book~\cite{Tick-91}).

1754:

1755: \item[queens:] a non-naive algorithm to solve the problem of placing

1756:   11 queens on a 11x11 chess board such that no two queens attack each

1757:   other.

1758: \end{description}

1759:

1760: Table~\ref{non_tabled_sequential} shows the base execution time, in

1761: seconds, for Yap, YapOr, YapTab, OPTYap and XSB for the set of

1762: non-tabled benchmarks. In parentheses, it shows the overhead over the

1763: Yap execution time. The timings reported for YapOr and OPTYap

1764: correspond to the execution with a single worker. The results indicate

1765: that YapOr, YapTab and OPTYap introduce, on average, an overhead of

1766: about 10\%, 5\% and 17\% respectively over standard Yap. Regarding

1767: XSB, the results show that, on average, XSB is 2.47 times slower than

1768: Yap, a result mainly due to the faster Yap engine.

1769:

1770: \begin{table}[!ht]

1771: \caption{Yap, YapOr, YapTab, OPTYap and XSB execution time on non-tabled programs.}

1772: \label{non_tabled_sequential}

1773: \begin{tabular}{lrrrrr}

1774: \hline\hline

1775:     {\bf Bench}

1776:     & \multicolumn{1}{c}{\bf Yap}

1777:     & \multicolumn{1}{c}{\bf YapOr}

1778:     & \multicolumn{1}{c}{\bf YapTab}

1779:     & \multicolumn{1}{c}{\bf OPTYap}

1780:     & \multicolumn{1}{c}{\bf XSB} \\

1781: \hline

1782: cubes       &  1.97 &  2.06 (1.05) &  2.05 (1.04) &  2.16 (1.10) &  4.81 (2.44) \\

1783: ham         &  4.04 &  4.61 (1.14) &  4.28 (1.06) &  4.95 (1.23) & 10.36 (2.56) \\

1784: map         &  9.01 & 10.25 (1.14) &  9.19 (1.02) & 11.08 (1.23) & 24.11 (2.68) \\

1785: nsort       & 33.05 & 37.52 (1.14) & 35.85 (1.08) & 39.95 (1.21) & 83.72 (2.53) \\

1786: puzzle      &  2.04 &  2.22 (1.09) &  2.19 (1.07) &  2.36 (1.16) &  4.97 (2.44) \\

1787: queens      & 16.77 & 17.68 (1.05) & 17.58 (1.05) & 18.57 (1.11) & 36.40 (2.17) \\

1788: \noalign{\vspace{.5cm}}

1789: \multicolumn{2}{l}{\it Average}

1790:                     &       (1.10) &       (1.05) &       (1.17) &       (2.47) \\

1791: \hline\hline

1792: \end{tabular}

1793: \end{table}

1794:

1795: YapOr overheads result from handling the work load register and from

1796: testing operations that \textbf{(i)} verify whether a node is shared

1797: or private, \textbf{(ii)} check for sharing requests, and

1798: \textbf{(iii)} check for backtracking messages due to cut

1799: operations. On the other hand, YapTab overheads are due to the

1800: handling of the freeze registers and support of the forward

1801: trail. OPTYap overheads inherits both sources of

1802: overheads. Considering that Yap Prolog is one of the fastest Prolog

1803: engines currently available, the low overheads achieved by YapOr,

1804: YapTab and OPTYap are very good results.

1805:

1806: Since OPTYap is based on the same environment model as the one used by

1807: YapOr, we then compare OPTYap's performance with that of

1808: YapOr. Table~\ref{non_tabled_parallel} shows the speedups relative to

1809: the single worker case for YapOr and OPTYap with 4, 8, 16, 24 and 32

1810: workers. Each speedup corresponds to the best execution time obtained

1811: in a set of 3 runs. The results show that YapOr and OPTYap achieve

1812: identical effective speedups in all benchmark programs. These results

1813: allow us to conclude that OPTYap maintains YapOr's behavior in

1814: exploiting or-parallelism in non-tabled programs, despite it including

1815: all the machinery required to support tabled programs.

1816:

1817: \begin{table}[!ht]

1818: \caption{Speedups for YapOr and OPTYap on non-tabled programs.}

1819: \label{non_tabled_parallel}

1820: \begin{tabular}{lrrrrrrrrrrr}

1821: \hline\hline

1822:     & \multicolumn{5}{c}{\bf YapOr}

1823:     &

1824:     & \multicolumn{5}{c}{\bf OPTYap} \\ \noalign{\vspace{.2cm}}

1825:     {\bf Bench}

1826:     & \multicolumn{1}{c}{\bf 4}

1827:     & \multicolumn{1}{c}{\bf 8}

1828:     & \multicolumn{1}{c}{\bf 16}

1829:     & \multicolumn{1}{c}{\bf 24}

1830:     & \multicolumn{1}{c}{\bf 32}

1831:     &

1832:     & \multicolumn{1}{c}{\bf 4}

1833:     & \multicolumn{1}{c}{\bf 8}

1834:     & \multicolumn{1}{c}{\bf 16}

1835:     & \multicolumn{1}{c}{\bf 24}

1836:     & \multicolumn{1}{c}{\bf 32} \\

1837: \hline

1838: cubes   & 3.99 & 7.81 & 14.66 & 19.26 & 20.55 & & 3.98 & 7.74 & 14.29 & 18.67 & 20.97 \\

1839: ham     & 3.93 & 7.61 & 13.71 & 15.62 & 15.75 & & 3.92 & 7.64 & 13.54 & 16.25 & 17.51 \\

1840: map     & 3.98 & 7.73 & 14.03 & 17.11 & 18.28 & & 3.98 & 7.88 & 13.74 & 18.36 & 16.68 \\

1841: nsort   & 3.98 & 7.92 & 15.62 & 22.90 & 29.73 & & 3.96 & 7.84 & 15.50 & 22.75 & 29.47 \\

1842: puzzle  & 3.93 & 7.56 & 13.71 & 18.18 & 16.53 & & 3.93 & 7.51 & 13.53 & 16.57 & 16.73 \\

1843: queens  & 4.00 & 7.95 & 15.39 & 21.69 & 25.69 & & 3.99 & 7.93 & 15.41 & 20.90 & 25.23 \\

1844: \noalign{\vspace{.5cm}}

1845: {\it Average}

1846:         & 3.97 & 7.76 & 14.52 & 19.13 & 21.09 & & 3.96 & 7.76 & 14.34 & 18.92 & 21.10 \\

1847: \hline\hline

1848: \end{tabular}

1849: \end{table}

1850:

1851:

1852:

1853: \subsection{Performance on Tabled Programs}

1854:

1855: In order to place OPTYap's results in perspective we start by

1856: analyzing the overheads introduced to extend YapTab to parallel

1857: execution and by measuring YapTab and OPTYap behavior when compared

1858: with XSB. We use a set of tabled benchmark programs from the

1859: XMC\footnote{The XMC system~\cite{Ramakrishnan-00} is a model checker

1860: implemented atop the XSB system which verifies properties written in

1861: the alternation-free fragment of the modal

1862: $\mu$-calculus~\cite{Kozen-83} for systems specified in XL, an

1863: extension of value-passing CCS~\cite{Milner-89}.}~\cite{xmc} and

1864: XSB~\cite{xsb} \emph{world wide web} sites that are frequently used in

1865: the literature to evaluate such systems. The benchmark programs are:

1866:

1867: \begin{description}

1868: \item[sieve:] the transition relation graph for the \emph{sieve}

1869:   specification\footnote{We are thankful to C. R. Ramakrishnan for

1870:     helping us in dumping the transition relation graph of the

1871:     automatons corresponding to each given XL specification, and in

1872:     building runnable versions out of the XMC environment.} defined

1873:   for 5 processes and 4 overflow prime numbers.

1874:

1875: \item[leader:] the transition relation graph for the \emph{leader

1876:     election} specification defined for 5 processes.

1877:

1878: \item[iproto:] the transition relation graph for the \emph{i-protocol}

1879:   specification defined for a correct version (fix) with a huge window

1880:   size (w = 2).

1881:

1882: \item[samegen:] solves the same generation problem for a randomly

1883:   generated 24x24x2 cylinder. This benchmark is very interesting

1884:   because for sequential execution it does not allocate any consumer

1885:   node. Variant calls to tabled subgoals only occur when the subgoals

1886:   are already completed.

1887:

1888: \item[lgrid:] computes the transitive closure of a 25x25 grid using a

1889:   left recursion algorithm. A link between two nodes, $n$ and $m$, is

1890:   defined by two different relations; one indicates that we can reach

1891:   $m$ from $n$ and the other indicates that we can reach $n$ from $m$.

1892:

1893: \item[lgrid/2:] the same as \textbf{lgrid} but it only requires half

1894:   the relations to indicate that two nodes are connected. It defines

1895:   links between two nodes by a single relation, and it uses a

1896:   predicate to achieve symmetric reachability. This modification

1897:   alters the order by which answers are found. Moreover, as indexing

1898:   in the first argument is not possible for some calls, the execution

1899:   time increases significantly. For this reason, we only use here a

1900:   20x20 grid.

1901:

1902: \item[rgrid/2:] the same as \textbf{lgrid/2} but it computes the

1903:   transitive closure of a 25x25 grid and it uses a right recursion

1904:   algorithm.

1905: \end{description}

1906:

1907: Table~\ref{tabled_sequential} shows the execution time, in seconds,

1908: for YapTab, OPTYap and XSB for the set of tabled benchmarks. In

1909: parentheses, it shows the overhead over the YapTab execution time. The

1910: execution time reported for OPTYap correspond to the execution with a

1911: single worker.

1912:

1913: \begin{table}[!ht]

1914: \caption{YapTab, OPTYap and XSB execution time on tabled programs.}

1915: \label{tabled_sequential}

1916: \begin{tabular}{lrrr}

1917: \hline\hline

1918:     {\bf Bench}

1919:     & \multicolumn{1}{c}{\bf YapTab}

1920:     & \multicolumn{1}{c}{\bf OPTYap}

1921:     & \multicolumn{1}{c}{\bf XSB} \\

1922: \hline

1923: sieve   & 235.31 & 268.13 (1.14) & 433.53 (1.84) \\

1924: leader  &  76.60 &  85.56 (1.12) & 158.23 (2.07) \\

1925: iproto  &  20.73 &  23.68 (1.14) &  53.04 (2.56) \\

1926: samegen &  23.36 &  26.00 (1.11) &  37.91 (1.62) \\

1927: lgrid   &   3.55 &   4.28 (1.21) &   7.41 (2.09) \\

1928: lgrid/2 &  59.53 &  69.02 (1.16) &  98.22 (1.65) \\

1929: rgrid/2 &   6.24 &   7.51 (1.20) &  15.40 (2.47) \\

1930: \noalign{\vspace{.5cm}}

1931: {\it Average} &  &        (1.15) &        (2.04) \\

1932: \hline\hline

1933: \end{tabular}

1934: \end{table}

1935:

1936: The results indicate that, for these set of tabled benchmark programs,

1937: OPTYap introduces, on average, an overhead of about 15\% over

1938: YapTab. This overhead is very close to that observed for non-tabled

1939: programs (11\%). The small difference results from locking requests to

1940: handle the data structures introduced by tabling.  Locks are require

1941: to insert new trie nodes into the table space, and to update subgoal

1942: and dependency frame pointers to tabled answers. These locking

1943: operations are all related with the management of tabled

1944: answers. Therefore, the benchmarks that deal with more tabled answers

1945: are the ones that potentially can perform more locking

1946: operations. This causal relation seems to be reflected in the

1947: execution times showed in Table~\ref{tabled_sequential}, because the

1948: benchmarks that show higher overheads are also the ones that find more

1949: answers. The answers found by each benchmark are presented next in

1950: Table~\ref{tabled_stats}.

1951:

1952: Table~\ref{tabled_sequential} also shows that YapTab is on average

1953: about twice as fast as XSB for these set of benchmarks. This may be

1954: partly due to the faster Yap engine, as seen in

1955: Table~\ref{non_tabled_sequential}, and also to the fact that XSB

1956: implements functionalities that are still lacking in YapTab and that

1957: XSB may incur overheads in supporting those functionalities. These

1958: results show that we have accomplished our initial aim of implementing

1959: an or-parallel tabling system that compares favorably with current

1960: \emph{state of the art} technology.  Hence, we believe the following

1961: evaluation of the parallel engine is significant and fair.

1962:

1963: In order to achieve a deeper insight on the behavior of each

1964: benchmark, and therefore clarify some of the results presented next,

1965: we first present in Table~\ref{tabled_stats} data on the benchmark

1966: programs. The columns in Table~\ref{tabled_stats} have the following

1967: meaning:

1968:

1969: \begin{description}

1970: \item[first:] is the number of first calls to subgoals corresponding

1971:   to tabled predicates. It corresponds to the number of generator

1972:   choice points allocated.

1973:

1974: \item[nodes:] is the number of subgoal/answer trie nodes used to

1975:   represent the complete subgoal/answer trie structures of the tabled

1976:   predicates in the given benchmark. For the answer tries, in

1977:   parentheses, it shows the percentage of saving that the trie's

1978:   design achieves on these data structures. Given the $total$ number

1979:   of nodes required to represent individually each answer and the

1980:   number of nodes $used$ by the trie structure, the $saving$ can be

1981:   obtained by the following expression:

1982: \[saving~=~\frac{total~-~used}{total}\]

1983: As an example, consider two answers whose single representation

1984: requires respectively 12 and 8 answer trie nodes for each. Assuming

1985: that the answer trie representation of both answers only requires 15

1986: answer trie nodes, thus 5 of those being common to both paths, it

1987: achieves a saving of 25\%. Higher percentages of saving reflect higher

1988: probabilities of lock contention when concurrently accessing the table

1989: space.

1990:

1991: \item[depth:] is the average depth of the whole set of paths in the

1992:   corresponding answer trie structure. In other words, it is the

1993:   average number of answer trie nodes required to represent an answer.

1994:   Trie structures with smaller average depth values are more amenable

1995:   to higher lock contention.

1996:

1997: \item[unique:] is the number of non-redundant answers found for tabled

1998:   subgoals. It corresponds to the number of answers stored in the

1999:   table space.

2000:

2001: \item[repeated:] is the number of redundant answers found for tabled

2002:   subgoals. A high number of redundant answers can degrade the

2003:   performance of the parallel system when using table locking schemes

2004:   that lock the table space without taking into account whether

2005:   writing to the table is, or is not, likely.

2006: \end{description}

2007:

2008: \begin{table}[!ht]

2009: \caption{Characteristics of the tabled programs.}

2010: \label{tabled_stats}

2011: \begin{tabular}{lrrrrrrrr}

2012: \hline\hline

2013:     & \multicolumn{2}{c}{\bf Subgoal Tries}

2014:     &

2015:     & \multicolumn{2}{c}{\bf Answer Tries}

2016:     &

2017:     & \multicolumn{2}{c}{\bf New Answers} \\ \noalign{\vspace{.2cm}}

2018:       {\bf Bench}

2019:     & \multicolumn{1}{c}{\bf first}

2020:     & \multicolumn{1}{c}{\bf nodes}

2021:     &

2022:     & \multicolumn{1}{c}{\bf nodes}

2023:     & \multicolumn{1}{c}{\bf depth}

2024:     &

2025:     & \multicolumn{1}{c}{\bf unique}

2026:     & \multicolumn{1}{c}{\bf repeated} \\

2027: \hline

2028: sieve   &   1 &    7 & &    8624(57\%) & 53   & &    380 & 1386181 \\

2029: leader  &   1 &    5 & &   41793(70\%) & 81   & &   1728 &  574786 \\

2030: iproto  &   1 &    6 & & 1554896(77\%) & 51   & & 134361 &  385423 \\

2031: samegen & 485 &  971 & &   24190(33\%) &  1.5 & &  23152 &   65597 \\

2032: lgrid   &   1 &    3 & &  391251(49\%) &  2   & & 390625 & 1111775 \\

2033: lgrid/2 &   1 &    3 & &  160401(49\%) &  2   & & 160000 &  449520 \\

2034: rgrid/2 & 626 & 1253 & &  782501(33\%) &  1.5 & & 781250 & 2223550 \\

2035: \hline\hline

2036: \end{tabular}

2037: \end{table}

2038:

2039: By observing Table~\ref{tabled_stats} it seems that \emph{sieve} and

2040: \emph{leader} are the benchmarks least amenable to table lock

2041: contention because they are the ones that find the least number of

2042: answers and also the ones that have the deepest trie structures. In

2043: this regard, \emph{lgrid}, \emph{lgrid/2} and \emph{rgrid/2}

2044: correspond to the opposite case. They find the largest number of

2045: answers and they have very shallow trie structures. However,

2046: \emph{rgrid/2} is a benchmark with a large number of first subgoals

2047: calls which can reduce the probability of lock contention because

2048: answers can be found for different subgoal calls and therefore be

2049: inserted with minimum overlap.  Likewise, \emph{samegen} is a

2050: benchmark that can also benefit from its large number of first subgoal

2051: calls, despite also presenting a very shallow trie structure.

2052: Finally, \emph{iproto} is a benchmark that can also lead to higher

2053: ratios of lock contention. It presents a deep trie structure, but it

2054: inserts a huge number of trie nodes in the table space. Moreover, it

2055: is the benchmark showing the highest percentage of saving.

2056:

2057: To assess OPTYap's performance when running tabled programs in

2058: parallel, we ran OPTYap with varying number of workers for the set of

2059: tabled benchmark programs. Table~\ref{tabled_parallel_batched}

2060: presents the speedups for OPTYap with 4, 8, 16, 24 and 32 workers. The

2061: speedups are relative to the single worker case of

2062: Table~\ref{tabled_sequential}. They correspond to the best speedup

2063: obtained in a set of 3 runs. The table is divided in two main blocks:

2064: the upper block groups the benchmarks that showed potential for

2065: parallel execution, whilst the bottom block groups the benchmarks that

2066: do not show any gains when run in parallel.

2067:

2068: \begin{table}[!ht]

2069: \caption{Speedups for OPTYap on tabled programs.}

2070: \label{tabled_parallel_batched}

2071: \begin{tabular}{lrrrrr}

2072: \hline\hline

2073:     & \multicolumn{5}{c}{\bf Number of Workers} \\ \noalign{\vspace{.2cm}}

2074:     {\bf Bench}

2075:     & \multicolumn{1}{c}{\bf  4}

2076:     & \multicolumn{1}{c}{\bf  8}

2077:     & \multicolumn{1}{c}{\bf 16}

2078:     & \multicolumn{1}{c}{\bf 24}

2079:     & \multicolumn{1}{c}{\bf 32} \\

2080: \hline

2081: sieve   & 3.99 & 7.97 & 15.87 & 23.78 & 31.50 \\

2082: leader  & 3.98 & 7.92 & 15.78 & 23.57 & 31.18 \\

2083: iproto  & 3.05 & 5.08 &  9.01 &  8.81 &  7.21 \\

2084: samegen & 3.72 & 7.27 & 13.91 & 19.77 & 24.17 \\

2085: lgrid/2 & 3.63 & 7.19 & 13.53 & 19.93 & 24.35 \\

2086: \noalign{\vspace{.5cm}}

2087: {\it Average}

2088:         & 3.67 & 7.09 & 13.62 & 19.17 & 23.68 \\

2089: \hline

2090: lgrid   & 0.65 & 0.68 &  0.55 &  0.46 &  0.39 \\

2091: rgrid/2 & 0.94 & 1.15 &  0.72 &  0.77 &  0.65 \\

2092: \noalign{\vspace{.5cm}}

2093: {\it Average}

2094:         & 0.80 & 0.92 &  0.64 &  0.62 &  0.52 \\

2095: \hline\hline

2096: \end{tabular}

2097: \end{table}

2098:

2099: The results show superb speedups for the XMC \emph{sieve} and the

2100: \emph{leader} benchmarks up to 32 workers. These benchmarks reach

2101: speedups of 31.5 and 31.18 with 32 workers! Two other benchmarks in

2102: the upper block, \emph{samegen} and \emph{lgrid/2}, also show

2103: excellent speedups up to 32 workers. Both reach a speedup of 24 with

2104: 32 workers. The remaining benchmark, \emph{iproto}, shows a good

2105: result up to 16 workers and then it slows down with 24 and 32 workers.

2106: Globally, the results for the upper block are quite good, especially

2107: considering that they include the three XMC benchmarks that are more

2108: representative of real-world applications.

2109:

2110: On the other hand, the bottom block shows almost no speedups at all.

2111: Only for \emph{rgrid/2} with 8 workers we obtain a slight positive

2112: speedup of 1.15. The worst case is for \emph{lgrid} with 32 workers,

2113: where we are about 2.5 times slower than execution with a single

2114: worker. In this case, surprisingly, we observed that for the whole set

2115: of benchmarks the workers are busy for more than 95\% of the execution

2116: time, even for 32 workers. The actual slowdown is therefore not caused

2117: because workers became idle and start searching for work, as usually

2118: happens with parallel execution of non-tabled programs. Here the

2119: problem seems more complex: workers do have available work, but there

2120: is a lot of contention to access that work.

2121:

2122: The parallel execution behavior of each benchmark program can be

2123: better understood through the statistics described in the tables that

2124: follows. The columns in these tables have the following meaning:

2125:

2126: \begin{description}

2127: \item[variant:] is the number of variant calls to subgoals

2128:   corresponding to tabled predicates. It matches the number of

2129:   consumer choice points allocated.

2130:

2131: \item[complete:] is the number of variant calls to completed tabled

2132:   subgoals. It is when the \emph{completed table optimization} takes

2133:   places, that is, when the set of found answers is consumed by

2134:   executing compiled code directly from the trie structure associated

2135:   with the completed subgoal.

2136:

2137: \item[SCC suspend:] is the number of SCCs suspended.

2138:

2139: \item[SCC resume:] is the number of suspended SCCs that were resumed.

2140:

2141: \item[contention points:] is the total number of unsuccessful first

2142:   attempts to lock data structures of all types. Note that when a

2143:   first attempt fails, the requesting worker performs arbitrarily

2144:   locking requests until it succeeds. Here, we only consider the first

2145:   attempts.

2146:

2147:   \begin{description}

2148:   \item[subgoal frame:] is the number of unsuccessful first attempts

2149:     to lock subgoal frames. A subgoal frame is locked in three main

2150:     different situations: \textbf{(i)} when a new answer is found

2151:     which requires updating the subgoal frame pointer to the last

2152:     found answer; \textbf{(ii)} when marking a subgoal as completed;

2153:     \textbf{(iii)} when traversing the whole answer trie structure to

2154:     remove pruned answers and compute the code for direct compiled

2155:     code execution.

2156:

2157:   \item[dependency frame:] is the number of unsuccessful first

2158:     attempts to lock dependency frames. A dependency frame has to be

2159:     locked when it is checked for unconsumed answers.

2160:

2161:   \item[trie node:] is the number of unsuccessful first attempts to

2162:     lock trie nodes. Trie nodes must be locked when a worker has to

2163:     traverse the subgoal trie structure during a tabled subgoal call

2164:     operation or the answer trie structure during a new answer

2165:     operation.

2166:   \end{description}

2167: \end{description}

2168:

2169: To accomplish these statistics it was necessary to introduce in the

2170: system a set of counters to measure the several parameters. Although,

2171: the counting mechanism introduces an additional overhead in the

2172: execution time, we assume that it does not significantly influence the

2173: parallel execution pattern of each benchmark program.

2174:

2175: Tables~\ref{stats_batched_upper} and~\ref{stats_batched_below} show

2176: respectively the statistics gathered for the group of programs with

2177: and without parallelism. We do not include the statistics for the

2178: \emph{leader} benchmark because its execution behavior showed to be

2179: identical to the observed for the \emph{sieve} benchmark.

2180:

2181: \begin{table}[!ht]

2182: \caption{Statistics of OPTYap using batched scheduling for the group

2183:          of programs with parallelism.}

2184: \label{stats_batched_upper}

2185: \begin{tabular}{lrrrrr}

2186: \hline\hline

2187:     & \multicolumn{5}{c}{\bf Number of Workers} \\ \noalign{\vspace{.2cm}}

2188:     {\bf Parameter}

2189:     & \multicolumn{1}{c}{\bf  4}

2190:     & \multicolumn{1}{c}{\bf  8}

2191:     & \multicolumn{1}{c}{\bf 16}

2192:     & \multicolumn{1}{c}{\bf 24}

2193:     & \multicolumn{1}{c}{\bf 32} \\

2194: \hline

2195: {\bf sieve}         &        &        &        &        &        \\

2196: variant/complete    &    1/0 &    1/0 &    1/0 &    1/0 &    1/0 \\

2197: SCC suspend/resume  &   20/0 &   70/0 &  136/0 &  214/0 &  261/0 \\

2198: contention points   &    108 &    329 &    852 &   1616 &   3040 \\

2199: ~~~subgoal frame    &      0 &      0 &      0 &      0 &      2 \\

2200: ~~~dependency frame &      0 &      0 &      1 &      0 &      4 \\

2201: ~~~trie node        &     96 &    188 &    415 &    677 &   1979 \\

2202: \noalign{\vspace{.5cm}}

2203: {\bf iproto}        &        &        &        &        &        \\

2204: variant/complete    &    1/0 &    1/0 &    1/0 &    1/0 &    1/0 \\

2205: SCC suspend/resume  &    5/0 &    9/0 &   17/0 &   26/0 &   32/0 \\

2206: contention points   &   7712 &  22473 &  60703 & 120162 & 136734 \\

2207: ~~~subgoal frame    &   3832 &   9894 &  21271 &  33162 &  33307 \\

2208: ~~~dependency frame &    678 &   4685 &  25006 &  66334 &  81515 \\

2209: ~~~trie node        &   3045 &   6579 &  10537 &  11816 &  11736 \\

2210: \noalign{\vspace{.5cm}}

2211: {\bf samegen}       &          &          &          &          &          \\

2212: variant/complete    & 485/1067 & 1359/193 & 1355/197 & 1384/168 & 1363/189 \\

2213: SCC suspend/resume  &    187/2 &   991/11 &  1002/20 &  1024/25 &  1020/34 \\

2214: contention points   &      255 &      314 &      743 &     1160 &     1607 \\

2215: ~~~subgoal frame    &        8 &       52 &      112 &      283 &      493 \\

2216: ~~~dependency frame &        0 &        0 &        1 &        0 &        0 \\

2217: ~~~trie node        &      154 &      119 &      201 &      364 &      417 \\

2218: \noalign{\vspace{.5cm}}

2219: {\bf lgrid/2}       &        &        &        &        &        \\

2220: variant/complete    &    1/0 &    1/0 &    1/0 &    1/0 &    1/0 \\

2221: SCC suspend/resume  &    4/0 &    8/0 &   16/0 &   24/0 &   32/0 \\

2222: contention points   &   4004 &  10072 &  28669 &  59283 &  88541 \\

2223: ~~~subgoal frame    &    167 &   1124 &   7319 &  17440 &  27834 \\

2224: ~~~dependency frame &     98 &   1209 &   5987 &  23357 &  35991 \\

2225: ~~~trie node        &   2958 &   5292 &  10341 &  12870 &  12925 \\

2226: \hline\hline

2227: \end{tabular}

2228: \end{table}

2229:

2230: The statistics obtained for the \emph{sieve} benchmark support the

2231: excellent performance speedups showed for parallel execution. It shows

2232: insignificant number of contention points, it only calls a variant

2233: subgoal, and despite the fact that it suspends some SCCs it

2234: successfully avoids resuming them. In this regard, the \emph{samegen}

2235: benchmark also shows insignificant number of contention points.

2236: However the number of variant subgoals calls and the number of

2237: suspended/resumed SCCs indicate that it introduces more dependencies

2238: between workers. Curiously, for more than 4 workers, the number of

2239: variant calls and the number of suspended SCCs seems to be stable. The

2240: only parameter that slightly increases is the number of resumed SCCs.

2241: Regarding \emph{iproto} and \emph{lgrid/2}, lock contention seems to

2242: be the major problem. Trie nodes show identical lock contention,

2243: however \emph{iproto} inserts about 10 times more answer trie nodes

2244: than \emph{lgrid/2}. Subgoal and dependency frames show an identical

2245: pattern of contention, but \emph{iproto} presents higher contention

2246: ratios. Moreover, if we remember from Table~\ref{tabled_sequential}

2247: that \emph{iproto} is about 3 times faster than \emph{lgrid/2} to

2248: execute, we can conclude that the contention ratio for \emph{iproto}

2249: is obviously much higher per time unit, which justifies its worst

2250: behavior.

2251:

2252: \begin{table}[!ht]

2253: \caption{Statistics of OPTYap using batched scheduling for the group

2254:          of programs without parallelism.}

2255: \label{stats_batched_below}

2256: \begin{tabular}{lrrrrr}

2257: \hline\hline

2258:     & \multicolumn{5}{c}{\bf Number of Workers} \\ \noalign{\vspace{.2cm}}

2259:     {\bf Parameter}

2260:     & \multicolumn{1}{c}{\bf  4}

2261:     & \multicolumn{1}{c}{\bf  8}

2262:     & \multicolumn{1}{c}{\bf 16}

2263:     & \multicolumn{1}{c}{\bf 24}

2264:     & \multicolumn{1}{c}{\bf 32} \\

2265: \hline

2266: {\bf lgrid}         &        &        &        &        &        \\

2267: variant/complete    &    1/0 &    1/0 &    1/0 &    1/0 &    1/0 \\

2268: SCC suspend/resume  &    4/0 &    8/0 &   16/0 &   24/0 &   32/0 \\

2269: contention points   & 112740 & 293328 & 370540 & 373910 & 452712 \\

2270: ~~~subgoal frame    &  18502 &  73966 &  77930 &  68313 & 115862 \\

2271: ~~~dependency frame &  17687 & 113594 & 215429 & 223792 & 248603 \\

2272: ~~~trie node        &  72751 &  91909 &  61857 &  62629 &  64029 \\

2273: \noalign{\vspace{.5cm}}

2274: {\bf rgrid/2}       &           &           &           &          &           \\

2275: variant/complete    & 3051/1124 & 3072/1103 & 3168/1007 & 3226/949 &  3234/941 \\

2276: SCC suspend/resume  &  1668/465 &  1978/766 & 2326/1107 & 2121/882 & 2340/1078 \\

2277: contention points   &     58761 &    110984 &    133058 &   170653 &    173773 \\

2278: ~~~subgoal frame    &     55415 &    103104 &    122938 &   159709 &    160771 \\

2279: ~~~dependency frame &         0 &         8 &         5 &      259 &       268 \\

2280: ~~~trie node        &      1519 &      3595 &      5016 &     4780 &      4737 \\

2281: \hline\hline

2282: \end{tabular}

2283: \end{table}

2284:

2285: The statistics gathered for the second group of programs present very

2286: interesting results. Remember that \emph{lgrid} and \emph{rgrid/2} are

2287: the benchmarks that find the largest number of answers per time unit

2288: (please refer to Tables~\ref{tabled_sequential}

2289: and~\ref{tabled_stats}). Regarding \emph{lgrid}'s statistics it shows

2290: high contention ratios in all parameters considered. Closer analysis

2291: of its statistics allows us to observe that it shows an identical

2292: pattern when compared with \emph{lgrid/2}. The problem is that the

2293: ratio per time unit is significantly worst for \emph{lgrid}. This

2294: reflects the fact that most of \emph{lgrid}'s execution time is spent

2295: in \emph{massively} accessing the table space to insert new answers

2296: and to consume found answers.

2297:

2298: The sequential order by which answers are accessed in the trie

2299: structure is the key issue that reflects the high number of contention

2300: points in subgoal and dependency frames. When inserting a new answer

2301: we need to update the subgoal frame pointer to point at the last found

2302: answer. When consuming a new answer we need to update the dependency

2303: frame pointer to point at the last consumed answer. For programs that

2304: find a large number of answers per time unit, this obviously increases

2305: contention when accessing such pointers. Regarding trie nodes, the

2306: small depth of \emph{lgrid}'s answer trie structure (2 trie nodes) is

2307: one of the main factors that contributes to the high number of

2308: contention points when massively inserting trie nodes. Trie structures

2309: are a compact data structure. Therefore, obtaining good parallel

2310: performance in the presence of massive table access will always be a

2311: difficult task.

2312:

2313: Analyzing the statistics for \emph{rgrid/2}, the number of variant

2314: subgoals calls and the number of suspended/resumed SCCs suggest that

2315: this benchmark leads to complex dependencies between workers.

2316: Curiously, despite the large number of consumer nodes that the

2317: benchmark allocates, contention in dependency frames is not a problem.

2318: On the other hand, contention for subgoal frames seems to be a major

2319: problem. The statistics suggest that the large number of SCC resume

2320: operations and the large number of answers that the benchmark finds

2321: are the key aspects that constrain parallel performance. A closer

2322: analysis shows that the number of resumed SCCs is approximately

2323: constant with the increase in the number of workers. This may suggest

2324: that there are answers that can only be found when other answers are

2325: also found, and that the process of finding such answers cannot be

2326: anticipated. In consequence, suspended SCCs have always to be resumed

2327: to consume the answers that cannot be found sooner. We believe that

2328: the sequencing in the order that answers are found is the other major

2329: problem that restrict parallelism in tabled programs.

2330:

2331: Another aspect that can negatively influence this benchmark is the

2332: number of completed calls. Before executing the first call to a

2333: completed subgoal we need to traverse the trie structure of the

2334: completed subgoal. When traversing the trie structure the

2335: correspondent subgoal frame is locked. As \emph{rgrid/2} stores a huge

2336: number of answer trie nodes in the table (please refer to

2337: Table~\ref{tabled_stats}) this can lead to longer periods of lock

2338: contention.

2339:

2340:

2341:

2342: \section{Concluding Remarks}

2343:

2344: We have presented the design, implementation and evaluation of

2345: OPTYap. OPTYap is the first available system that exploits

2346: or-parallelism and tabling from logic programs. A major guideline for

2347: OPTYap was concerned with making best use of the excellent technology

2348: already developed for previous systems. In this regard, OPTYap uses

2349: Yap's efficient sequential Prolog engine as its starting framework,

2350: and the SLG-WAM and environment copying approaches, respectively, as

2351: the basis for its tabling and or-parallel components.

2352:

2353: Through this research we aimed at showing that the models developed to

2354: exploit implicit or-parallelism in standard logic programming systems

2355: can also be used to successfully exploit implicit or-parallelism in

2356: tabled logic programming systems. First results reinforced our belief

2357: that tabling and parallelism are a very good match that can contribute

2358: to expand the range of applications for Logic Programming.

2359:

2360: OPTYap introduces low overheads for sequential execution and compares

2361: favorably with current versions of XSB. Moreover, it maintains YapOr's

2362: effective speedups in exploiting or-parallelism in non-tabled

2363: programs.  Our best results for parallel execution of tabled programs

2364: were obtained on applications that have a limited number of tabled

2365: nodes, but high or-parallelism. However, we have also obtained good

2366: speedups on applications with a large number of tabled nodes.

2367:

2368: On the other hand, there are tabled programs where OPTYap may not

2369: speed up execution. Table access has been the main factor limiting

2370: parallel speedups so far. OPTYap implements tables as tries, thus

2371: obtaining good indexing and compression. On the other hand, tries are

2372: designed to avoid redundancy. To do so, they restrict concurrency,

2373: especially when updating. We plan to study whether alternative designs

2374: for the table data structure can obtain scalable speedups even when

2375: frequently updating tables.

2376:

2377: Our applications do not show the completion algorithm to be a major

2378: factor in performance so far. In the future, we plan to study OPTYap

2379: over a large range of applications, namely, natural language, database

2380: processing, and non-monotonic reasoning. We expect that non-monotonic

2381: reasoning applications, for instance, will raise more complex

2382: dependencies and further stress the completion algorithm. We are also

2383: interested in the implementation of pruning in the parallel

2384: environment.

2385:

2386:

2387:

2388: \section*{Acknowledgments}

2389:

2390: The authors are thankful to the anonymous reviewers for their valuable

2391: comments. This work has been partially supported by $CLoP^n$ (CNPq),

2392: PLAG (FAPERJ), APRIL (POSI/SRI/40749/2001), and by funds granted to

2393: LIACC through the Programa de Financiamento Plurianual, Funda\c{c}\~ao

2394: para a Ci\^encia e Tecnologia and Programa POSI.

2395:

2396:

2397:

2398: \bibliographystyle{plain}

2399: \bibliography{references}

2400:

2401: \end{document}

2402: