cs0308007/paper.tex
1: \documentclass{tlp}
2: \usepackage{epsf}
3: \usepackage{isolatin1}
4: \usepackage{multicol}
5: \usepackage{amssymb}
6: 
7: 
8: 
9: \title[On Applying Or-Parallelism and Tabling to Logic Programs]
10:       {On Applying Or-Parallelism and Tabling to Logic Programs}
11: 
12: \author[R. Rocha, F. Silva and V. Santos Costa]
13:        {RICARDO ROCHA, FERNANDO SILVA\\
14:          DCC-FC \& LIACC\\
15:          Universidade do Porto, Portugal\\
16:          \email{\{ricroc,fds\}@ncc.up.pt}
17:        \and VITOR SANTOS COSTA\\
18:          COPPE Systems \& LIACC\\
19:          Universidade do Rio de Janeiro, Brasil\\
20:          \email{vitor@cos.ufrj.br}
21:        }
22: 
23: 
24: 
25: \begin{document}
26: \maketitle
27: 
28: \begin{abstract}
29:   Logic Programming languages, such as Prolog, provide a high-level,
30:   declarative approach to programming. Logic Programming offers great
31:   potential for implicit parallelism, thus allowing parallel systems
32:   to often reduce a program's execution time without programmer
33:   intervention. We believe that for complex applications that take
34:   several hours, if not days, to return an answer, even limited
35:   speedups from parallel execution can directly translate to very
36:   significant productivity gains.
37:   
38:   It has been argued that Prolog's evaluation strategy --~SLD
39:   resolution~-- often limits the potential of the logic programming
40:   paradigm. The past years have therefore seen widening efforts at
41:   increasing Prolog's declarativeness and expressiveness. Tabling has
42:   proved to be a viable technique to efficiently overcome SLD's
43:   susceptibility to infinite loops and redundant subcomputations.
44:   
45:   Our research demonstrates that implicit or-parallelism is a natural
46:   fit for logic programs with tabling. To substantiate this belief, we
47:   have designed and implemented an or-parallel tabling engine
48:   --~OPTYap~-- and we used a shared-memory parallel machine to
49:   evaluate its performance. To the best of our knowledge, OPTYap is
50:   the first implementation of a parallel tabling engine for logic
51:   programming systems. OPTYap builds on Yap's efficient sequential
52:   Prolog engine.  Its execution model is based on the SLG-WAM for
53:   tabling, and on the environment copying for or-parallelism.
54:   
55:   Preliminary results indicate that the mechanisms proposed to
56:   parallelize search in the context of SLD resolution can indeed be
57:   effectively and naturally generalized to parallelize tabled
58:   computations, and that the resulting systems can achieve good
59:   performance on shared-memory parallel machines. More importantly, it
60:   emphasizes our belief that through applying or-parallelism and
61:   tabling to logic programs the range of applications for Logic
62:   Programming can be increased.
63: \end{abstract}
64: 
65: \begin{keywords}
66: Or-Parallelism, Tabling, Implementation, Performance.
67: \end{keywords}
68: 
69: 
70: 
71: \section{Introduction}
72: 
73: Logic programming provides a high-level, declarative approach to
74: programming. Arguably, Prolog is the most popular and powerful logic
75: programming language. Prolog's popularity was sparked by the success
76: of the sequential execution model presented in 1983 by David H. D.
77: Warren, the \emph{Warren Abstract Machine}
78: (\emph{WAM})~\cite{Warren-83}. Throughout its history, Prolog has
79: demonstrated the potential of logic programming in application areas
80: such as Artificial Intelligence, Natural Language Processing,
81: Knowledge Based Systems, Machine Learning, Database Management, or
82: Expert Systems.
83: 
84: Logic programs are written in a subset of First-Order Logic, Horn
85: clauses, that has an intuitive interpretation as positive facts and as
86: rules. Programs use the logic to express the problem, whilst questions
87: are answered by a resolution procedure with the aid of user
88: annotations. The combination was summarized by Kowalski's
89: motto~\cite{Kowalski-79}:
90: \[algorithm~=~logic~+control\]
91: Ideally, one would want Prolog programs to be written as logical
92: statements first, and for control to be tackled as a separate issue.
93: In practice, the limitations of Prolog's operational semantics, SLD
94: resolution, mean that Prolog programmers must be concerned with SLD
95: semantics throughout program development.
96: 
97: Several proposals have been put forth to overcome some of these
98: limitations and therefore improve the declarativeness and
99: expressiveness of Prolog. One such proposal that has been gaining in
100: popularity is \emph{tabling}, also referred to as \emph{tabulation} or
101: \emph{memoing}~\cite{Michie-68}. In a nutshell, tabling consists of
102: storing intermediate answers for subgoals so that they can be reused
103: when a repeated subgoal appears during the resolution process. It can
104: be shown that tabling based execution models, such as SLG
105: resolution~\cite{Chen-96}, are able to reduce the search space, avoid
106: looping, and that they have better termination properties than SLD
107: based models. For instance, SLG resolution is guaranteed to terminate
108: for all logical programs with the \emph{bounded term-size
109: property}~\cite{Chen-96}.
110: 
111: Work on SLG resolution, as implemented in the XSB logic programming
112: system~\cite{xsb}, proved the viability of tabling technology for
113: applications such as Natural Language Processing, Knowledge Based
114: Systems and Data Cleaning, Model Checking, and Program Analysis. SLG
115: resolution also includes several extensions to Prolog, namely support
116: for negation~\cite{Apt-94}, hence allowing for novel applications in
117: the areas of Non-Monotonic Reasoning and Deductive Databases.
118: 
119: One of the major advantages of logic programming is that it is well
120: suited for parallel execution. The interest in the parallel execution
121: of logic programs mainly arose from the fact that parallelism can be
122: exploited \emph{implicitly} from logic programs. This means that
123: parallelism can be automatically exploited, that is, without input
124: from the programmer to express or manage parallelism, ideally making
125: parallel logic programming as easy as logic programming.
126: 
127: Logic programming offers two major forms of implicit parallelism,
128: \emph{Or-Parallelism} and \emph{And-Parallelism}. Or-parallelism
129: results from the parallel execution of alternative clauses for a given
130: predicate goal, while and-parallelism stems from the parallel
131: evaluation of subgoals in an alternative clause. Some of the most
132: well-known systems that successfully supported these forms of
133: parallelism are: Aurora~\cite{Aurora-88} and Muse~\cite{Ali-90a} for
134: or-parallelism; \&-Prolog~\cite{Hermenegildo-91},
135: DASWAM~\cite{Shen-92}, and ACE~\cite{Pontelli-97} for and-parallelism;
136: and Andorra-I~\cite{Costa-91} for or-parallelism together with
137: and-parallelism. A detailed presentation of such systems and the
138: challenges and problems in their implementation can be found
139: in~\cite{Gupta-01}. Arguably, or-parallel systems have been the most
140: successful parallel logic programming systems so far.  Experience has
141: shown that or-parallel systems can obtain very good speedups for
142: applications that require search.  Examples can be found in
143: application areas such Parsing, Optimization, Structured Database
144: Querying, Expert Systems and Knowledge Discovery applications.
145: 
146: The good results obtained with parallelism and with tabling rises the
147: question of whether further efficiency improvements may be achievable
148: through parallelism. Freire and colleagues were the first to research
149: this area~\cite{Freire-95}. Although tabling works for both
150: deterministic and non-deterministic applications, Freire focused on
151: the search process, because tabling has frequently been used to reduce
152: the search space. In their model, each tabled subgoal is computed
153: independently in a separate computational thread, a \emph{generator
154: thread}. Each generator thread is the sole responsible for fully
155: exploiting its subgoal and obtain the complete set of answers.
156: Arguably, Freire's model will work particularly well if we have many
157: non-deterministic generators. On the other hand, it will not exploit
158: parallelism if there is a single generator and many non-tabled
159: subgoals. It also does not exploit parallelism between a generator's
160: clauses. As we discuss in Section~\ref{section_performance_analysis},
161: experience has shown that interesting applications do indeed have a
162: limited number of generators.
163: 
164: Ideally, we would like to exploit maximum parallelism and take maximum
165: advantage of current technology for tabling and parallel systems. To
166: exploit maximum parallelism, we would like to exploit parallelism from
167: both tabled and non-tabled subgoals. Further, we would like to reuse
168: existing technology for tabling and parallelism. As such, we would
169: like to exploit parallelism from tabled and non-tabled subgoals
170: \emph{in much the same way}. As with Freire, we would focus on
171: or-parallelism first, and we will focus throughout on shared-memory
172: platforms.
173: 
174: Towards this goal, we proposed two new computational
175: models~\cite{Rocha-99a}, \emph{Or-Parallelism within Tabling}
176: (\emph{OPT}) and \emph{Tabling within Or-Parallelism} (\emph{TOP}).
177: Both models are based on the idea that all open alternatives in the
178: search tree should be amenable to parallel exploitation, be they from
179: tabled or non-tabled subgoals. The OPT model further assumes tabling
180: as the base component of the parallel system, that is, each
181: \emph{worker}\footnote{The term \emph{worker} is widely used in the
182: literature to refer to each computational unit contributing to the
183: parallel execution.} is a full sequential tabling engine. OPT
184: triggers or-parallelism when workers run out of alternatives to
185: exploit: at this point, a worker will share part of its SLG
186: derivations with the other. In contrast, the TOP model represents the
187: whole SLG forest as a shared search tree, thus unifying parallelism
188: with tabling. Workers are logically positioned at branches in this
189: tree. When a branch completes or suspends, workers move to nodes with
190: open alternatives, that is, alternatives with either open
191: clauses or new answers stored in the table.
192: 
193: The main contribution of this work is the design and performance
194: evaluation of what to the best of our knowledge is the first parallel
195: tabling logic programming system, OPTYap~\cite{Rocha-01}. We chose the
196: OPT model for two main advantages, both stemming from the fact that
197: OPT encapsulates or-parallelism within tabling. First, implementation
198: of the OPT models follows naturally from two well-understood
199: implementation issues: we need to implement a tabling engine, and then
200: we need to support or-parallelism. Second, in the OPT model a worker
201: can keep its nodes \emph{private} until reaching a sharing point. This
202: is a key issue in reducing parallel overheads. We remark that it is
203: common in or-parallel works to say that work is initially
204: \emph{private}, and that is made \emph{public} after sharing.
205: 
206: OPTYap builds on the YapOr~\cite{Rocha-99b} and YapTab~\cite{Rocha-00}
207: engines. YapOr was previous work on supporting or-parallelism over
208: Yap's Prolog system~\cite{Costa-99b}. YapOr is based on the
209: environment copying model for shared-memory machines, as originally
210: implemented in Muse~\cite{Ali-90b}. YapTab is a sequential tabling
211: engine that extends Yap's execution model to support tabled evaluation
212: for definite programs. YapTab's implementation is largely based on the
213: ground-breaking design of the XSB system~\cite{Sagonas-94,Rao-97},
214: which implements the SLG-WAM~\cite{Swift-94b,Sagonas-96,Sagonas-98}.
215: YapTab has been designed from scratch and its development was done
216: taking into account the major purpose of further integration to
217: achieve an efficient parallel tabling computational model, whilst
218: comparing favorably with current \emph{state of the art} technology.
219: In other words, we aim at respecting the \emph{no-slowdown
220: principle}~\cite{Hermenegildo-Phd}: our or-parallel tabling system
221: should, when executed with a single worker, run as fast or faster
222: than the current available sequential tabling systems. Otherwise,
223: parallel performance results would not be significant and fair.
224: 
225: In order to validate our design we studied in detail the performance
226: of OPTYap in shared-memory machines up to 32 workers. The results we
227: gathered show that OPTYap does indeed introduce low overheads for
228: sequential execution and that it compares favorably with current
229: versions of XSB.  Furthermore, the results show that OPTYap maintains
230: YapOr's speedups for parallel execution of non-tabled programs, and
231: that there are tabled applications that can achieve very high
232: performance through parallelism. This substantiates our belief that
233: tabling and parallelism can together contribute to increasing the
234: range of applications for Logic Programming.
235: 
236: 
237: 
238: \section{Tabling for Logic Programs}
239: 
240: The basic idea behind tabling is straightforward: programs are
241: evaluated by storing newly found answers of current subgoals in an
242: appropriate data space, called the \emph{table space}. The method then
243: uses this table to verify whether calls to subgoals are repeated.
244: Whenever such a repeated call is found, the subgoal's answers are
245: recalled from the table instead of being re-evaluated against the
246: program clauses. In practice, two major issues have to be addressed:
247: 
248: \begin{enumerate}
249: \item What is a repeated subgoal? We may say that a subgoal repeats if
250:   it is the same as a previous subgoal, up to variable renaming;
251:   alternatively, we may say it is repeated if it is an instance of a
252:   previous subgoal. The former approach is known as
253:   \emph{variant-based tabling}~\cite{Ramakrishnan-99}, the latter as
254:   \emph{subsumption-based tabling}~\cite{Rao-96}. Variant-based
255:   tabling has been researched first and is arguably better understood,
256:   although there has been significant recent progress in
257:   subsumption-based tabling~\cite{Johnson-99}. We shall use
258:   variant-based tabling approach in this work.
259: \item How to execute subgoals? Clearly, we must change the selection
260:   function and search rule to accommodate for repeated subgoals. In
261:   particular, we must address the situation where we recursively call
262:   a tabled subgoal before we have fully tabled all its
263:   answers. Several strategies to do so have been
264:   proposed~\cite{Tamaki-86,Vieille-89,Chen-96}. We use the popular SLG
265:   resolution~\cite{Chen-96} in this work, mainly because this approach
266:   has good termination properties.
267: \end{enumerate}
268: 
269: In the following, we illustrate the main principles of tabled
270: evaluation using SLG resolution through an example.
271: 
272: 
273: 
274: \subsection{Tabled Evaluation}
275: 
276: Consider the Prolog program of Figure~\ref{fig_finite_SLG_tree}. The
277: program defines a small directed graph, represented by the
278: \texttt{arc/2} predicate, with a relation of reachability, given by
279: the \texttt{path/2} predicate. In this example we ask the query goal
280: \texttt{?- path(a,Z)} on this program. Note that traditional Prolog
281: would immediately enter an infinite loop because the first clause of
282: \texttt{path/2} leads to a repeated call to \texttt{path(a,Z)}. In
283: contrast, if tabling is applied then termination is ensured. The
284: declaration \texttt{:- table path/2} in the program code indicates
285: that predicate \texttt{path/2} should be tabled.
286: Figure~\ref{fig_finite_SLG_tree} illustrates the evaluation sequence
287: when using tabling.
288: 
289: \begin{figure}[!ht]
290: \centerline{
291: \epsfxsize=12cm
292: \epsffile{finite_SLG_tree.eps}
293: }
294: \caption{A finite tabled evaluation.}
295: \label{fig_finite_SLG_tree}
296: \end{figure}
297: 
298: At the top, the figure illustrates the program code and the state of
299: the table space at the end of the evaluation. The main sub-figure
300: shows the forest of SLG trees for the original query. The topmost tree
301: represents the original invocation of the tabled subgoal
302: \texttt{path(a,Z)}. It thus computes all nodes reachable from node
303: \texttt{a}. As we shall see, computing all nodes reachable from
304: \texttt{a} requires computing all nodes reachable from \texttt{b} and
305: all nodes reachable from \texttt{c}. The middle tree represents the
306: SLG tree \texttt{path(b,Z)}, that is, it computes all nodes reachable
307: from node \texttt{b}. The bottommost tree represents the SLG tree
308: \texttt{path(c,Z)}.
309: 
310: Next, we describe in detail the evaluation sequence presented in the
311: figure. For simplicity of presentation, the root nodes of the SLG
312: trees \texttt{path(b,Z)} and \texttt{path(c,Z)}, nodes $6$ and $13$,
313: are shown twice. The numbering of nodes denotes the evaluation
314: sequence.
315: 
316: Whenever a tabled subgoal is first called, a new tree is added to the
317: forest of trees and a new entry is added to the table space. We name
318: first calls to tabled subgoals \emph{generator nodes} (nodes depicted
319: by white oval boxes). In this case, execution starts with a generator
320: node, node $0$. The evaluation thus begins by creating a new tree
321: rooted by \texttt{path(a,Z)} and by inserting a new entry in the table
322: space for it.
323: 
324: The second step is to resolve \texttt{path(a,Z)} against the first
325: clause for \texttt{path/2}, creating node $1$. Node $1$ is a variant
326: call to \texttt{path(a,Z)}. We do not resolve the subgoal against the
327: program at these nodes, instead we consume answers from the table
328: space. Such nodes are thus called \emph{consumer nodes} (nodes
329: depicted by gray oval boxes). At this point, the table does not have
330: answers for this call. The consumer therefore must \emph{suspend},
331: either by freezing the whole stacks~\cite{Sagonas-98}, or by copying
332: the stacks to separate storage~\cite{Demoen-00}.
333: 
334: The only possible move after suspending is to backtrack to node $0$.
335: We then try the second clause to \texttt{path/2}, thus calling
336: \texttt{arc(a,Z)}. The \texttt{arc/2} predicate is not tabled, hence
337: it must be resolved against the program, as Prolog would. We name such
338: nodes \emph{interior nodes}. The first clause for \texttt{arc/2}
339: immediately succeeds (step $3$). We return back to the context for the
340: original goal, obtaining an answer for \texttt{path(a,Z)}, and store
341: the answer \texttt{Z=b} in the table.
342: 
343: We can now choose between two options. We may backtrack and try the
344: alternative clauses for \texttt{arc/2}. Otherwise, we may suspend the
345: current execution, and resume node $1$ with the newly found answer. We
346: decide to continue exploiting the interior node. Both steps $4$ and
347: $5$ fail, so we backtrack to node $0$. Node $0$ has no more clauses
348: left to try, so we try to check whether it has \emph{completed}. It
349: has not, as node $1$ has not consumed all its answers. We therefore
350: must resume node $1$. The stacks are thus restored to their state at
351: node $1$, and the answer \texttt{Z=b} is forwarded to this node. The
352: subgoal succeeds trivially and we call the continuation,
353: \texttt{path(b,Z}). This is the first call to \texttt{path(b,Z)}, so
354: we must create a new tree rooted by \texttt{path(b,Z)} (node $6$),
355: insert a new entry in the table space for it, and proceed with the
356: evaluation of \texttt{path(b,Z)}, as shown in the middle tree.
357: 
358: Again, \texttt{path(b,Z)} calls itself recursively, and suspends at
359: node $7$. We now have two consumers, node $1$ and node $7$. The only
360: answer in the table was already consumed, so we have to backtrack to
361: node $6$. This leads to generating a new interior node (node $8$) and
362: consulting the program for clauses to \texttt{arc(b,Z)}. The first
363: clause fails (step $9$), but the second clause matches (step $10$).
364: The answer is returned to node $6$ and stored in the table. We next
365: have three choices: continue forward execution, backtrack to the open
366: interior node, or resume the consumer node $7$. In the example we
367: choose to follow a Prolog-like strategy and continue forward
368: execution. Step $11$ thus returns the binding \texttt{Z=c} to the
369: subgoal \texttt{path(a,Z)}. We store this answer in
370: \texttt{path(a,Z)}'s table entry.
371: 
372: This will be the last answer to \texttt{path(a,Z)}, but we can only
373: prove so after fully exploiting the tree: we still have an open
374: interior node (node $8$), and two suspended consumers (nodes $1$ and
375: $7$). We now choose to backtrack to node $8$, and exploit the last
376: clause for \texttt{arc/2} (step $12$). At this point we fail all the
377: way back to node $6$. We cannot complete node $6$ yet, as we have an
378: unfinished consumer below (node $7$). The only answer in the table for
379: this consumer is \texttt{Z=c}. We use this answer and obtain a first
380: call to \texttt{path(c,Z)}.
381: 
382: The new generator, node $13$, needs a new table. Again, we try the
383: first clause and suspend on the recursive call (node $14$). Next, we
384: backtrack to the second clause. Resolution on \texttt{arc(c,Z)} (node
385: $15$) fails twice (steps $16$ and $17$), and then generates an answer,
386: \texttt{Z=b} (step $18$). We return the answer to node $13$, and store
387: the answer in the table. Again, we choose to continue forward
388: execution, thus finding a new answer to \texttt{path(b,Z)}, which is
389: again stored in the table (step $19$). Next, we continue forward
390: execution (step $20$), and find an answer to \texttt{path(a,Z)},
391: \texttt{Z=b}. This answer had already been found at step $3$. SLG
392: resolution does not store duplicate answers in the table. Instead,
393: repeated answers \emph{fail}. This is how the SLG-WAM avoids
394: unnecessary computations, and even looping in some cases.
395: 
396: What to do next? We do not have interior nodes to exploit, so we
397: backtrack to generator node $13$. The generator cannot complete
398: because it has a consumer below (node $14$). We thus try to complete
399: by sending answers to consumer node $14$. The first answer,
400: \texttt{Z=b}, leads to a new consumer for \texttt{path(b,Z)} (node
401: $21$). The table has two answers for \texttt{path(b,Z)}, so we can
402: continue the consumer immediately. This gives a new answer
403: \texttt{Z=c} to \texttt{path(c,Z)}, which is stored in the table
404: (step $22$). Continuing forward execution results in the answer
405: \texttt{Z=c} to \texttt{path(b,Z)} (step $23$). This answer repeats
406: what we found in step $10$, so we must fail at this point.
407: Backtracking sends us back to consumer node $21$. We then consume the
408: second answer for \texttt{path(b,Z)}, which generates a repeated
409: answer, so we fail again (step $24$). We then try consumer node $14$.
410: It next consumes the second answer, again leading to repeated
411: subgoals, as shown in steps $25$ to $27$. At this point we fail back
412: to node $13$, which makes sure that all answers to the consumers below
413: (nodes $14$, $21$, and $25$) have been tried. Unfortunately, node $13$
414: cannot complete, because it depends on subgoal \texttt{path(b,Z)}
415: (node $21$). Completing \texttt{path(c,Z)} earlier is not safe because
416: we can loose answers. Note that, at this point, new answers can still
417: be found for subgoal \texttt{path(b,Z)}. If new answers are found,
418: consumer node $21$ should be resumed with the newly found answers,
419: which in turn can lead to new answers for subgoal \texttt{path(c,Z)}.
420: If we complete sooner, we can loose such answers.
421: 
422: Execution thus backtracks and we try the answer left for consumer node
423: $7$. Steps $28$ to $30$ show that again we only get repeated answers.
424: We fail and return to node $6$. All nodes in the trees for node $6$
425: and node $13$ have been exploited. As these trees do not depend on any
426: other tree, we are sure no more answers are forthcoming, so at last
427: step $31$ declares the two trees to be complete, and closes the
428: corresponding table entries.
429: 
430: Next we backtrack to consumer node $1$. We had not tried \texttt{Z=c}
431: on this node, but exploiting this answer leads to no further answers
432: (steps $32$ to $34$). The computation has thus fully exploited every
433: node, and we can complete the remaining table entry (step $35$).
434: 
435: 
436: 
437: \subsection{SLG-WAM Operations}
438: 
439: The example showed four new main operations: entering a tabled subgoal;
440: adding a new answer to a generator; exporting an answer from the
441: table; and trying to complete the tree. In more detail:
442: 
443: \begin{enumerate}
444: \item The \emph{tabled subgoal call} operation is a call to a tabled
445:   subgoal. It checks if a subgoal is in the table, and if not, adds a
446:   new entry for it and allocates a new generator node (nodes $0$, $6$
447:   and $13$). Otherwise, it allocates a consumer node and starts
448:   consuming the available answers (nodes $1$, $7$, $14$, $21$, $25$,
449:   $28$ and $32$).
450: \item The \emph{new answer} operation returns a new answer to a
451:   generator. It verifies whether a newly generated answer is already
452:   in the table, and if not, inserts it (steps $3$, $10$, $11$, $18$,
453:   $19$ and $22$). Otherwise, it fails (steps $20$, $23$, $24$, $26$,
454:   $27$, $29$, $30$, $33$, and $34$).
455: \item The \emph{answer resolution} operation forwards answers from
456:   the table to a consumer node. It verifies whether newly found
457:   answers are available for a particular consumer node and, if any,
458:   consumes the next one. Otherwise, it schedules a possible resolution
459:   to continue the execution. Answers are consumed in the same order
460:   they are inserted in the table. The answer resolution operation is
461:   executed every time the computation reaches a consumer node.
462: \item The \emph{completion} operation determines whether a tabled
463:   subgoal is \emph{completely evaluated}. It executes when we
464:   backtrack to a generator node and all of its clauses have been
465:   tried. If the subgoal has been completely evaluated, the operation
466:   closes its table entry and reclaims space (steps $31$ and $35$).
467:   Otherwise, it schedules a possible resolution to continue the
468:   execution.
469: \end{enumerate}
470: 
471: The example also shows that we have some latitude on where and when to
472: apply these operations. The actual sequence of operations thus depends
473: on a \emph{scheduling strategy}. We next discuss the main principles
474: for completion and scheduling strategies in some more detail.
475: 
476: 
477: 
478: \subsection{Completion}
479: 
480: Completion is needed in order to recover space and to support
481: negation. We are most interested on space recovery in this work.
482: Arguably, in this case we could delay completion until the very end of
483: execution. Unfortunately, doing so would also mean that we could only
484: recover space for suspended (consumer) subgoals at the very end of the
485: execution. Instead we shall try to achieve \emph{incremental
486: completion}~\cite{Chen-95} to detect whether a generator node has been
487: fully exploited, and if so to recover space for all its consumers.
488: 
489: Completion is hard because a number of generators may be mutually
490: dependent. Figure~\ref{fig_graph_dependencies} shows the dependencies
491: for the completed graph. Node $0$ depends on itself recursively
492: through consumer node $1$, and on generator node $6$. Node $6$ depends
493: on itself, consumer nodes $7$ and $28$, and on node $13$. Node $13$
494: also depends on itself, consumer nodes $14$ and $25$, and on node $6$
495: through consumer node $21$. There is thus a loop between nodes $6$ and
496: $13$: if we find a new answer for node $6$, we may get new answers for
497: node $13$, and so for node $6$.
498: 
499: \begin{figure}[!ht]
500: \centerline{
501: \epsfxsize=7cm
502: \epsffile{graph_dependencies.eps}
503: }
504: \caption{Node dependencies for the completed graph.}
505: \label{fig_graph_dependencies}
506: \end{figure}
507: 
508: In general, a set of mutually dependent subgoals forms a
509: \emph{Strongly Connected Component} (or
510: \emph{SCC})~\cite{Tarjan-72}. Clearly, we can only complete SCCs
511: together. We will usually represent an SCC through the oldest
512: generator. More precisely, the youngest generator node which does not
513: depend on older generators is called the \emph{leader node}. A leader
514: node is also the oldest node for its SCC, and defines the current
515: completion point.
516:   
517: XSB uses a stack of generators to detect completion
518: points~\cite{Sagonas-98}. Each time a new generator is introduced it
519: becomes the current leader node. Each time a new consumer is
520: introduced one verifies if it is for an older generator node ${\cal
521: G}$. If so, ${\cal G}$'s leader node becomes the current leader
522: node. Unfortunately, this algorithm does not scale well for parallel
523: execution, which is not easily representable with a single stack.
524: 
525: 
526: 
527: \subsection{Scheduling}
528: 
529: At several points we had to choose between continuing forward
530: execution, backtracking to interior nodes, returning answers to
531: consumer nodes, or performing completion. Ideally, we would like to
532: run these operations in \emph{parallel}. In a sequential system, the
533: decision on which operation to perform is crucial to system
534: performance and is determined by the \emph{scheduling strategy}.
535: Different scheduling strategies may have a significant impact on
536: performance, and may lead to different order of answers. YapTab
537: implements two different scheduling strategies, \emph{batched} and
538: \emph{local}~\cite{Freire-96}. YapTab's default scheduling strategy is
539: batched.
540: 
541: Batched scheduling is the strategy we followed in the example: it
542: favors forward execution first, backtracking to interior nodes next,
543: and returning answers or completion last. It thus tries to delay the
544: need to move around the search tree by \emph{batching} the return of
545: answers. When new answers are found for a particular tabled subgoal,
546: they are added to the table space and the evaluation continues until
547: it resolves all program clauses for the subgoal in hand.
548: 
549: Batched scheduling runs all interior nodes before restarting the
550: consumers. In the worst case, this strategy may result in creating a
551: complex graph of interdependent consumers. Local scheduling is an
552: alternative tabling scheduling strategy that tries to evaluate
553: subgoals as independently as possible, by executing one SCC at a time.
554: Answers are only returned to the leader's calling environment when its
555: SCC is completely evaluated.
556: 
557: 
558: 
559: \section{The Sequential Tabling Engine}
560: 
561: We next give a brief introduction to the implementation of YapTab.
562: Throughout, we focus on support for the parallel execution of definite
563: programs.
564: 
565: The YapTab design is WAM based, as is the SLG-WAM. Yap data
566: structures' are very close to the WAM's~\cite{Warren-83}: there is a
567: \emph{local stack}, storing both choice points and environment frames;
568: a \emph{global stack}, storing compound terms and variables; a
569: \emph{code space area}, storing code and the internal database; a
570: \emph{trail}; and a \emph{auxiliary stack}. To support the SLG-WAM we
571: must extend the WAM with a new data area, the \emph{table space}; a
572: new set of registers, the \emph{freeze registers}; an extension of the
573: standard trail, the \emph{forward trail}. We must support four new
574: operations: \emph{tabled subgoal call}, \emph{new answer},
575: \emph{answer resolution}, and \emph{completion}. Last, we must support
576: one or several \emph{scheduling strategies}.
577: 
578: We reconsidered decisions in the original SLG-WAM that can be a
579: potential source of parallel overheads. Namely, we argue that the
580: stack based completion detection mechanism used in the SLG-WAM is not
581: suitable to a parallel implementation. The SLG-WAM considers that the
582: control of leader detection and scheduling of unconsumed answers
583: should be done at the level of the data structures corresponding to
584: first calls to tabled subgoals, and it does so by associating
585: completion frames to generator nodes. On the other hand, YapTab
586: considers that such control should be performed through the data
587: structures corresponding to variant calls to tabled subgoals, and thus
588: it associates a new data structure, the \emph{dependency frame}, to
589: consumer nodes. We believe that managing dependencies at the level of
590: the consumer nodes is a more intuitive approach that we can take
591: advantage of.
592: 
593: The introduction of this new data structure allows us to reduce the
594: number of extra fields in tabled choice points and to eliminate the
595: need for a separate completion stack. Furthermore, allocating the data
596: structure in a separate area simplifies the implementation of
597: parallelism. We next review the main data structures and algorithms of
598: the YapTab design. A more detailed description is given
599: in~\cite{Rocha-PhD}.
600: 
601: 
602: 
603: \subsection{Table Space}
604: 
605: The table space can be accessed in different ways: to look up if a
606: subgoal is in the table, and if not insert it; to verify whether a
607: newly found answer is already in the table, and if not insert it; to
608: pick up answers to consumer nodes; and to mark subgoals as
609: completed. Hence, a correct design of the algorithms to access and
610: manipulate the table data is a critical issue to obtain an efficient
611: tabling system implementation.
612: 
613: Our implementation of tables uses tries as proposed by Ramakrishnan
614: \emph{et al.}~\cite{Ramakrishnan-99}. Tries provide complete
615: discrimination for terms and permit lookup and possibly insertion to
616: be performed in a single pass through a term. In
617: section~\ref{section_concurrent_table_access} we discuss how OPTYap
618: supports concurrent access to tries.
619: 
620: Figure~\ref{fig_tries} shows the completed table for the query shown
621: in Figure~\ref{fig_finite_SLG_tree}. Table lookup starts from the
622: \emph{table entry} data structure. Each table predicate has one such
623: structure, which is allocated at compilation time. A pointer to the
624: table entry can thus be included in the compiled code. Calls to the
625: predicate will always access the table starting from this point.
626: 
627: \begin{figure}[!ht]
628: \centerline{
629: \epsfxsize=11cm
630: \epsffile{tries_example.eps}
631: }
632: \caption{Using tries to organize the table space.}
633: \label{fig_tries}
634: \end{figure}
635: 
636: The table entry points to a tree of trie nodes, the \emph{subgoal trie
637:   structure}. More precisely, each different call to \texttt{path/2}
638: corresponds to a unique path through the subgoal trie structure. Such
639: a path always starts from the table entry, follows a sequence of
640: subgoal trie data units, the \emph{subgoal trie nodes}, and terminates
641: at a leaf data structure, the \emph{subgoal frame}.
642: 
643: Each subgoal trie node represents a binding for an argument or
644: sub-argument of the subgoal. In the example, we have three possible
645: bindings for the first argument, \texttt{X=c}, \texttt{X=b}, and
646: \texttt{X=a}. Each binding stores two pointers: one to be followed if
647: the argument matches the binding, the other to be followed otherwise.
648: 
649: We often have to search through a chain of sibling nodes that
650: represent alternative paths, e.g., in the query \texttt{path(a,Z)} we
651: have to search through nodes \texttt{X=c} and \texttt{X=b} until
652: finding node \texttt{X=a}. By default, this search is done
653: sequentially. When the chain becomes larger then a threshold value, we
654: dynamically index the nodes through a hash table to provide direct
655: node access and therefore optimize the search.
656: 
657: Each subgoal frame stores information about the subgoal, namely an
658: entry point to its \emph{answer trie structure}. Each unique path
659: through the answer trie data units, the \emph{answer trie nodes},
660: corresponds to a different answer to the entry subgoal. All answer
661: leave nodes are inserted in a linked list: the subgoal trie points at
662: the first and last entry in this list. Leaves' answer nodes are
663: chained together in insertion time order, so that we can recover
664: answers in the same order they were inserted. A consumer node thus
665: needs only to point at the leaf node for its last consumed answer, and
666: consumes more answers just by following the chain of leaves.
667: 
668: 
669: 
670: \subsection{Generator and Consumer Nodes}
671: 
672: Generator and consumer nodes correspond, respectively, to first and
673: variant calls to tabled subgoals, while interior nodes correspond to
674: normal, not tabled, subgoals. Interior nodes are implemented at the
675: engine level as WAM choice points. To implement generator nodes we
676: extended the WAM choice points with a pointer to the corresponding
677: subgoal frame. To implement consumer nodes we use the notion of
678: \emph{dependency frame}. Dependency frames will be stored in a proper
679: space, the \emph{dependency space}.
680: Figure~\ref{fig_nodes_relationships} illustrates how generator and
681: consumer nodes interact with the table and dependency spaces. As we
682: shall see in section~\ref{section_leader_nodes}, having a separate
683: dependency space is quite useful for our copying-based implementation,
684: although dependency frames could be stored together with the
685: corresponding choice point in the sequential implementation. All
686: dependency frames are linked together to form a dependency list of
687: consumer nodes. Additionally, dependency frames store information
688: about the last consumed answer for the correspondent consumer node;
689: and information for detecting completion points, as we discuss next.
690: 
691: \begin{figure}[!ht]
692: \centerline{
693: \epsfxsize=12cm
694: \epsffile{nodes_relationships.eps}
695: }
696: \caption{The nodes and their relationship with the table and dependency spaces.}
697: \label{fig_nodes_relationships}
698: \end{figure}
699: 
700: 
701: 
702: \subsection{Leader Nodes}
703: 
704: We need to perform completion in order to recover space and in order
705: to determine negative loops between subgoals in programs with
706: negation. In this work we focus on positive programs only, so our goal
707: will be to recover space. Unfortunately, as an artifact of the
708: SLG-WAM, it can happen that the stack segments for a SCC ${\cal S}$
709: remain within the stack segments for another SCC ${\cal S'}$. In such
710: cases, ${\cal S}$ cannot be recovered in advance when completed, and
711: thus, recovering its space must be delayed until ${\cal S'}$ also
712: completes. To approximate SCCs in a stack-based implementation,
713: Sagonas~\cite{Sagonas-PhD} denotes a set of SCCs whose space must be
714: recovered together as an \emph{Approximate SCC} or \emph{ASCC}. For
715: simplicity, in the following we will use the SCC notation to refer to
716: both ASCCs and SCCs.
717: 
718: The completion operation takes place when we backtrack to a generator
719: node that \textbf{(i)} has exhausted all its alternatives and that
720: \textbf{(ii)} is as a leader node (remember that the youngest
721: generator node which does not depend on older generators is called a
722: leader node). We designed novel algorithms to quickly determine
723: whether a generator node is a leader node. The key idea in our
724: algorithms is that each dependency frame holds a pointer to the
725: resulting leader node of the SCC that includes the correspondent
726: consumer node. Using the leader node pointer from the dependency
727: frames, a generator node can quickly determine whether it is a leader
728: node. More precisely, in our algorithm, a generator ${\cal L}$ is a
729: leader node when either \textbf{(a)} ${\cal L}$ is the youngest tabled
730: node, or \textbf{(b)} the youngest consumer that says ${\cal L}$ is
731: the leader.
732: 
733: Our algorithm thus requires computing leader node information whenever
734: creating a new consumer node ${\cal C}$. We proceed as follows. First,
735: we hypothesize that the leader node is ${\cal C}$'s generator, say
736: ${\cal G}$. Next, for all consumer nodes older than ${\cal C}$ and
737: younger than ${\cal G}$, we check whether they depend on an older
738: generator node. Consider that there is at least one such node and that
739: the oldest of these nodes is ${\cal G'}$. If so then ${\cal G'}$ is
740: the leader node. Otherwise, our hypothesis was correct and the leader
741: node is indeed ${\cal G}$. Leader node information is implemented as a
742: pointer to the choice point of the newly computed leader node.
743: 
744: Figure~\ref{fig_spotting_current_leader} uses the example from
745: Figure~\ref{fig_finite_SLG_tree} to illustrate the leader node
746: algorithm. For compactness, the figure presents calls to
747: \texttt{path(a,Z)}, \texttt{path(b,Z)}, \texttt{path(c,Z)} and
748: \texttt{arc(a,Z)}, as \texttt{pa}, \texttt{pb}, \texttt{pc}, and
749: \texttt{aa}, respectively. Figure~\ref{fig_spotting_current_leader}(a)
750: shows the initial configuration. The generator node ${\cal N}_0$ is
751: the current leader node because it is the only subgoal. Figure
752: \ref{fig_spotting_current_leader}(b) shows the dependency graph after
753: creating node ${\cal N}_2$. First, we called a variant of
754: \texttt{path(a,Z)}, and allocated the corresponding dependency frame.
755: ${\cal N}_0$ is the generator node for the variant call
756: \texttt{path(a,Z)}, ${\cal N}_0$ is the leader node for ${\cal
757:   N}_1$'s. ${\cal N}_1$ then suspended, we backtracked to ${\cal N}_0$
758: and called \texttt{arc(a,Z)}. As \texttt{arc(a,Z)} is not tabled, we
759: had to allocate an interior node for ${\cal N}_2$.
760: 
761: \begin{figure}[!ht]
762: \centerline{
763: \epsfxsize=12cm
764: \epsffile{spotting_leader.eps}
765: }
766: \caption{Spotting the current leader node.}
767: \label{fig_spotting_current_leader}
768: \end{figure}
769: 
770: Figure \ref{fig_spotting_current_leader}(c) shows the graph after we
771: created node ${\cal N}_{14}$. We have already created first and
772: variant calls to subgoals \texttt{path(b,Z)} and \texttt{path(c,Z)}.
773: Two new dependency frames were allocated and initialized. We thus have
774: three SCCs on stack: one per generator. The youngest SCC on stack is
775: for subgoal \texttt{path(c,Z)}. As a result, the current leader node
776: for the new set of nodes becomes ${\cal N}_{13}$. This is the one
777: referred in the youngest dependency frame.
778: 
779: Figure \ref{fig_spotting_current_leader}(d) shows the interesting case
780: where tabled nodes exist between a consumer and its generator. In the
781: example, consumer node ${\cal N}_{21}$, has two consumers, ${\cal
782:   N}_7$ and ${\cal N}_{14}$, separating it from its generator, ${\cal
783:   N}_6$.  As both consumers do not depend on nodes older than ${\cal
784:   N}_6$, the leader node for ${\cal N}_{21}$ is still ${\cal N}_6$,
785: and ${\cal N}_6$ becomes the current leader node. This situation
786: represents the point at which subgoal \texttt{path(c,Z)} starts
787: depending on subgoal \texttt{path(b,Z)} and their SCCs are merged
788: together. Next, we allocated consumer node ${\cal N}_{25}$. Nodes
789: ${\cal N}_{14}$ and ${\cal N}_{21}$ are between ${\cal N}_{25}$ and
790: the generator ${\cal N}_{13}$. Our algorithm says that since ${\cal
791:   N}_{21}$ depends on an older generator node, ${\cal N}_6$, the
792: leader node information for ${\cal N}_{25}$ is also ${\cal N}_6$. As a
793: result, ${\cal N}_6$ remains the current leader node.
794: 
795: Finally, Figure \ref{fig_spotting_current_leader}(e) shows the point
796: after the subgoals \texttt{path(b,Z)} and \texttt{path(c,Z)} have
797: completed and the segments belonging to their SCC have been
798: released. The computation switches back to ${\cal N}_1$, consumes the
799: next answer and calls \texttt{path(c,Z)}. At this point,
800: \texttt{path(c,Z)} is already completed, and thus we can avoid
801: consumer node allocation and instead perform what is called the
802: \emph{completed table optimization}~\cite{Sagonas-98}. This
803: optimization allocates a node, similar to an interior node, that will
804: consume the set of found answers executing compiled code directly from
805: the trie data structure associated with the completed
806: subgoal~\cite{Ramakrishnan-99}.
807: 
808: 
809: 
810: \subsection{Completion and Answer Resolution}
811: \label{section_completion_answer_resolution}
812: 
813: After backtracking to a leader node, we must check whether all younger
814: consumer nodes have consumed all their answers. To do so, we walk the
815: chain of dependency frames looking for a frame which has not yet
816: consumed all the generated answers. If there is such a frame, we
817: should resume the computation of the corresponding consumer node. We
818: do this by restoring the stack pointers and backtracking to the node.
819: Otherwise, we can perform completion. This includes \textbf{(i)}
820: marking as complete all the subgoals in the SCC; \textbf{(ii)}
821: deallocating all younger dependency frames; and \textbf{(iii)}
822: backtracking to the previous node to continue the execution.
823: 
824: Backtracking to a consumer node results in executing the answer
825: resolution operation. The operation first checks the table space for
826: unconsumed answers. If there are new answers, it loads the next
827: available answer and proceeds. Otherwise, it backtracks again. If this
828: is the first time that backtracking from that consumer node takes
829: place, then it is performed as usual. Otherwise, we know that the
830: computation has been resumed from an older generator node ${\cal G}$
831: during an unsuccessful completion operation. Therefore, backtracking
832: must be done to the next consumer node that has unconsumed answers and
833: that is younger than ${\cal G}$. If no such consumer node can be
834: found, backtracking must be done to the generator node ${\cal G}$.
835: 
836: The process of resuming a consumer node, consuming the available set
837: of answers, suspending and then resuming another consumer node can be
838: seen as an iterative process which repeats until a fixpoint is
839: reached. This fixpoint is reached when the SCC is completely
840: evaluated.
841: 
842: 
843: 
844: \section{Or-Parallelism within Tabling}
845: 
846: The first step in our research was to design a model that would allow
847: concurrent execution of all available alternatives, be they from
848: generator, consumer or interior nodes. We researched two designs: the
849: TOP (Tabling within Or Parallelism) model and the OPT (Or-Parallelism
850: within Tabling) model.
851: 
852: Parallelism in the TOP model is supported by considering that a
853: parallel evaluation is performed by a set of independent WAM engines,
854: each managing an unique branch of the search tree at a time. These
855: engines are extended to include direct support to the basic table
856: access operations, that allow the insertion of new subgoals and
857: answers. When exploiting parallelism, some branches may be
858: \emph{suspended}. Generator and interior nodes suspend alternatives
859: because we do not have enough processors to exploit them all. Consumer
860: nodes may also suspend because they are waiting for more answers.
861: Workers move in the search tree, looking for points where they can
862: exploit parallelism.
863: 
864: Parallel evaluation in the OPT model is done by a set of independent
865: tabling engines that \emph{may} share different common branches of the
866: search tree during execution. Each worker can be considered a
867: sequential tabling engine that fully implements the tabling
868: operations: access the table space to insert new subgoals or answers;
869: allocate data structures for the different types of nodes; suspend
870: tabled subgoals; resume subcomputations to consume newly found
871: answers; and complete private (not shared) subgoals. As most of the
872: computation time is spent in exploiting the search tree involved in a
873: tabled evaluation, we can say that tabling is the base component of
874: the system.
875: 
876: The or-parallel component of the system is triggered to allow
877: synchronized access to the shared parts of the execution tree, in
878: order to get new work when a worker runs out of alternatives to
879: exploit, and to perform completion of shared subgoals. Unexploited
880: alternatives should be made available for parallel execution,
881: regardless of whether they originate from generator, consumer or
882: interior nodes. From the viewpoint of SLG resolution, the OPT
883: computational model generalizes the Warren's multi-sequential engine
884: framework for the exploitation of or-parallelism. Or-parallelism stems
885: from having several engines that implement SLG resolution, instead of
886: implementing Prolog's SLD resolution.
887: 
888: We have already seen that the SLG-WAM presents several opportunities
889: for parallelism. Figure~\ref{fig_opt_example} illustrates how this
890: parallelism can be specifically exploited in the OPT model. The
891: example assumes two workers, ${{\cal W}_1}$ and ${{\cal W}_2}$, and
892: the program code and query goal from Figure~\ref{fig_finite_SLG_tree}.
893: For simplicity, we use the same abbreviation introduced in
894: Figure~\ref{fig_spotting_current_leader} to denote the subgoals.
895: 
896: \begin{figure}[!ht]
897: \centerline{
898: \epsfxsize=10cm
899: \epsffile{opt_example.eps}
900: }
901: \caption{Exploiting parallelism in the OPT model.}
902: \label{fig_opt_example}
903: \end{figure}
904: 
905: Consider that worker ${\cal W}_1$ starts the evaluation. It first
906: allocates a generator and a consumer node for tabled subgoal
907: \texttt{path(a,Z)}. Because there are no available answers for
908: \texttt{path(a,Z)}, it backtracks. The next alternative leads to a
909: non-tabled subgoal \texttt{arc(a,Z)} for which we create an interior
910: node. The first alternative for \texttt{arc(a,Z)} succeeds with the
911: answer \texttt{Z=b}. The worker inserts the newly found answer in the
912: table and starts exploiting the next alternative for
913: \texttt{arc(a,Z)}. This is shown in the left sub-figure. At this
914: point, worker ${\cal W}_2$ requests for work. Assume that worker
915: ${\cal W}_1$ decides to share all of its private nodes. The two
916: workers will share three nodes: the generator and consumer nodes for
917: \texttt{path(a,Z)}, and the interior node for \texttt{arc(a,Z)}.
918: Worker ${\cal W}_2$ takes the next unexploited alternative of
919: \texttt{arc(a,Z)} and from now on, either worker can find further
920: answers for \texttt{path(a,Z)} or resume the shared consumer node.
921: 
922: The OPT model offers two important advantages over the TOP model.
923: First, OPT reduces to a minimum the overlap between or-parallelism and
924: tabling. Namely, as the example shows, in OPT it is straightforward to
925: make nodes public only when we want to share them. This is very
926: important because execution of private nodes is almost as fast as
927: sequential execution. Second, OPT enables different data structures
928: for or-parallelism and for tabling. For instance, one can use the
929: SLG-WAM for tabling, and environment copying or binding arrays for
930: or-parallelism.
931: 
932: The question now is whether we can achieve an implementation of the
933: OPT model, and whether that implementation is \emph{efficient}. We
934: implemented OPTYap in order to answer this question. In OPTYap,
935: tabling is implemented by freezing the whole stacks when a consumer
936: blocks. Or-parallelism is implemented through copying of stacks. More
937: precisely, we optimize copying by using \emph{incremental copying},
938: where workers only copy the differences between their stacks. We
939: adopted this framework because environment copying and the SLG-WAM
940: are, respectively, two of the most successful or-parallel and tabling
941: engines. In our case, we already had the experience of implementing
942: environment copying in the Yap Prolog, the YapOr system, with
943: excellent performance results~\cite{Rocha-99b}. Adopting YapOr for the
944: or-parallel component of the combined system was therefore our first
945: choice.
946: 
947: Regarding the tabling component, an alternative to freezing the stacks
948: is copying them to a separate storage as in CHAT~\cite{Demoen-00}. We
949: found two major problems with CHAT. First, to take best advantage of
950: CHAT we need to have separate environment and choice point stacks, but
951: Yap has an integrated local stack. Second, and more importantly, we
952: believe that CHAT is less suitable than the SLG-WAM to an efficient
953: extension to or-parallelism because of its incremental completion
954: technique. CHAT implements incremental completion through an
955: incremental copying mechanism that saves intermediate states of the
956: execution stacks up to the nearest generator node. This works fine for
957: sequential tabling, because leader nodes are always generator nodes.
958: However, as we will see, for parallel tabling this does not hold
959: because any public node can be a potential leader node. To preserve
960: incremental completion efficiency in a parallel tabling environment,
961: incremental saving should be performed up to the parent node, as
962: potentially it can be a leader node. Obviously, this node-to-node
963: segmentation of the incremental saving technique will degrade the
964: efficiency of any parallel system.
965: 
966: 
967: 
968: \section{The Or-Parallel Tabling Engine}
969: 
970: The OPT model requires changes to both the initial designs for
971: parallelism and tabling. As we enumerated next, support or-parallelism
972: plus tabling requires changes to memory allocation, table access, the
973: completion algorithm. We must further ensure that environment copying
974: and tabling suspension do not interfere. Or-parallelism issues refer
975: to scheduling and to speculative work. In more detail:
976: 
977: \begin{enumerate}
978: \item We must support parallel memory allocation and deallocation of
979:   the several data structures we use. Fortunately, most of our data
980:   structures are fixed-sized and parallel memory allocation can be
981:   implemented efficiently.
982: \item We must allow for several workers to concurrently read and
983:   update the table. To do so workers need to be able to lock the
984:   table. As we shall see finer locking allows for more parallelism,
985:   but coarser locking has less overheads.
986: \item OPTYap uses the copying model, where workers do not see the
987:   whole search tree, but instead only the branches corresponding to
988:   their current SLG-WAM. It is thus possible that a generator may not
989:   be in the stacks for a consumer (and vice-versa). We show that one
990:   can generalize the concept of leader node for such cases, and that
991:   such a generalization still gives a conservative approximation for a
992:   SCC. Completion can thus be performed when we are the last worker
993:   backtracking to the generalized leader nodes, and there is no more
994:   work below. The first condition can be easily checked through the
995:   or-parallel machinery. The second condition uses the sequential
996:   tabling machinery.
997: \item Or-parallelism and tabling are not strictly orthogonal. More
998:   precisely, naively sharing or-parallel work might result in
999:   overwriting suspended stacks. Several approaches may be used to
1000:   tackle this problem, we have proposed and implemented a suspension
1001:   mechanism that gives maximum scheduling flexibility.
1002: \item Scheduling or-parallel work in our system is based on the Muse
1003:   scheduler~\cite{Ali-90b}. Intuitively this corresponds to a form of
1004:   hierarchical scheduling, where we favor tabled scheduling
1005:   operations, and resort to the more expensive or-parallel scheduling
1006:   when no tabling operations are available. Other approaches are
1007:   possible, but this one has served OPTYap well so far. We also
1008:   discuss how moving around the shared parts of the search tree
1009:   changes in the presence of parallelism.
1010: \item Last, we briefly discuss pruning issues. Although pruning in the
1011:   presence of tabling is a complex issue~\cite{Guo-02,Castro-03}, we
1012:   still should execute correctly for non-tabled regions of the search
1013:   tree (interior nodes).
1014: \end{enumerate}
1015: 
1016: We next discuss these issues in some detail, presenting the general
1017: execution framework.
1018: 
1019: 
1020: 
1021: \subsection{Memory Organization}
1022: 
1023: In OPTYap, memory is divided into a \emph{global} addressing space and
1024: a collection of \emph{local} spaces, as illustrated in
1025: Figure~\ref{fig_optyap_memory}. The global space includes the code
1026: area and a parallel data area that consists of all the data structures
1027: required to support concurrent execution. Each local space represents
1028: one system worker and it contains the four WAM execution stacks
1029: inherited from Yap: global stack, local stack, trail, and auxiliary
1030: stack.
1031: 
1032: \begin{figure}[!ht]
1033: \centerline{
1034: \epsfxsize=7cm
1035: \epsffile{optyap_memory.eps}
1036: }
1037: \caption{Memory organization in OPTYap.}
1038: \label{fig_optyap_memory}
1039: \end{figure}
1040: 
1041: The parallel data area includes the table and dependency spaces
1042: inherited from YapTab, and the \emph{or-frame space}~\cite{Ali-90b}
1043: inherited from YapOr to synchronize access to shared nodes.
1044: Additionally, we have an extra data structure to preserve the stacks
1045: of suspended SCCs (further details in section~\ref{section_scc}).
1046: Remember that we use specific extra fields in the choice points to
1047: access the data structures in the parallel data area. When sharing
1048: work, the execution stacks of the sharing worker are copied from its
1049: local space to the local space of the requesting worker. The data
1050: structures from the parallel data area associated with the shared
1051: stacks are automatically inherited by the requesting worker in the
1052: copied choice points.
1053: 
1054: The efficiency of a parallel system largely depends on how concurrent
1055: handling of shared data is achieved and synchronized. Page faults and
1056: memory cache misses are a major source of overhead regarding data
1057: access or update in parallel systems. OPTYap tries to avoid these
1058: overheads by adopting a page-based organization scheme to split memory
1059: among different data structures, in a way similar to Bonwick's Slab
1060: memory allocator~\cite{Bonwick-94}. Each memory page of the parallel
1061: data area only contains data structures of the same type. Whenever a
1062: new request for a data structure of type ${\cal T}$ appears, the next
1063: available structure on one of the ${\cal T}$ pages is returned. If
1064: there are no available structures in any ${\cal T}$ page, then one of
1065: the free pages is made to be of type ${\cal T}$. A page is freed when
1066: all its data structures are released. A free page can be immediately
1067: reassigned to a different structure type.
1068: 
1069: 
1070: 
1071: \subsection{Concurrent Table Access}
1072: \label{section_concurrent_table_access}
1073: 
1074: Our experience showed that the table space is the major data area open
1075: to concurrent access operations in a parallel tabling environment. To
1076: maximize parallelism, whilst minimizing overheads, accessing and
1077: updating the table space must be carefully controlled. Reader/writer
1078: locks are the ideal implementation scheme for this purpose. In a
1079: nutshell, we can say that there are two critical issues that determine
1080: the efficiency of a locking scheme for the table. One is the
1081: \emph{lock duration}, that is, the amount of time a data structure is
1082: locked. The other is the \emph{lock grain}, that is, the amount of
1083: data structures that are protected through a single lock request. It
1084: is the balance between lock duration and lock grain that compromises
1085: the efficiency of different table locking approaches. For instance, if
1086: the lock scheme is short duration or fine grained, then inserting many
1087: trie nodes in sequence, corresponding to a long trie path, may result
1088: in a large number of lock requests. On the other hand, if the lock
1089: scheme is long duration or coarse grain, then going through a trie
1090: path without extending or updating its trie structure, may
1091: unnecessarily lock data and prevent possible concurrent access by
1092: others.
1093: 
1094: Unfortunately, it is impossible beforehand to know which locking
1095: scheme would be optimal. Therefore, in OPTYap we experimented with
1096: four alternative locking schemes to deal with concurrent accesses to
1097: the table space data structures, the \emph{Table Lock at Entry Level}
1098: scheme, TLEL, the \emph{Table Lock at Node Level} scheme, TLNL, the
1099: \emph{Table Lock at Write Level} scheme, TLWL, and the \emph{Table
1100: Lock at Write Level - Allocate Before Check} scheme, TLWL-ABC.
1101: 
1102: The TLEL scheme essentially allows a single writer per subgoal trie
1103: structure and a single writer per answer trie structure. The main
1104: drawback of TLEL is the contention resulting from long lock
1105: duration. The TLNL enables a single writer per chain of sibling nodes
1106: that represent alternative paths from a common parent node. The TLWL
1107: scheme is similar to TLNL in that it enables a single writer per chain
1108: of sibling nodes that represent alternative paths to a common parent
1109: node. However, in TLWL, the common parent node is only locked when
1110: writing to the table is likely. TLWL also avoids the TLNL memory usage
1111: problem by replacing trie node lock fields with a global array of lock
1112: entries. Last, the TLWL-ABC scheme anticipates the allocation
1113: and initialization of nodes that are likely to be inserted in the
1114: table space before locking.
1115: 
1116: Through experimentation, we observed that the locking schemes, TLWL
1117: and TLWL-ABC, present the best speedup ratios and they are the only
1118: schemes showing scalability. Since none of these two schemes clearly
1119: outperform the other, we assumed TLWL as the default. The observed
1120: slowdown with higher number of workers for TLEL and TLNL schemes is
1121: mainly due to their locking of the table space even when writing is
1122: not likely. In particular, for repeated answers they pay the cost of
1123: performing locking operations without inserting any new trie node. For
1124: these schemes the number of potential contention points is
1125: proportional to the number of answers found during execution, being
1126: they unique or redundant.
1127: 
1128: 
1129: 
1130: \subsection{Leader Nodes}
1131: \label{section_leader_nodes}
1132: 
1133: Or-parallel systems execute alternatives early. As a result, different
1134: workers may execute the generator and the consumer subgoals. In fact,
1135: it is possible that generators will execute earlier, and in a
1136: different branch than in sequential execution. As
1137: Figure~\ref{fig_guess_leader} shows, this may induce complex
1138: dependencies between workers, therefore requiring a more elaborate
1139: completion algorithm that may involve branches from several workers.
1140: 
1141: \begin{figure}[!ht]
1142: \centerline{
1143: \epsfxsize=7cm
1144: \epsffile{guess_leader.eps}
1145: }
1146: \caption{At which node should we check for completion?}
1147: \label{fig_guess_leader}
1148: \end{figure}
1149: 
1150: In this example, worker ${\cal W}_1$ takes the leftmost alternative
1151: while worker ${\cal W}_2$ takes the rightmost from the youngest common
1152: node. While exploiting their alternatives, ${\cal W}_1$ calls a tabled
1153: subgoal \texttt{a} and ${\cal W}_2$ calls a tabled subgoal \texttt{b}.
1154: As this is the first call to both subgoals, a generator node is stored
1155: for each one. Next, each worker calls the tabled subgoal firstly
1156: called by the other, and two consumer nodes, one per worker, are
1157: therefore allocated. At this point both workers hold a consumer node
1158: while not having the corresponding generator node in their branches.
1159: Conversely, the owner of each generator node has consumer nodes being
1160: executed by a different worker. The question is where should we check
1161: for completion? Intuitively, we would like to choose a node that is
1162: common to both branches and the youngest common node seems the better
1163: choice. But that node is not a generator node!
1164: 
1165: We could avoid this problem by disallowing consumer nodes for
1166: generator nodes on other branches. Unfortunately, such a solution
1167: would severely restrict parallelism. Our solution was therefore to
1168: \emph{allow completion at all kind of public nodes}.
1169: 
1170: To clarify these new situations we introduce a new concept, the
1171: \emph{Generator Dependency Node} (or \emph{GDN}). Its purpose is to
1172: signal the nodes that are candidates to be leader nodes, therefore
1173: representing a similar role as that of the generator nodes for
1174: sequential tabling. A GDN is calculated whenever a new consumer node,
1175: say ${\cal C}$, is created. We define the GDN ${\cal D}$ for a
1176: consumer node ${\cal C}$ with generator $\cal G$ to be \emph{the
1177:   youngest node on ${\cal C}$'s current branch that is an ancestor of
1178:   ${\cal G}$}. Obviously, if ${\cal G}$ belongs to the current branch
1179: of ${\cal C}$ then ${\cal G}$ must be the GDN. Thus GDN reduces to
1180: leader node for sequential computations. On the other hand, if the
1181: worker allocating ${\cal C}$ is not the one that allocated ${\cal G}$
1182: then the youngest node ${\cal D}$ is a public node, but not
1183: necessarily ${\cal G}$. Figure~\ref{fig_public_generator_dependency}
1184: presents three different situations that better illustrate the GDN
1185: concept. ${\cal W_G}$ is always the worker that allocated the
1186: generator node ${\cal G}$, and ${\cal W_C}$ is the worker that is
1187: allocating a consumer node ${\cal C}$.
1188: 
1189: \begin{figure}[!ht]
1190: \centerline{
1191: \epsfxsize=11cm
1192: \epsffile{public_generator_dependency.eps}
1193: }
1194: \caption{Spotting the generator dependency node.}
1195: \label{fig_public_generator_dependency}
1196: \end{figure}
1197: 
1198: In situation (a), the generator node ${\cal G}$ is on the branch of
1199: the consumer node ${\cal C}$, and thus, ${\cal G}$ is the GDN. In
1200: situation (b), nodes ${\cal N}_1$ and ${\cal N}_2$ are on the branch
1201: of ${\cal C}$ and both contain a branch leading to the generator
1202: ${\cal G}$. As ${\cal N}_2$ is the youngest node of the two, it is the
1203: GDN. Situation (c) differs from (b) in that the public nodes represent
1204: more than one branch and, in this case, are interleaved in the
1205: physical stack. In this situation, ${\cal N}_1$ is the unique node
1206: that belongs to ${\cal C}$'s branch and that also contains ${\cal G}$
1207: in a branch below. ${\cal N}_2$ contains ${\cal G}$ in a branch below,
1208: but it is not on ${\cal C}$'s branch, while ${\cal N}_3$ is on ${\cal
1209:   C}$'s branch, but it does not contain ${\cal G}$ in a branch below.
1210: Therefore, ${\cal N}_1$ is the GDN. Notice that in both cases (b) and
1211: (c) the GDN can be a generator, a consumer or an interior node.
1212: 
1213: The procedure that computes the leader node information when
1214: allocating a new dependency frame now relies on the GDN
1215: concept. Remember that it is through this information that a node can
1216: determine whether it is a leader node. The main difference from the
1217: sequential algorithm is that now we first hypothesize that the leader
1218: node for the consumer node in hand is its GDN, and not its generator
1219: node. Then, we check the consumer nodes younger than the newly found
1220: GDN for an older dependency. Note that as soon as an older dependency
1221: ${\cal D}$ is found in a consumer node ${\cal C'}$, the remaining
1222: consumer nodes, older than ${\cal C'}$ but younger than the GDN, do
1223: not need to be checked. This is safe because the previous computation
1224: of the leader node information for the consumer node ${\cal C'}$
1225: already represents the oldest dependency that includes the remaining
1226: consumer nodes. We next give an argument on the correctness of the
1227: algorithm.
1228: 
1229: Consider a consumer node with GDN ${\cal G}$ and assume that its
1230: leader node ${\cal D}$ is found in the dependency frame for consumer
1231: node ${\cal C}$. Now hypothesize that there is a consumer node ${\cal
1232:   N}$ younger than ${\cal G}$ with a reference ${\cal D'}$ older than
1233: ${\cal D}$. Therefore, when previously computing the leader node for
1234: ${\cal C}$ one of the following situations occurred: \textbf{(i)}
1235: ${\cal D}$ is the GDN for ${\cal C}$ or \textbf{(ii)} ${\cal D}$ was
1236: found in a dependency frame for a consumer node ${\cal C'}$. Situation
1237: \textbf{(i)} is not possible because ${\cal N}$ is younger than ${\cal
1238:   D}$ and it holds a reference older than ${\cal D}$. Regarding
1239: situation \textbf{(ii)}, ${\cal C'}$ is necessarily younger than
1240: ${\cal N}$ as otherwise the reference found for ${\cal C}$ had been
1241: ${\cal D'}$. By recursively applying the previous argument to the
1242: computation of the leader node for ${\cal C'}$ we conclude that our
1243: initial hypothesis cannot hold because the number of nodes between
1244: ${\cal C}$ and ${\cal N}$ is finite.
1245: 
1246: With this scheme, concurrency is not a problem. Each worker views its
1247: own leader node independently from the execution being done by
1248: others. A new consumer node is always a private node and a new
1249: dependency frame is always the youngest dependency frame for a
1250: worker. The leader information stored in a dependency frame denotes
1251: the resulting leader node at the time the correspondent consumer node
1252: was allocated. Thus, after computing such information it remains
1253: unchanged. If when allocating a new consumer node the leader changes,
1254: the new leader information is only stored in the dependency frame for
1255: the new consumer, therefore not influencing others. Observe, for
1256: example, the situation from Figure~\ref{fig_dependency_frames}. Two
1257: workers, ${\cal W}_1$ and ${\cal W}_2$, exploiting different
1258: alternatives from a common public node, ${\cal N}_4$, are allocating
1259: new private consumer nodes. They compute the leader node information
1260: for the new dependency frames without requiring any explicit
1261: communication between both and without requiring any synchronization
1262: if consulting the common dependency frame for node ${\cal N}_4$. The
1263: resulting dependency chain for each worker is illustrated on each side
1264: of the figure. Note that the dependency frame for consumer node ${\cal
1265: N}_4$ is common to both workers. It is illustrated twice only for
1266: simplicity.
1267: 
1268: \begin{figure}[!ht]
1269: \centerline{
1270: \epsfxsize=9cm
1271: \epsffile{dependency_frames.eps}
1272: }
1273: \caption{Dependency frames in the parallel environment.}
1274: \label{fig_dependency_frames}
1275: \end{figure}
1276: 
1277: Within this scenario, worker ${\cal W}_1$ will check for completion at
1278: node ${\cal N}_1$, its current leader node, and worker ${\cal W}_2$
1279: will check for completion at node ${\cal N}_2$. Obviously, ${\cal
1280:   W}_2$ cannot perform completion when reaching ${\cal N}_2$. If
1281: ${\cal W}_1$ finds new answers for subgoal \texttt{c}, they should be
1282: consumed in node ${\cal N}_6$. Moreover, as ${\cal W}_1$ has a
1283: dependency for an older node, ${\cal N}_1$, the SCCs from both workers
1284: should only be completed together at node ${\cal N}_1$. However,
1285: ${\cal W}_1$ can allocate another consumer node that changes its
1286: current leader node. Therefore, ${\cal W}_2$ cannot know beforehand
1287: the leader where both SCCs should be completed. Determining the leader
1288: node where several dependent SCCs from different workers may be
1289: completed together is the problem that we address next.
1290: 
1291: 
1292: 
1293: \subsection{SCC Suspension}
1294: \label{section_scc}
1295: 
1296: Different paths may be followed when a worker ${\cal W}$ reaches a
1297: leader node for a SCC ${\cal S}$. The simplest case is when the node
1298: is private. In this case, we proceed as for sequential tabling.
1299: Otherwise, the node is public, and other workers can still influence
1300: ${\cal S}$. For instance, these workers may find new answers for a
1301: consumer node in ${\cal S}$, in which case the consumer must be
1302: resumed to consume the new answers. Clearly, in such cases, ${\cal W}$
1303: should not complete. On the other hand, ${\cal W}$ has tried all
1304: available alternatives and would like to move anywhere in the tree,
1305: say to node ${\cal N}$, to try other work. According to the copying
1306: model we use for or-parallelism, we should backtrack to the youngest
1307: node common to ${\cal N}$'s branch, that is, we should reset our
1308: stacks to the values of the common node. According to the freezing
1309: model that we use for tabling, we cannot recover the current consumers
1310: because they are frozen. We thus have a contradiction.
1311: 
1312: Note that this is the only case where or-parallelism and tabling
1313: conflict. One solution would be to disallow movement in this case.
1314: Unfortunately, we would again severely restrict parallelism. As a
1315: result, in order to allow ${\cal W}$ to continue execution it becomes
1316: necessary to \emph{suspend the SCC} at hand. Suspending a SCC includes
1317: saving the SCC's stacks to a proper space, leaving in the leader node
1318: a reference to the suspended SCC. These suspended computations are
1319: considered again when the remaining workers do completion.
1320: 
1321: In order to find out which suspended SCCs need to be resumed, each
1322: worker maintains a list of nodes with suspended SCCs. The last worker
1323: backtracking from a public node ${\cal N}$ checks if it holds
1324: references to suspended SCCs. If so, then ${\cal N}$ is included in
1325: the worker's list of nodes with suspended SCCs (the nodes are linked
1326: in stack order). If the node already belongs to other worker's list,
1327: it is not collected.
1328: 
1329: A suspended SCC should be resumed if it contains consumer nodes with
1330: unconsumed answers. To resume a suspended SCC a worker needs to copy
1331: the saved stacks to the correct position in its own stacks, and thus,
1332: it has to suspend its current SCC first. Figure~\ref{fig_resuming_scc}
1333: illustrates the management of suspended SCCs when searching for SCCs
1334: to resume. It considers a worker ${\cal W}$, positioned in the leader
1335: node ${\cal N}_1$ of its current SCC ${\cal S}_1$. ${\cal W}$ consults
1336: its list of nodes with suspended SCCs, and starts checking the
1337: suspended SCC ${\cal S}_4$ for unconsumed answers. Assuming that
1338: ${\cal S}_4$ does not contain unconsumed answers, the search continues
1339: in the next node in the list. Here, suppose that SCC ${\cal S}_2$ does
1340: not have consumer nodes with unconsumed answers, but SCC ${\cal S}_3$
1341: does. The current SCC ${\cal S}_1$ is then suspended, and only then
1342: ${\cal S}_3$ resumed.
1343: 
1344: \begin{figure}[!ht]
1345: \centerline{
1346: \epsfxsize=12cm
1347: \epsffile{resuming_scc.eps}
1348: }
1349: \caption{Resuming a suspended SCC.}
1350: \label{fig_resuming_scc}
1351: \end{figure}
1352: 
1353: Notice that node ${\cal N}_3$ was removed from ${\cal W}$'s list of
1354: suspended SCCs because ${\cal S}_3$ may not include ${\cal N}_3$ in
1355: its stack segments. For simplicity and efficiency, instead of checking
1356: ${\cal S}_3$'s segments, we simply remove ${\cal N}_3$'s from ${\cal
1357:   W}$'s list. Note that this is a safe decision as a SCC only depends
1358: from branches below the leader node. Thus, if ${\cal S}_3$ does not
1359: include ${\cal N}_3$ then no new answers can be found for ${\cal
1360:   S}_4$'s consumer nodes. Otherwise, if this is not the case then
1361: ${\cal W}$ or other workers can eventually be scheduled to a node held
1362: by ${\cal S}_4$ and find new answers for at least one of its consumer
1363: nodes. In this case, when failing, these workers will necessarily
1364: backtrack through ${\cal N}_3$, ${\cal S}_4$'s leader. Therefore, the
1365: last worker backtracking from ${\cal N}_3$ will collect it for its own
1366: list, which allows ${\cal S}_4$ to be later resumed when executing
1367: completion in an older leader node.
1368: 
1369: 
1370: 
1371: \subsection{The Flow of Control}
1372: 
1373: Actual execution control of a parallel tabled evaluation mainly flows
1374: through four procedures. The process of completely evaluating SCCs is
1375: accomplished by the \texttt{completion()} and
1376: \texttt{answer\_resolution()} procedures, while parallel
1377: synchronization is achieved by the \texttt{getwork()} and
1378: \texttt{scheduler()} procedures. Here we focus on the execution in
1379: engine mode, that is on the \texttt{completion()},
1380: \texttt{answer\_resolution()} and \texttt{getwork()} procedures, and
1381: leave scheduling for the following section.
1382: Figure~\ref{fig_control_flow} presents a general overview of how
1383: control flows between the three procedures and how it flows within
1384: each procedure.
1385: 
1386: \begin{figure}[!ht]
1387: \centerline{
1388: \epsfxsize=12cm
1389: \epsffile{control_flow.eps}
1390: }
1391: \caption{The flow of control in a parallel tabled evaluation.}
1392: \label{fig_control_flow}
1393: \end{figure}
1394: 
1395: A novel completion procedure, \texttt{public\_completion()},
1396: implements completion detection for public leader nodes. As for
1397: private nodes, whenever a public node finds that it is a leader, it
1398: starts to check for younger consumer nodes with unconsumed answers. If
1399: there is such a node, we resume the computation to it. Otherwise, it
1400: checks for suspended SCCs with unconsumed answers. Remember that to
1401: resume a suspended SCC a worker needs to suspend its current SCC
1402: first.
1403: 
1404: We thus adopted the strategy of resuming suspended SCCs \emph{only
1405:   when the worker finds itself at a leader node}, since this is a
1406: decision point where the worker either completes or suspends the
1407: current SCC. Hence, if the worker resumes a suspended SCC it does not
1408: introduce further dependencies. This is not the case if the worker
1409: would resume a suspended SCC ${\cal R}$ as soon as it reached the node
1410: where it had suspended. In that situation, the worker would have to
1411: suspend its current SCC ${\cal S}$, and after resuming ${\cal R}$ it
1412: would probably have to also resume ${\cal S}$ to continue its
1413: execution. A first disadvantage is that the worker would have to make
1414: more suspensions and resumptions. Moreover, if we resume earlier,
1415: ${\cal R}$ may include consumer nodes with unconsumed answers that are
1416: common with ${\cal S}$. More importantly, suspending in non-leader
1417: nodes leads to further complexity that can be very difficult to
1418: manage.
1419: 
1420: A SCC ${\cal S}$ is completely evaluated when \textbf{(i)} there are
1421: no unconsumed answers in any consumer node belonging to ${\cal S}$ or
1422: in any consumer node within a SCC suspended in a node belonging to
1423: ${\cal S}$; and \textbf{(ii)} there are no other representations of
1424: the leader node ${\cal N}$ in the computational environment, be ${\cal
1425: N}$ represented in the execution stacks of a worker or be ${\cal N}$
1426: in the suspended stack segments of a SCC. Completing a SCC includes
1427: \textbf{(i)} marking all dependent subgoals as complete; \textbf{(ii)}
1428: releasing the frames belonging to the complete branches, including the
1429: branches in suspended SCCs; \textbf{(iii)} releasing the frozen stacks
1430: and the memory space used to hold the stacks from suspended SCCs; and
1431: \textbf{(iv)} readjusting the freeze registers and the whole set of
1432: stack and frame pointers.
1433: 
1434: The answer resolution operation for the parallel environment
1435: essentially uses the same algorithm as previously described for
1436: private nodes (please refer to
1437: section~\ref{section_completion_answer_resolution}). Initially, the
1438: procedure checks for unconsumed answers to be loaded for execution. If
1439: we have answers, execution will jump to them. Otherwise, we schedule
1440: for a backtracking node. If this is not the first time that
1441: backtracking from that consumer node takes place, we know that the
1442: computation has been resumed from an older leader node ${\cal L}$
1443: during an unsuccessful completion operation. ${\cal L}$ is thus the
1444: oldest node to where we can backtrack. Backtracking must be done to
1445: the next consumer node that has unconsumed answers and that is younger
1446: than ${\cal L}$. Otherwise, if there are no such consumer nodes,
1447: backtracking must be done to ${\cal L}$.
1448: 
1449: The \texttt{getwork()} procedure contributes to the progress of a
1450: parallel tabled evaluation by moving to effective work. The usual way
1451: to execute \texttt{getwork()} is through failure to the youngest
1452: public node on the current branch. We can distinguish two main
1453: procedures in \texttt{getwork()}. One detects completion points and
1454: therefore makes the computation flow to the
1455: \texttt{public\_completion()} procedure. The other corresponds to
1456: or-parallel execution. It synchronizes to check for available
1457: alternatives and executes the next one, if any. Otherwise, it invokes
1458: the scheduler. A completion point is detected when ${\cal N}$ is the
1459: leader node pointed by the youngest dependency frame. The exception is
1460: if ${\cal N}$ is itself a generator node for a consumer node within
1461: the current SCC and it contains unexploited alternatives. In such
1462: cases, the current SCC is not fully exploited. Hence, we should
1463: exploit first the available alternatives, and only then invoke
1464: completion.
1465: 
1466: 
1467: 
1468: \subsection{Scheduling Work}
1469: 
1470: Scheduling work is the scheduler's task. It is about efficiently
1471: distributing the available work for exploitation between the running
1472: workers. In a parallel tabling environment we have the extra
1473: constraint of keeping the correctness of sequential tabling semantics.
1474: A worker enters in scheduling mode when it runs out of work and
1475: returns to execution whenever a new piece of unexploited work is
1476: assigned to it by the scheduler.
1477: 
1478: The scheduler for the OPTYap engine is mainly based on YapOr's
1479: scheduler. All the scheduler strategies implemented for YapOr were
1480: used in OPTYap. However, extensions were introduced in order to
1481: preserve the correctness of tabling semantics. These extensions allow
1482: support for leader nodes, frozen stack segments, and suspended
1483: SCCs. The OPTYap model was designed to enclose the computation within
1484: a SCC until the SCC was suspended or completely evaluated. Thus,
1485: OPTYap introduces the constraint that the \emph{computation cannot
1486: flow outside the current SCC, and workers cannot be scheduled to
1487: execute at nodes older than their current leader node}. Therefore,
1488: when scheduling for the nearest node with unexploited alternatives, if
1489: it is found that the current leader node is younger than the potential
1490: nearest node with unexploited alternatives, then the current leader
1491: node is the node scheduled to proceed with the evaluation.
1492: 
1493: The next case is when the scheduling to determine the nearest node
1494: with unexploited alternatives does not return any node to proceed
1495: execution. The scheduler then starts searching for busy\footnote{A
1496: worker is said to be busy when it is in engine mode exploiting
1497: alternatives. A worker is said to be idle when it is in scheduling
1498: mode searching for work.} workers that can be demanded for work. If
1499: such a worker ${\cal B}$ is found, then the requesting worker moves up
1500: to the youngest node that is common to ${\cal B}$, in order to become
1501: partially consistent with part of ${\cal B}$. Otherwise, no busy
1502: worker was found, and the scheduler moves the idle worker to a better
1503: position in the search tree. Therefore, we can enumerate three
1504: different situations for a worker to move up to a node ${\cal N}$:
1505: \textbf{(i)} ${\cal N}$ is the nearest node with unexploited
1506: alternatives; \textbf{(ii)} ${\cal N}$ is the youngest node common
1507: with the busy worker we found; or \textbf{(iii)} ${\cal N}$
1508: corresponds to a better position in the search tree.
1509: 
1510: The process of moving up in the search tree from a current node ${\cal
1511: N}_0$ to a target node ${\cal N}_f$ is mainly implemented by the
1512: \texttt{move\_up\_one\_node()} procedure. This procedure is invoked
1513: for each node that has to be traversed until reaching ${\cal
1514: N}_f$. The presence of frozen stack segments or the presence of
1515: suspended SCCs in the nodes being traversed influences and can even
1516: abort the usual moving up process.
1517: 
1518: Assume that the idle worker ${\cal W}$ is currently positioned at
1519: ${{\cal N}}_i$ and that it wants to move up one node. Initially, the
1520: procedure checks for frozen nodes on the stack to infer whether ${\cal
1521: W}$ is moving within a SCC. If so, ${\cal W}$ simply moves up. The
1522: interesting case is when ${\cal W}$ is not within a SCC. If ${{\cal
1523: N}}_i$ holds a suspended SCC, then ${\cal W}$ can safely resume it. If
1524: resumption does not take place, the procedure proceeds to check
1525: whether ${\cal W}$ holds the unique representation of ${\cal
1526: N}_i$. This being the case, the suspended SCCs in ${\cal N}_i$ can be
1527: completed. Completion can be safely performed over the suspended SCCs
1528: in ${\cal N}_i$ not only because the SCCs are completely evaluated, as
1529: none was previously resumed, but also because no more dependencies
1530: exist, as there are no other branches below ${\cal N}_i$. Moreover, if
1531: ${\cal N}_i$ is a generator node then its correspondent subgoal can be
1532: also marked as completed. Otherwise, ${\cal W}$ simply moves up.
1533: 
1534: The scheduler extensions described are mainly related with tabling
1535: support. As the scheduling strategies inherited from the YapOr's
1536: scheduler were designed for an or-parallel model, and not for an
1537: or-parallel tabling model, further work is still needed to implement
1538: and experiment with proper scheduling strategies that can take
1539: advantage of the parallel tabling environment.
1540: 
1541: 
1542: 
1543: \subsection{Speculative Work}
1544: 
1545: In~\cite{Ciepielewski-91}, Ciepielewski defines speculative work as
1546: \emph{work which would not be done in a system with one
1547: processor}. The definition clearly shows that speculative work is an
1548: implementation problem for parallelism and it must be addressed
1549: carefully in order to reduce its impact. The presence of pruning
1550: operators during or-parallel execution introduces the problem of
1551: speculative work~\cite{Hausman-PhD,Ali-92a,Beaumont-93}. Prolog has an
1552: explicit pruning operator, the \emph{cut} operator. When a computation
1553: executes a cut operation, all branches to the right of the cut are
1554: pruned. Computations that can potentially be pruned are thus
1555: \emph{speculative}. Earlier execution of such computations may result
1556: in wasted effort compared to sequential execution.
1557: 
1558: In parallel tabling, not only the answers found for the query goal may
1559: not be valid, but also answers found for tabled predicates may be
1560: invalidated. The problem here is even more serious because tabled
1561: answers can be consumed elsewhere in the tree, which makes
1562: impracticable any late attempt to prune computations resulting from
1563: the consumption of invalid tabled answers. Indeed, consuming invalid
1564: tabled answers may result in finding more invalid answers for the same
1565: or other tabled predicates. Notice that finding and consuming answers
1566: is the natural way to get a tabled computation going forward. Delaying
1567: the consumption of answers may compromise such flow. Therefore, tabled
1568: answers should be released as soon as it is found that they are safe
1569: from being pruned. Whereas for all-solution queries the requirement is
1570: that, at the end of the execution, we will have the set of valid
1571: answers; in tabling the requirement is to have the set of valid tabled
1572: answers released as soon as possible.
1573: 
1574: Currently, OPTYap implements an extension of the cut scheme proposed
1575: by Ali and Karlsson~\cite{Ali-92a}, that prunes useless work as early
1576: as possible, by optimizing the delivery of tabled answers as soon as
1577: it is found that they are safe from being pruned~\cite{Rocha-PhD}. As
1578: cut semantics for operations that prune tabled nodes is still an open
1579: problem, OPTYap does not handle cut operations that prune tabled nodes
1580: and for such cases execution is aborted.
1581: 
1582: 
1583: 
1584: \section{Related Work}
1585: 
1586: A first proposal on how to exploit implicit parallelism in tabling
1587: systems was Freire's \emph{Table-parallelism}~\cite{Freire-95}. In
1588: this model, each tabled subgoal is computed independently in a single
1589: computational thread, a \emph{generator thread}. Each generator thread
1590: is associated with a unique tabled subgoal and it is responsible for
1591: fully exploiting its search tree in order to obtain the complete set
1592: of answers. A generator thread dependent on other tabled subgoals will
1593: asynchronously consume answers as the correspondent generator threads
1594: will make them available. Within this model, parallelism results from
1595: having several generator threads running concurrently. Parallelism
1596: arising from non-tabled subgoals or from execution alternatives to
1597: tabled subgoals is not exploited. Moreover, we expect that scheduling
1598: and load balancing would be even harder than for traditional parallel
1599: systems.
1600: 
1601: More recent work~\cite{Guo-01}, proposes a different approach to the
1602: problem of exploiting implicit parallelism in tabled logic
1603: programs. The approach is a consequence of a new sequential tabling
1604: scheme based on \emph{dynamic reordering of alternatives with variant
1605: calls}. This dynamic alternative reordering strategy not only tables
1606: the answers to tabled subgoals, but also the alternatives leading to
1607: variant calls, the \emph{looping alternatives}. Looping alternative
1608: are reordered and placed at the end of the alternative list for the
1609: call. After exploiting all matching clauses, the subgoal enters a
1610: looping state, where the looping alternatives, if they exist, start
1611: being tried repeatedly until a fixpoint is reached.  An important
1612: characteristic of tabling is that it avoids recomputation of tabled
1613: subgoals. An interesting point of the dynamic reordering strategy is
1614: that it avoids recomputation through performing recomputation. The
1615: process of retrying alternatives may cause redundant recomputations of
1616: the non-tabled subgoals that appear in the body of a looping
1617: alternative. It may also cause redundant consumption of answers if the
1618: body of a looping alternative contains more than one variant subgoal
1619: call. Within this model, parallelism arises if we schedule the
1620: multiple looping alternatives to different workers. Therefore,
1621: parallelism may not come so naturally as for SLD evaluations and
1622: parallel execution may lead to doing more work.
1623: 
1624: There have been other proposals for concurrent tabling but in a
1625: distributed memory context. Hu~\cite{Hu-PhD} was the first to
1626: formulate a method for distributed tabled evaluation termed
1627: \emph{Multi-Processor SLG (SLGMP)}. This method matches subgoals with
1628: processors in a similar way to Freire's approach.  Each processor gets
1629: a single subgoal and it is responsible for fully exploiting its search
1630: tree and obtain the complete set of answers. One of the main
1631: contributions of SLGMP is its controlled scheme of propagation of
1632: subgoal dependencies in order to safely perform distributed
1633: completion. An implementation prototype of SLGMP was developed, but as
1634: far as we know no results have been reported.
1635: 
1636: A different approach for distributed tabling was proposed by
1637: Damásio~\cite{Damasio-00}. The architecture for this proposal relies
1638: on four types of components: a \emph{goal manager} that interfaces
1639: with the outside world; a \emph{table manager} that selects the
1640: clients for storing tables; \emph{table storage clients} that keep the
1641: consumers and answers of tables; and \emph{prover clients} that
1642: perform evaluation. An interesting aspect of this proposal is the
1643: completion detection algorithm. It is based on a classical credit
1644: recovery algorithm~\cite{Mattern-89} for distributed termination
1645: detection. Dependencies among subgoals are not propagated and,
1646: instead, a controller client, associated with each SCC, controls the
1647: credits for its SCC and detects completion if the credits reach the
1648: zero value. An implementation prototype has also been developed, but
1649: further analysis is required.
1650: 
1651: Marques \emph{et al.}~\cite{Marques-00} have proposed an initial
1652: design for an architecture for a multi-threaded tabling engine. Their
1653: first aim is to implement an engine capable of processing multiple
1654: query requests concurrently. The main idea behind this proposal seems
1655: very interesting, however the work is still in an initial stage.
1656: 
1657: Other related mechanisms for sequential tabling have also been
1658: proposed. Demoen and Sagonas proposed a copying approach to deal with
1659: tabled evaluations and implemented two different models, the
1660: CAT~\cite{Demoen-98} and the CHAT~\cite{Demoen-00}. The main idea of
1661: the CAT implementation is that it replaces SLG-WAM's freezing of the
1662: stacks by copying the state of suspended computations to a proper
1663: separate stack area. The CHAT implementation improves the CAT design
1664: by combining ideas from the SLG-WAM with those from the CAT. It avoids
1665: copying all the execution stacks that represent the state of a
1666: suspended computation by introducing a technique for freezing stacks
1667: without using freeze registers.
1668: 
1669: Zhou \emph{et al.}~\cite{Zhou-00,Zhou-01a} developed a linear tabling
1670: mechanism that works on a single SLD tree without requiring
1671: suspensions/resumptions of computations. The main idea is to let
1672: variant calls execute from the remaining clauses of the former first
1673: call. It works as follows: when there are answers available in the
1674: table, the call consumes the answers; otherwise, it uses the predicate
1675: clauses to produce answers.  Meanwhile, if a call that is a variant of
1676: some former call occurs, it takes the remaining clauses from the
1677: former call and tries to produce new answers by using them. The
1678: variant call is then repeatedly re-executed, until all the available
1679: answers and clauses have been exhausted, that is, until a fixpoint is
1680: reached.
1681: 
1682: 
1683: 
1684: \section{Performance Analysis}
1685: \label{section_performance_analysis}
1686: 
1687: To assess the efficiency of our parallel tabling implementation and
1688: address the question of whether parallel tabling is worthwhile, we
1689: present next a detailed analysis of OPTYap's performance. We start by
1690: presenting an overall view of the overheads of supporting the several Yap
1691: extensions: YapOr, YapTab and OPTYap. Then, we compare YapOr's
1692: parallel performance with that of OPTYap for a set of non-tabled
1693: programs. Next, we use a set of tabled programs to measure the
1694: sequential behavior of YapTab, OPTYap and XSB, and to assess OPTYap's
1695: performance when running the tabled programs in parallel.
1696: 
1697: YapOr, YapTab and OPTYap are based on Yap's~4.2.1 engine\footnote{Note
1698: that sequential execution would be somewhat better with more recent
1699: Yap engines.}. We used the same compilation flags for Yap, YapOr,
1700: YapTab and OPTYap. Regarding XSB Prolog, we used version~2.3 with the
1701: default configuration and the default execution parameters. All
1702: systems use batched scheduling for tabling.
1703: 
1704: The environment for our experiments was \emph{oscar}, a Silicon
1705: Graphics Cray Origin2000 parallel computer from the Oxford
1706: Supercomputing Centre. \emph{Oscar} consists of 96 MIPS 195 MHz R10000
1707: processors each with 256 Mbytes of main memory (for a total shared
1708: memory of 24 Gbytes) and running the IRIX~6.5.12 kernel. While
1709: benchmarking, the jobs were submitted to an execution queue
1710: responsible for scheduling the pending jobs through the available
1711: processors in such a way that, when a job is scheduled for execution,
1712: the processors attached to the job are fully available during the
1713: period of time requested for execution. We have limited our
1714: experiments to 32 processors because the machine was always with a
1715: very high load and we were limited to a guest-account.
1716: 
1717: 
1718: 
1719: \subsection{Performance on Non-Tabled Programs}
1720: 
1721: Fundamental criteria to judge the success of an or-parallel, tabling,
1722: or of a combined or-parallel tabling model includes measuring the
1723: overhead introduced by the model when running programs that do not
1724: take advantage of the particular extension. Ideally, a program should
1725: not pay a penalty for mechanisms that it does not require.
1726: 
1727: To place our performance results in perspective we first evaluate how
1728: the original Yap Prolog engine compares against the several Yap
1729: extensions and against the most well-known tabling engine, XSB
1730: Prolog. We use a set of standard non-tabled logic programming
1731: benchmarks. All benchmarks find all the answers for the
1732: problem. Multiple answers are computed through automatic failure after
1733: a valid answer has been found. The set includes the following
1734: benchmark programs:
1735: 
1736: \begin{description}
1737: \item[cubes:] solves the N-cubes or instant insanity problem from
1738:   Tick's book~\cite{Tick-91}. It consists of stacking 7 colored cubes
1739:   in a column so that no color appears twice within any given side of
1740:   the column.
1741:   
1742: \item[ham:] finds all hamiltonian cycles for a graph consisting of 26
1743:   nodes with each node connected to other 3 nodes.
1744:   
1745: \item[map:] solves the problem of coloring a map of 10 countries with
1746:   five colors such that no two adjacent countries have the same color.
1747:   
1748: \item[nsort:] naive sort algorithm. It sorts a list of 10 elements by
1749:   brute force starting from the reverse order (and worst) case.
1750:   
1751: \item[puzzle:] places numbers 1 to 19 in an hexagon pattern such that
1752:   the sums in all 15 diagonals add to the same value (also taken from
1753:   Tick's book~\cite{Tick-91}).
1754:   
1755: \item[queens:] a non-naive algorithm to solve the problem of placing
1756:   11 queens on a 11x11 chess board such that no two queens attack each
1757:   other.
1758: \end{description}
1759: 
1760: Table~\ref{non_tabled_sequential} shows the base execution time, in
1761: seconds, for Yap, YapOr, YapTab, OPTYap and XSB for the set of
1762: non-tabled benchmarks. In parentheses, it shows the overhead over the
1763: Yap execution time. The timings reported for YapOr and OPTYap
1764: correspond to the execution with a single worker. The results indicate
1765: that YapOr, YapTab and OPTYap introduce, on average, an overhead of
1766: about 10\%, 5\% and 17\% respectively over standard Yap. Regarding
1767: XSB, the results show that, on average, XSB is 2.47 times slower than
1768: Yap, a result mainly due to the faster Yap engine.
1769: 
1770: \begin{table}[!ht]
1771: \caption{Yap, YapOr, YapTab, OPTYap and XSB execution time on non-tabled programs.}
1772: \label{non_tabled_sequential}
1773: \begin{tabular}{lrrrrr}
1774: \hline\hline
1775:     {\bf Bench}
1776:     & \multicolumn{1}{c}{\bf Yap}
1777:     & \multicolumn{1}{c}{\bf YapOr}
1778:     & \multicolumn{1}{c}{\bf YapTab}
1779:     & \multicolumn{1}{c}{\bf OPTYap}
1780:     & \multicolumn{1}{c}{\bf XSB} \\
1781: \hline
1782: cubes       &  1.97 &  2.06 (1.05) &  2.05 (1.04) &  2.16 (1.10) &  4.81 (2.44) \\
1783: ham         &  4.04 &  4.61 (1.14) &  4.28 (1.06) &  4.95 (1.23) & 10.36 (2.56) \\
1784: map         &  9.01 & 10.25 (1.14) &  9.19 (1.02) & 11.08 (1.23) & 24.11 (2.68) \\
1785: nsort       & 33.05 & 37.52 (1.14) & 35.85 (1.08) & 39.95 (1.21) & 83.72 (2.53) \\
1786: puzzle      &  2.04 &  2.22 (1.09) &  2.19 (1.07) &  2.36 (1.16) &  4.97 (2.44) \\
1787: queens      & 16.77 & 17.68 (1.05) & 17.58 (1.05) & 18.57 (1.11) & 36.40 (2.17) \\
1788: \noalign{\vspace{.5cm}}
1789: \multicolumn{2}{l}{\it Average}
1790:                     &       (1.10) &       (1.05) &       (1.17) &       (2.47) \\
1791: \hline\hline
1792: \end{tabular}
1793: \end{table}
1794: 
1795: YapOr overheads result from handling the work load register and from
1796: testing operations that \textbf{(i)} verify whether a node is shared
1797: or private, \textbf{(ii)} check for sharing requests, and
1798: \textbf{(iii)} check for backtracking messages due to cut
1799: operations. On the other hand, YapTab overheads are due to the
1800: handling of the freeze registers and support of the forward
1801: trail. OPTYap overheads inherits both sources of
1802: overheads. Considering that Yap Prolog is one of the fastest Prolog
1803: engines currently available, the low overheads achieved by YapOr,
1804: YapTab and OPTYap are very good results.
1805: 
1806: Since OPTYap is based on the same environment model as the one used by
1807: YapOr, we then compare OPTYap's performance with that of
1808: YapOr. Table~\ref{non_tabled_parallel} shows the speedups relative to
1809: the single worker case for YapOr and OPTYap with 4, 8, 16, 24 and 32
1810: workers. Each speedup corresponds to the best execution time obtained
1811: in a set of 3 runs. The results show that YapOr and OPTYap achieve
1812: identical effective speedups in all benchmark programs. These results
1813: allow us to conclude that OPTYap maintains YapOr's behavior in
1814: exploiting or-parallelism in non-tabled programs, despite it including
1815: all the machinery required to support tabled programs.
1816: 
1817: \begin{table}[!ht]
1818: \caption{Speedups for YapOr and OPTYap on non-tabled programs.}
1819: \label{non_tabled_parallel}
1820: \begin{tabular}{lrrrrrrrrrrr}
1821: \hline\hline
1822:     & \multicolumn{5}{c}{\bf YapOr}
1823:     &
1824:     & \multicolumn{5}{c}{\bf OPTYap} \\ \noalign{\vspace{.2cm}}
1825:     {\bf Bench}
1826:     & \multicolumn{1}{c}{\bf 4}
1827:     & \multicolumn{1}{c}{\bf 8}
1828:     & \multicolumn{1}{c}{\bf 16}
1829:     & \multicolumn{1}{c}{\bf 24}
1830:     & \multicolumn{1}{c}{\bf 32}
1831:     &
1832:     & \multicolumn{1}{c}{\bf 4}
1833:     & \multicolumn{1}{c}{\bf 8}
1834:     & \multicolumn{1}{c}{\bf 16}
1835:     & \multicolumn{1}{c}{\bf 24}
1836:     & \multicolumn{1}{c}{\bf 32} \\
1837: \hline
1838: cubes   & 3.99 & 7.81 & 14.66 & 19.26 & 20.55 & & 3.98 & 7.74 & 14.29 & 18.67 & 20.97 \\
1839: ham     & 3.93 & 7.61 & 13.71 & 15.62 & 15.75 & & 3.92 & 7.64 & 13.54 & 16.25 & 17.51 \\
1840: map     & 3.98 & 7.73 & 14.03 & 17.11 & 18.28 & & 3.98 & 7.88 & 13.74 & 18.36 & 16.68 \\
1841: nsort   & 3.98 & 7.92 & 15.62 & 22.90 & 29.73 & & 3.96 & 7.84 & 15.50 & 22.75 & 29.47 \\
1842: puzzle  & 3.93 & 7.56 & 13.71 & 18.18 & 16.53 & & 3.93 & 7.51 & 13.53 & 16.57 & 16.73 \\
1843: queens  & 4.00 & 7.95 & 15.39 & 21.69 & 25.69 & & 3.99 & 7.93 & 15.41 & 20.90 & 25.23 \\
1844: \noalign{\vspace{.5cm}}
1845: {\it Average}
1846:         & 3.97 & 7.76 & 14.52 & 19.13 & 21.09 & & 3.96 & 7.76 & 14.34 & 18.92 & 21.10 \\
1847: \hline\hline
1848: \end{tabular}
1849: \end{table}
1850: 
1851: 
1852: 
1853: \subsection{Performance on Tabled Programs}
1854: 
1855: In order to place OPTYap's results in perspective we start by
1856: analyzing the overheads introduced to extend YapTab to parallel
1857: execution and by measuring YapTab and OPTYap behavior when compared
1858: with XSB. We use a set of tabled benchmark programs from the
1859: XMC\footnote{The XMC system~\cite{Ramakrishnan-00} is a model checker
1860: implemented atop the XSB system which verifies properties written in
1861: the alternation-free fragment of the modal
1862: $\mu$-calculus~\cite{Kozen-83} for systems specified in XL, an
1863: extension of value-passing CCS~\cite{Milner-89}.}~\cite{xmc} and
1864: XSB~\cite{xsb} \emph{world wide web} sites that are frequently used in
1865: the literature to evaluate such systems. The benchmark programs are:
1866: 
1867: \begin{description}
1868: \item[sieve:] the transition relation graph for the \emph{sieve}
1869:   specification\footnote{We are thankful to C. R. Ramakrishnan for
1870:     helping us in dumping the transition relation graph of the
1871:     automatons corresponding to each given XL specification, and in
1872:     building runnable versions out of the XMC environment.} defined
1873:   for 5 processes and 4 overflow prime numbers.
1874: 
1875: \item[leader:] the transition relation graph for the \emph{leader
1876:     election} specification defined for 5 processes.
1877:   
1878: \item[iproto:] the transition relation graph for the \emph{i-protocol}
1879:   specification defined for a correct version (fix) with a huge window
1880:   size (w = 2).
1881:   
1882: \item[samegen:] solves the same generation problem for a randomly
1883:   generated 24x24x2 cylinder. This benchmark is very interesting
1884:   because for sequential execution it does not allocate any consumer
1885:   node. Variant calls to tabled subgoals only occur when the subgoals
1886:   are already completed.
1887:   
1888: \item[lgrid:] computes the transitive closure of a 25x25 grid using a
1889:   left recursion algorithm. A link between two nodes, $n$ and $m$, is
1890:   defined by two different relations; one indicates that we can reach
1891:   $m$ from $n$ and the other indicates that we can reach $n$ from $m$.
1892:   
1893: \item[lgrid/2:] the same as \textbf{lgrid} but it only requires half
1894:   the relations to indicate that two nodes are connected. It defines
1895:   links between two nodes by a single relation, and it uses a
1896:   predicate to achieve symmetric reachability. This modification
1897:   alters the order by which answers are found. Moreover, as indexing
1898:   in the first argument is not possible for some calls, the execution
1899:   time increases significantly. For this reason, we only use here a
1900:   20x20 grid.
1901: 
1902: \item[rgrid/2:] the same as \textbf{lgrid/2} but it computes the
1903:   transitive closure of a 25x25 grid and it uses a right recursion
1904:   algorithm.
1905: \end{description}
1906: 
1907: Table~\ref{tabled_sequential} shows the execution time, in seconds,
1908: for YapTab, OPTYap and XSB for the set of tabled benchmarks. In
1909: parentheses, it shows the overhead over the YapTab execution time. The
1910: execution time reported for OPTYap correspond to the execution with a
1911: single worker.
1912: 
1913: \begin{table}[!ht]
1914: \caption{YapTab, OPTYap and XSB execution time on tabled programs.}
1915: \label{tabled_sequential}
1916: \begin{tabular}{lrrr}
1917: \hline\hline
1918:     {\bf Bench}
1919:     & \multicolumn{1}{c}{\bf YapTab}
1920:     & \multicolumn{1}{c}{\bf OPTYap}
1921:     & \multicolumn{1}{c}{\bf XSB} \\
1922: \hline
1923: sieve   & 235.31 & 268.13 (1.14) & 433.53 (1.84) \\
1924: leader  &  76.60 &  85.56 (1.12) & 158.23 (2.07) \\
1925: iproto  &  20.73 &  23.68 (1.14) &  53.04 (2.56) \\
1926: samegen &  23.36 &  26.00 (1.11) &  37.91 (1.62) \\
1927: lgrid   &   3.55 &   4.28 (1.21) &   7.41 (2.09) \\
1928: lgrid/2 &  59.53 &  69.02 (1.16) &  98.22 (1.65) \\
1929: rgrid/2 &   6.24 &   7.51 (1.20) &  15.40 (2.47) \\
1930: \noalign{\vspace{.5cm}}
1931: {\it Average} &  &        (1.15) &        (2.04) \\
1932: \hline\hline
1933: \end{tabular}
1934: \end{table}
1935: 
1936: The results indicate that, for these set of tabled benchmark programs,
1937: OPTYap introduces, on average, an overhead of about 15\% over
1938: YapTab. This overhead is very close to that observed for non-tabled
1939: programs (11\%). The small difference results from locking requests to
1940: handle the data structures introduced by tabling.  Locks are require
1941: to insert new trie nodes into the table space, and to update subgoal
1942: and dependency frame pointers to tabled answers. These locking
1943: operations are all related with the management of tabled
1944: answers. Therefore, the benchmarks that deal with more tabled answers
1945: are the ones that potentially can perform more locking
1946: operations. This causal relation seems to be reflected in the
1947: execution times showed in Table~\ref{tabled_sequential}, because the
1948: benchmarks that show higher overheads are also the ones that find more
1949: answers. The answers found by each benchmark are presented next in
1950: Table~\ref{tabled_stats}.
1951: 
1952: Table~\ref{tabled_sequential} also shows that YapTab is on average
1953: about twice as fast as XSB for these set of benchmarks. This may be
1954: partly due to the faster Yap engine, as seen in
1955: Table~\ref{non_tabled_sequential}, and also to the fact that XSB
1956: implements functionalities that are still lacking in YapTab and that
1957: XSB may incur overheads in supporting those functionalities. These
1958: results show that we have accomplished our initial aim of implementing
1959: an or-parallel tabling system that compares favorably with current
1960: \emph{state of the art} technology.  Hence, we believe the following
1961: evaluation of the parallel engine is significant and fair.
1962: 
1963: In order to achieve a deeper insight on the behavior of each
1964: benchmark, and therefore clarify some of the results presented next,
1965: we first present in Table~\ref{tabled_stats} data on the benchmark
1966: programs. The columns in Table~\ref{tabled_stats} have the following
1967: meaning:
1968: 
1969: \begin{description} 
1970: \item[first:] is the number of first calls to subgoals corresponding
1971:   to tabled predicates. It corresponds to the number of generator
1972:   choice points allocated.
1973:   
1974: \item[nodes:] is the number of subgoal/answer trie nodes used to
1975:   represent the complete subgoal/answer trie structures of the tabled
1976:   predicates in the given benchmark. For the answer tries, in
1977:   parentheses, it shows the percentage of saving that the trie's
1978:   design achieves on these data structures. Given the $total$ number
1979:   of nodes required to represent individually each answer and the
1980:   number of nodes $used$ by the trie structure, the $saving$ can be
1981:   obtained by the following expression:
1982: \[saving~=~\frac{total~-~used}{total}\]
1983: As an example, consider two answers whose single representation
1984: requires respectively 12 and 8 answer trie nodes for each. Assuming
1985: that the answer trie representation of both answers only requires 15
1986: answer trie nodes, thus 5 of those being common to both paths, it
1987: achieves a saving of 25\%. Higher percentages of saving reflect higher
1988: probabilities of lock contention when concurrently accessing the table
1989: space.
1990: 
1991: \item[depth:] is the average depth of the whole set of paths in the
1992:   corresponding answer trie structure. In other words, it is the
1993:   average number of answer trie nodes required to represent an answer.
1994:   Trie structures with smaller average depth values are more amenable
1995:   to higher lock contention.
1996:   
1997: \item[unique:] is the number of non-redundant answers found for tabled
1998:   subgoals. It corresponds to the number of answers stored in the
1999:   table space.
2000:   
2001: \item[repeated:] is the number of redundant answers found for tabled
2002:   subgoals. A high number of redundant answers can degrade the
2003:   performance of the parallel system when using table locking schemes
2004:   that lock the table space without taking into account whether
2005:   writing to the table is, or is not, likely.
2006: \end{description}
2007: 
2008: \begin{table}[!ht]
2009: \caption{Characteristics of the tabled programs.}
2010: \label{tabled_stats}
2011: \begin{tabular}{lrrrrrrrr}
2012: \hline\hline
2013:     & \multicolumn{2}{c}{\bf Subgoal Tries}
2014:     &
2015:     & \multicolumn{2}{c}{\bf Answer Tries}
2016:     &
2017:     & \multicolumn{2}{c}{\bf New Answers} \\ \noalign{\vspace{.2cm}}
2018:       {\bf Bench}
2019:     & \multicolumn{1}{c}{\bf first}
2020:     & \multicolumn{1}{c}{\bf nodes}
2021:     &
2022:     & \multicolumn{1}{c}{\bf nodes}
2023:     & \multicolumn{1}{c}{\bf depth}
2024:     &
2025:     & \multicolumn{1}{c}{\bf unique}
2026:     & \multicolumn{1}{c}{\bf repeated} \\
2027: \hline
2028: sieve   &   1 &    7 & &    8624(57\%) & 53   & &    380 & 1386181 \\
2029: leader  &   1 &    5 & &   41793(70\%) & 81   & &   1728 &  574786 \\
2030: iproto  &   1 &    6 & & 1554896(77\%) & 51   & & 134361 &  385423 \\
2031: samegen & 485 &  971 & &   24190(33\%) &  1.5 & &  23152 &   65597 \\
2032: lgrid   &   1 &    3 & &  391251(49\%) &  2   & & 390625 & 1111775 \\
2033: lgrid/2 &   1 &    3 & &  160401(49\%) &  2   & & 160000 &  449520 \\
2034: rgrid/2 & 626 & 1253 & &  782501(33\%) &  1.5 & & 781250 & 2223550 \\
2035: \hline\hline
2036: \end{tabular}
2037: \end{table}
2038: 
2039: By observing Table~\ref{tabled_stats} it seems that \emph{sieve} and
2040: \emph{leader} are the benchmarks least amenable to table lock
2041: contention because they are the ones that find the least number of
2042: answers and also the ones that have the deepest trie structures. In
2043: this regard, \emph{lgrid}, \emph{lgrid/2} and \emph{rgrid/2}
2044: correspond to the opposite case. They find the largest number of
2045: answers and they have very shallow trie structures. However,
2046: \emph{rgrid/2} is a benchmark with a large number of first subgoals
2047: calls which can reduce the probability of lock contention because
2048: answers can be found for different subgoal calls and therefore be
2049: inserted with minimum overlap.  Likewise, \emph{samegen} is a
2050: benchmark that can also benefit from its large number of first subgoal
2051: calls, despite also presenting a very shallow trie structure.
2052: Finally, \emph{iproto} is a benchmark that can also lead to higher
2053: ratios of lock contention. It presents a deep trie structure, but it
2054: inserts a huge number of trie nodes in the table space. Moreover, it
2055: is the benchmark showing the highest percentage of saving.
2056: 
2057: To assess OPTYap's performance when running tabled programs in
2058: parallel, we ran OPTYap with varying number of workers for the set of
2059: tabled benchmark programs. Table~\ref{tabled_parallel_batched}
2060: presents the speedups for OPTYap with 4, 8, 16, 24 and 32 workers. The
2061: speedups are relative to the single worker case of
2062: Table~\ref{tabled_sequential}. They correspond to the best speedup
2063: obtained in a set of 3 runs. The table is divided in two main blocks:
2064: the upper block groups the benchmarks that showed potential for
2065: parallel execution, whilst the bottom block groups the benchmarks that
2066: do not show any gains when run in parallel.
2067: 
2068: \begin{table}[!ht]
2069: \caption{Speedups for OPTYap on tabled programs.}
2070: \label{tabled_parallel_batched}
2071: \begin{tabular}{lrrrrr}
2072: \hline\hline
2073:     & \multicolumn{5}{c}{\bf Number of Workers} \\ \noalign{\vspace{.2cm}}
2074:     {\bf Bench}
2075:     & \multicolumn{1}{c}{\bf  4}
2076:     & \multicolumn{1}{c}{\bf  8}
2077:     & \multicolumn{1}{c}{\bf 16}
2078:     & \multicolumn{1}{c}{\bf 24}
2079:     & \multicolumn{1}{c}{\bf 32} \\
2080: \hline
2081: sieve   & 3.99 & 7.97 & 15.87 & 23.78 & 31.50 \\
2082: leader  & 3.98 & 7.92 & 15.78 & 23.57 & 31.18 \\
2083: iproto  & 3.05 & 5.08 &  9.01 &  8.81 &  7.21 \\
2084: samegen & 3.72 & 7.27 & 13.91 & 19.77 & 24.17 \\
2085: lgrid/2 & 3.63 & 7.19 & 13.53 & 19.93 & 24.35 \\
2086: \noalign{\vspace{.5cm}}
2087: {\it Average}
2088:         & 3.67 & 7.09 & 13.62 & 19.17 & 23.68 \\
2089: \hline
2090: lgrid   & 0.65 & 0.68 &  0.55 &  0.46 &  0.39 \\
2091: rgrid/2 & 0.94 & 1.15 &  0.72 &  0.77 &  0.65 \\
2092: \noalign{\vspace{.5cm}}
2093: {\it Average}
2094:         & 0.80 & 0.92 &  0.64 &  0.62 &  0.52 \\
2095: \hline\hline
2096: \end{tabular}
2097: \end{table}
2098: 
2099: The results show superb speedups for the XMC \emph{sieve} and the
2100: \emph{leader} benchmarks up to 32 workers. These benchmarks reach
2101: speedups of 31.5 and 31.18 with 32 workers! Two other benchmarks in
2102: the upper block, \emph{samegen} and \emph{lgrid/2}, also show
2103: excellent speedups up to 32 workers. Both reach a speedup of 24 with
2104: 32 workers. The remaining benchmark, \emph{iproto}, shows a good
2105: result up to 16 workers and then it slows down with 24 and 32 workers.
2106: Globally, the results for the upper block are quite good, especially
2107: considering that they include the three XMC benchmarks that are more
2108: representative of real-world applications.
2109: 
2110: On the other hand, the bottom block shows almost no speedups at all.
2111: Only for \emph{rgrid/2} with 8 workers we obtain a slight positive
2112: speedup of 1.15. The worst case is for \emph{lgrid} with 32 workers,
2113: where we are about 2.5 times slower than execution with a single
2114: worker. In this case, surprisingly, we observed that for the whole set
2115: of benchmarks the workers are busy for more than 95\% of the execution
2116: time, even for 32 workers. The actual slowdown is therefore not caused
2117: because workers became idle and start searching for work, as usually
2118: happens with parallel execution of non-tabled programs. Here the
2119: problem seems more complex: workers do have available work, but there
2120: is a lot of contention to access that work.
2121: 
2122: The parallel execution behavior of each benchmark program can be
2123: better understood through the statistics described in the tables that
2124: follows. The columns in these tables have the following meaning:
2125: 
2126: \begin{description}
2127: \item[variant:] is the number of variant calls to subgoals
2128:   corresponding to tabled predicates. It matches the number of
2129:   consumer choice points allocated.
2130:   
2131: \item[complete:] is the number of variant calls to completed tabled
2132:   subgoals. It is when the \emph{completed table optimization} takes
2133:   places, that is, when the set of found answers is consumed by
2134:   executing compiled code directly from the trie structure associated
2135:   with the completed subgoal.
2136:   
2137: \item[SCC suspend:] is the number of SCCs suspended.
2138:   
2139: \item[SCC resume:] is the number of suspended SCCs that were resumed.
2140:   
2141: \item[contention points:] is the total number of unsuccessful first
2142:   attempts to lock data structures of all types. Note that when a
2143:   first attempt fails, the requesting worker performs arbitrarily
2144:   locking requests until it succeeds. Here, we only consider the first
2145:   attempts.
2146: 
2147:   \begin{description}
2148:   \item[subgoal frame:] is the number of unsuccessful first attempts
2149:     to lock subgoal frames. A subgoal frame is locked in three main
2150:     different situations: \textbf{(i)} when a new answer is found
2151:     which requires updating the subgoal frame pointer to the last
2152:     found answer; \textbf{(ii)} when marking a subgoal as completed;
2153:     \textbf{(iii)} when traversing the whole answer trie structure to
2154:     remove pruned answers and compute the code for direct compiled
2155:     code execution.
2156:     
2157:   \item[dependency frame:] is the number of unsuccessful first
2158:     attempts to lock dependency frames. A dependency frame has to be
2159:     locked when it is checked for unconsumed answers.
2160:     
2161:   \item[trie node:] is the number of unsuccessful first attempts to
2162:     lock trie nodes. Trie nodes must be locked when a worker has to
2163:     traverse the subgoal trie structure during a tabled subgoal call
2164:     operation or the answer trie structure during a new answer
2165:     operation.
2166:   \end{description}
2167: \end{description}
2168: 
2169: To accomplish these statistics it was necessary to introduce in the
2170: system a set of counters to measure the several parameters. Although,
2171: the counting mechanism introduces an additional overhead in the
2172: execution time, we assume that it does not significantly influence the
2173: parallel execution pattern of each benchmark program.
2174: 
2175: Tables~\ref{stats_batched_upper} and~\ref{stats_batched_below} show
2176: respectively the statistics gathered for the group of programs with
2177: and without parallelism. We do not include the statistics for the
2178: \emph{leader} benchmark because its execution behavior showed to be
2179: identical to the observed for the \emph{sieve} benchmark.
2180: 
2181: \begin{table}[!ht]
2182: \caption{Statistics of OPTYap using batched scheduling for the group
2183:          of programs with parallelism.}
2184: \label{stats_batched_upper}
2185: \begin{tabular}{lrrrrr}
2186: \hline\hline
2187:     & \multicolumn{5}{c}{\bf Number of Workers} \\ \noalign{\vspace{.2cm}}
2188:     {\bf Parameter}
2189:     & \multicolumn{1}{c}{\bf  4}
2190:     & \multicolumn{1}{c}{\bf  8}
2191:     & \multicolumn{1}{c}{\bf 16}
2192:     & \multicolumn{1}{c}{\bf 24}
2193:     & \multicolumn{1}{c}{\bf 32} \\
2194: \hline
2195: {\bf sieve}         &        &        &        &        &        \\
2196: variant/complete    &    1/0 &    1/0 &    1/0 &    1/0 &    1/0 \\
2197: SCC suspend/resume  &   20/0 &   70/0 &  136/0 &  214/0 &  261/0 \\
2198: contention points   &    108 &    329 &    852 &   1616 &   3040 \\
2199: ~~~subgoal frame    &      0 &      0 &      0 &      0 &      2 \\
2200: ~~~dependency frame &      0 &      0 &      1 &      0 &      4 \\
2201: ~~~trie node        &     96 &    188 &    415 &    677 &   1979 \\
2202: \noalign{\vspace{.5cm}}
2203: {\bf iproto}        &        &        &        &        &        \\
2204: variant/complete    &    1/0 &    1/0 &    1/0 &    1/0 &    1/0 \\
2205: SCC suspend/resume  &    5/0 &    9/0 &   17/0 &   26/0 &   32/0 \\
2206: contention points   &   7712 &  22473 &  60703 & 120162 & 136734 \\
2207: ~~~subgoal frame    &   3832 &   9894 &  21271 &  33162 &  33307 \\
2208: ~~~dependency frame &    678 &   4685 &  25006 &  66334 &  81515 \\
2209: ~~~trie node        &   3045 &   6579 &  10537 &  11816 &  11736 \\
2210: \noalign{\vspace{.5cm}}
2211: {\bf samegen}       &          &          &          &          &          \\
2212: variant/complete    & 485/1067 & 1359/193 & 1355/197 & 1384/168 & 1363/189 \\
2213: SCC suspend/resume  &    187/2 &   991/11 &  1002/20 &  1024/25 &  1020/34 \\
2214: contention points   &      255 &      314 &      743 &     1160 &     1607 \\
2215: ~~~subgoal frame    &        8 &       52 &      112 &      283 &      493 \\
2216: ~~~dependency frame &        0 &        0 &        1 &        0 &        0 \\
2217: ~~~trie node        &      154 &      119 &      201 &      364 &      417 \\
2218: \noalign{\vspace{.5cm}}
2219: {\bf lgrid/2}       &        &        &        &        &        \\
2220: variant/complete    &    1/0 &    1/0 &    1/0 &    1/0 &    1/0 \\
2221: SCC suspend/resume  &    4/0 &    8/0 &   16/0 &   24/0 &   32/0 \\
2222: contention points   &   4004 &  10072 &  28669 &  59283 &  88541 \\
2223: ~~~subgoal frame    &    167 &   1124 &   7319 &  17440 &  27834 \\
2224: ~~~dependency frame &     98 &   1209 &   5987 &  23357 &  35991 \\
2225: ~~~trie node        &   2958 &   5292 &  10341 &  12870 &  12925 \\
2226: \hline\hline
2227: \end{tabular}
2228: \end{table}
2229: 
2230: The statistics obtained for the \emph{sieve} benchmark support the
2231: excellent performance speedups showed for parallel execution. It shows
2232: insignificant number of contention points, it only calls a variant
2233: subgoal, and despite the fact that it suspends some SCCs it
2234: successfully avoids resuming them. In this regard, the \emph{samegen}
2235: benchmark also shows insignificant number of contention points.
2236: However the number of variant subgoals calls and the number of
2237: suspended/resumed SCCs indicate that it introduces more dependencies
2238: between workers. Curiously, for more than 4 workers, the number of
2239: variant calls and the number of suspended SCCs seems to be stable. The
2240: only parameter that slightly increases is the number of resumed SCCs.
2241: Regarding \emph{iproto} and \emph{lgrid/2}, lock contention seems to
2242: be the major problem. Trie nodes show identical lock contention,
2243: however \emph{iproto} inserts about 10 times more answer trie nodes
2244: than \emph{lgrid/2}. Subgoal and dependency frames show an identical
2245: pattern of contention, but \emph{iproto} presents higher contention
2246: ratios. Moreover, if we remember from Table~\ref{tabled_sequential}
2247: that \emph{iproto} is about 3 times faster than \emph{lgrid/2} to
2248: execute, we can conclude that the contention ratio for \emph{iproto}
2249: is obviously much higher per time unit, which justifies its worst
2250: behavior.
2251: 
2252: \begin{table}[!ht]
2253: \caption{Statistics of OPTYap using batched scheduling for the group
2254:          of programs without parallelism.}
2255: \label{stats_batched_below}
2256: \begin{tabular}{lrrrrr}
2257: \hline\hline
2258:     & \multicolumn{5}{c}{\bf Number of Workers} \\ \noalign{\vspace{.2cm}}
2259:     {\bf Parameter}
2260:     & \multicolumn{1}{c}{\bf  4}
2261:     & \multicolumn{1}{c}{\bf  8}
2262:     & \multicolumn{1}{c}{\bf 16}
2263:     & \multicolumn{1}{c}{\bf 24}
2264:     & \multicolumn{1}{c}{\bf 32} \\
2265: \hline
2266: {\bf lgrid}         &        &        &        &        &        \\
2267: variant/complete    &    1/0 &    1/0 &    1/0 &    1/0 &    1/0 \\
2268: SCC suspend/resume  &    4/0 &    8/0 &   16/0 &   24/0 &   32/0 \\
2269: contention points   & 112740 & 293328 & 370540 & 373910 & 452712 \\
2270: ~~~subgoal frame    &  18502 &  73966 &  77930 &  68313 & 115862 \\
2271: ~~~dependency frame &  17687 & 113594 & 215429 & 223792 & 248603 \\
2272: ~~~trie node        &  72751 &  91909 &  61857 &  62629 &  64029 \\
2273: \noalign{\vspace{.5cm}}
2274: {\bf rgrid/2}       &           &           &           &          &           \\
2275: variant/complete    & 3051/1124 & 3072/1103 & 3168/1007 & 3226/949 &  3234/941 \\
2276: SCC suspend/resume  &  1668/465 &  1978/766 & 2326/1107 & 2121/882 & 2340/1078 \\
2277: contention points   &     58761 &    110984 &    133058 &   170653 &    173773 \\
2278: ~~~subgoal frame    &     55415 &    103104 &    122938 &   159709 &    160771 \\
2279: ~~~dependency frame &         0 &         8 &         5 &      259 &       268 \\
2280: ~~~trie node        &      1519 &      3595 &      5016 &     4780 &      4737 \\
2281: \hline\hline
2282: \end{tabular}
2283: \end{table}
2284: 
2285: The statistics gathered for the second group of programs present very
2286: interesting results. Remember that \emph{lgrid} and \emph{rgrid/2} are
2287: the benchmarks that find the largest number of answers per time unit
2288: (please refer to Tables~\ref{tabled_sequential}
2289: and~\ref{tabled_stats}). Regarding \emph{lgrid}'s statistics it shows
2290: high contention ratios in all parameters considered. Closer analysis
2291: of its statistics allows us to observe that it shows an identical
2292: pattern when compared with \emph{lgrid/2}. The problem is that the
2293: ratio per time unit is significantly worst for \emph{lgrid}. This
2294: reflects the fact that most of \emph{lgrid}'s execution time is spent
2295: in \emph{massively} accessing the table space to insert new answers
2296: and to consume found answers.
2297: 
2298: The sequential order by which answers are accessed in the trie
2299: structure is the key issue that reflects the high number of contention
2300: points in subgoal and dependency frames. When inserting a new answer
2301: we need to update the subgoal frame pointer to point at the last found
2302: answer. When consuming a new answer we need to update the dependency
2303: frame pointer to point at the last consumed answer. For programs that
2304: find a large number of answers per time unit, this obviously increases
2305: contention when accessing such pointers. Regarding trie nodes, the
2306: small depth of \emph{lgrid}'s answer trie structure (2 trie nodes) is
2307: one of the main factors that contributes to the high number of
2308: contention points when massively inserting trie nodes. Trie structures
2309: are a compact data structure. Therefore, obtaining good parallel
2310: performance in the presence of massive table access will always be a
2311: difficult task.
2312: 
2313: Analyzing the statistics for \emph{rgrid/2}, the number of variant
2314: subgoals calls and the number of suspended/resumed SCCs suggest that
2315: this benchmark leads to complex dependencies between workers.
2316: Curiously, despite the large number of consumer nodes that the
2317: benchmark allocates, contention in dependency frames is not a problem.
2318: On the other hand, contention for subgoal frames seems to be a major
2319: problem. The statistics suggest that the large number of SCC resume
2320: operations and the large number of answers that the benchmark finds
2321: are the key aspects that constrain parallel performance. A closer
2322: analysis shows that the number of resumed SCCs is approximately
2323: constant with the increase in the number of workers. This may suggest
2324: that there are answers that can only be found when other answers are
2325: also found, and that the process of finding such answers cannot be
2326: anticipated. In consequence, suspended SCCs have always to be resumed
2327: to consume the answers that cannot be found sooner. We believe that
2328: the sequencing in the order that answers are found is the other major
2329: problem that restrict parallelism in tabled programs.
2330: 
2331: Another aspect that can negatively influence this benchmark is the
2332: number of completed calls. Before executing the first call to a
2333: completed subgoal we need to traverse the trie structure of the
2334: completed subgoal. When traversing the trie structure the
2335: correspondent subgoal frame is locked. As \emph{rgrid/2} stores a huge
2336: number of answer trie nodes in the table (please refer to
2337: Table~\ref{tabled_stats}) this can lead to longer periods of lock
2338: contention.
2339: 
2340: 
2341: 
2342: \section{Concluding Remarks}
2343: 
2344: We have presented the design, implementation and evaluation of
2345: OPTYap. OPTYap is the first available system that exploits
2346: or-parallelism and tabling from logic programs. A major guideline for
2347: OPTYap was concerned with making best use of the excellent technology
2348: already developed for previous systems. In this regard, OPTYap uses
2349: Yap's efficient sequential Prolog engine as its starting framework,
2350: and the SLG-WAM and environment copying approaches, respectively, as
2351: the basis for its tabling and or-parallel components.
2352: 
2353: Through this research we aimed at showing that the models developed to
2354: exploit implicit or-parallelism in standard logic programming systems
2355: can also be used to successfully exploit implicit or-parallelism in
2356: tabled logic programming systems. First results reinforced our belief
2357: that tabling and parallelism are a very good match that can contribute
2358: to expand the range of applications for Logic Programming.
2359: 
2360: OPTYap introduces low overheads for sequential execution and compares
2361: favorably with current versions of XSB. Moreover, it maintains YapOr's
2362: effective speedups in exploiting or-parallelism in non-tabled
2363: programs.  Our best results for parallel execution of tabled programs
2364: were obtained on applications that have a limited number of tabled
2365: nodes, but high or-parallelism. However, we have also obtained good
2366: speedups on applications with a large number of tabled nodes.
2367: 
2368: On the other hand, there are tabled programs where OPTYap may not
2369: speed up execution. Table access has been the main factor limiting
2370: parallel speedups so far. OPTYap implements tables as tries, thus
2371: obtaining good indexing and compression. On the other hand, tries are
2372: designed to avoid redundancy. To do so, they restrict concurrency,
2373: especially when updating. We plan to study whether alternative designs
2374: for the table data structure can obtain scalable speedups even when
2375: frequently updating tables.
2376: 
2377: Our applications do not show the completion algorithm to be a major
2378: factor in performance so far. In the future, we plan to study OPTYap
2379: over a large range of applications, namely, natural language, database
2380: processing, and non-monotonic reasoning. We expect that non-monotonic
2381: reasoning applications, for instance, will raise more complex
2382: dependencies and further stress the completion algorithm. We are also
2383: interested in the implementation of pruning in the parallel
2384: environment.
2385: 
2386: 
2387: 
2388: \section*{Acknowledgments}
2389: 
2390: The authors are thankful to the anonymous reviewers for their valuable
2391: comments. This work has been partially supported by $CLoP^n$ (CNPq),
2392: PLAG (FAPERJ), APRIL (POSI/SRI/40749/2001), and by funds granted to
2393: LIACC through the Programa de Financiamento Plurianual, Funda\c{c}\~ao
2394: para a Ci\^encia e Tecnologia and Programa POSI.
2395: 
2396: 
2397: 
2398: \bibliographystyle{plain}
2399: \bibliography{references}
2400: 
2401: \end{document}
2402: