logic programming languages such as Prolog, whale the family of concurrent ..... selection of a unifying clause in the program when per- forming a resolution.
Parallel
Logic Programming
JACQUES
CHASSIN
IMAG/LGI,
46al~enue
PHILIPPE INRIA,
Systems
DE KERGOMMEAUX Ftilm
Vlallct,
F-38031
Grenoble
Cedex
7, France
CODOGNET
Domawze
de Voluceuu,
Parallelizing because
aims
languages
intrinsic
communication This
article
implementations.
efficient
parallel
art sequential
and
logic Prolog
such
parallel
D.3.2
and
Formal
Key
constraint
of constraint
B.3.2
programmmg;
of parallelism
in order
to obtain
and
most
speedups
also addresses
is, the
the
over state-of-the-
current
and
and the efficiency
parallehsm logic
of existing
and the concurrent
programming
with
logic
parallelism,
[Memory
Design
Multiple
Stream
Techniques]:
Logic
Language
and parallel interpreters;
Mathematical
Data
Concurrent
Languages]:
distributed, Languages]:
Structures]:
Architectures]:
languages;
D.3,4
preprocessors; Lo~c—logzc
F.4.
[Programming
1 [Mathematical
programming
Languages Words
memory
and
Phrases:
AND-parallelism,
constraints,
management,
nondeterminism,
OR-parallelism,
Warren
Machine
Abstract
develops their
implemented,
D. 1.6 [Programming
[Programming
programming,
parallelism,
languages
exploitation
effective
Techniques]:
Processors—compzlers;
Terms:
Additional
the transparent
research
programming
concurrency—that
to be solved
the applicability
C. 1.2 [Processor
Classifications-concurrent, Languages]:
reach
logic
processes—within
been
The article
Descriptors:
raltel
Programming:
General
systems
D. 1.3 [Programming
Programming-pa
have
community,
One
architectures.
memory,
Architectures;
parallel
logic
the
on transparent
solutions
combination
and Subject
Styles—shared
merging
programs.
in existing
to the problems
in extending
in the research
of logic
of concurrent
mainly
solutions
implementations. issues
interest
to express
between
These
as models
use of highly
Logic
family
concentrates
approaches,
Categories
the
programming
research
much
Cedex
of parallelism
programmers
mature
efficient
languages
whale
allowing
the most
prospective
has attracted
and AND-parallelisms
and synchronization
and surveys
Le Ghesna.v
exploitation
as Prolog,
constructs
algorithms.
OR-
at transparent
such
language
systems,
F-78153
logic programming
of the
stream
Rocquencourt,
guard,
hash
multisequential Prolog,
1. INTRODUCTION Logic programs can be computed sequentially or in parallel without changing their declarative semantics. For this reason, they are often considered well suited to programming multiprocessors. At the same time, parallel architectures repre-
bmdmg windows,
arrays,
load
implementation
scheduling
parallel
concurrent
balancing,
massive
techniques,
tasks,
static
analysis,
sent the most promising way to increase the computing power of computers in the future. Since multiprocessors remain difficult to use efficiently, an implicitly parallel programming language offers a very attractive mean of exercising the parallelism of multiprocessors of today and tomorrow.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 01994 ACM 0360-0300/94/0600-0295 $01.50
ACM
Computing
Surveys,
Vol. 26, No, 3, September
1994
296
-
J. Chassin
de Kergommeaux
and P. Codognet
CONTENTS
1.INTRODUCTION 2,
PARALLELISMS
IN
LOGIC
PROGRAMMING
Programs ‘2 2 Sources of Parallehsm m Logic Programs 23 OR-Parallehsm m Prolog 2.4 AND-Parallehsm m Prolog ISSUES 3. LANGUAGE 3.1 Parallehz]ng Prolog 3.2 Delays, Synchronlzatlon, and Concurrency 3.3 Concurrent Logic Languages 34 Umfymg Prolog and Concurrent Languages 2.1
4
5
Logic
IMPLEMENTATION 41
From
42
Efficient
ISSUES
Early
Models
to Multisequential
Sequential
43
OR-Parallelism
4.4
AND-Parallehsm
45
Scheduhng
SYSTEMS
Systems
Implementation
of Parallel
Techniques
Tasks
EXPLOITING
ONE
TYPE
Exploiting
OR-Parallehsm
OF
PARALLELISM 51
Systems
~ 2 A System
Exploiting
Parallehsm:
Independent
AND-
&-Prolog
53
Systems
54
Performance
Exploiting
Dependent
of Systems
AND-Parallelism
Exploiting
One
Type
of Parallelism 6
SYSTEMS
COMBINING
SEVERAL
TYPES
OF
PARALLELISM 61
Systems
Combming
Independent
AND-
With
OR-Parallelism 62
Systems
Combmmg
Dependent
AND-
With
OR-Parallelism 7
RESEARCH
C~TRRENT
7 1 Statw
?)
Analysis
AND
of Logic
72
Parallelism
and
73
Concurrent
Constraint
74
Use
of Massively
PERSPECTIVES
Programs
Constraint
Logic
Programming
Languages
Parallel
Multiprocessors
CONCLUSION
ACKNOWLEDGMENTS REFERENCES
Logic programming languages are very high-level languages enabling programs to be developed more rapidly and concisely than with imperative languages. However, in spite of important progress in compilation techniques for these languages, they remain less efficient than imperative languages, and their use is mainly constrained to prototyping. Increasing the efficiency of logic programming to the level of imperative languages would certainly enlarge their domain of use and raise the productivity of production programmers.
ACM
Computing
Surveys,
Vol.
26,
No
3, September
1994
These reasons persuaded the ICOT to choose logic programming as the basic programming language of the Fifth Generation Computer Systems project (FGCS) [Furukawa 1992; Shapiro and Warren 1993]. One aim of this project was to produce multiprocessors delivering more than one gigs-lips, one “lip” being one logical inference per second, and one inference being similar to a procedure call of an imperative language. The gigs-lips level of performance seemed out of scope when the FGCS project started in 1982, since the most efficient Prolog systems of that period were limited to several kilo-lips of performance. This is not the case any more, mainly because of the spectacular progress made by the hardware technology. The most efficient Prolog implementations exceed one mega-lip on today’s most powerful RISC microprocessors, and massively parallel multiprocessors can include more than 1000 such processors, However, getting gigs-lip-level performance out of massively parallel multiprocessors will only be possible if parallelizing techniques capture enough of the parallelism in the logic programs while keeping the overhead of parallel execution limited. Additionally, it is not clear whether a large number of logic programs can benefit from massive parallelism. There exist two main schools of thought in the logic programming community: either parallelism should be made explicit, or it should be kept implicit. Explicit parallelism is the approach developed in concurrent logic languages which have already been successfully used in operating systems and parallel simulators [Foster 1990; Shapiro 1989]. The aim of implicit parallelism is instead to speed up existing or future logic programs without troubling programmers with parallelism. Both approaches will be presented in this article, although more emphasis will be placed on systems exploiting parallelism implicitly. Research in parallel logic programming has resulted in the definition of a large number of parallel computational models [Delgado-Rannauro 1992a; 1992b;
Parallel 1992c]. All these models cannot be presented in this article, which will concentrate on the most mature ones, already used in efficient parallel implementations. The organization of the article is the following. The first sections are introductory, defining the different types of parallelism that can be exploited in logic programming, introducing the language issues involved in implicit versus explicit parallelism, and raising implementation issues involved in implementing efficiency logic programming in parallel. The two following sections survey representative systems exploiting one type of parallelism or combining several types of them. Several important research topics and perspectives are then sketched before the conclusion of the article. 2. PARALLELISMS IN LOGIC PROGRAMMING Due to their declarative and logical essence, logic programs naturally exhibit different kinds of parallelism. Indeed, the parallelization of logic programs can be seen as a direct consequence of the key feature advocated by Kowalski [1979] in his well-known motto “programs
= logic
+ control.”
A problem should be decomposed into a logical part (i.e., a specification written as set of logical formulas) which encodes the declarative meaning of the program and a control part (implementation dependent) which provides a way of executing this specification. The most popular instance of the latter, proposed for example by the Prolog language, is sequential depth-first search, but different kinds of parallel execution mechanisms are also possible. Each type of parallelism leads to a specific model of execution that forms the foundations for one of the systems that will be presented later. 2.1 Logic Programs This brief the basic
and intuitive presentation notions of logic programs
of is
Logic
Programming
Systems
●
297
intended to make the paper self-contained. A more complete introduction to logic programming can be found in Llovd .
[1987i. -
-
A logic program Horn clauses, that form:
A:–
B1,...,
is a set of definite is, formulas of the
Bn.
(1)
A.
(2)
and one query :-
Q1,...,
QP.
(3)
The A,, l?,, and Q, are atomic formuwhere p is a las such as p(tl,,,,,tin), predicate symbol and tl, . . . . t~ are compound terms, constants, or simple variables. In (l), A is the head of the clause while the conjunction of the Bi is called its body. Each of the B, is called a goal. Logic programs enjoy both a declarative, or logical, semantics and an operational, or procedural, semantics. Therefore (1) can be logically read as “if B1 and Bz . . . . and B. are true, then A is true,” and (2) as “A is a fact that is always true.” The declarative interpretation of a logic program P amounts to proving that the query (3) is a logical consequence of the conjunction of the clauses of P. On the other hand, the procedural interpretation of Horn clauses forms the basis of the operational semantics of logic programming. The set of all clauses with the same head predicate symbol p can be considered as the definition of the procedure p. Each goal of the body of a clause can be considered as a procedure call. Parameter passing between the caller and the callee is achieved by unification of the arguments. Unifying two terms consists in finding an assignment of the free variables of both terms which makes the two terms equal. For instance, unifying the two termsl f(X, a, Y) and f(b, Z, T) yields the substitution {X ~ b, Z + a, Y + T}. Hence
1Written with Prolog’s identifiers convention: variables begin with a capital letter, while function symbols and constants begin with a lower-case letter.
ACM
Computing
Surveys,
Vol. 26, No. 3, September
1994
298
“
J. Chassin
de Kergommeaux
and
parameter passing is multidirectional. At a given step of the execution of a logic program, the set of goals that remain to be executed, i.e., the continuation of the computation, is called the current resoluent. The execution of a logic program starts by taking the query as the initial resolvent and proceeds by transforming the current resolvent into a new one, as described by the following high-level pseudocode: ●
Set the query
s while
current
resolvent
to the
initial
The query is equivalent to the request: “Find a person whose grandfather is Mark.” Let us detail the execution of this program by examining the resolvents produced at each step. During program execution each use of a program clause requires renaming the variables of the clause with fresh variables distinct from those in the resolvent. This will be achieved in the following example by adding subscripts to variables’ names.
(1) The first
the current
—Select
resolvent
any goal
G of the
then
Resolution
*
Select whose
resolvent
*
(2)
a clause of the program head unifies with G of this
* if there solvent tives then if resolvent
(3)
step:
exists a previous rewith untried alterna-
restore
else exit
is empty
Let us illustrate simple example.
this
with
this
resolvent
(2’)
success
machinery
by
a
2.1. Consider the following “genealogy” program:
grandfather :- father father(john,
(X, Z) (X, Y), father
(Y, Z).
(1)
philip).
(2)
father(peter,
andy).
(3)
father(andy,
mark).
(4)
and query :- grandfather(X,
ACM
Computmg
Surveys,
Vol
mark),
26, No
3, September
(5)
1994
mark).
Assume that we select the leftmost goal for continuing the resolution process. It unifies with (head of) clause (2) with substitution {X + john, y, ~ philip], giving the resolvent: mark).
Since no clause head in the program unifies with the goal of the resolvent, backtracking occurs, and the last choice (i.e., second resolvent) is restored: :- father(X,
Example traditional
mark).
Y,), father(Y1,
:- father (philip,
failure
then
query
The (unique) goal of the query (5) unifies with the head of clause (1), that is, grandfather (Xl, Z,), with substitution {Zl + mark, X1 ~ X}, giving the resolvent: :- father(X,
apply the substitution (i. e., variables bindings) resulting from unification to the resolvent
else backtracking
is the initial
:- grandfather(X,
proG
step:
* replace G by the body clause in the resolvent
resolvent
(5):
is nonempty
—if there exists a clause of the gram whose head unifies with
●
P. Codognet
Y,), father
(Yl, mark).
an alternative uni(3’1 We now consider fying clause to resolve with the leftmost goal, such as clause (3), giving the substitution {X + peter, Y, + and y}, and the resolvent: :- father(andy, (4’)
mark).
This goal unifies with (head of) clause (4), with an empty substitution, giving the empty clause as final resolvent: ❑
(empty
clause).
Parallel
Logic
Program
ming
Systems
w
299
grandfather(X,mark)
x,=x Z1=mark father(X,Yl)
, father(Yl,mark)
,,~,,,,:
,,:,/ father(philip,mark)
father(mark,mark)
father(andy,mark)
fail
success
fail
Figure 1.
Search
tree for tutorial
The execution of this logic program is successful, and the result is the binding of the variable of the query, that is, X = peter, which is called an answer substitution. This computation can be graphically summarized by depicting the associated search tree, shown in Figure 1. The nodes of the tree are the resolvents occurring in the computation, and the arcs are labeled by the set of bindings of the corresponding resolution step. We underline, when necessary, the selected goal in a resolvent. A sequential computation according to Prolog’s strategy simply consists in a depth-first left-to-right search of the tree. Logic programming languages such as Prolog also include several extensions to this “pure” logical setting, as for instance side-effect primitives (e.g., read and write predicates), and the (in)famous cut operator which is used to make determinate parts of the computation. Cut appears syntactically as a special goal (written “ !”) in the body of a clause and will affect, when executed, the part of the computation performed after the entry of the clause. A cut will remove all choicepoints (i.e., backtracking points created when several unifying clauses for a goal exist), that have been created since. No backtracking in this part of the computation is then possible.
example.
Example is replaced cut :
2.2. by
Suppose that clause clause (3) containing
father(peter,
andy) :- !.
(3) a
(3’)
This cut is executed just before the success node of the search tree, and will not affect this success, but will remove the choice-point created for the goal father (X, Y,), which still has a remaining alternative (clause (4)). The cut will remove this alternative, that is, delete on Figure 1 the rightmost branch of the search tree. Here, this has not much effect on the overall computation, since the deleted branch ended with failure, but, had it ended in success, this solution would have been deleted as well. Therefore, using cuts could obviously destroy the logical meaning of a program, because it may delete some success appearing on a branch pruned by a cut. 2.2 Sources
of Parallelism
in Logic Programs
There are two intrinsic sources of parallelism in the above execution model, which correspond to the two choices that have to be made to perform an inherently parallel algorithm sequentially. The first choice is the selection of a goal in the resolvent in order to perform a resolution step. One may envisage selecting several goals of the resolvent and
ACM
Computing
Surveys,
Vol. 26, No. 3, September
1994
300
“
J. Chassin
de Kergommeaux
and
performing all resolution steps simultaneously. This is called AND-parallelism, since it consists of pursuing in parallel goals that must all be satisfied. The second one is the selection of a unifying clause in the program when performing a resolution step and proceeding to a new resolvent. If there exist several such unifying clauses, it is possible to perform several alternative resolution steps in parallel, creating therefore several new resolvents upon which the computation proceeds. This is called ORparallelism, since it consists of pursuing in parallel alternative goals, any one of which may be satisfied.
2.3
OR-Parallelism
in Prolog
OR-parallelism consists of the simultaneous development of several resolvents that would be computed successively by backtracking in a sequential execution. Examining the search tree (Figure 1’), an OR-parallel search consists of exploring in parallel each branch of the tree, these branches being indeed OR-branches representing alternative computations, In our example, the computation may split into three OR-branches when computing the second resolvent, using the three clauses of the father procedure. Of course, only the second branch succeeds at the next computation step. It is worth noticing that in our definition of OR-parallelism, simultaneous independent execution is not limited to a single resolution step, and thus reduced to a mere database parallelism while searching for a matching clause, but applies to the entire computation of com r.so[.~ents. Thus, if the OR-split plete occurs early enough in the computation, alternative resolvents the remaining might involve large computations, yielding therefore coarse-grain parallelism. The main problem in OR-parallelism is to manage different independent resolvents in parallel, in each of the ORbranches. This is not a problem in the example above where the resolvents are very small, but in general, care must be taken to avoid inefficient copying of large
ACM
Computmg
Surveys,
Vol. 26, No
3, September
1994
P. Codognet
data structures, such as the current set of variables bindings. Sophisticated algorithms and data structures have been designed to overcome this problem, which we present in Section 5. Another issue to be addressed in this “eager evaluation” of alternative choices is tuning the amount of parallelism in order to stop the system from collapsing under too many processes. Efficient solutions mix backtracking and on-demand OR-parallelism (by idle processors) in so-called “multisequential” models.
2.4
AND-ParaHelism
in Prolog
AND-parallelism consists of the simultaneous computation of several goals of a resolvent. Nevertheless, if each subcomputation is followed independently, one needs to ensure the compatibility of the binding sets produced by the parallel branches. Consider again the computation in Example 2.1, and more precisely the second subgoal - father(X, Y,), father(Y1, mark), which has potential AND-parallelism. The execution of the two parallel AND branches results in the following candidate solution sets: Branch
1: goal:
Variable
{X=
:- father(X,
bindings:
Y,)
{X = john, Y, = phlllp},
peter, ‘I’, = andy},
{x = andy,
‘f, =
mark} Branch
2: goal:
Variable
binding:
father(Y1,
mark)
{Yl = andy}
Obviously, only the second set of bindings of the first branch is compatible with the unique binding produced by the second
branch.
One
must
“join”
binding
sets
coming from different AND-branches in order to form a valid solution for the entire resolvent. The main difficulty with AND-parallelism is to obtain coherent bindings for the variables shared by several goals executed in parallel. Run-time checks can be very expensive. This has led to design execution models where goals that may possibly bind shared variables to conflicting values are either serialized or syn-
Parallel chronized. The first solution leads to independent AND-parallelism: only independent goals—those that do not share any variables—are developed in parallel. The second solution leads to dependent AND-parallelism: concurrent execution of goals sharing variables is possible. Such goals are synchronized by producer/consumer relationships on the shared variables. This ammoach has been used in concurrent lo~i; programming languages such as Parfog,- I@l, and Concurrent Prolog. 3. LANGUAGE
Logic more ating
Systems
“
easily modeled by multiple concurrent processes.
3.1 Parallelizing
301 cooper-
Prolog
Prolog was originally designed with a sequential target machine in mind. However, parallelizing the language without change is an appealing approach for the following reasons: 0
Parallelism can be transparently achieved with “minimal” changes to the initial model, at least at the abstract level, as described in the previous section. All the implementation technology developed for sequential Prolog systems, such as the WAM abstract machine [Warren 1983], can thus be reused. The most successful offsprings of this approach are the multisequential models for both OR and AND parallelism that will be presented in Section 4.
0
Parallel execution does not add any complexity to the programming language; it helps the user to concentrate on declarative statements without bothering with control issues. This view keeps up with the separation of logic and control advocated by Kowalski [1979] in his well-known formula “program = logic + control.”
●
The corpus of Prolog programs already develop~d can be ex~&te~ without an; modification to the programs’ source code by parallel systems.
ISSUES
The first attempts in the earlv 80s to parallelize Prolog and to define ~ew models of execution for logic programs raised a number of issues, including some key points in language design. The main debate can be summarized by the following question: “Do we need a new language with specific constructs to express concurrency and parallelism, or should we stick to Prolog and exploit parallelism transparently?” Prolog has been in use for nearly two decades and has established itself as a useful tool for a wide range of problems. Declarative languages become more attractive as computing power increases, since it becomes more feasible to displace complexity of programming from the user to the internal computation mechanism. Efficient parallel implementations of existing languages are thus desirable to speed applications that are already developed, and they can moreover lead logic programming a step further in covering effective “real-world” applications. On the other hand, one may sometimes prefer a more flexible control of the search mechanism to the simple but somehow rigid search control of Prolog. This brings up the problem of communication and synchronization in logic programs that calls for new features and changes in language design. Thus was born the new domain of concurrent logic languages, sometimes called committedchoice languages. It has paved the way to a wide range of new applications that are
Programming
Some problems, however, arise when considering the Prolog constructs that are order sensitive. Most side-effect primitives, as for instance read or write, require what could be called ANDsequentiality, and are thus necessary sequentialization points for AND-parallel models. OR-parallel models, however, which keep the sequential order of execution of goals, do not suffer from ANDsequentiality. Nevertheless, side-effect primitives can also cause problems in these models, since programmers usually expect the OR-sequentiality of Prolog, which requires that alternative clauses
ACM
Computing
Surveys,
Vol.
26,
No.
3, September
1994
302
●
J. Chassin
de Kergommeaux
and
be executed in the textual order of the program. Problems of OR-sequentiality are even stronger when considering “impure” or “nonlogical” features such as the cut operator or the assert/retract primitives which dynamically manipulate program clauses. These constructs will cause problems in OR-parallel execution, because they are sensitive to the order in which OR-branches are explored. Let us consider the cut operator. All the computation corresponding to the development of the alternative clauses that may be pruned by a cut is called speculative work. This work will be pruned if the cut is executed, but will be developed if the branch containing the cut fails before executing it. Scheduling speculative work is therefore a difficult task in OR-parallel models [Hausman 1990], since anticipation may amount to useless computation while conservatism may prevent parallel execution. To avoid the inherent sequentiality of the cut operator, OR-parallel Prolog systems usually introduce a new “symmetric” cut operator: whereas the cut prunes choice-points in the previous alternative clauses of a predicate (with respect to the textual order of the proa symmetric cut will prune gram), choice-points in all the alternative clauses of a predicate (both before and after the clause containing the symmetric cut in the textual order of the program). This new operator will obviously be easier to implement in an OR-parallel system: as soon as one parallel branch reaches a symmetric cut operator, all other branches are killed, without bothering about the order in which they appear in the text of the program. This is of concurclose to the “commit” operator rent logic languages, which will be detailed in the next section.
P. Codognet
alects such as IC-Prolog [Clark and McCabe 1981], Prolog-II [Colmerauer 1986], MU-Prolog [Naish 1984], or SICStus Prolog [Carlsson and Widen 1988] provide primitives to explicitly delay the execution of some goals, and therefore introduce some coroutining mechanism between active goals of the resolvent. Roughly speaking, a predicate can be given a wait declaration specifying that it should not be executed before some of its arguments are instantiated, i.e., bound to nonvariable terms. This corresponds to declaring this goal as a consumer-only goal on those arguments. Such delayed goals in a resolvent can be seen as coroutines that will be woken up by some instantiation of their variables. A more direct and precise control mechanism has been introduced by the family of concurrent logic languages. All these languages have some syntactic way to declare that a process (goal) is either a producer or consumer of some of its arguments. Such declarations are thus used to produce a synchronization mechanism between active processes, i.e., goals of the resolvent. When unification of a goal instantiates a consumed variable, this goal is suspended until that variable is sufficiently instantiated (by some other goal). Let us illustrate this machinery by a simple example. Example 3.1. Consider a very simple program consisting of the three clauses below:
p(a).
(1)
p(b)
(2)
q(b) .
(3)
and a query 3.2
:-p(x),
Delays, Synchronization, and Concurrency
In Prolog, control of forward execution is given by the order of the goals inside a clause and that of the clauses in the program. However, a more flexible control strategy is sometimes needed. Di-
ACM
Computing
Surveys,
Vol
26, No. 3, September
1994
q(x).
(4)
Also suppose that we have stated (we will detail the syntax later on) that p is consumer of its argument (X) while q is producer. Consider an execution that selects goal p in the query: the unification
Parallel with any of clause (1) or (2) amounts to instantiating X, and therefore goal p is suspended. On the other hand, q may be selected, and the unification with clause (3), which amounts to binding X to b, is possible because q is producer of X; p can then be woken up, and unification with clause (2) succeeds without any instantiation, since X is now bound to b. Observe how producer/consumer synchronization is achieved between active goals of the resolvent and results in this example being a purely determinate computation, while a usual Prolog computation would have been nondeterminate: p would have been selected; unification with clause (2) would have produced the binding of X to a; then selection of q would have led to a failure and backtracking, bringing up clause (3) for p; q would then succeed.
3.3
Concurrent
Logic
[Clark and Gregory 1981], the ancestor of (concurrent languages such as Parlog [Clark and Gregory 1986], GHC and KL1 [Ueda and Chikayama 1990], Concurrent Prolog [Shapiro 1983] and FCP [Shapiro 1989], or CP [Saraswat 1987]. The reader should consult Shapiro [ 1989] for a detailed genealogy, history, and complete presentation of this programming paradigm.
Systems
●
303
The main programming concepts behind those languages can be summarized as follows: ●
The process interpretation of logic programs [Shapiro 1983] replaces the traditional procedural interpretation. Each goal is seen as a distinct process; AND-connected goals are assumed to run concurrently. This form of ANDparallelism is called dependent ANDparallelism, since goals that share variables (hence “linked” or dependent) can run in parallel.
●
Communication is achieved through logical variables: each variable shared by several goals/processes acts as a communication channel, leading to a very powerful and elegant communication mechanism.
●
Synchronization is achieved by using a p~oducer/consumer relationship between processes. A process may be blocked during a resolution step until the variables that it consumes are sufficiently instantiated (by some other processes).
Logic Languages
The usual logic programming framework is limited to transformational systems, i.e., systems that, given an original input, transform it to yield some output at the end, and as the end. This framework is not well suited to reactive systems, i.e., “open” systems able to react to a continuous stream of inputs from the “real world,” which are more easily modeled by interacting agents. To address these problems, communication and synchronization techniques derived from the concurrency theory [Dijkstra 1975] have been introduced in logic programming, giving rise to concurrent logic programming. The basic notions for the integration of concurrency into logic programs can be traced back to Relational Language
Programming
Several new major language features are therefore introduced (see example below): (1) Some way of stating a producer/consumer mode for each variable has to be introduced. Each language has its own way to declare the access mode (Read or Write) of a variable in the predicate declaration, which we will not detail here. These modes are used dynamically to induce a producer/ consumer relationship between active processes (goal of the resolvent) on each shared variable, i.e., on each communication channel. This technique amounts to replacing the unification procedure involved at each resolution step by a matching procedure for consumer goals, i.e., allowing binding of the variables in the callee only (head of clause) and not in the caller (goal of the resolvent). This is what happened in Example 3.1: when P(X) was first selected, both clauses (1) and (2) unify, but neither matches,
ACM
Computing
Surveys,
Vol. 26, No 3, September
1994
304
“
J. Chassin
de Kergommeaux
and
since unification amounts to binding variable X of the goal. This notion in fact can be rephrased and generalized in the Ask-and-Tell mechanism of concurrent constraint languages, which will be presented in Section 7.3. introduced by (2) The concept of guard Dijkstra [1975] and used for imperative languages such as Ada and Occam is adapted to logic programming. A guard is a switch construct of condition/action pairs, which delays execution until one of the conditions is true, and executes the corresponding action as soon as a single condition is verified. This translates to logic programs as follows: each clause is given an additional guard part that appears between the head and the body. A guard is simply a sequence of goals that must be executed successfully before the body of the clause is entered (all body goals are spawned in parallel). Implementation considerations have led the last generation of languages such as KL1 or FCP (Flat Concurrent Prolog) to accept only fZat guards, i.e., guards composed only with built-in predicates. This ensures that no hierarchy of guard systems will ever be created (hence the name) and that guard checking will be reasonably fast. care” re nondetemn[n isrn (3) “Don’t places the traditional “Don’t know” nondeterminism of logic languages. This means that at each resolution step, only one alternative is pursued (and we do not care which) as opposed to pursuing all of them (since we do not know which one to choose). The latter is achieved in a language like Prolog by creating a choice point whenever more than one clause unifies for a resolution step, and by backtracking upon failure of one computation branch. Alternative terminologies are “indeterminism” for “Don’t care” nondeterminism and “angelic nondeterminism” for “Don’t know” nondeterminism. Under “don’t-care” nondeterminism no
ACM
Computing
Surveys,
Vol.
26,
No.
3, September
1994
P. Codognet choice-point is ever created, and no backtracking ever takes place, resulting in a deterministic language. The only nondeterminism lies in the guard check, since all guards are evaluated in parallel. Among all satisfiable guards, one of them is chosen, and to the the computation is committed corresponding clause, neglecting alternative ones. This mechanism is syntactically expressed by the presence of a commit operator between the guard and the body of each clause. The commit indeed corresponds to a generalized and symmetric “cut” operator.
Concurrent logic languages hence depart from traditional logic languages in several ways, most notably by the replacement of unification by matching and by the giving up of “Don’t know” (“angelic”) nondeterminism. The former is the key mechanism for the synchronization mechanism and is the price to pay for a new programming style. The latter is, however, only supported for implementation reasons, since a simple backtracking scheme (such as the chronological backtracking of Prolog) was not easy in a concurrent environment. To give a flavor of what a concurrent logic program is, let us consider the following merge program, written following the syntax of KLl or FCP( l). This program merges the two intmt streams (lists) ;n its first two arguments to produce an output stream in its third argument. L
Example
,3.2
I Out merge ([ Xllnll, ln2, Out) “- true ln2, Out’). [XIOut’], merge(lnl, merge (lnl, [Xlln21, Out) :- true I out [XIOut’1, merge(lnl, in2, Out’), merge([ 1, ln2, Out) :-true I Out = ln2. merge(lnl, [ ], Out) :-truel Old = Inl.
= =
This program is very simple: all guards being empty (true predicates indicate empty guards), synchronization will be enforced by the compound terms present in the head of the clauses. Since unification is replaced by matching, the presence of a compound term in the head of a
Parallel
clause for a given argument place enmode for this arguforces a consumer ment, meaning that this term cannot be bound to a variable of the caller goal but should be checked versus an incoming nonvariable term only. Thus a merge process will wait until some data arrives on one of its input streams (i.e., a “cons” is produced in one input list). When an input stream is reduced to the empty list, merge simply produces what it is fed through the remaining input stream. This program preserves the order of the elements in both input streams, but guarantees nothing about the rate at which each stream will be served. More complex promerger which grams, such as a fair guarantees that every data produced by one input stream will eventually be written on the output stream, are presented in Shapiro [ 1989]. 3.4 Unifying Prolog and Concurrent Logic Languages As described above, parallel concurrent logic languages distinct application areas: Prolog is higher-level terministic, solving.
Prolog address
and two
more suited for declarative programs, such as nondesearch-intensive problem
Concurrent logic lan~ages are more . — suited when explicit control matters and fine-grain coordination are essential, when the problem is more easily modeled by communicating agents. It was thus natural to try to encompass both paradigms in a single language, for the best of both worlds. model was proposed by The Andorra Warren [1988] to combine OR-parallelism and dependent AND-parallelism (cf. Section 2.4), and it has now bred a variety of idioms and extensions developed by different research groups. The essential idea is to execute determinate goals first and in parallel, delaying the execution of nondeterminate goals until no determinate goal can proceed any more. This was inspired by the design of the concurrent language P-Prolog [Yang and
Logic
Programming
Systems
●
305
Aiso 1986], where synchronization between goals was based on the concept of determinacy of guard systems. However, the roots of such a concept can be traced back further to the early developments of Prolog, as for instance in the sidetrackof Pereira and Porto ing search procedure [1979] which favors the development of goals with fewer alternatives (and hence determinate goals as a special case). This is indeed but another instance of the well-known “first fail” heuristic often used in problem solving, which consists of exploring first the branches that are most likely to fail. An interesting aspect of the Andorra model is its capability to reduce the size of the computation when compared to standard Prolog, since early execution of determinate goals can amount to an a priori pruning of the search space (see Example 3.3 and Section 6.2). The ability of delaying nondeterminate goals as long as determinate ones can proceed amounts to a synchronization mechanism managed by determinacy. The programming style of concurrent logic Iangaages, such as cooperation between a producer process and a consumer process, can thus be somewhat mimicked by making the consumer nondeterminate (and hence blocked) as long as the producer has not produced a value, which would then wake up the consumer by making him determinate; see Costa et al. [ 1991b] for an example. Andorrabased models, such as Andorra-I [Costa et al. 1991a], Andorra Kernel Language (AKL) [Haridi and Janson 1990], and Pandora [Baghat 1993], thus support both Prolog and concurrent logic languages styles of programming. Moreover, the Andorra Kernel Language is an attempt to fully encompass both Prolog and concurrent logic languages. The main new language feature is the introduction of the guard construct borrowed from the concurrent logic paradigm and its associated guard operator. Both ‘
record:
●
●ther*
prOces*_f
Logic
(Y)
.-------------Y
philip Clloioa
=
point:
father Uaxt
(-tar,
2
andy)
Local Top
o
untxi9d:
fathar
Stack
Trail
stack
trail Aativaticm
Activation
x.cord:
10ng_cOmputatiOn
--------
(P)
P
D
(Y)
---
Y
---------------
;..
record:
process_fathars ”--
-.
unbound
*
point:
Choia* father liaxt
untried:
fatha
Local
Stack
Activation
ToP
-. . ..-.
3
(Y)
. . . . . . . .I
Y
Trail
point:
c2wic0 fathar
,z-
Next untri9d: father (mdy, T-
IWWrd:
10ng_Oouput
-----
I
at ion
--------●
P
mark]
trail“.’’’ .....
activation
andy) ......... ..
o
.IH I
record:
prOcess_fathars
(peter, trail
(P)
stack
*
......
.
i
Figure 4. Backtracking in Prolog implementations. This figure illustrates how locations in the local stack are used to store several bindings (the global stack is not used in the example above, but the situation would be similar If it were). Here, the same location is used to store two of the successive bindings of variable Y, initially bound to philip (1), then unbound when backtracking occurs (2), and then bound to andy (3). The program example is taken from Figure 3.
One problem occurs with stack copying: the worker grabbing work needs to restore its stacks to their state when the OR-node was created. In the example of Figure 3 and Figure 4, the binding of Y a conditional binding, has to to philip, be reset in the copy of the stacks of WI done in the workspace of Wz. Two solutions have been proposed to solve this
problem: the first one, called Kabu-wake2 model, associates to each binding a logical date, such a date being managed by each worker [Masuzawa et al, 1986] (see Figure 5); the second solution uses the
2 Kabu-Wake names a transplanting to grow bonzai trees.
ACM
Computing
Surveys,
Vol.
26,
No
technique
3. September
used
1994
310
“
J.
Chassin
de Kergommeaux
Stacks Worker
and
Stacks Worker
W, WY
----Local date:
Stack
P. Codognet
W2
- -> Local
WI
Stack
Wz
date: O
O
~ q -------,->
1 ---------
, date:
philip
Y
I -.
-
1
Y
I I
Choice point:
t
process_fathers(Y)
. ..-
. . . . . . ------
-
unbound
I
I
father
I I
Next untried:
I I
father(andy,
I
mark)
I I
date:
O
I I I
,.. ,
. . .. .. .. .. . .. . .
Top trail
I 1 I I !
date:l
Activation
I
record:
long_computation(P) ----------------
I I
-
*
P
II
l==
I I I I I I I
Trail
Stack
—-
*
JVI
I I L———.
. . . . .——
Figure 5. Kabu-Wake model: Use of loglcal dates to ldentlfy conditional variables. The logical date of a worker M the number of choice-points in the stack of this worker. The binding of Y to phzIzp was performed after the creation of the choice-point j!ather and is therefore not valid for Wz. Copying of the stacks of WI to Wz does not interfere with W1’s actiwties. The program example is taken from Figure 3.
trail stack to restore the state of the stack of the worker grabbing work [Ali and Karlsson 19901. In the latter soluWI and Wz synchronize tion, workers during the copy operation so that Wz gets coherent copies of the stacks of W1. Then W2 uses its copy of the trail stack to unbind, in its copies of the stacks, all the conditional bindings performed by WI which are not valid for Wz, reproducing part of the operations done during a backtracking operation (see Figure 4). The first solution results in some time
ACM
Comput,ng
Surveys,
Vol
26,
No.
3. September
1994
and space overhead to associate counter values with each binding while, in the second solution, an idle worker copying the stacks from an active worker slows down the latter since both workers need to synchronize during the copying of the trail stack at least. The overhead of stack copying can be reduced by observing that often an idle worker shares a part of the program search tree with any active worker. In W2 gets general, when an idle worker work from an active one WI, there is no
Parallel
Logic
Programming
*
Systems
311
ROOT
? 1
1 1
: I
? CCP
#,/
,4
-:-----------i---------------” \
/’
W2
,
stack segments
‘\ .
\\
/\,
.
\\
:
Segments
[
copy
‘+
t
-.-i
WCP
to
‘-----
---------------Wz
/
w,
“
Figure 6. Incremental copying of the stacks. W2 pops its top stack segments until common choice-point and then co~ies on the top of its stacks the stack segments of CCP and WCP.’
need to copy the complete stacks of the active worker WI, since WI and Wz already share a part of the program search tree [Ali and Karlsson 1990]. When Wz takes work from the node (choice-point) to the last WCP of Wl, W2 backtracks common choice-point (CCP) before copying the segments of the stacks of WI that than CCP and older than are younger the choice-point WCP (see Figure 6). Additionally, bindings performed by WI in the common parts of the stacks, between the creations of CCP and WCP, must be installed in the stacks of Wz. This can also be done by using the trail stack of WI or a special data structure recording value trail or such bindings (also named binding
4.3.2
list).
Sharing of Stacks
In the stack-sharing scheme, parts of the local and global stacks are shared; parts are private: workers share the parts of the stacks representing the portions of the proof tree that they have in common. Also, private data structures are used to
it reaches created
W,
the last between
store conflicting bindings to variable locations of the shared parts of the stacks. Two types of data structures have been proposed: binding arrays [Warren 1987] [Borgwardt 19841. and hash windows to A binding array is an array, private each worker, which contains such bindings. Variable locations of the shared part, which could otherwise be bound simultaneously to conflicting values by several workers, do not store values (bindings) as in Prolog systems but integers. These integers index the bindings of the variables within the binding array of each worker sharing these variable locations (see Figare 7). In the alternative solution, bindings performed to the same variable locations are stored in data structures called hash windows and attached to each branch of the search tree. An entry in a hash window is computed by hashing the variable location. All the bindings performed along the path leading from the root of the search tree to a branch of the computation are valid for the worker computing this branch. Therefore, hash windows are
ACM
Computing
Surveys,
Vol.
26,
No.
3, September
1994
312
.
J.
Chassin
de Kergommeaux
and
P. Codognet
Shared Local Stack WORKER
Actwation
record:
proces_fathers(Y)
m . Activation
Y
Binding
‘array
Clwce point:
y -,% ,. ,,: ,: ::
i father
,,’ (:
Next untried: father(andy, ,--
~
(
mark)
1:
,! ,,: ,,
Top trail
,,
,EWORKER
W2
Local Stack
;:.
t!’,
:
,L.:. ;, :, :, :,
record:
10ng_cOmputation(P) -------------------
P
●
I
Trad slack
1 l+= ‘H Local Stack
!, ,, ,,, ,:
Activation record: 10ng_cOmputat[ On(P)
-----
-----
--------------
●
‘P
----,
,’
. . ..
,:
=:
Binding
~1
array
W*
I
Trail stack
andy
----
*
Binding
array
t!;
phd)p
‘H
Figure 7. Use of binding arrays. This simple representation of the computational state of the example of Figure 3 exemplifies the use of binding arrays Two workers WI and W2 share the portion of the local stack containing the variable location Y, bound to phL@ by WI and to andy by W2. The global stack is not used in this example.
linked, and variable lookup may imply searching a chain of hash windows of unbounded length (see Figure 8). The concept of “hash window” has led to several implementations [Butler et al. 1986; Chassin de Kergommeaux and Robert 1990; Westphal et al. 1987] (see Section 5.1). An extension of the basic scheme restricts the creation of hash windows to cases where parallelism is actually used [Westphal et al. 1987]. The binding-array scheme adds constant overhead to some variable lookup (dereferencing) operations. Another overhead of this scheme is the updating of the binding array of a worker, when it gets work from another worker. On the contrary, starting a new branch is cheap
ACM
Computmg
Surveys,
Vol.
26.
No.
3, September
1994
the hash window scheme since it is only necessary to create a new hash window and link it to the previous one. However, the cost of a variable lookup is unbound since it may imply the searching of a chain of hash windows of unbounded length.
in
4.3.3 Recomputation
of Stacks
To avoid the overheads associated with the stacks copying and the stacks sharing solutions, it is possible to have workers compute independently complete paths of the search tree [Clocksin and A1shawi 1988]. The parts of the search tree corresponding to portions of the stacks, which would be either copied or
Parallel
Logic
Programming
Systems
313
“
T~[--------jF3Eq Shared
Local
Activation
Stack
~
record: ,, ,, .,.,.. .1..,... : . . . .
father
I:i
father(andy,
II, (, ,,.
mark)
Previous hash-window
--
: P :
:
------------., q---l
1
i
3’
m--;
Wz
I
record:
10ng_cOmputatiOn(P)
, l;. ,
--1
stack
Actwation
!,:: Next untried:
t
Local
,,:: Choice pint:
andy
1
procw.fathers(y) ------------------Shared
Y
hash(Y)
I
I
‘ash(y)&d Local
stack
;1’,
Activation record: 10ng_cOmputatiOn(P) ------------------*----,.
P
-
n
Figure 8. new
branch
Hash windows. In the original is entered: a new alternative
previous
one
locations
in
the
computation.
in
the
the
path
shared
part
The
program
of the of
the
hash-window taken from
search
tree
stacks
are
example
is
taken
leading done
in
from
shared in one of the previous schemes, are recomputed by each of the active workers. Each worker is given a predetermined path of the search tree described by an “oracle” (path in the search tree) allocated by a specialized worker called “controller,” Programs are rewritten to obtain an arity-2 search tree so that oracles can be efficiently encoded as bit strings (see Figure 9). 4.4
scheme, a new hash a choice-point. This
AND-Parallelism
In systems exploiting AND-parallelism, several workers may bind the same logical variables to conflicting values. To avoid joining dynamically partial ANDsolutions, some systems compute only independent goals in parallel. The other
to the the Figure
All
root.
hash
window
window is created each time a hash window is linked to the
conditional attached
bindings to
the
to
current
variable branch
of
3.
AND-parallel AND-parallelism
systems exploit instead.
4.4.1 Independent
dependent
AA!D-Parallelism
The main problem is the detection of independence between goals. Independence detection can be done during the execution or in advance at compile-time or partly at compile-time and partly during the execution. Run-Time
Independence
Detection
In the AND-OR process model [Conery 1983], dependencies among the goals of each clause are represented by producer/consumer graphs of the variables appearing in the clause, these graphs being updated at run-time by costly op-
ACM
Computmg
Surveys,
Vol.
26,
No.
3, September
1994
314
“
J.
Chassin
de Kergommeuux
process-fathers(Y)
:- fatherl
10ng-cOmputatiOn(P)
(X, Y),
and
P. Codognet
long.computation(
Y).
:- . ..
fatherl(john,
philip).
fatherl(X,Y)
:- father2(X,Y).
father2(peter,andy). (father2(X,Y)
:- father3(X,Y).
I
father3(andy,mark)
I fatherl(.Y, father-l (john,
Y) :- father2(X,
Bitstring Bitstring
encoding
Y).
phdzp). encoding
1
O
,~73
father2(X,
Y) :- father3(X,
Y).
2
Figure 9. Recomputation of stacks. In order for two workers to compute independently the branches 2 and 3 of this search tree, they will each have to compute the path leading from the root to node II The program example from Figure 3 was rewritten to obtain an arity-2 search tree. The computation path leading to branch 3 can thus be encoded as bit string 11. erations. If several goals have ground (constant) arguments, they can be executed in AND-parallel since their execution is guaranteed not to bind these arguments to conflicting values. In the APEX system [Lin and Kumar 1988], clauses are compiled into a producer/consumer data flow graph. One token is associated with each variable V of a clause. The leftmost goal of the clause with the variable V is a generator of V. A goal becomes executable when it has received the tokens for all its shared variables. This model was implemented with tokens represented by bit vectors, the ith bit representing the ith goal. Another bit vector is associated to each goal so that it is possible to test whether all the generators (other goals in the same clause) of the variables in the clause have completed by a simple bit vector operation. Restricted
AND-Parallelism
To limit the run-time [1984] has proposed
ACM
Computmg
Surveys,
overhead, performing
Vol
26.
No
DeGroot half of
3, September
1994
the independence detection work at compile-time. Programs are compiled into graphs called Conditional Graph Expressimple run-time sions (CGES ), including independence tests such as testing for the groundless of a variable. For example, the clause: f(x)
:- p(x),
will be compiled expression:
q(x),
into
s(x).
the following
graph
f(X) = (ground(X) p (x), (ground(X)
q(X), s(X)).
If the argument of ground (X) is ground, the two following goals are computed in parallel. Otherwise they are executed sequentially. Depending on the instantiation state of X at run-time, three possible execution graphs may occur at run-time (see Figure 10).
Parallel
Logic
X
X ground
Programming
Systems X
not ground
p(x) p(x)
q(x)
/
notground
X not ground
X ground
‘!/
315
P(x)
s(x)
\
“
“x\
‘(x) s(x)
Figure 10.
Possible
execution
graphs
resulting
from
Conditional
Graph
Expressions.
f(x) is called, all three goals can be computed in parallel. If -X is grounded by the instead, the two last goals q(X) and S(X) can be executed in parallel. Otherwise, executed sequentially,
Conditional Graph Expressions may fail to capture the potential independence between goals sharing the same variables, thus the name “Restricted AND-parallelism.” This occurs most notably in programs where two goals share a variable that is instantiated by only one of them, such as in the quicksort program:
quicksort(
[XIL], Sorted List, Ace)
:- partition(L,
X, Little, Bigs),
quicksort(Littles,
SortedList,
quicksort(Bigs, quicksort(
SortedBigs,
Ace).
[ ], Sorted List, Sorted List).
In this example, no parallelism can be detected automatically between the two in the first recursive calls to quicksort clause, whereas it can be exploited if the example is rewritten as:
quicksort(
[XILI, SortedList,
Ace)
:- partition (L, X, Littles, Bigs), quicksort(Littles, quicksort(Bigs,
Sorted List, [X ITemp] ), SortedBigs,
Ace),
Temp = SortedBIgs. quicksort(
[ ], Sorted List, Sorted List).
Several extensions have been proposed to increase the number of goals eligible
after
of P(X) goals are
for parallel execution [Hermenegildo and Rossi 1990; Winsborough and Waern 1988]. In practice, even “simple” run-time checks can be very expensive, such as testing for groundless of a complex nested Prolog term. To avoid run-time tests, current research is being concentrated on purely static detection of the independence between goals [Muthukumar and Hermenegildo 1990]. Early results indicate that these methods capture almost as much parallelism as the compilation into CGES. 4.4.2
[X ISortedBigs] ),
If X is ground computation all three
Dependent
AND-Parallelism
In the following we will assume that dependent AND-parallelism is only used between “determinate” goals, for which at most one clause matches. This is the case with flat concurrent logic languages and with the Andorra model of computation (see Sections 3.3 and 3.4). Most systems implementing dependent AND-parallelism use the goal process model. A goal process is responsible for trying each candidate clause until one is found that successfully commits and then creating the goal processes for the body goals of the clause. The goal being determinate, at most one candidate clause head will unify with the goal. Tentative unification of a goal process with a clause head will either fail, succeed, or suspend on a variable. Dependent AND-parallel systems need therefore to create, sus-
ACM
Computmg
Surveys,
Vol.
26,
No.
3, September
1994
316
●
J.
Chassin
de Kergommeaux
and
penal, restart, and kill a potentially high number of fine-grained processes. These basic operations on processes may be the source of considerable run-time overhead and need to be limited to obtain an efficient implementation [Crammond 1988]. Process-switching overhead can be limited by reducing the use of the general scheduling procedure to assign a process to a worker: after completing the last child of a process, a worker continues with the parent; similarly, after spawning processes, a worker continues with the execution of its last child. Another very important problem arising in concurrent logic programming systems is memory consumption. Since these systems do not support the Prolog “don’t nondeterminism, the popping of know” the stacks performed at backtracking in Prolog systems (see Section 4.2 and Figure 4) cannot be done in these systems, resulting in considerable memory consumption and poor data locality. General garbage collection algorithms that compact the stack and the heap require that all workers synchronize on entering and exiting garbage collection [ Crammond 1988]. In the PIM project, an incomplete but faster incremental garbage collection is used instead to increase the locality of references and obtain a better cache-hit ratio. In the Multiple Reference Bit (MRB) algorithm, used inside a processing element or a group of processing elements sharing the same memory (cluster), each pointer has one bit of information to indicate whether it is the only pointer to the referenced data [Chikayama and Kimura 1987]. MRB is supported efficiently in time and space by the hardware of the PIM machine [Taki 1992] while reclaiming more than 60% of cells of stream parallel prothe garbage grams.
4.5
Scheduling
of Parallel
Tasks
The aim of the scheduling of parallel tasks in parallel logic systems is to optimize the global behavior of the parallel system. Therefore, schedulers aim to bal-
ACM
Computmg
Surveys,
Vol.
26,
No.
3, September
1994
P. Codognet
ance the load among workers while limiting as much as possible the overhead arising from task switching. This overhead occurs when workers have exhausted their work and need to move in the computation tree to grab a new task. In order to keep this overhead low, schedulers attempt to limit the number of task switching by maximizing the granularity of parallel tasks and to minimize each task-switching overhead. Additionally, to avoid unnecessary computations, schedulers need to manage speculative work carefully. Maximizing the granularity of parallel tasks would require the ability to estimate either at compile-time or at runtime the computation cost of goals (AND-parallelism) or alternative clauses (OR-parallelism). Current research (see Section 7.1) mainly concentrates on compile-time analysis of granularity [Debray et al. 1990; Tick 1988]. Because of the limited results obtained so far in task granularity analysis, most schedulers use heuristics to select work. One common heuristic used by OR-parallel systems [Baron et al. 1988a; Lusk et al. 1988] is that idle workers are given the highest uncomputed alternative in the computation tree, therefore mixing breadth-first exploration strategy among workers to the depth-first Prolog computation rule within each worker. To limit the number of task switches between workers, several schedulers share the computation of several nodes of the tree among workers [Ali and Karlsson 1991; Beaumont et al. 1991]. These schedulers then attempt to maximize the amount of shared work. In contrast to the granularity of an unexplored alternative (OR-parallelism) or of a goal waiting to be computed, the amount of sharable work in a branch can be easily estimated by the number of unexplored alternatives. Once an active worker has performed sharing of all its available work with an idle one, both workers resume the normal depth-first search of Prolog computation. Workers of independent AND-parallel systems distribute in priority the closest
Parallel
work from their current position in the computation tree: goals waiting to be solved are put on a stack accessible from other workers [Hermenegildo 1986]. In most OR-parallel logic systems, an idle worker switching to a new task needs to install its binding array or to copy parts of the stacks of the active worker providing work. In these systems, the cost of task switching depends on the relative positions of both workers providing and receiving work. Schedulers designed for these systems attempt to minimize the cost of task switching by selecting the closest possible work in the computation tree [Ali and Karlsson 1991; Butler et al. 1988; Calderwood and Szeredi 1989]. This is not necessary for systems designed to provide task-switching costs independent of the relative positions of both workers involved in this operation [Baron et al. 1988a]. Speculative work may be pruned by pending cuts. To avoid performing unnecessary work, a scheduler should give preference to nonspeculative work, and if all available work is speculative, to the least speculative work available [Hausman 1989]. Speculative work is better handled by schedulers performing bottommost dispatching based on sharing [Ali and Karlsson 1991; Beaumont et al. 1991] than topmost dispatching of work [Calderwood and Szeredi 1989], since the control of the former is closer to the depth-first leftmost computation of sequential Prolog systems. 5. SYSTEMS EXPLOITING OF PARALLELISM
ONE TYPE
A large number of models have already been proposed to parallelize Prolog. In this section, we will concentrate on the systems exploiting one type of parallelism that have been efficiently implemented on multiprocessor systems, based on the multisequential approach.
Logic
Programming
Exploiting
OR-Parallelism
The Kabu-Wake system (see Section 4.3.1 and Figure 5) was the first to copy stacks
“
317
when switching tasks and to use logical dates to discard invalid bindings, apparently without the incremental copying optimization. The Kabu-Wake implementation, on special-purpose hardware, provides nearly linear speedups for large problems computed in parallel. However, it is based on a rather slow interpreter. A more efficient implementation of this model was done on a transputer-based architecture [Briat et al. 1992]. The ANL-WAM system from the Argonne National Laboratories [Butler et al. 1986; Disz et al. 1987] was the first parallel Prolog system based on compiling techniques and implemented on a shared-memory multiprocessor. All trailed logical variables are stored in hash windows (see Section 4.3.2 and Figure 8), even when no parallel computation takes place. Being the first efficient parallel Prolog implementation, the ANL-WAM system has been widely used for experimental purposes. The performance of the ANL-WAM system is limited by the rather low efficiency of the sequential WAM engine used and by the overhead arising from its hash window model. In PEPSys [Baron et al. 1988a; Westphal et al. 1987] hash windows are only used if needed: when distinct logical variables having the same WAM identification are bound simultaneously by several workers. Most accesses to the values of variables are as efficient as in a sequential implementation, but some accesses may involve searching of hash window chains of unbounded length. The cost of task installation is independent of the respective positions of the work and of the idle worker in the search tree. In spite of the sharing of the stacks, PEPSys does not require a global address space and has been simulated on a scalable architecture combining shared memory in clusters and message passing between clusters [Baron et al. 1988b]. PEPSys
5.1 Systems
Systems
mented
has
also
on
a
been
efficiently
shared-memory
implemulti-
a single processor, the WAM-based PEPSYS implementation runs 30 to 4070 slower than SICStus Proprocessor.
ACM
Computing
On
Surveys,
Vol.
26,
No.
3, September
1994
318
●
J.
Chassin
de Kergommeaux
Table 1. Program[”
Execution
times]
and
Times
1 PE
P. Codognet
and Relatlve
2 PEs
4 PEs
Speedup
8 PEs
of Aurora
10 PES
SICStus 0.6
parsel
*2O
1.09
0.70
0.67
0.71
1.57
1
1.80
280
2.93
276
125
4.87
2.47
1.30
0.70
0.63
382
1
1.97
3.74
692
7.72
1.27
194
1.04
0.59
0.51
2.73
1.93
3.58
629
734
1.37
4.26
2.14
107
0.88
677
199
395
788
9.60
1,25
9.54
4,87
2,53
2.06
13.78
207
406
7.81
959
143
1.97
parse5
db5 *1O
3.74 1 846
8-queensl
1 19,79
tina
1
The benchmark m-ograms were executed by Aurora using the Bristol scheduler and SICSt& implementations on Se&ent Symme;ry. parsel, parse5, and db5 are parsing and database queries of the Chut80 natural language database front-end program. 8-queensl computes all the solutions to the 8-queens problem, and tzrzcI is a tourmm information program. Each entry of the table contains the run-time of the program m seconds on the upper line and tbe relative speedup on the lower line
log, both systems being interpreters of WAM instructions implemented in C. In parallel, it provides almost linear speedups for programs having a large enough search space [Baron et al. 1988a]. Other experimental results [Chassin de Kergommeaux 1989] indicate that long hash window chains are rare and do not compromise the efficiency of the implementation. The Aurora system is based on the SRI model [Warren 1987] where bindings to logical variables sharing the same identification are performed in binding arrays (see Section 4.3.2 and Figure 7). An Aurora prototype, based on SICStus Prolog, was implemented on several commercial shared-memory multiprocessors [Carlsson 1990]. Four different schedulers have been implemented: three of them use various techniques for the idle worker to
ACM
Computmg
Surveys,
Vol.
26,
No.
3, September
1994
find the closest topmost node of the search tree with available work [Brand 1988; Butler et al. 1988; Calderwood and Szeredi 1989]. Since experimental results [Szeredi 1989] indicate that the work installation overhead remains fairly limited, the Bristol scheduler [Beaumont et al. 199 1] performs sharing of work similar to the Muse system, the objective being to maximize the granularity of parallel tasks. All schedulers support the full Prolog languages including side effects. The system has successfully run a large number of Prolog programs, and the performances of some of them have been reported in Beaumont et al. [1991] and Szeredi [1989] (see Table 1). On a single processor, Aurora is slightly more efficient than PEPSys, and on average it also provides slightly better speedups in parallel.
Parallel
In the Muse model [Ali and Karlsson 1990], active workers share, with otherwise idle workers, several choice-points containing unprocessed alternatives. Sharing is performed by an active worker, which creates an image of a portion of its choice-points stack, in a workspace shared with a previously idle worker. The stacks of the active worker are then copied from its local memory to the local memory of the idle worker, most of the copying being performed in parallel by the two workers, using incremental copying of the portions of the stack which are not identical. Unwinding and installation of bindings in the shared section use the trail stack instead of logical dates in Kabu-Wake. Muse was implemented on several commercial multiprocessors with uniform and nonuniform memory addressing (see Section 7.4) and on the BCMachine prototype constructed at SICS, which provides both shared and private memory [Ali and Karlsson 1991]. The Muse implementation is based on SICStus Prolog. It combines a very low sequential overhead over SICStus (5%) with parallel speedups similar to Aurora. K-LEAF [Bosco et al. 1990] is a parallel Prolog system implemented on a transputer-based multiprocessor. In contrast to other multisequential systems, it creates all possible OR-parallel tasks. Combinatorial explosion of the number of tasks is avoided by language constructs that ensure sufficient grain size of parallel tasks. K-LEAF was implemented on an experimental transputer-based multiprocessor providing a virtual global address space. This implementation is based on the WAM, using the binding-array technique. The WAM code is either emulated or expanded into C and then compiled using a commercial C compiler, the latter solution being five times more efficient than the former.
5.2
A System Exploiting Independent AND-parallelism: &-Prolog
The &-Prolog system is the most mature system exploiting independent ANDparallelism, combining a compiler per-
Logic
Programming
Systems
●
319
forming independent detection with an efficient implementation of AND-parallelism on shared-memory multiprocessors [Hermenegildo and Greene 1990]. The &-Prolog system supports both automatic and user-expressed parallelisms. Parallelism can be expressed explicitly with the &-Prolog language. &-Prolog is very similar to Prolog with the addition of the parallel conjunction operator “&” and of several built-in primitives to perform the independence run-time tests of conditional graph expressions (see Section 4.4. 1). For example, the Prolog program p(x):-q(x),r(x). can be written
in &-Prolog
p(X) :- (ground(X)
-
as
q(X) & r(X)).
parallelism, provided by orAutomatic dinary Prolog programs, can also be exploited since the &-Prolog compiler performs a number of static analyses to transform Prolog programs into &-Prolog programs. The static analyses aim at detecting the independence of literals, even in presence of side effects [Muthukumar and Hermenegildo 1989a; 1989bl. &-Prolog programs are compiled into an extension of the WAM called PWAM [Hermenegildo 1986] whose main difference from the WAM is the addition of a goal stack where idle processors pick work. The PWAM assumes a sharedmemory multiprocessor. To adjust the computing resources to the amount of parallel work available, the &-Prolog scheduler organizes PWAMS in a ring. A worker searches for an idle PWAM in the ring. If no idle PWAM is found, and enough memory is available, the worker creates a new PWAM, links it to the ring, and looks for work in the goal stacks of the other PWAMS. The &-Prolog run-time system is based on SICStus and was implemented on shared-memory multiprocessors. The overhead of the &-Prolog system running sequentially over SICStus arises mainly from the use of the goal stack and remains very limited (less than 5%) for the majority of the benchmarks using uncon-
ACM
Computing
Surveys,
Vol.
26,
No.
3, September
1994
320
●
J.
Chassin
de Kergommeaux
Table 2.
Execution
Times
Program
and
(Seconds)
1 PE
4 Pcs
P. Codognet
and Relatwe
8 PEs
Speedup
10 PEs
of &-Prolog
Quintus Sun3-110
matrlx(50)-]nt
238
5.98
300
240
3.98
7.93
992
305
1.2
0.66
064
10
2.54
4.62
476
279
2.79
2.79
2.79
10
1.0
10
1.0 qs(1000)-app
qs(1000)-dl-si
1.0 qs(1000)-dl-nsl
annOtatOr(100)
288
095
062
061
10
303
4.65
4.72
091
028
0.17
013
.32.5
535
70
1.0
798
17
1.61
161
062
The benchmark programs were executed by &-Prolog on Sequent Symmetry and Qumtus 2.2 on SUN3-11O. The first benchmark is a matrix multlphcation. The three next benchmarks are different versions of the quzchsort of a list of 1000 elements using the predicate append or difference-llsts exploiting strict independent AND-parallehsm (s1) and nonstrict Independent ANDparallehsm (nsi). The last benchmark annotator is a “reahstlc” program of 1800 lines, the annotator being one of the static analysers used m the &-Prolog compder.
ditional parallelism, that is, without run-time tests from conditional graph expressions. These tests have been shown to be fairly expensive, which explains why an important research activity is dedicated to purely static detection of independence among goals. Parallel speedups depend on the benchmark program, but even when no parallelism is available (see Table 2), the &-Prolog system remains almost as efficient as the SICStus sequential Prolog implementation. 5.3
Systems Exploiting AND-Parallelism
Dependent
The important research activity dedicated to implementing efficiently concurrent logic languages on sequential and parallel computers has already been reported in a survey paper [Shapiro 1989].
ACM
Computmg
Surveys,
Vol.
26,
No
3, September
1994
We only parallel guages. 5.3.1
mention in the implementations
following some of these lan-
Parlog
The JAM abstract machine based on the WAM was designed to support concurrent logic languages and the full Parlog language in particular. It was implemented on a shared-memory multiprocessor [Crammond 1988], reaching half of the efficiency of SICStus Prolog. In parallel, the benchmarks containing important dependent AND-parallelism show speedups of between 12 and 15 on 20 processors while benchmarks containing no dependent AND-parallelism, such as the queens program, run almost at the same speed in parallel as sequentially (see Table 3).
Parallel
Table 3. Speedup
I
Program
qsort
nrev
lqueens
tak
queens
5.3.2 Flat Concurrent
Logic
Programming
Systems
●
321
Execution Times (Seconds) and Relative of the JAM on Sequent Balance B21.
no. of
1 PE
5 PEE
10 PEs
19 PEs
3.7
0.9
0.6
0.5
1.0
4.1
6.2
7.4
30.0
6.5
3.6
2.4
1.0
4.6
8.3
12.5
378
7.6
3.9
2.4
10
4.97
calls
4708
80601
23531
63609
62082
969
15.75
71.4
15,0
7.9
1.0
4.76
9.03
85.9
87.6
88.7
90.1
1.0
0.98
0.97
0.95
Prolog
FCP was implemented on distributedmemory multiprocessors Intel iPSC 1 [Taylor 1989; Taylor et al. 1987]. The implementation provides limited speedup for fine-grained systolic computations and good speedup for large-grained programs (see Table 4). 5.3.3 GHC and KL 1 A large number of GHC and KLl implementations were done in the framework of the Japanese Fifth Generation Computer Systems Project led by the ICOT. It is only possible to mention a few of them in this article. KLl was implemented on a commercial shared-memory multiprocessor [Sato and Goto 1988], this implementation being 2 to 9 times slower than Aurora, for programs executing the same algorithms [Tick 1989]. The most significant prototypes involved both hardware and software development with the implementation of several highly parallel machines dedicated to the execution of
4.8 1487
the concurrent logic language KL1. These machines, known as Multi~PSI and PIM are presented in Section 7.4.2.
5.3.4
Commercial Concurrent Language: Strand
Logic
Strand (STReam AND-parallelism) is a commercial product that was derived from Parlog and FCP [Foster and Taylor 1989]. In the Strand language, there is no more unification but matching of the head of a clause against a goal. The rule guards reduce to simple tests. An assignment operation can be used in the bodies of the rules, but as in other logical languages, Strand variables are singleassignment variables. With processor allocation pragmas, programmers can control the degree of parallelism of their programs. A foreign interface language enables the calling of Fortran or C modules from a Strand program. Strand can thus be used to coordinate the execution of existing sequential programs on multiprocessors, therefore avoiding the need
ACM
Computing
Surveys,
Vol.
26,
No.
3, September
1994
322
“
J.
Chassin
Table
de Kergommeaux Execuhon
4.
Times
and Relatwe
Program
Merge
sort
(60
speedup
(10,000)
155.21
sort
16 PEs
(d-4
Hypercube)
61.35
( 10,000)
191.83
(sers)
input
5638
1
multlply
37229
problem)
51.78
(sees)
59.82
(large
problem)
1122
(see.s)
622
5.54 (reins)
(reins)
9.4
1 FCPic
(WCS)
3.40
1
(small
(sees)
2.53
1
X 60)
FCPic
of FCP on Intel IPSC1
(sew.)
input
reversed
Matrix
P. Codognet
Uniprocessor
ordered
Merge
and
(reins)
9.35 (reins) 1
Triangle
6575
12 48.9 (reins)
(reins)
13.4
1
OR-parallel
Prolog
560
58 (reins)
(reins)
97
1
— The
merge
and
matrw
multzply
programs
are systolic
algorithms,
the
other
benchmarks being large-gram computations, FCPZC is a 600-line polygonbased graphic modehng system; trzangle is a puzzle program used to evaluate LMp systems; OR-purallel Prolog IS a parallel interpreter runmng a Prolog program computing all permutations of a hst of slx elements,
to rewrite these programs to parallelize them. The implementation of Strand is based on the Strand Abstract Machine or SAM, designed to limit the target-dependent parts of the implementation. SAM is divided into a reduction component, similar to a simplified concurrent logic language implementation, and a communication component performing matching (read) and assignments (write). All machine-specific features of the implementation are localized into the communication component. Strand was implemented on a variety of sharedand distributed-memory multiprocessors. The Strand uniprocessor implementation
ACM
C!omputmg
Surveys,
VOI
26,
No.
3, September
1994
runs approximately two times faster Parlog and three times faster than on SUN workstations.
5.4
Performance of Systems Type of Parallelism
Exploiting
than FCP
One
The systems described above have demonstrated that it is possible to efficiently exploit ORand independent AND-parallelism on shared-memory multiprocessors containing a limited number (up to several tens) of processors. “Efficiently” means that in programs that exhibit the suitable type of parallelism, almost linear speedup can be obtained, whereas in the worst case, when no par-
Parallel
allelism can be exploited, their speeds remain similar to an efficient sequential Prolog implementation such as SICStus Prolog. Dependent AND-paraIlel systems are slower when they execute the same algorithms, but they enable exploitation of finer-grained parallelism rather than OR- and independent AND-parallel systems. The majority of the parallel systems presented in this section compile logic programs into abstract machine instructions that are usually executed by an interpreter written in C, as is the case in SICStus Prolog. Some commercially available Prolog systems execute abstract machine instructions more efficiently by using an abstract machine interpreter written in assembly language (Quintus Prolog) or by generating target machine code from the abstract machine instructions (BIM Prolog). Parallelizing techniques developed in the systems presented above remain valid to parallelize these efficient commercial systems, although improving the efficiency of the sequential engine will probably decrease the parallel speedup [Maes 1992]. Whether it is cost-effective to use nonsalable shared-memory multiprocessors to solve symbolic problems programmed in logic languages is still questionable. This situation may change when the multiprocessor technology becomes more mature and multiprocessor workstations become as common as uniprocessor ones. Then a parallel logic programming system, providing significant speedup in the best cases and no decrease in speed otherwise, will be a good way of exploiting the computing power delivered by the multiprocessor platform. The use of more scalable highly and massively parallel multiprocessors, to solve larger problems currently intractable sequentially, is discussed in Section 7.4. 6. SYSTEMS
COMBINING OF PARALLELISM
SEVERAL
TYPES
Important benefits can be expected from the combination of several types of parallelism into a single parallel logic
Logic
Programming
Systems
●
323
programming system implementation: mainly an increase in the number of programs that can benefit from parallelism and a reduction of the search space of some of them. Although systems exploiting one type of parallelism have demonstrated their ability to reach effective speedups over state-of-the-art sequential Prolog systems, a large number of parallel programs cannot benefit from speed improvements: this is the case of deterministic programs in OR-parallel systems and of large search programs in AND-parallel systems. Additionally, while OR- and independent AND-parallel systems mainly exploit coarse-grained parallelism, dependent AND-parallel systems exploit fine-grained parallelism, present in a large number of applications. Combining OR- with AND-parallelism may also result in significant reductions in the program search space when several recomputations of the same independent AND-branch, due to backtracking into a previous AND-branch, can be avoided by “reusing” the solutions already produced in each AND-branch (see Figure 11). However, the combination of ANDwith OR-parallelism raises difficult problems. Control of the execution must guarantee that all solutions to the ANDbranches are cross-produced without introducing too much overhead to store partial solutions or to synchronize workers. Merging of partial solutions during the cross-product of AND-branches is a complex operation whatever binding scheme is being used: stack copying or stack sharing using binding arrays or hash windows. The solutions proposed to solve these problems [Conery 1992; Costa et al. 1991a; Delgado-Rannauro 1992b; Gupta and Jayaraman 1990; Gupta and Hermenegildo 1991a; Ka16 and Ramkumar 1990; 1992; Westphal et al. 1987] are too complex to be presented in this article. Indeed, the idea of “sharing” solutions between OR-branches is not unique to parallel execution models, and also appears in a few sequential Prolog sys-
ACM
Computing
Surveys,
Vol
26,
No
3, September
1994
324
-
J.
Chassin
de Kergommeaux
and_par(Paramet
ers)
:-
left_
and
P. Codognet
hand(Left_Parameters,
Left-Results),
right_hend(Right_Parameters,
Right_Results).
left_hand(X,
Y)
:-
. . .
/*
First
left_hand(X,
Y)
:-
. . .
/*
Second
left_hand(X,
Y)
:-
. . .
/*
Third
right_hand(Z,
T)
:-
long_cornputatiOn(.
branch
*/
branch
branch
*/
*/
..).
Figure Il. Reduction of thesearch space byuseof cross-product. Inasequential system, therlght-hand predicate is computed three times, once for each solution to the left-hand one. In a system combining AND with OR-parallelism, the solution to the right-hand predicate will be computed once and “cross-produced” with the three solutions to the left-hand predicate, thus saving two (long) computations of the right-hand predicate.
terns that attempt to ’’memoize’’ and “reuse” partial solutions in order to avoid recomputations, such as Villemonte dela Clergerie [1992]. Such systems are not yet mature and competitive enough with traditional Prolog systems to serve as a basis for parallel models. Moreover, simulations of a variety of programs [Shen and Hermenegildo 1991] indicate that only a few of them would benefit from reuse of partial solutions of independent AND-branches. 6.1 Systems Combining with OR-Parallelism
independent
AND-
The ROLOG system implements the Reduce OR Process Model (ROPM) [Kale 1987] which combines independent ANDwith OR-parallelism. Programs are compiled into an abstract machine inspired by the WAM [Ramkumar and Ka16 1989]. The implementation uses a machine-independent run-time kernel called the CHARE kernel, which made possible the porting of ROLOG on a variety of sharedand distributed-memory parallel machines. Because of the complexity of the execution model, sequential efficiency of ROLOG is several times lower than efficient parallel logic systems exploiting one type of parallelism. In parallel, it provides linear speedups for benchmarks
ACM
Computmg
Surveys,
Vol
26,
No.
3, September
1994
where programmer annotations ensure sufficient granularity of tasks. Experimental results also indicate that significant reductions of the search space can also be obtained in several programs, by avoiding the recomputation of ANDparallel branches due to backtracking [Ramkumar 1991]. The ACE model [Gupta and Hermenegildo 1991b] aims to combine the independent AND-parallel approach of the &-Prolog system with the OR-parallel approach of the Muse system. No effort is made to reuse results of independent AND-subcomputations in different ORbranches, as described above, but the strength of this model lies in its (re)use of simple techniques whose efficiency has already been proved. 6.2
Systems Combining With OR-Parallelisms
Dependent
AND-
As presented in Section 3.4, execution models based on the Andorra principle have been proposed to combine OR- and dependent AND-parallelism. They intend to subsume both the Prolog and the concurrent programming paradigms. There exist two main streams of research in this area, depending on the language being supported: the first one, based on the Basic Andorra Model, supports the Pro-
Parallel log language, while another one, based on the Andorra Kernel Language, integrates concurrent languages constructs. 6.2.1 Basic Andorra Model The Andorra-I system is an implementation of the Basic Andorra Model on shared-memory multiprocessors [Yang et al. 1993]. The Andorra-I system consists of a compiler, an engine, and a scheduler. The compiler performs a program analysis, based on abstract interpretation techniques, in order to determine, for each procedure, the mode of its arguments, i.e., the possible instantiation types. This information is used, together with the analysis of pruning operators and side effects, to generate sequencing instructions that should be taken into account by the execution model. Specialized code is also generated to test efficiently the determinacy of procedures. The compiler produces Andorra-I Abstract Machine code, that is, an extension of Crammonds JAM (see Section 5,3.1). The engine supports OR-parallelism by borrowing techniques from the Aurora system (binding arrays) and exploits dependent AND-parallelism by using techniques derived from the JAM implementation of PARLOG. Andorra-I programs are executed by teams of workers (abstract processing agents). Each team is associated with a separate OR-branch in the computation tree, and undertakes the parallel execution of determinate goals in that branch, if any. Otherwise a nondeterminate goal is chosen, usually the leftmost goal to follow Prolog’s strategy, and a choicepoint is created. The team will then explore one of the OR-branches. If several teams are available, OR-branches are explored in parallel using the SRI model as in Aurora, whose scheduler is also used to distribute work. Within a team, all workers share the same execution stacks, but manage their own run queue of goals waiting for execution. When a local queue becomes empty, the worker will try to
Logic
Programming
Systems
●
325
find work within the team, as in the implementation of Parlog. Performance evaluation of Andorra-I on Sequent Symmetry shows that the relative OR-parallel speedup of AndorraI is comparable with Aurora and that the relative AND-parallel speedup of Andorra-I is comparable with the JAM. Sequentially, Andorra-I runs as fast as the JAM and 2.5 times slower than SICStus Prolog in average, that is, approximately two times slower than Aurora. The ability of Andorra to reduce the amount of computation was estimated by measuring the total number of inferences executed for several problems. The reduction can attain one order of magnitude. When both ANDand OR-parallelism are exploited, the overall speedup is greater than that achievable with one kind of parallelism alone. The basic Andorra model was also implemented as an extension of the Parallel NU-Prolog system [Naish 1988; Palmer and Naish 1991], which implements dependent AND-parallelism. A simple variant of the compilation techniques developed for concurrent logic languages [Klinger and Shapiro 1988] is used to construct a decision tree of the conditions under which a goal is determinate. First, parallel experiments were limited to 4 processors, giving speedups ranging from 2 to 3.4 for simple test programs, 6,2.2 Extensions to the Basic Andorra Model Warren [1990] recently proposed a new execution model that extends the Basic Andorra Model by allowing parallel execution between nondeterminate goals as long as they perform “local” computations, i.e., since they do not instantiate nonlocal variables. In this way nondeterminate goals are synchronized (i.e., blocked) only when they try to “guess” the value of some external variable. Observe that such extra parallelism contains independent AND-parallelism as a subcase. The Extended Andorra Model combines thus all three kinds of parallelism: OR-, dependent AND-, and independent AND-parallelism.
ACM
Computmg
Surveys,
Vol.
26,
No,
3, September
1994
326
●
J. Chassin
de Kergommeaux
and
The IDIOM model displays another combination of the three kinds of parallelism [Gupta et al. 1991c]. It uses Conditional Graph Expressions (CGE) to express independent AND-parallelism as in &-Prolog, and execution proceeds as follows. First, in the dependent ANDparallel phase, all determinate goals are evaluated in parallel. When no more determinate goals exist, the leftmost goal is selected for expansion. If it is a CGE, then an independent AND-parallel phase is entered; otherwise a choice-point is created as in the basic Andorra model. Solution sharing is handled by cross-producing partial solutions of independent subcomputations. The two extended models presented above have not yet been implemented. 6.2.3 Andorra Kernel Language The Andorra Kernel Language (AKL) [Janson and Haridi 1991] extends the Basic Andorra Model by borrowing some concurrent languages constructs (see Section 3.4), The semantics of AKL [Haridi and Janson 1990] is given by a set of rewrite rules on AND/OR trees, which has led to the design of a simple abstract machine and sequential implementation. Parallel implementations are currently under development, based on the integration of a MUSE-like mechanism for handling OR-parallelism. Independently, a new abstract machine was developed to accommodate the new constructs of AKL [Palmer and Naish 1991]. It was inspired by the JAM as the Andorra-I implementation. Another execution model [van Hecker et al. 199 1] exploits mostly OR-parallelism and uses hash windows similar to those of PEPSys, which seem more suited to the OR-parallel execution of AKL than binding arrays. 7. CURRENT RESEARCH AND PERSPECTIVES An important research activity aimed at extending existing models and systems has already been mentioned in previous sections. In this section we summarize
ACM
Computing
Surveys,
Vol
26,
No.
3, September
1994
P. Codognet
the development of static analysis techniques for logic programs. Additionally, this section reviews a research domain we deemed very promising: the combination of constraints and parallelism and the development of concurrent constraint languages. We also consider the challenge of efficiently using the massively parallel computers that have become commercially available. 7.1 Static Analysis
of Logic Programs
An important research area in logic programming is that of static (compile-time) program analysis, which can be used to predict run-time behavior of programs as an aid to optimization. This analysis is usually based on abstract interpretation techniques, introduced by the seminal work of Cousot and Cousot [ 1977] for flowchart languages and developed in the domain of logic programming since the mid-80s [Bruynooghe 1991; Jones and Sondergaard 1987; Mellish 1986]. Abstract interpretation is a general scheme for static data flow analysis of programs. Such analysis consists primarily of executing a program on some special domain called “abstract” because it abstracts only some desired property of the concrete domain of computation. The use of abstract interpretation for parallel execution falls roughly into three major kinds of analysis:
* dependency
analysis, which approximates data dependencies between literals due to shared variables,
●
granularity analysis, which mates the size of computations,
●
determinacv deterministic
approxi-
analysis, which identifies par~s o~ programs.
Various abstract domains have been designed that approximate values of logical variables into only some groundless and sharing information. Let us take a simple example to illustrate this. Consider the substitution: {X+
X1, Y+
Z+g(U),
f’(U, X1+
b), a}.
Parallel It can be approximated by a groundless component (which variables are bound to terms that do not contain any free variable ?) {X
is ground,
Xl
is ground},
and a sharing component (which variables share a common variable as subterm?) {Y and Z share}. Obviously we have lost some information with respect to the original substitution (e.g., the structure of terms), but specific information that can be used for optimization has been extracted. This is currently used in independent AND-parallel models such as by Hermenegildo and Greene [1990] in order to find out at compile-time which predicates will run in parallel [Muthukumar and Hermenegildo 1990], thus avoiding costly run-time tests (see Section 4.4. 1). Experimental results indicate that a large part of the potential parallelism of programs is captured, although some potential parallelism may be lost, compared to a method performing precise but costly run-time analysis. In dependent ANDparallel models, the same kind of information can be used to determine an advantageous scheduling order among active processes and to avoid useless process creation and suspension. Another useful application is static detection of deadlocks in concurrent programs [Codognet et al. 1990]. Task granularity analysis, by discriminating large tasks from small tasks not worth parallelization, strives to improve the scheduling policy of ANDand OR-parallel models. Research has only just begun on this important topic [Debray et al. 1990]. Determinacy analysis is required by Andorra-based models, in order to simplify run-time tests. As execution models become more and more complex with the combination of several kinds of parallelism, the need for compile-time analysis increases in order to compile programs more simply and more efficiently.
Logic
Programming
Systems
●
327
7.2 Parallelism and Constraint Logic Programming Constraint Logic Programming (CLP) languages provide an attractive paradigm that combines the advantages of logic programming (declarative semantics, nondeterminism, logical variables, partial-answer solution) with the efficiency of special-purpose constraint-solving algorithms over specific domains such as reals, rationals, finite domains, or booleans. The key point of this paradigm is to extend Prolog with the power of dealing with domains other than (uninterpreted) terms and to replace the concept of unification with the general concept of constraint solving [Jaffar and Lassez 1987]. Several languages, such as CLP(R) [Jaffar and Lassez 1987], CHIP [Dincbas et al. 1990; Van Hentenryck 1989b], Prolog-III [Colmerauer 1990] show the usefulness of this approach for real industrial applications in various domains: combinatorial problems, scheduling, cutting stock, circuit simulation, diagnosis, finance, etc. The use of constraints indeed leads to a more natural representation of complex problems. A most promising perspective for CLP languages is the parallel execution of such languages, where both OR- and AND-parallelism can be applied. The simultaneous exploration of alternative paths in the search tree provided by ORparallelism can be particularly useful in problems such as optimization problems. Sequential CLP systems use a branchand-bound method that involves searching for one solution and then starting over with the added constraint that the next solution should be better than the first one, ensuring thus an a priori pruning of nonoptimal branches. With ORparallelism, one can move away from a pure depth-first search and mimic a limited breadth-first search,3 since several alternative branches are explored simultaneously by parallel processes. Such a scheme allows for “superlinear” relative
3Limited
ACM
by the
Computing
computing
Surveys,
Vol.
sources
26,
No,
available.
3, September
1994
328
●
J. Chassin
de Kergommeaux
and P. Codognet
speedups, since the simultaneous search is more likely to find a better solution quickly, therefore improving the pruning of the search space and the overall performance. Experiments were done by combining the finite domains solver of CHIP and the Pepsys OR-parallel system [Van Hentenryck 1989a], and pursued in the ElipSys system [Veron et al. 1993]. They show that impressive speedups with respect to an efficient sequential implementation can be achieved. The ElipSys system shows that “superlinear” relative speedups can be achieved on big graphcoloring problems (up to a factor 100 for 12 processors), and that nearly linear speedups can be achieved for a branchand-bound algorithm on disjunctive scheduling problems (up to a factor 9 for 10 processors). AND-parallelism can also be used for better performance. Independent ANDparallelism corresponds to partitioning the global constraint system into several smaller (independent) subsystems that can be solved more easily. Since problems tackled by constraint systems are usually very large and NP-hard, breaking a global problem into several subproblems that can be solved independently is an important issue. Andorra-based models, by favoring deterministic computations, can also greatly improve CLP. They provide a good heuristic to guide the constraint-solving process. This is the case, for instance, when solving disjunctive constraints. In CLP languages, disjunctive constraints are treated by using the nondeterminism of the underlying Prolog language. A choice-point is created for each disjunctive constraint, and different alternatives are then explored by backtracking. The (static) order in which the disjunctive constraints are stated will be the order in which the choice-points will be created and therefore will direct the overall search mechanism. A poor order could lead to bad performance due to useless backtracking. The Andorra principle, by delaying choice-point creation, will make the problem as deterministic as possible before handling the disjunctive con-
ACM
Computing
Surveys,
Vol
26,
No.
3, September
1994
straints. In this way, computations are factorized between alternatives, and some disjunctions may even be rendered deterministic, thus reducing the complexity of the problem. 7.3 Concurrent
Constraint
Languages
The current CLP framework is based on Prolog-like languages and hence is limited to transformation systems (see Section 3). New research has begun to extend this to reactive systems, by investigating concurrent constraint languages. These languages were recently introduced by Saraswat [1993] and Saraswat and Rinard [1990]. They are based on the ask-and-tell mechanism [Maher 1987] and generalize in an obvious manner both concurrent logic languages and constraint logic languages. The key idea is to use constraints to extend the synchronization and control mechanisms of concurrent logic languages. Briefly, multiple agents (processes) run concurrently and interact by means of a shared store, i.e., a global set of constraints. Each agent can either add a new constraint to the store (Tell operation) or check whether the store entails a given constraint (Ask operation). Synchronization is achieved through a blocking ask: an asking agent may block and suspend if one can state neither that the asked constraint is entailed nor that its negation is ennothing prevents the contailed—i.e., straint from being entailed, but the store does not yet contain enough information. The agent is suspended until other concurrently running agents add (Tell ) new constraints to the store to make it strong enough to decide. Obviously, the instantiation of this framework to unification constraints over terms (i.e., equations of the form tl = tz ) nicely rephrases the usual concurrent logic languages of Section 3. Telling such a constraint corresponds to performing unification, while asking a constraint corresponds to performing matching: Ask( X = a) with respect to a store S will suspend as long as S is not strong enough to entail X = a, i.e., does not contain
Parallel such an instantiation (which would be added—told—by another concurrent agent). Note however that such unification constraints are just one of many possible domains over which the con-current constraint framework can be instantiated: possible alternatives are finite domains, linear arithmetics, booleans, etc. Also, restricting the languages to only Tell operations over constraints reproduces the Prolog-based CLP languages, which perform constraint satisfaction (Tell ) but no synchronization through constraints (Ask ). The efficient implementation of these languages is still under investigation, but current research shows that the implementation technologies and know-how developed for both concurrent languages and constraint languages can be merged and nicely integrated, as for instance in the CC(FD) language of Van Hentenryck et al. [1993] which uses the finite domains constraints introduced by the CHIP CLP language [Van Hentenryck 1989 b], with improved performances.
Logic
Use of Massively Parallel Multiprocessors
Most existing parallel logic programming systems described in the previous sections were implemented on nonsalable shared-memory UMA (Uniform Memory Access) architectures. An ever increasing number of massively parallel multiprocessors (that is, scalable to a maximum number of processing elements on the order of 103, are becoming available. These multiprocessors may provide logically shared memory based on physically distributed memory, access time to the shared memory being nonuniform (NUMA: Non Uniform Memory Access, or COMA: Cache-Only Memory Access). Examples of such multiprocessors include the BBN Butterfly TC2000 (NUMA), composed of up to 128 processing elements, Motorola M881OO, and the Kendall Square Research KSR-1 (COMA), composed of up to 1088 proprietary processors. Most massively parallel multiprocessors have the
Systems
w
329
NORMA (NO Remote Memory Access) architecture, whose interprocessor communication is performed by message passing. The commercial massively parallel multiprocessors appearing on the market include the CS-2 produced by a European consortium of three manufacturers (Meiko, ParSys, and Telmat) with up to 1024 Spare processors, the SP-1 from IBM with up to 128 RS-6000 processors, the CM-5 from Thinking Machines with up to 1024 Spare processors, and the Intel Paragon with up to 4096 processors [Bell 1992; Ramanathan and Oren 1993]. Although intended primarily to solve number-crunching applications, these multiprocessors could quickly help solve large symbolic applications, currently untractable on sequential computers. In order for parallel logic programming systems to efficiently use a large number of processing elements, it is necessary that the techniques used in these systems scale well with the number of processing elements.
7.4.1 7.4
Programming
Processor-Memory
Connection
Issue
Systems performing sharing of computation stacks and designed for UMA shared-memory multiprocessors may not be appropriate for NUMA multiprocessors such as the BBN Butterfly. On this machine, accessing local memory on a node is one order of magnitude faster than accessing shared memory on a remote node. Since cache coherency is not provided, shared memory cannot be cached, and local accesses to shared memory are slower than local accesses to private memory. This is the case for all accesses to the execution stacks, which cannot be cached since they are shared among workers. This problem was exhibited by the Aurora prototype implementation on the Butterfly [Mudambi 1991], which ran sequentially 80% slower than SICStus, in contrast to its 20% slower sequential performance on the UMA shared-memory Sequent Symmetry. In spite of these problems, experiments performed demonstrated the possibility of obtaining near-linear speedup with 80
ACM
Computmg
Surveys,
Vol.
26,
No.
3, September
1994
330
“
J. Chassin Table 5.
de Kergornrneaux Execution
Times
and
of the MUSE
P. Codognet Prototype
on Butterfly
TC2000
No. of PEs
~ =~:
~
‘o’”:
Execution times design of a circuit
and speedups board.
processing elements computing large problems. Systems performing stack copying are better suited for NUMA multiprocessors. Experimental results of the MUSE prototype running on Butterfly TC2000 [Ali and Karlsson 1992] indicate that the overhead of MUSE running sequentially over SICStus Prolog remains limited to 57. while near-linear speedups are provided up to 40 processors (see Table 5). 7.4.2
Design
of Spec/al/zed
Architectures
Since NUMA architectures are not suitable for models sharing the computation stacks, one track of research is to design a scalable UMA architecture. In the Data Diffusion Machine [Warren and Haridi 1988], the location of a datum is decoupled from its virtual address, and data migrates automatically where it is needed: memory is actually organized like a very large cache (COMA). The hardware organization consists of a hierarchy of buses and data controllers linking proeach having a set-associative cessors, memory. The machine is scalable since of levels in the hierarchy is the number not fixed, and it should be possible to build a multiprocessor including a large number of processing elements with a limited number of levels. The KSR-1 commercial multiprocessor whose architecture is similar to the DDM is now commercially available from Kendall Square Research. No result of parallel Prolog implementation on KSR-1 was available at the time this article was written.
ACM
Computmg
Surveys,
Vol
26,
No
3, September
1994
:
::
of a knowledge-based
system
checking
the
The most important research activity in this area was performed by the ICOT, in the Fifth Generation Computer Systems project. This project delivered the PIM multiprocessor, including up to 512 specialized processors [Kumon et al. 1992; Nakashima and Nakajima 1992; Taki 1992]. PIM is composed of up to 64 clusters, each cluster being similar to a (physically) shared-memory multiprocessor including 8 processing elements. The implementation techniques of KLl on the PIM machine have been used on the Multi-PSI/V2 distributed memory multiprocessor [Nakajima et al. 1989]. The intracluster issues of the PIM machine were originally investigated inside a Multi-PSI/V2 processor while intercluster issues of PIM were investigated between Multi-PSI/V2 processing elements. This was the case, for example, with intracluster garbage collection [Chikayama and Kimura 1987] and intercluster garbage collection [Ichiyoshi et al. 1990] techniques. 7.4.3
Scheduhng
Issues
Several research projects address the scalability issues raised by scheduling and load balancing over massively multiprocessors. In order to avoid bottlenecks that could arise in centralized or distributed systems, multilevel hierarchical load-balancing schemes have been proposed [Briat et al. 1991; Furuichi et al. 1990]. Bottlenecks are expected to arise in centralized systems with a large number of processing elements accessing a central scheduler. Fully distributed sys-
Parallel terns require some shared data and result in bottlenecks to access these shared data together with a large amount of interprocessor communication. Experimental results over 64 PEs Multi-PSI NORMA architectures, reported by Furuichi et al. [1990], indicate that multilevel load balancing increases the speedups when using more than 32 PEs: a speedup of 50, using 64 PEs, was obtained for a puzzle-solving program.
8.
CONCLUSION
Considerable research activity is being dedicated to parallelizing logic programming, mainly because of the “intrinsic” parallelism of logic programming languages and of the computational needs of symbolic problems. Parallel logic programming systems exploit essentially OR-parallelism—simultaneous computation of several resolvents—and ANDparallelism—simultaneous computation of several goals of a clause. AND-parallel execution can be restricted to goals that do not share any variable, in which case it is called independent. Otherwise, it is called dependent AND-parallelism. Two main approaches can be distinguished: the first one exploits parallelism transparently by executing Prolog programs in parallel, while the second one develops language constructs allowing programmers to express explicitly the parallelism of their applications. The second approach is mainly represented by the family of concurrent logic languages, which is intended for supporting the programming of applications not addressed by Prolog, such as reactive systems and operating system programming. A large number of parallel computational models have been proposed to parallelize logic programming. Among them, multisequential computing models can use most of the efficient implementation techniques developed for sequential Prolog implementation. In spite of the inherent parallelism of logic languages, implementing them efficiently in parallel raises difficult problems. Optimizations made possible by the Prolog backtracking
Logic
Programming
Systems
●
331
cannot be applied to OR-parallel systems. AND-parallel systems need to check that goals executed in parallel assign their variables to coherent bindings. Independent AND-parallel goals cannot bind the same variables. Independence can be tested at compile-time, at runtime, or partly at compile-time and partly at run-time. Most run-time independence tests are time consuming, and the most efficient results are obtained when the goal independence can be computed at compile-time. Dependent AND-paralleI systems only compute determinate goals in parallel, determinacy being enforced by the language in concurrent logic systems or checked partly at compile-time and partly at run-time in the Andorra model of computation. Some of these computational models have been implemented efficiently on multiprocessors. The most mature systems exploit one type of parallelism. Systems exploiting OR- and independent AND-parallelism obtain linear speedups for programs containing enough parallelism while programs containing no exploitable parallelism run almost as efficiently as if executed by an efficient Prolog system. Systems exploiting dependent AND-parallelism are usually several times less efficient when they are used to program the same applications, but they can exploit more parallelism of finer grain in logic programs. Concurrent logic languages, which exploit this type of parallelism, have also proved their potential for developing different types of programs, such as the PIMOS operating system for both multi-PSI and PIM machines, composed of 100,000 lines of KL1 [Furukawa 1992]. Systems combining several types of parallelisms are not for the moment as efficient as systems exploiting a single type of parallelism, but they may result in a reduction of the search space of programs by avoiding useless computations. Also, these systems can speed up more logic programs than systems exploiting one type of parallelism. Current and future research in the domain of logic programming is active in
ACM
Computing
Surveys,
Vol.
26,
No,
3, September
1994
332
●
J. Chassin
de Kergommeaux
and
several domains. One track of research seeks to improve existing systems, mainly by developing compile-time analysis techniques for logic programs to detect determinacy of and independence among goals, and to anticipate the granularity of resolvents (OR-parallelism) or goals (AND-parallelism). Another fruitful direction is the combination of constraint logic programming with parallelism. The last direction of research in this article is the use of highly and massively parallel multiprocessors. The most important activity in this area was performed by the Japanese Fifth Generation Computer Systems project, where the PIM/p multiprocessor, demonstrated in 1992, delivers approximately one hundred millions of reductions per second peak performance (one reduction is equivalent to one lip). In spite of the excellent performance results achieved by parallel logic systems, the cost effectiveness of parallel logic programming is still questionable. Parallel multiprocessors generally do not use the most recent microprocessors already used in the most powerful uniprocessor workstations, In other words, because of the rapid progress in VLSI technology, a sequential Prolog system running on the most recent type of workstation performs almost as well as a parallel logic programming system running on a modestly parallel multiprocessor based on the previous generation of microprocessors (see figures in Section 5). This situation might change rapidly since several parallel workstations using stateof-the-art microprocessors are becoming commercially available. There are not yet enough applications likely to benefit from highly and massively parallel systems, since the traditional use of logic programming is constrained to making prototypes. This situation may also change with the advent of highly and massively parallel multiprocessors that can address problems too large to be solved by the most efficient sequential logic programming systems. As the performance of parallel systems becomes radically higher than that of uniprocessors, and as the processor tech-
ACM
Computmg
Surveys,
Vol
26,
No.
3, September
1994
P. Codognet
nology used in parallel systems approaches the state of the art, parallel logic programming systems will become increasingly attractive. Existing systems can adaptively exploit parallelism when it is present, with little sacrifice of sequential performance when it is not. Most importantly, logic programming provides logic programming systems with natural opportunities to exploit parallelism and programmers with convenient ways to express it. ACKNOWLEDGMENTS The for
authors the
tions
idea
and
Jacques survey
would
hke
of this
survey
corrections. Cohen
and
in detail
and
to thank
D. H
and
for
We would Paul
also
Hilfinger
proposing
D
helpful hke
for
Warren suggesto thank
reading
numerous
the
improve-
ments.
REFERENCES AIT-KAcI, H. Tutor~al bridge, ALI,
1991. Warren’s Reconstruetton.
Abstract Machzne: A MIT Press, Cam-
Mass.
K. A. M. AND KARLSSON, R. 1992. OR-Parallel speedups in a knowledge based system. On Muse and Aurora. In proceedings of the International Conference on Fifth Generation Computer Systems ICOT, Tokyo.
ALI, K A. M AND KARLSSON, R. 1991. Scheduhng Or-Parallehsm in Muse. In F70ceedmgs of the 8th International Conference on Logic Programming, MIT Press, Cambridge, Mass., 807-821, ALI,
K. A. M. AND KARLSSON, R. 1990. The Muse Or-Parallel Prolog model and Its performance. In proceedings of the 1990 North American Conference on Logic Programming MIT Press, Cambridge, Mass.
BAHGAT, R, 1993, Non-Determmlstlc Logzc Programming in Pandora, tific Publishing, London, U.K.
Concurrent World Scien-
BARON, LT. C., CHASSIN DE KERWMMEAUX, J., HAILPERIN, M., RATCLIFFE, M., ROBERT, P , SYRE, J. C., AND WESTPHAL, H. 1988a. The parallel ECRC Prolog system PEPSys: An overview and evaluation results. In Proceedings of the Internatlonal Conference on Fzfth Generation Cornputer Systems. ICOT, Tokyo. BARON, U. C., ING, B., RATCLIFFE, M., AND ROBERT, P. 1988b. A distributed architecture for the PEPSys parallel logic programming system. In Proceedings of ICPP’88. Pennsylvania State Univ., 410-413. BEAUMONT, A., MUTHURAMAN, WARREN, D. H. D. 1991.
S,, SZEREDI, P., AND Flexible scheduling
Parallel of OR-parallelism in Aurora: The Bristol Scheduler. In Parallel Architectures and Languages Europe, PARLE ’91. Lecture Notes in Computer Science, vol. 506, Springer-Verlag, New York, 403-420. BELL, G. 1992. fore its time.
Ultracomputers: Commun. ACM
A teraflop 35, 8, 26-47.
SOFI, G. gramming tures. In Conference Cambridge, BRAND, P. gigalips
CECCHI, C., MOISO, C., PORTA, M., AND 1990. Logic and functional proon distributed memory architecProceedings of the 7th International on Logic Programming. MIT Press, Mass., 325-339.
1988. Wavefront project, SICS.
scheduling.
Int.
Rep.,
BRIAT, J., FAVRE, M., GEYER, C., AND CHASSIN DE KERGOMMEAUX, J. 1992. OPERA OR-Parallel Prolog system on supernode. In Zmplemen tutions of Distributed Prolog. John Wiley and Sons, New York, 45-63. BRIAT, J., FAVRE, M., GEYER, C., AND CHASSIN DE KERGOMMEAUX, J. 1991. Scheduling of ORparallel Prolog on a scalable, reconfigurable, distributed memory multiprocessor. In Parallel Architectures and Languages Europe, PARLE ’91. Lecture Notes in Computer Science, vol. 506, Springer-Verlag, New York, 385-402. BRUYNOOGHE, M. 1991. A practical framework for the abstract interpretation of logic programs. J. Logic Program. 10, 2, 98-124. BUTLER, R., DISZ, T., LUSK, E., OLSON, R., OVERBIIEK, R. A., M-W STEVENS, R., 1988. Scheduling OR-parallelism: An Argonne perspective. In proceedings of the 5th International Conference and Symposium on Logic Programming. MIT Press, Cambridge, Mass., 1590-1605. BUTLER, R., LUSK, E. L., OLSON, R., AND OVERBEEK, R. A. 1986. ANLWAM: A parallel implementation of the Warren Abstract Machine. Tech. Rep., Argonne National Laboratories. CALDERWOOD, A. AND SZEREDI, P. 1989. Scheduling Or-parallelism in Aurora. In Proceedings of the 6th International Conference on Logic Programming. MIT Press, Cambridge, Mass. CARLSSON, M. ANrJ WIDEN, J. 1988. SICStus log User Manual. Res. Rep., SICS.
Pro-
CHIKAYAMA, T. ANII KIMURA, Y 1987. Multiple reference management in flat GHC. In Proceedings of the 4th International Conference on Logw Mass.,
Programming. 276–293.
MIT
Press,
Cambridge,
S. 1981. A programming.
Systems
CLARK, K. AND MCCABE, cilities Micro burgh
F.
1981.
333
●
Languages Press, New The
and York.
control
fa-
of IC-Prolog. In Expert Systems in the Electronic Age, D. Mitchie, Ed. EdinUniversity Press, Edinburgh, U. K.
CLOCKSIN, W. F. AND ALSHAWI, H. 1988. A method for efficiently executing Horn clause programs using multiple processors. New Gen. Comput. 5, 361-376. CODOGNET, C., CODOGNET, P., AND CORSINI, M.-M. 1990. Abstract interpretation for concurrent logic languages. In Proceedings of the 1990 North American Conference on Logic Programming. MIT Press, Cambridge, Mass., 215–232, COLMERAUER, A. 1990. An introduction log-111. Commzm. ACM, 33, 7.
to
Pro-
COLMERAUER, A. 1986. Theoretical model of Prolog II. In Logic Programmmg and Its Applicat~on, M. van Caneghem and D. Warren, Eds. Ablex, Norwood, N. J., 3–31. CONERY, J. S. 1992. The OPAL Machine. In plementations of Distributed Prolog. John ley and Sons, New York, 159-185.
ImWi-
CONERY, J. S. 1983. The AND/OR process model for parallel execution of logic programs. Ph.D. thesis, Tech. Rep. 204, Dept. of Computer and Information Science, Univ. of California, Irvine, Calif. COSTA, V. S., WARREN, D. H. D., AND YANG, R. 1991a. The Andorra-I engine: A parallel implementation of the basic Andorra model. In Proceedings of the 8th International Conference on Logic Programming. MIT Press, Cambridge, Mass., 825–839. COSTA, V. S., WARREN, D. H. D., AND YANG, R. 1991b. The Andorra-I preprocessor Supporting full Prolog on the basic Andorra model. In Proceedings of the 8th International Conference on Logic Programming. MIT Press, Cambridge, Mass., 443-456.
COUSOT, P. AND COUSOT, R.
1977. Abstract interpretation: A unified framework for static analysis of programs by approximation of fixpoint. In the 4th ACM Symposium on Principles of Programming Languages. ACM Press, New York.
CRAMMOND, J. 1988. Implementation of committed choice logic languages on shared memory multiprocessors. Ph.D. thesis, Dept. of Computer Science, Herroit-Watt Univ., Edinburgh, U. K. CHASSIN DE KERGOMMEAUX, J. the PEPSys implementation Tech. Rep. CA-44, ECRC.
1989. on
Measures of the MX500.
ParPro-
CHASSIN DE KERGOMMEADX, J. AND ROBERT, P. 1990. An abstract machine to implement efficiently OR-AND parallel Prolog. J. Logic Program. 8, 3.
relational In the
VILLEMONTE DE LA CLERGERIE, E. 1992. Dyalog: Complete evaluation of Horn clauses by dynamic programming. Ph.D. thesis, INRIA.
CLARK, K. ANI) GREGORY, S. 1986. PARLOG: allel programming in logic. ACM Trans. gram, Lang. Syst. 8, L CLARK, K. AND GREGORY, language for parallel
Programming
ACM Conference on Functional Computer Architecture. ACM
be-
BORGWARDT, P. 1984. Parallel prolog using stack segments on shared memory multiprocessors. In the 84 International Symposzum on Logic Programmmg. IEEE, New York, 2-11.
Bosco, P. G.,
Logic
ACM
Computing
Surveys,
Vol.
26,
No.
3, September
1994
334
“
J. Chassin
de Kergommeaux
and
DEBRAY, S. K., LIN, N.-W., AND HERMENEGILDO, M. 1990. Task granularity analysis in logic programs. In Proceedings of the ACM SIGPLAN’90 Conference on Programmmg Language Design and Implementation. ACM, New York. DEGROOT, D. 1984. Restricted And-parallelism. In Proceedings of the International Con&ence on Fifth Generation Computer Systems 1984. ICOT, Tokyo, 471-478 DELGADO-RANNAURO, S. A. 1992a. OR-parallel lo~c computational models. In Implementations of D~strlbuted Prolog. John Wdey and Sons, New York, 3-26. DELMDO.R.ANNAURO,
S.
A.
1992b
MOOLENAR, R., VAN HECKER, H., AND DEMOEN. 1991. A parallel implementation of AKL. the ILPS 91 Workshop on Implernentatton Parallel Logw Programming Systems. ILPS.
B. In of
HARIDI, S. AND JANSON, S. 1990. Kernel Andorra Prolog and its computation model. In Proceedings of the 7th International Conference on Logic Program mmg. MIT Press, Cambridge, Mass., 31-46.
HAUSMAN, B. 1989. Pruning and scheduling speculative work in Or-parallel systems. In Parallel Architectures and Languages Europe, PARLE’89. Lecture Notes in Computer Science, vol. 366. Springer-Verlag, Berlin.
DIH.GADO-RANN.AURO, S. A. 1992c. Stream ANDparallel logic computational models. In Implementations of Dlstrlbufed Prolog. John Wiley and Sons, New York, 239–257. DIJKSTRA, E. W. 1975. Guarded commands, nondeterminacy, and formal derivation of programs. Commun. ACM 18, 8. DINCBAS, M., SIMONIS, H., AND VAN HENTENRYCL P. 1990. Solving large combinatorial problems in logic programming ,1. Logic Program. 8, 1-2. DISZ, T , LUSK, E., AND OVERBEEK, R. 1987. Experiments with OR-parallel loglc programs. In the 4th International Conference on Logzc Programming. ACM Press, New York, 576-599. Progranmzzng Prentice-Hall
pendent And-, independent And-j and Or-parallelism. In Proceedings of the 1991 International Logtc Programming Symposmm. MIT Press, Cambridge, Mass., 152-166.
HAUSMAN, B. 1990. Pruning and speculative work in OR-Parallel Prolog. Ph.D. thesis, SICS Res. Rep D-90-9001, Royal Inst of Technology, Stockholm, Sweden.
Restricted
ANDand AND/OR-parallel logic computational models. In Implementations of Distrz b uted Prolog. John Wiley and Sons, New York, 121-141.
FOSTER, I. 1990, Systems allel LogLc Languages. tional, U. K.
P. Codognet
in ParInterna-
HERMENEC+ILDO, M. V. 1986. An abstract machine for restricted AND-parallel execution of logic programs In I%oceedzngs of the 3rd International Con ference on Logic Programming. Lecture Notes in Computer Science, SpringerVerlag, New York, 25-39. HERMENEGILDO, M. V. AND GREENE, K. J. 1990. &-Prolog and its performance: Exploiting independent And-parallelism. In Proceedings of the 7th International Conference on Logic Programming. MIT Press, Cambridge, Mass., 253–268. HERMENEGILDO,
M.
AND
ROSSI,
F.
1990.
Non-
1. AND TAYLOR, S. 1989 Strand New Concepts in Parallel Programming F’rent~ceHall, Englewood Cliffs, N J
strict independent And-parallelism. In Proceedings of the 7th International Conference on Logw Prwgrammmg. MIT Press, Cambridge, Mass., 237-252.
FURUICHI, M., TAKI, K., AND ICHIYOSHI, A. N. 1990. A multi-level load balancing scheme for Orparallel exhaustive search programs on the Multl-PSI. ACM SIGPLAN Not. 25, 3, 50-59.
ICHIYOSHI, N., ROKUSAWA, K., NAKAJIMA, K., AND INAMURA, Y. 1990. A new external reference management and distributed umfication for KL1. New Gen. Cornput. 7, 159-177.
FURUIMWA, K. 1992. Logic programming as the integrator of the Fifth Generation Computer Systems Project Commzm. ACM 35, 3, 83-92.
JAFFAR, J AND LASSEZ, J.-L. 1987. Constraint logic programming. In the 13th ACM Symposium on Principles of Programmmg Languages, POPL 87. ACM Press, New York.
FOSTER.
GUPT,A, G. AND HERMENEGILDO, M. 1991a. ACE And/Or-parallel copying-based execution of logic programs. In ICLP’91 Workshop on Parallel Executton of Logbc Program.?. SpringerVerlag, New York. GUPTA, G AND H~RMENE~lLDO, M. And/Or-parallel copying-based logic programs. Tech. Rep. Polit6cnica de Madrid, Spare GUPTA,
G. AND JAYARAMAN,
B.
1991b. ACE: execution of Universidad
1990.
Optimizing
And-Or parallel Implementations. In Proceedings of the 1990 North Amerzcan Conference on Logic Programming. MIT Press, Cambridge, Mass 605-623. GUPTA, G, COSTA, V. S., YANG, R., AND HERMENI? GILDO, M. V 1991 IDIOM Integrating de-
ACM
Computing
Surveys,
Vol
26,
No
3, September
1994
JANSON, S. AND HARIDI, S. 1991. Programming paradigms of the Andorra Kernel Language. In Proceedings of the 1991 International Logic MIT Press, CamProgramnung S’ymposutm. bridge, Mass., 167-186 ,JoN~s, N. AND SONDERGAARD, H. 1987. A semantic-based framework for the abstract interpretation of Prolog. In Abstract Interpretat~on of Declarative Languages, S. Abramsky and C. Hankin, Eds. Ellis Horwood, 123-142. KALfi, L. V. 1987. The REDUCE-OR process model for parallel evaluation of logic programs. In Proceedings of the 4th International Conference on Logic Programming. MIT Press, Cambridge, Mass., 616-632.
Parallel KALfi, L. V. duce-OR
AND RAMIfUMAR, B. 1992. process model for parallel
gramming on non-shared memory Implementations of Distr~buted Wiley
and
Sons,
New
York,
The Relogic pro-
187-212.
KLINGER, S. AND SHAPIRO, E. 1988. A decision tree compilation algorithm for FCP (—, :, ?). In pz’oceedmgs of the 5th International Conference and Sympostum on Logic Programming. MIT Press, Cambridge, Mass., 1315–1336. ing.
Elsevier
1979. Science,
Logic New
for
Problem
Solw
York.
KUMON, K., ASATO, A., ARAI, S., SHINOGI, T., AND HATTORI, A. 1992. Architecture and implementation of PIM/p. In Proceedings of the In ternatLonal Conference on Fifth Generation Computer Systems 1992. ICOT, Tokyo, 414-424, LfN, Y. J. AND KUMAR, V. 1988. AND-parallel execution of logic programs on a shared memory multiprocessor: A summary of results. In Proceedings of the 5th International Conference and Syrnposlum on Logic Programming. MIT Press, Cambridge Mass., 1123-1141. LLOYD, J, W. gramming. Lus~,
1987. Foundation.9 of Logic Springer-Verlag, Berlin.
Pro-
E.,
BUTLER, R., DISZ, T., OLSON, R., OVERR., STEVENS, R., WARREN, D. H. D., CALDERWODD, A., SZER~DI, P., HARID1, S., BRAND, P., CARLSON, M., CIEPIELEWSKI, A., AND HAUSMAN, B.> 1988. The Aurora Or-parallel pro~og system. In Proceedings of the International Conference on Fifth Generation Computer Systems 1988, ICOT, Tokyo. BEEK,
Programming
Systems
“
335
matic compile-time parallelization of logic programs for independent And-parallelism. In Proceedings of the 7th International Conference MIT Press, Cambridge, on Logic Programming. Mass., 221-236.
machines. In Prolog. John
KALfi, L. V. AND RAMKUMAR, B. 1990. Joining AND parallel solutions in AND/OR parallel systems. In Proceedings of the 1990 North American Conference on Logic Programming. MIT Press, Cambridge, Mass., 624-641.
KOWALSKZ, R. A.
Logic
MUTHUKUMAR, K. AND HERMENEGILDO, M. 1989a. Complete and efficient methods for supporting side-effects in independent/restricted ANDparallelism. In Proceedings of the 6th International Conference on Logic Programming. MIT Press, Cambridge, Mass., 80-97. MUTHUKUMAR, K. AND HERMENEGILDO, M. 1989b. Determination of variable dependence information through abstract interpretation. In Proceedings of the North American Conference on Logic Programming. MIT Press, Cambridge, Mass., 166-188. NAISH, L. 1988. Parallelizing NU-Prolog. In Proceedz ngs of the 5th International Conference and Symposium on Logic Program mmg. MIT Press, Cambridge, Mass., 1546-1564. NAISH, L. 1984. Mu-Prolog 3. ldb Reference ual. Melbourne University, Melbourne, tralia.
ManAus-
NAKAJIMA, K., INAMURA, Y., ICHIYOSHI, N., ROKUSAWA, K., AND CHIKAYAMA, T. 1989. Distributed implementation of KLl on the MultiPSI/V2. In the 6th International Conference on Logic Program ming. NAKASHIMA, H. AND NAKAJIMA, K. 1992. Architecture and implementation of PIM\m. In Proceedings of the In ternatlonal Conference on F~fth Generation Computer Systems 1992. ICOT, Tokyo, 425-435. PALMER, D. AND NAISH, L. 1991. NUA-Prolog: An extension to the WAM for parallel andorra. In Proceedings of the 8th International Conference on Logic Programming. MIT Press, Cambridge, Mass., 429-442.
MAES, L. 1992. Or-parallel speedups in a compiled PROLOG engine: Results of the integration of the Muse scheduler in the ProLog by BIM engine, Draftj BIM, Kwikcstraat 4, B-3078 Everberg, Belgium.
PERCE~OIS, C., SIGNES, N., AND SELLE, L. 1992. An actor-oriented computer for logic and its
MA~ER, M. J. 1987. Logic semantics for a class of committed-choice programs. In Proceedings of the 4th International Conference on Logic Programming, MIT Press, Cambridge, Mass., 858-876.
PER~IRA, L. M. AND PORTZ.), A, 1979. Intelligent backtracking and sidetracking in Horn clause programs. Tech. Rep., CINUL 2/79, Universitade Nuova de Lisboa, Portugal.
MASUZAWA, H. 1986, Kabu Wake parallel inference mechanism and its evaluation. In 1986 FJCC. IEEE, New York, 955–962. MELLISH, C. S. 1986. Abstract interpretation of Prolog Programs. In Proceedings of the 3rd International Conference on Logzc Programming, Springer-Verlag, New York. 463–474.
application. Prolog. 213-235.
In
Implementations of Distributed Wiley and Sons, New York,
John
RAMANATHAN, commercial 21, 3.
G. ANZI OREN, J. 1993. Survey of parallel machines. ACM Szggarch
RAMZiUMAR, B. 1991, Machine independent “AND” and “OR parallel execution of logic programs. Ph.D. thesis, Univ. of Illinois, Urbana-Champaign, Ill.
MUDAMBI, S. 1991. Performances of Aurora on NUMA machmes. In Proceedings of the 8th Intei-natzonal Conference on Logic Programming. MIT Press, Cambridge, Mass., 793–806.
RAMKUMAR, B. AND IQm4, L. V. 1989. Compiled execution of the Reduce-OR process model on multiprocessors. In Proceedings of the North American Conference on Logzc Programming. MIT Press, Cambridge, Mass., 313–331.
MUTHUKUMAR, K. AND HERMENEGILDO, M. V. The DCG, UDG, and MEL methods for
SARASWAT, V. programming
1990. auto-
ACM
Computing
A.
1991. Concurrent languages. In Doctoral
Surveys,
Vol
26,
No,
constraint Disserta-
3, September
1994
336
J. Chassin
●
tzon
MIT
Award Press,
de Kergornmeaux
and P. Codognet VAN HENTENRYCK, P., SARASWAT, V., AND DEVILLE, Y. 1993. Design, implementation and evaluation of the concurrent constraint language cc(fd). Tech. Rep., Brown Univ., Providence, R. I.
and Logic Programming Serzes. Cambridge. Mass. To be published.
SARASWAT, V. A. 1987. The concurrent logic programming language CP Definition and operational semantics. In the 13th ACM Symposium on Principles of Programming Languages, POPL
87. ACM
Press,
New
VAN ROY, P. AND DES~AIN, A. M. 1992. High-performance logic programming with the Aquarius Prolog compiler. (“ompater 25, 1
York.
SARASWAT, V. A. AND RINARD, M. 1990. Concurrent constraint programmmg. In the 16th ACM Symposlurn on Principles of Programmmg Languages, POPL 90. ACM Press, New York. SATO, M. AND GOTO, A. 1988. Evaluation of the KLl parallel system on shared memory multiprocessor. In Proceedings of the IFIP Working Conference on Parallel Processing. North-Holland, Amsterdam. SHAPIRO, E. 1989. The family of concurrent programming languages. ACM C’omput. 21, 3.
logic Sure.
SHAPIRO, E. 1983. A subset of Concurrent Prolog and its interpreter. Tech. Rep , Weizmann Institute, Rehovot, Israel. SHEN, K. AND HERM~NIS~ILDO, M. V. 1991. A simulation study of Or- and independent And- parallelism. In F’roceedzngs of the 1991 International Logic Prograrnrnmg Symposium. MIT Press, Cambridge, Mass., 135-151. SZEREI)I, P. 1989. Performance analysis of the Aurora OR-parallel Prolog system. In Proceedings of the North Amerzcan Conference on Logic Programrnmg. MIT Press, Cambridge, Mass. TAIL
K. 1992. Parallel Inference Machine PIM. In Proceedings of the International Conference on Fzfth Generat~on Computer Systems 1992. ICOT, Tokyo, 50-72.
TAYLO~ S. 1989. Parallel Techniques. Prentice-Hall
Logtc Programming International, U. K.
199x A TAYLOR, S., SAFRA, S., AND SHAPIRO. E parallel implementation of Flat Concurrent Prolog. J. Parall. Program. 15, 3, 245-275. TICK
E. 1989. Comparmg gramming architectures.
two parallel loglc proIEEE Softw. (July)
TICK,
E. 1988 Compile-time granularity analysis for parallel logic programming languages In Proceedings of the Iuternatzonal Conference orL Fl fth Generation Computer Systems 1988. ICOT, Tokyo, 994-1000.
UEDA, K. AND CHIKAYAMA, T. 1990. Kernel language for the Parallel chine Comput. J. 33, 6. VAN HENTENRYCK, tion in Logic bridge, Mass. VAN
Constraint
MIT
SatisfacPress, Cam-
HENTENRYCK, P. 1989b. Parallel constraint satisfaction in logic programming. Preliminary results of Chip within PEPSYS In Prwceedmgs of the 6th Internatz onal Conference on Logzc Programmmg. MIT Press, Cambridge, Mass., 165-180.
Recewed
ACM
P. 1989a. Programming.
Design of the Inference Ma-
May 1992;
Computmg
final
Surveys,
revmon
accepted
Vol
No.
26,
November
3. September
1994
VERON, A., SCHUERMAN, K., REEVE, M., AND LI, L. 1993. Why and how in the ElipSys OR-parallel CLP system. In the 5th Znternationa[ PARLE Conference. Lecture Notes in Computer vol 694. Springer-Verlag, Berlin, Science, 291–304. WARREN, D. H. D. 1990 The extended Andorra model with Implicit control. In the ICLP 90 Workshop on Parallel Logzc Programming. Presentation slides WARREN, D. H. D. 1988. The Int. Rep., Gigalips Group.
Andorra
principle.
WARREN, D. H. D. 1987. The SRI model for Orparallel execution of Prolog. Abstract design and Implementation issues. In proceedings of the 1987 S.vmposum on Logic Programmmg. IEEE Computer Society Press, Los Alamitos, Cahf., 46-53. WARREN, D. H. D. 1983. An abstract Prolog instruction set. Tech. Rep. tn309, SRI, Menlo Park, Calif. WARREN, D. H. D. AND HARIDI, S. 1988. Data Diffusion Machine—A scalable shared virtual memory multiprocessor. In Proceedings of the International Conference on Fifth Generation Computer Systems 1988. ICOT, Tokyo, 943952. WESTPHAL, H , ROBERT, P., CHASSIN, J., AND SYRE, J.-C. 1987 The PEPSys Model: Combimng backtracking, ANDand OR-parallelism. In Proceedings of the 1987 Symposium on Logzc Programming IEEE Computer Society Press, Los Alamitos, Cahf., 436-448 WINSBOROUGH, W AND WAERN, A. 1988. Transparent And-parallelism in the presence of shared free variables. In Proceedings of the 5th Internatmnal Conference and Symposium on Logzc Program mlng. MIT Press, Cambridge, Mass , 749-764. YANG, R. ANU Amo, H. 1986. P-Prolog: A parallel logic language based on exclusive relatlon. In Proceedings of the 3rd Znternatzonal Conference OTLLogic Program mmg Springer-Verlag, New York, 255–269 YANG, R., BEAUMONT, T., DUTRA, I., COSTA, V. S., AND WARREN, D. H. D. 1993. Performance of the compder-based Andorra-I system. In Proceedz ngs of the 10th International Conference on Logzc Programming. D. S. Warren, Ed., MIT Press, Cambridge, Mass., 150-166.
1993