In this paper we outline five abstract algorithms specifying parallel execution of production s stem. (PS) programs on the DAD0 machine. Each algorit m offers a.
From: AAAI-84 Proceedings. Copyright ©1984, AAAI (www.aaai.org). All rights reserved. FIVE
PARALLEL
ALGORITHMS ON THE
FOR PRODUCTION DAD0 MACHINE*
Salvatore Computer
New York
EXECUTION
J. Stolfo
Science
Columbia
SYSTEM
Department
University City,
N.Y.
10027
Abstract In
this
paper
parallel inherent our
conclusions
under
by
algorithm
construction
in
a
abstract is
empirically on
the
at Columbia
algorithms
systems designed
variety
Ongoing
programs.
each
five
production
algorithm
parallelism
system
specify of
Each
machine.
of
we
execution
of
research
to
different aims
evaluating DAD02
on
to the
prototype,
for the
while WM elements are simple lists of equals sign), constant symbols (corresponding to tuples of the relational An example production, borrowed from the algebra). blocks world, is illustrated in figure 1.
the
DAD0
capture
the
production (Goal
substantiate
(Isa
performance
Clear-top-of =x
(On-top-of
presently
(Isa
University.
Block)
Block) =y
=y
=x) -- > delete( On-top-of
Block)
assert(On-top-of
=y
=x)
=y
Table)
1 Introduction If the
five abstract algorithms In this paper we outline specifying parallel execution of production s stem (PS) programs on the DAD0 machine. Each algorit !I m offers a number of advantages for particular types of PS programs. We expect to implement these algorithms on the DAD02 prototype and critically evaluate the performance of each on a variety of application programs. Software development is presently underway using the DAD01 prototype that has been operational at Columbia University since April, 1983.
is to clear
there
covered
by something a block,
then
remove
the
Match:
For
each
rule,
matches
the
current
each
pattern
element
with
throughout In general, a Production Systenl (PS [Newell 1973, Davis and King 1975, Rychener 1976, Aorgy 19821 IS defined by a set of rules, or prodzlctions, which form the Production Memory (PM), together with a database of called the Worlcing Memory (WM). assert ions, Each production consists of a conjunction of pattern elements, called the left-hand side (LHS of the rule, along with a set of actions called the rzgfzt- R and side (RHS). The RHS specifies information that is to be added to (asserted) or removed from WM when the LHS successfully matches against the contents of WM.
the
rules
=y
LHS. collected
is on =x
is on the
table.
Production. system
determine is matched
repeatedly
whether
the
of
WM:
environment
variables
the are
that =y
the production cycle of operations:
LHS
element
Systems
fact that
An Example
1:
operation., the followmg
of a block,
(=y)
is also
assert
top
(=x)
which
Figure In executes
the
is a block
and
We begin with a brief description of PS’s and identify various possible characteristics of PS programs which may not be immediately apparent from a general description of the basic formalism. These characteristics lead to different algorithms which will be discussed in the remaining sections of this paper. 2 Production
goal
and
by
some
consistently
All
matching
in
instances
conflict
the
WM bound of
set
of
rules. Select:
Choose
according Act:
Add
exactly
to some to
specified
in
perform
some
one of the
predefined
or
delete
the
RHS
from of
matching
rules
criterion. the
WM
all
selected
assertions rule
or
operation.
Pattern elements in the LHS may have a variety of forms which are dependent on the form and content of WM elements. In the simplest case, patterns are lists composed of constants and variables (prefixed with an
During the selection phase of production system execution, a typical interpreter provides confht resolution strategies based on the recency of matched data in WM, as syntactic discrimination. Other resolution as well schemes are possible, but for the present paper such issues will not significantly change our analysis, and hence will not be discussed.
*This research h as been supported by the Defense Advanced Research Projects Agency through contract N00039-84-C-0165, as well as gr&ts from Intel, Digital Eauinment. Hewlett-Packard. Valid Logic Svstems. AT&T Be’11&Labor&tories and IBM Corporations-and -the New York State Science and Technology Foundation. We gratefully acknowledge their support.
We shall only consider the parallel execution of PS programs with the goal of accelerating the rule firing rate of the recognize/act cycle as well as the number of Wh4 transactions perrormed. In a later section of this paper, we shall consider other possible parallel activities as, for example, the concurrent execution of multiple PS programs.
300
state
new
of
temporal this
to be always
3. Many
by
may
similar
1. Assign
some
of
WM. algorithm
illustrated
in
of rules
to a set of (distinct)
on
WM.
subset
of WM
elements
processors
of a rule
operation
is not
appropriate.
(possibly
distinct
from
those
in
each
but
step
to
may a
1). until
no rule
6. Global
is active:
LHS a.
Broadcast storing
an
to
in
conflict b.
the
system.
Report
Broadcast
the
step
3.b
to
their
local
2:
of
(non-temporally
action
specifications
have
large
is
global
the
unlikely,
of
not
scope
of
the state.
matches.
which
may
relatively need
in
effects
i.e., saving
pattern
The
potentially Thus,
small.
access
small
to
all
a
of WM
subset
within
its instantiated all
the
of
data
processors,
which end
Production
in
update
System
prior
to
the
cycle.
to only 9 Small
Algorithm.
PM.
This very PS cycle
10. Large
simple view of the parallel implementatjon forms the basis of our subsequent analysis.
3 Characteristics
of Production
System
thousands 11. Large
Programs
thousands
In this section we enumerate various characteristics of PS programs in general terms. The reader will note that these characteristics are less indicative of a specific PS but rather are characteristics of various formalism, problems whose solutions are encoded in rule form. It the “inherent parallelism” should be noted, though, that not be represented by the problems may Ert;gc0u;“a”r PS f ormalksm used for their solution. 1. Temporal made
on
between need
Redundancy. each
each not
[Forgy
cycle,
be
19821
relatively
Rules.
few
WM
rules
on need
changes
by
saving
matching
the
incorporating to
WM
The
repeated.
Affected changes
previous
is probably
PS interpreter
Few Thus,
cycle.
this Few each
are
algorithm
be matched
are
This
against case
of characteristic On
each
conflict
initiating
the
number
of
rules match
rules
may 5.
cycle
of
may
be
phase
of
is restricted
a few hundred. WM. PM.
Similarly, A
of rules WM.
WM
may
consist
of only
elements. PS
may
Similarly,
of data
consist
of
several
in PM. WM
may
consist
of
elements.
Algorithms
The reader is assumed to be knowledgeable about the Rete match algorithm (see [Forgy 19791 and [Forgy 19821 . We will thus freely discuss the details of the Rete mate h We begin with a when needed without prior explication. brief description of the DAD0 architecture. (The reader is 19831 and [Stolfo and Miranker encouraged to see Stolfo 19841 for complete d etails of the system.)
of a
strategy. cycle,
elements
which
state
example
rules
WM
of
than
tests
In this section we outline five different algori;ha;; suitable for direct execution on the DAD0 machine. will be independently discussed leadin to various t fl ey are most conclusions about which characteristics Ongoing research aims to verify appropriate for capturing. our conclusions by empirically evaluating their performance for different classes of PS programs.
operations
Rete
best
4 Five
The
rather
value).
firings.
executed next
of
converse
rule
WM,
in the requiring
example,
threshold
number
8 Small
Repeat;
number
elements
conditions of
(for
a
operation,
test
portions
as the
7 Multiple
entire
reported
the
be viewed
RHS.
to WM
accordingly.
Abstract
instance
the
large
constant
Pattern
may
elements
compare some
rated
to
of WM.
rule
individual
local
compute
rule
changes
WM
a
a
a few hundred
by
rules.
WM
relatively
tests
of
access
phase,
instances.
processor,
rated
processors
match
maximally
each
maximally
all
the
formation
each
within
Figure
to
begin
set of matching
Considering
c.
instruction
rules
resulting
2. Few
where
in many
elements.
3. Repeat
of the
cycle.
to
elements
rule
rule
are
each
in situations
restricting
scope
rules
on
case,
seems
of WM
single
to a set of
WM
appear
may
Thus,
match
match
some
In this
number
processors. 2. Assign
RHS
that
guarantee
Many
to
example,
changes
the
not
case.
elements
5. Restricted
subset
for
arise,
however,
does
Rules. changes
pattern
redundant).
update abstract
the
the
4. Massive
algebraic operation. processed by a parallel by the this approach figure 2.
alone
Affected
affected This
Note,
WM.
redundancy
affected and
against
thus the
301
4.1
The
DAD0
4.2
Machine
DAD0 is a fine-grain, parallel machine where A processing and memory are extensively intermingled. full-scale production version of the DAD0 machine would the order of a hundred comprise a very large (on each thousand) set of processing elements (PE’s , containing its own processor, a small amount (16 k bytes, rototype version of local in the current design of the random access memory (RAM P, and a specia I ized I/O The PE’s are interconnected to form a complete switch. binary tree.
Algorithm
1: Full
Distribution
of PM
small number of distinct In this case, a very rules are distributed to each of the 1023 production DAD02 PE’s, as well as all WM elements relevant to the rules in question, i.e., only those data elements which match some pattern in the LHS of the rules. Algorithm 1 alternates the entire DAD0 tree between MIMD and SIMD modes of operation. The match phase is implemented as an MIMD process, whereas selection and act execute as SIMD operations. In simplest terms, each PE executes the match phase for its own small PS. One such PS is allowed to “fire” a which is communicated to all other PE’s. rule, however, is illustrated in figure 3. The algorithm
Within the DAD0 machine, each PE is capable of executing in either of two modes under the control of runIn the first, which we will call SIA4D time software. mode for single instruction stream, multiple data stream), the P Ii executes instructions broadcast by some ancestor PE within the tree. (SIMD typically re&hs, t;A;;ingol; stream of “machine-level” instructions. the other hand, SIMD is generalized to mean a single Thus, stream of remote procedure invocation instructions. DAD0 makes more effective use of its communication bus by broadcasting more “meaningful” instructions.) In the second, which will be referred to as MIMD mode (for mult,iple instruction stream, mulifle data stream), each PE its own local RAM, executes instructions stored A single conventional independently of the other PE’s. adjacent to the root of the DAD0 tree, coprocessor, controls the operation of the entire ensemble of PE’s.
1. Initialize:
Distribute
PE.
PE.
Set CHANGES
2. Repeat 3. Act:
Distribute
the For
a.
each
rule
matcher rules
to initial
elements.
WM
to
to each
in CHANGES
WM-change
Broadcast
b.
simple
a few distinct
following:
WM-change
specific
When a D,4DO PEsuE;ters MIMD mode, its logical a WaY as to effectively is changed in state “disconnect” it and its descendants from all higher-level In particular, a PE in MIMD mode does PE’s in the tree. not receive any instructions that might be placed on the tree-structured communication bus by one of its ancestors. however, broadcast instructions to be Such a PE may, executed by its own descendants, providing all of these descendants have themselves been switched to SIMD mode. The DAD0 machine can thus be configured in such a way that an arbitrary internal node in the tree acts as the root of a tree-structured SIMD device in which all PE’s execute a single instruction (on different data) at a given point in time. This flexible architectural design supports multipleSIMD execution (MSIMD). Thus, the machine may be logically divided into distinct partitions, each executing a distinct task, and is the primary source of DADO’s s eed in executing a large number of primitive pattern mate K ing operations concurrently.
a
each
WM
Broadcast
(add
element) a
or
command
to
locally
PE
operates
independently
mode
and
modifies
its
is a deletion,
it checks
and
rule
If this
and
a
local
WM.
instances
modifies
as
If this
conflict
set
appropriate.
it matches its
match. in MIMD
its local
is an addition,
rules
delete
to all PE’s.
[Each
removes
do:
local
its set
of
conflict
set
instruction
to
accordingly]. end
C.
4. Find
do;
local
maxima:
PE
to
rate
according
to
some
resolution
strategy
each
Broadcast its
an
local
matching
predefined (see
instances
criteria
[McDermott
(conflict and
Forgy,
19781). 5. Select:
Using
circuit
of
execution Our comments will be directed towards the DAD02 consisting of 1023 PE’s constructed from prototype commercially available chips. Each PE contains an 8 bit Intel 8751 processor, 16K bytes of local RAM, 4K bytes of local ROM and a semi-custom I/O switch. The DAD02 I/O swit,ch, which is being implemented in semi-custom gate array technology, has been designed to support rapid communication. In addition, a specialized global combinational circuit incorporated within the I/O switch will allow for the very rapid selection of a single distinguished PE from a set of candidate PE’s in the tree, call mu-resolving. (The a process we max-resolve instruction computes the maximum of a s ecified register in all PE’s in one instruction cycle, whit Tl can then be used to select a distinct PE from the entire set of PE’s taking part in the operation.) Currently, the 15 processing element version of DAD0 performs these operations in firmware embodied in its off-the-shelf components.
from
6. Instantiate:
Figure 4.2.1
high-speed identify
among
Report
Set CHANGES 7. end
the
DADOB,
the
to the
max-RESOLVE a
all PE’s
single with
instantiated reported
rule
active RHS
for rules.
actions.
WM-changes.
Repeat; 3:
Discussion
Full
Distribution
of Algorithm
of Production
Memory.
1
We have left the details of the local match routine unspecified at step 3.b. Thus, a simple precompiled Rete match network and interpreter may be distributed to each processor. However, it is not clear whether a simple naive matching algorithm may be more appropriate since only a very small number of rules is present in each PE. Memory considerations may decide this issue: the overhead associated with linking and manipulating intermediate partial matches in a Rete network may be more expensive than direct pattern matching against the local W’M on each cycle.
302
Performance of this algorithm varies with the In the best case, the time complexity of the local match. to match the rule set is bounded by the time to match The worst case is dependent on the only a. few rules. maximum number of WM elements accessed during the If a simple naive match is used at match of the rules. each PE, this may require a considerable amount of computation even though the size of the local WM’s IS limited. Simple hashing of WM may dramatically improve a local naive matching operation, however.
suited 1.
We conclude that this algorithm is probably to implementing PS programs characterized by: Temporal would execution
3.
Many
to
though,
5.
WM
would
it
separately.
Note,
also of
WM.
three
or
for
4.
be
PM
of
common quite
depending pattern
elements
large.
Even
is
resident
element number replicated individual
of
on if an in
additional in
other
elements
the
average
between average
of
efficiently
implemented.
3.
above each
PE
saving
state
Many
rules
cycle.
each
PE
local PE’s),
may
rules,
WM
WM
of
may
unique
(while
elements
minimum
be stored
in WM.
of
6.
Global
8.
are 1000
DAD0
each
cycle,
The allow
tests
WM
such
ability
to
to
WM
for
may
also
be
an
environment,
has
few advantages.
by
WM-changes
changes
to
many
on
WM
rules
concurrency many
are
matchings
each
may
may
achieved
rule
is, unfortunately, one
of the
DAD02,
for
4 of
store
the
AA,
be
potentially at
the
PM-
to be achieved
30
Algorithm
with
a few hundred may
be quite configuration each
PE
thousands
has of
WM
for
WM
elements
point
each
we for
32
rules
rule
For
above
would
consisting
allows
may
a 32-way PE.
be accessed
In
system, systems
example, allow
of 32 PE’s. storage
elements
at each
appropriate.
however.
noted
would Rule
more
the
a PM-level PE’s
a thousand
are
large,
Since
storage,
is underutilized.
performance.
PM-level may
PM
envisage
considerable WM
this
Furthermore,
The original DAD0 algorithm detailed in [Stolfo 19831 makes direct use of the machine’s ability to execute in both MIMD and SIMD modes of operation at the same point in time. The machine is logically divided into three conceptually distinct components: a PM-/eve/, an upper tree and a number of WM-subtrees. The PM-level consists of MIMD-mode PE’s executing the match phase at one appropriately chosen level of the tree. A number of distinct rules are stored in each PM-level PE. The WMsubtrees rooted by the PM-level PE’s consist of a number of SIMD mode PE’s collectively operating as a hardware content-addressable memory. WM elements relevant to the
for
WM-
device.
in size.
for rule
Thus,
rules
by the
parallel
restricted
machine
decreasing
WM-subtrees,
handled mode
is used
example,
WM
DAD02
tree
machine.
roughly
potentially 11
rather
level of the
capacity
efficiently
as an SIMD
PM
level
indeed a later
also
operating
DAD0 2: Original
cycles
the
access
to
For
affected
only full
are to
parallel
massive
affected.
subtrees
In
4.3 Algorithm
PS
efficiently.
be WM
The most serious drawback of this algorithm is the a local WM is too large to be conveniently case where stored in a PE. Clearly, characteristic 5 is appropriate for this algorithm only in the presence of characteristic 9, small WM. Multiple rule firings (characteristic 7) A discussion of this case is deferred possible. section.
on
level would
a significant
a
for
content-addressable
a
changes
between
Since
permitted
3000-4000
large
are
in
only
tests
number
of one
not
but
be
Similarly,
allows
matching,
rules. 11.
designed
Indeed,
elements
a
at
consisting
redundant. WM
memory
each
the
stored
2
specifically
was as:
Non-temporally distribute
appropriate. rules
of Algorithm
against
global
Given
Discussion
This algorithm programs characterized
cycle.
Clearly, match
4.3.1
suitable, would
on each
Hence,
four a
be
PE’s
matches.
potentially
possible.
possible
Thus, it would
results
be particularly is
characteristics,
pattern to
local
not
PM
make
of
match
WM.
of PM,
may
number
new
small
Large
2
small
required
relatively
rules
WM
sequential
cycle.
distribution
characteristic
scope
is
each
It is probably best to view WM as a distributed Each WM-subtree PE thus stores relational relation. tuples. The PM-level PE’s match the LHS’s of rules in a In terms ueries. manner similar to processing relational of the Rete match, e’ntraconditkon tests o? pattern elements in the LHS of a rule are executed as relational selection, equi-join while intercondition tests correspond to operations. Each PM-level PE thus stores a set of relational tests compiled from the LHS of a rule set assigned to it. Concurrency is achieved between PM-level PE’s as well as in accessing PE’s of the WM-subtrees. The algorithm is illustrated in figure 4.
best
to
of
its local
on
similar
computing
Restricted rule
9.
partition
relatively
actively
to modify
initial
changes
amount
affected
the
that
a
PE
are
on
massive
considerable
at each
best
but
a
rules
depending be
since
redundancy, require
rules stored at the PM-level root PE are fully distributed The u per tree consists of throughout the WM-subtree. SIMD mode PE’s lying above t rl e PM-level, which implement synchronization and selection operations.
the for
Since
32
each
capacity,
many
easily
stored.
be
parallel total,
in parallel
access nearly at
to 1000
a given
in time.
While attempting to implement temporally redundant systems, Algorithm 2 may recompute much of its match results calculated on previous cycles. This indeed may not be the case if we modify Algorithm 2 to incorporate many of the capabilities of the Rete match.
303
1. Initialize:
Distribute
partitioned
Set CHANGES 2. Repeat 3. Act:
the For
a.
match
of rules
to the
each
initial
a PE.
elements.
in CHANGES
WM-change
PE’s
and
The
match
level
PE:
phase
PM-level
PE
by
SO,
its
element
addition,
PE
is
at
and
any If
free
identified,
WM-
and
element
occur
of PE
the
Any
are
subsequent
that to the
matching
pattern
(relational
for
bindings
sequentially for
PE
rules
below
variable
reported
PM-level
the
is broadcast
WM-subtree
the
of
elements
equi-join).
A local
conflict
stored
the
an the
deleted. a
a PM-level
matching.
rating
If
on
selection) are
set
of rules
along
with
in a distributed
Algorithm
is formed a
priority
manner
within
WM-subtree.
PM-level
of
the
match
PE’s
synchronize
The
max-RESOLVE
identify
with
the
upper
circuit
maximally
the
operation,
instance 7. Set
tree.
is used
(perhaps
3: Miranker’e
TREAT
Algorithm
TREAT views the pattern elements in the LHS of rules as relational algebra terms, as in Algorithm 2. Thus, the evaluation of such rela,tional algebra tests is also State is saved in a executed within the WM-subtrees. alpha in the form of distributed Rete WM-subtree corresponding to partial selections of tuples memories Rule instances in the matching various pattern elements. conflict set computed on previous cycles are also stored in a distributed manner within the WM-subtrees. These two the performance of substa,ntially improve additions that Anoop Gupta of CarnegieA’gorithm 2. independently v e note analyzed a similar Mellon University Compared to Algorithm 2, algorithm in Gupta 1983. TREAT shoul d perform su 1 stantially better for temporally redundant systems. We note that Gupta’s analysis of depends on certain assumptions that algorithm 2, however, derive misleading results.) Another aspect of TREAT is the clever manner in Pattern elements are first which relevancy is computed. distributed to the WM subtrees. When a new WM element is added to the system, a simple match a,t each WM-subtree PE determines the set of rules at the PMlevel which are affected by the change. Those identified are subsequently matched by the PM-level PE rules restricting the scope of the match to a smaller set of rules than would otherwise be possible with Algorithm 2.
conflict
rated
the
instantiated
to the CHANGES
root
RHS
of
the
4.4.1 set
the
reported
action
Repeat; Figure
4:
Original
DAD0
algorithm
is outlined
in figure
5.
Discussion
of Algorithm
3
The TREAT algorithm is a refinement of Algorithm 2 incorporating temporal redundancy. Hence, TREAT is best suited for PS programs characterized as:
winning
of DADO. to
TREAT
to
specifications. 8. end
matching
the
instance. 6. Report
for
do;
termination
5. Select:
rules
Daniel Miranker has invented an algorithm which modifies Algorithm 2 to include several of the features of The TREe Associative the Rete match for saving state. Temporally redundant (TREAT) algorithm [Miranker 19841 makes use of the same logical division of the DAD0 tree as in Algorithm 2. However, the state of the previous is saved in distributed data structures match operation within the WM-subtrees.
The 4. Upon
to select
updated
is performed
an
pattern
stored
and
of
is added.]
ii. Each
. ..
set
routine. is
instances
subtree
if WM-
[If th is is a deletion, (relational
is
4.4
PM-
local
match
probe
matching
to
its
WM-subtree
associative
this
in each
to
a partial
element
PM-level
determines
is relevant
rules
do;
the
selection) can be used faster than hashing).
Consideration of these techniques led us to investigate Rete for direct implementation on DAD02. Algorithms 3 and 4 detail this approach.
to match.
is initiated
accordingly.
111.
to
an instruction
change
end
and
PM-level
WM
WM-change
i. Each
C.
routine
to each
following:
Broadcast
b.
a
subset
1.
Temporally
redundant.
3.
Many
are
6.
Global
8.
Small
PM.
11.
Large
WM.
rules tests
affected
of WM
are
on each also
cycle.
efficiently
handled.
Algorithm.
changes improve the Simple may _ dramatically For example, rather than lteratmg over each situation. pattern element in each rule as in step S.b.ii, we may only execute the match for those rules affected by new WM changes. The selection of affected rules can be achieved quickly using the WM subtree as an associative memory. By distributing pattern elements as relational tu les in a associative probing similar to WM, manner Prelational
We note, though, that minor changes allow TREAT 2 directly (b setting L to all of to implement Algorithm the rules at the PM-level in step 3. B .ii and ignoring step 3.d.i). Thus, TREAT may also efficiently execute: 4.
Non-temporally In
step
3.d.iii,
redundant TREAT
systems. also
implements
a
useful
1. Initialize: simple
Distribute matcher
set of rules.
each
appropriate
the
LHS PE.
to the
pattern
of the
rules
Set
PM-level
below)
Distribute
the level
to
(described
and
WM-subtree
elements
a
in the to
the
compared to Algorithm 2 executing temporally redundant study and detailed analysis systems. (Th e implementation, of TREAT forms a major part of Daniel Miranker’s Ph.D. thesis.)
PE’s
appearing
appearing
CHANGES
PE
a compiled
root
in PM-
initial
4.5
WM
elements. 2. Repeat 3. Act:
the For
a.
WM-change
Broadcast
in CHANGES
WM-change
to
the
do;
WM-subtree
PE’s. b.
If this
change
instruction elements
d.
At
each
PM-level
PE
to
do;
to WM-subtree to match the
instruction
PE
against
the
local
ii. Report
the
affected
pattern
the
pattern
PE’s
an
WM-change element.
rules
and
Algorithm
store
rules
of
4.5.1
the
each
rule
1. Match rules
in L do;
2. For
remaining
patterns
of the
1.
Temporally
in
as
2.
Few
new
in
priority 3. end
L
in
2.
each
store
Discussion
specified
Algorithm
end
for
instance
found,
WM-subtree
with
10. a
rating.
PM.
We
1023
Rete
nodes
9.
5. Report 6. Set
the
7. end
winning
CHANGES
winning
max-RESOLVE
rule
to
in the
to
find
the
the
instance. the
instantiated
5:
The
TREAT
tree
This
believe
processed overlay
and
changes.
Rete
processed
that
by
only
DADOB.
technique
can
networks
be are
Thus,
in turn.
be achievable. since
(stored storage
at
alpha
capacity the
Rete
for
storage
network
nodes
intermediate and
beta
a
DAD02
of
partial memories), PE
may
size of WM.
Although overlayed Rete networks would be processed significant performance DADOB, sequentially on can be achieved by a natural pipelinin improvements a successful match an 3 Immediately following effect. communication at a node, the next two-input test from the overlayed network is initiated. Thus, while the parent node is processing the first network node, its children are proceeding with their tests of the second overlayed network node.
Algorithm.
strategy. When iterating over each of the rules in L affected by recent changes in WM, those pattern elements with the smallest alpha memories are processed first. This technique tends to process the join operations quickly by filtering out many potentially failing partial joins. As algorithm, Miranker
instance,
several
However,
restricting
WM
of the
instance.
Repeat; Figure
of the programs
19791.
be
forward
significant results
require RHS
may
limited
for
may
where
WM.
match
tree.
may,
in the
PM
Small
implementation for PS
by
in [Forgy
a straight
embedded
instance
4
affected
Large
implemented
each;
rated
are
is noted
However,
do;
do;
Use
maximally
6.
redundant
observation
require 4. Select:
of Algorithm
rules
large e.
in figure
Since this algorithm is a direct Rete match, it is most suitable characterized as:
in L appropriately.
iv. For
v. end
elements
4 is illustrated
in
L. iii. Order
Rete
This observation leads to Algorithm 4 whereby a logical Rete network is embedded on the ph sical DAD0 In the simplest case, i eaf nodes of binary tree structure. the DAD0 tree store and execute the initial linear chains whereas internal DAD0 PE’s of one-input, test nodes, operations. The physical two-input node execute connections between processors correspond to the logical da.ta flow links in the Rete network. The entire DAD0 in MIMD mode while executing this machine operates algorithm, behaving much like a pipelined data flow architecture.
set
cycles.
to PM-level
Phase.
i. Broadcast
WM
conflict
on previous
Match
an
delete
affected
an instruction
the
broadcast
and
any
calculated
Broadcast enter
match
and
instances c.
is a deletion, to
4: Fine-grain
A Rete network compiIed from the LHS’s of a rule set consists of a number of simple nodes encoding match Tokens, representing WM modifications, flow operations. through the network in one direction and are processed by Fortunately, the each node lying on their traversed paths. maximum fan-m of any node in a Rete network is two. Hence, a Rete network can be represented as a binary tree (with some minimal amount of node splitting).
following:
each
Algorithm
source of ninelining can improve A second In this cker the &tire RHS action performance as well. specification is broadcast at once to the DAD0 leaf PE’s Immediately following the conclusion of the at step 3.a. first match operation and communication of the first WM
noted above, Gu ta’s analysis of a TREAT-like as well as su f sequent analysis performed by [1984], show TREAT to be highly efficient
305
1. Initialize:
Map
network
on
provided
with
and
the
load
DAD0
the
the
compiled Each
tree.
appropriate
network
information
details).
Set
match
code
[Forgy
(see CHANGES
Rete node
is and
19821
to
for
initial
WM
1. Initialize.
2. Repeat
the For
in CHANGES
WM-change
to
Distribute
do;
of the a. b.
Broadcast
WM-change
the
leaf PE’s.
DAD0
Broadcast
an
one-input The
token. for
root
of
are
tests, the
The to
the
to the
PE
is provided
the
with
control
the
the
chosen Set
processor.
instantiated
completion SIMD
of PE
of PS
(PS-level), Algorithm
program
each
PS-level.
instruction in their
to each
PS-level
PE
MIMD
mode.
(Upon each
respective
reconnects
to
programs, the
tree
above
PE’s
are
in
to
in
mode.)
3. Repeat a.
2.
to
the
following.
Test
if
all
PS-level
SIMD
mode. End
Repeat;
4. Execution
Complete.
Halt.
Figure
7:
Simple
Multiple
PS Program
Execution.
RHS.
In the cases where various PS-level PE’s need to communicate results with eachother, step 3 is re laced with appropriate code sequences to report and broa a cast values from the PS-level in the proper manner. Each of the programs executed by PS-level PE’s are first modified to synchronize as necessary with the root PE to coordinate the communication acts, at, for example, termination of the Act phase.
Repeat; Figure
6:
Fine-grain
Rete
Algorithm.
token, the leaf PE’s initiate processing of the second WM Hence, as a WM token flows up the DAD0 tree, token. subsequent WM tokens flow close behind at lower levels of the tree in pipeline fashion.
4.0
execution
incorporate
DAD0
processor).
root
begin
to
physical
changes in
at the an
the
immediately
until
reports
the to
which
PE’s
appropriate
2. Broadcast
PS-level
PM-level
the the
DAD0
System-level
their
passing
two-input
maintained
new
waiting
by
tokens
repeated
DAD0
execute the
communicated
their
set
from
CHANGES 4. end
idle
to
do;
The
instance
processors
computed
Those
is then
control
Select:
nodes
ancestors
conflict end
on
processing
process
C.
sequences lay
to
PE’s
leaf
tests
immediate begin
all
the
results
descendants.
to
token)
test interior
match
one-input
Rete
instruction
(First,
Match. their
(a
divide
Production
similar
following:
each
Logically
static
a
elements. 3. Act:
algorithmic process executed in the controlled by some case is represented upper part of the tree. The simplest by the procedure illustrated in figure 7, which is similar in some respects to Algorithm 2.
Algorithm
In our characteristic as
5: Multiple
discussion so far, 7, multiple rule
- multiple,
independently
Asynchronous
Execution
no mention was made about We may view this firings.
executing
PS
programs,
or - executing PS program
multiple
conflict
set
rules
of the
same
concurrently.
In this regard we offer not a single algorithm, but rather an observation that may be put to practical use in each of the abovementioned algorithms. We note that any DAD0 PE may be viewed as a Thus, any algorithm operating root of a DAD0 machine. at the physical root of DAD0 may also be executed by Hence, any of the aforementioned some descendant node. algorithms can be executed at various sites in the machine concurrently! (Th is was noted in [Stolfo and Shaw 1982 .) This coarse level of parallelism, however, will need to IIe
In addition to concurrent execution of multiple PS programs, methods may be employed to concurrently execute portions of a single PS program. These methods are intimately tied to the way rules are partitioned in the tree. Subsets of rules may be constructed by a static analysis of PM separating those rules which do not directly interact with each other. In terms of the match problemsolving paradigm, for example, it may be convenient to think of independent subproblems and the methods implementing their solution (see [Newell 19731). Each such method may be viewed as a high-level subroutine represented as an independent rule set rooted by some internal node of DADO. Algorithm 1, for example, may be applied in parallel for each rule set in question. Asynchronous execution of these subroutines proceeds in a straight forward manner. The complexity arises when one subset of rules infers data required by other rule sets. The coordination of these communication acts is the focus of our ongoing research. Space does not permit a complete specification of this approach, and thus the reader is encouraged to see [Ishida 1984) for details of our initial thinking in this direction.
5 Conclusion
References
We
have outlined five abstract, algorithms for the parallel execution of PS programs on the DAD0 machine and indicated what characteristics they are best suited for. We summarize our results in tabular form as follows: Algorithm 1. Fully
PS
Distributed
2. Original 3. Miranker’s 4. Fine-grain 5. Multiple
PM
DAD0 TREAT Rete Asynchronous
Davis,
and
J.
Forgy,
1, 3, 5, 7, 9, 11
C.
L.
Mellon
1, 3, 4, 6, 7, 8, 11 1, 2, 5, 7, 9, 10 Forgy
to all cases.
the
On
Ph.D.
C.
L.
Gupta,
A
Science,
Carnegie-Mellon and
University, and
Report,
Columbia
University,
University, J.,
and
J.,
and
Report,
Department
University, Computers).
307
D. P.
Maclaine: 1984.
a
Shaw.
Computer. Science,
DADO:
of
Technical Columbia
A
Tree-structured
Production
Computer
Artificial
August,
The DAD0
System-level
Systems.
on
University,
(Submitted
Technical
to AI Journal).
Conference
of
Programming
Thesis.
for
Miranker.
Control
Department
(Submitted
Carnegie-Mellon
System
as
Computer
National
Intelligence, S.
E.
of
Intelligence.
Parallel
1983.
RETE. Science,
1973.
Ph.D.
of
DAD0
Information
Vi’aual
University,
Architecture
Proceedings
Stolfo
D.
Computer
Models
(Ed.),
Systems
DAD0
the and
1984.
Press,
1976.
August
Machine
of
Artificial
Department
for
TREAT
of
Systems:
Academic
The
J.
Systems,
Estimates
Chase
Science,
Report,
Conflict
Hayes-Roth
Inference
Carnegie-Mellon
Computer
S.
System and
April
W.
for
Report,
Science,
Production
Department
Production
M.
Computer
Waterman
Comparison
In
Language
S.
In
Production
A.
Rychener,
of
of
Machines.
1978.
Technical
Processing,
Stolfo
Forgy.
A
Structures.
Firing
1984.
Machine.
Newell,
on
Computer
1983.
Simultaneous
Performance
P.
Systems of
Tree-structured
on
Many
Problem.
Production Department
Pattern-directed
D.
the
19, 17-37.
Stolfo.
C.
Press,
for
University,
Strategies.
(EW, Academic
Science,
Matching
Department
Report,
Columbia
Resolution
Stolfo
J.
Technical
Miranker
1982,
Rules
J.
Carnegie-
Computer
Algorithm
Report,
S.
Production
McDermott,
of
OPS5
Technical
T.,
Report,
Pattern
Intelligence,
DA-DO.
Ishida
In this paper, we have outlined our expectations concerning the suitability of each of the algorithms for a variety of possible PS programs. We expect ouirtreenyrted findings to substantiate our claims, and to demonstrate this with working examples in the near future.
Fast
Object
Implementing
A.
of
Implementation
Technical Department
Rete:
Artificial
Computer
Thesis.
Pattern/Many Of the five reported algorithms, only the original DAD0 algorithm (number 2) has been carefullv studied analyticallj;. The ‘performanie statistics of the ;emaining four algorithms have yet to be analyzed in detail. However, much of the performance statistics cannot be analyzed without specific examples and detailed implementations. Working in close collaboration with researchers at AT&T Bell Laboratories, in the course of the next year of our research we intend to implement each of the stated algorithms on a working prototype of DADO.
of
1975.
Efficient
Systems.
Production
of
Department
University,
University,
1979.
Overview
Report,
Stanford
Production
3, 4, 6, 7, 8, 11
An
King.
Technical
Science,
Characteristics
Applies
R. Systems.
1982.
Production
Details.
Technical
Science,
Columbia
to IEEE
Transactions
on