Five Parallel Algorithms for Production System Execution on the ...

From: AAAI-84 Proceedings. Copyright ©1984, AAAI (www.aaai.org). All rights reserved. FIVE

PARALLEL

ALGORITHMS ON THE

FOR PRODUCTION DAD0 MACHINE*

Salvatore Computer

New York

EXECUTION

J. Stolfo

Science

Columbia

SYSTEM

Department

University City,

N.Y.

10027

Abstract In

this

paper

parallel inherent our

conclusions

under

by

algorithm

construction

in

a

abstract is

empirically on

the

at Columbia

algorithms

systems designed

variety

Ongoing

programs.

each

five

production

algorithm

parallelism

system

specify of

Each

machine.

of

we

execution

of

research

to

different aims

evaluating DAD02

on

to the

prototype,

for the

while WM elements are simple lists of equals sign), constant symbols (corresponding to tuples of the relational An example production, borrowed from the algebra). blocks world, is illustrated in figure 1.

the

DAD0

capture

the

production (Goal

substantiate

(Isa

performance

Clear-top-of =x

(On-top-of

presently

(Isa

University.

Block)

Block) =y

=y

=x) -- > delete( On-top-of

Block)

assert(On-top-of

=y

=x)

=y

Table)

1 Introduction If the

five abstract algorithms In this paper we outline specifying parallel execution of production s stem (PS) programs on the DAD0 machine. Each algorit !I m offers a number of advantages for particular types of PS programs. We expect to implement these algorithms on the DAD02 prototype and critically evaluate the performance of each on a variety of application programs. Software development is presently underway using the DAD01 prototype that has been operational at Columbia University since April, 1983.

is to clear

there

covered

by something a block,

then

remove

the

Match:

For

each

rule,

matches

the

current

each

pattern

element

with

throughout In general, a Production Systenl (PS [Newell 1973, Davis and King 1975, Rychener 1976, Aorgy 19821 IS defined by a set of rules, or prodzlctions, which form the Production Memory (PM), together with a database of called the Worlcing Memory (WM). assert ions, Each production consists of a conjunction of pattern elements, called the left-hand side (LHS of the rule, along with a set of actions called the rzgfzt- R and side (RHS). The RHS specifies information that is to be added to (asserted) or removed from WM when the LHS successfully matches against the contents of WM.

the

rules

=y

LHS. collected

is on =x

is on the

table.

Production. system

determine is matched

repeatedly

whether

the

of

WM:

environment

variables

the are

that =y

the production cycle of operations:

LHS

element

Systems

fact that

An Example

1:

operation., the followmg

of a block,

(=y)

is also

assert

top

(=x)

which

Figure In executes

the

is a block

and

We begin with a brief description of PS’s and identify various possible characteristics of PS programs which may not be immediately apparent from a general description of the basic formalism. These characteristics lead to different algorithms which will be discussed in the remaining sections of this paper. 2 Production

goal

and

by

some

consistently

All

matching

in

instances

conflict

the

WM bound of

set

of

rules. Select:

Choose

according Act:

Add

exactly

to some to

specified

in

perform

some

one of the

predefined

or

delete

the

RHS

from of

matching

rules

criterion. the

WM

all

selected

assertions rule

or

operation.

Pattern elements in the LHS may have a variety of forms which are dependent on the form and content of WM elements. In the simplest case, patterns are lists composed of constants and variables (prefixed with an

During the selection phase of production system execution, a typical interpreter provides confht resolution strategies based on the recency of matched data in WM, as syntactic discrimination. Other resolution as well schemes are possible, but for the present paper such issues will not significantly change our analysis, and hence will not be discussed.

*This research h as been supported by the Defense Advanced Research Projects Agency through contract N00039-84-C-0165, as well as gr&ts from Intel, Digital Eauinment. Hewlett-Packard. Valid Logic Svstems. AT&T Be’11&Labor&tories and IBM Corporations-and -the New York State Science and Technology Foundation. We gratefully acknowledge their support.

We shall only consider the parallel execution of PS programs with the goal of accelerating the rule firing rate of the recognize/act cycle as well as the number of Wh4 transactions perrormed. In a later section of this paper, we shall consider other possible parallel activities as, for example, the concurrent execution of multiple PS programs.

300

state

new

of

temporal this

to be always

3. Many

by

may

similar

1. Assign

some

of

WM. algorithm

illustrated

in

of rules

to a set of (distinct)

on

WM.

subset

of WM

elements

processors

of a rule

operation

is not

appropriate.

(possibly

distinct

from

those

in

each

but

step

to

may a

1). until

no rule

6. Global

is active:

LHS a.

Broadcast storing

an

to

in

conflict b.

the

system.

Report

Broadcast

the

step

3.b

to

their

local

2:

of

(non-temporally

action

specifications

have

large

is

global

the

unlikely,

of

not

scope

of

the state.

matches.

which

may

relatively need

in

effects

i.e., saving

pattern

The

potentially Thus,

small.

access

small

to

all

a

of WM

subset

within

its instantiated all

the

of

data

processors,

which end

Production

in

update

System

prior

to

the

cycle.

to only 9 Small

Algorithm.

PM.

This very PS cycle

10. Large

simple view of the parallel implementatjon forms the basis of our subsequent analysis.

3 Characteristics

of Production

System

thousands 11. Large

Programs

thousands

In this section we enumerate various characteristics of PS programs in general terms. The reader will note that these characteristics are less indicative of a specific PS but rather are characteristics of various formalism, problems whose solutions are encoded in rule form. It the “inherent parallelism” should be noted, though, that not be represented by the problems may Ert;gc0u;“a”r PS f ormalksm used for their solution. 1. Temporal made

on

between need

Redundancy. each

each not

[Forgy

cycle,

be

19821

relatively

Rules.

few

WM

rules

on need

changes

by

saving

matching

the

incorporating to

WM

The

repeated.

Affected changes

previous

is probably

PS interpreter

Few Thus,

cycle.

this Few each

are

algorithm

be matched

are

This

against case

of characteristic On

each

conflict

initiating

the

number

of

rules match

rules

may 5.

cycle

of

may

be

phase

of

is restricted

a few hundred. WM. PM.

Similarly, A

of rules WM.

WM

may

consist

of only

elements. PS

may

Similarly,

of data

consist

of

several

in PM. WM

may

consist

of

elements.

Algorithms

The reader is assumed to be knowledgeable about the Rete match algorithm (see [Forgy 19791 and [Forgy 19821 . We will thus freely discuss the details of the Rete mate h We begin with a when needed without prior explication. brief description of the DAD0 architecture. (The reader is 19831 and [Stolfo and Miranker encouraged to see Stolfo 19841 for complete d etails of the system.)

of a

strategy. cycle,

elements

which

state

example

rules

WM

of

than

tests

In this section we outline five different algori;ha;; suitable for direct execution on the DAD0 machine. will be independently discussed leadin to various t fl ey are most conclusions about which characteristics Ongoing research aims to verify appropriate for capturing. our conclusions by empirically evaluating their performance for different classes of PS programs.

operations

Rete

best

4 Five

The

rather

value).

firings.

executed next

of

converse

rule

WM,

in the requiring

example,

threshold

number

8 Small

Repeat;

number

elements

conditions of

(for

a

operation,

test

portions

as the

7 Multiple

entire

reported

the

be viewed

RHS.

to WM

accordingly.

Abstract

instance

the

large

constant

Pattern

may

elements

compare some

rated

to

of WM.

rule

individual

local

compute

rule

changes

WM

a

a

a few hundred

by

rules.

WM

relatively

tests

of

access

phase,

instances.

processor,

rated

processors

match

maximally

each

maximally

all

the

formation

each

within

Figure

to

begin

set of matching

Considering

c.

instruction

rules

resulting

2. Few

where

in many

elements.

3. Repeat

of the

cycle.

to

elements

rule

rule

are

each

in situations

restricting

scope

rules

on

case,

seems

of WM

single

to a set of

WM

appear

may

Thus,

match

match

some

In this

number

processors. 2. Assign

RHS

that

guarantee

Many

to

example,

changes

the

not

case.

elements

5. Restricted

subset

for

arise,

however,

does

Rules. changes

pattern

redundant).

update abstract

the

the

4. Massive

algebraic operation. processed by a parallel by the this approach figure 2.

alone

Affected

affected This

Note,

WM.

redundancy

affected and

against

thus the

301

4.1

The

DAD0

4.2

Machine

DAD0 is a fine-grain, parallel machine where A processing and memory are extensively intermingled. full-scale production version of the DAD0 machine would the order of a hundred comprise a very large (on each thousand) set of processing elements (PE’s , containing its own processor, a small amount (16 k bytes, rototype version of local in the current design of the random access memory (RAM P, and a specia I ized I/O The PE’s are interconnected to form a complete switch. binary tree.

Algorithm

1: Full

Distribution

of PM

small number of distinct In this case, a very rules are distributed to each of the 1023 production DAD02 PE’s, as well as all WM elements relevant to the rules in question, i.e., only those data elements which match some pattern in the LHS of the rules. Algorithm 1 alternates the entire DAD0 tree between MIMD and SIMD modes of operation. The match phase is implemented as an MIMD process, whereas selection and act execute as SIMD operations. In simplest terms, each PE executes the match phase for its own small PS. One such PS is allowed to “fire” a which is communicated to all other PE’s. rule, however, is illustrated in figure 3. The algorithm

Within the DAD0 machine, each PE is capable of executing in either of two modes under the control of runIn the first, which we will call SIA4D time software. mode for single instruction stream, multiple data stream), the P Ii executes instructions broadcast by some ancestor PE within the tree. (SIMD typically re&hs, t;A;;ingol; stream of “machine-level” instructions. the other hand, SIMD is generalized to mean a single Thus, stream of remote procedure invocation instructions. DAD0 makes more effective use of its communication bus by broadcasting more “meaningful” instructions.) In the second, which will be referred to as MIMD mode (for mult,iple instruction stream, mulifle data stream), each PE its own local RAM, executes instructions stored A single conventional independently of the other PE’s. adjacent to the root of the DAD0 tree, coprocessor, controls the operation of the entire ensemble of PE’s.

1. Initialize:

Distribute

PE.

PE.

Set CHANGES

2. Repeat 3. Act:

Distribute

the For

a.

each

rule

matcher rules

to initial

elements.

WM

to

to each

in CHANGES

WM-change

Broadcast

b.

simple

a few distinct

following:

WM-change

specific

When a D,4DO PEsuE;ters MIMD mode, its logical a WaY as to effectively is changed in state “disconnect” it and its descendants from all higher-level In particular, a PE in MIMD mode does PE’s in the tree. not receive any instructions that might be placed on the tree-structured communication bus by one of its ancestors. however, broadcast instructions to be Such a PE may, executed by its own descendants, providing all of these descendants have themselves been switched to SIMD mode. The DAD0 machine can thus be configured in such a way that an arbitrary internal node in the tree acts as the root of a tree-structured SIMD device in which all PE’s execute a single instruction (on different data) at a given point in time. This flexible architectural design supports multipleSIMD execution (MSIMD). Thus, the machine may be logically divided into distinct partitions, each executing a distinct task, and is the primary source of DADO’s s eed in executing a large number of primitive pattern mate K ing operations concurrently.

a

each

WM

Broadcast

(add

element) a

or

command

to

locally

PE

operates

independently

mode

and

modifies

its

is a deletion,

it checks

and

rule

If this

and

a

local

WM.

instances

modifies

as

If this

conflict

set

appropriate.

it matches its

match. in MIMD

its local

is an addition,

rules

delete

to all PE’s.

[Each

removes

do:

local

its set

of

conflict

set

instruction

to

accordingly]. end

C.

4. Find

do;

local

maxima:

PE

to

rate

according

to

some

resolution

strategy

each

Broadcast its

an

local

matching

predefined (see

instances

criteria

[McDermott

(conflict and

Forgy,

19781). 5. Select:

Using

circuit

of

execution Our comments will be directed towards the DAD02 consisting of 1023 PE’s constructed from prototype commercially available chips. Each PE contains an 8 bit Intel 8751 processor, 16K bytes of local RAM, 4K bytes of local ROM and a semi-custom I/O switch. The DAD02 I/O swit,ch, which is being implemented in semi-custom gate array technology, has been designed to support rapid communication. In addition, a specialized global combinational circuit incorporated within the I/O switch will allow for the very rapid selection of a single distinguished PE from a set of candidate PE’s in the tree, call mu-resolving. (The a process we max-resolve instruction computes the maximum of a s ecified register in all PE’s in one instruction cycle, whit Tl can then be used to select a distinct PE from the entire set of PE’s taking part in the operation.) Currently, the 15 processing element version of DAD0 performs these operations in firmware embodied in its off-the-shelf components.

from

6. Instantiate:

Figure 4.2.1

high-speed identify

among

Report

Set CHANGES 7. end

the

DADOB,

the

to the

max-RESOLVE a

all PE’s

single with

instantiated reported

rule

active RHS

for rules.

actions.

WM-changes.

Repeat; 3:

Discussion

Full

Distribution

of Algorithm

of Production

Memory.

1

We have left the details of the local match routine unspecified at step 3.b. Thus, a simple precompiled Rete match network and interpreter may be distributed to each processor. However, it is not clear whether a simple naive matching algorithm may be more appropriate since only a very small number of rules is present in each PE. Memory considerations may decide this issue: the overhead associated with linking and manipulating intermediate partial matches in a Rete network may be more expensive than direct pattern matching against the local W’M on each cycle.

302

Performance of this algorithm varies with the In the best case, the time complexity of the local match. to match the rule set is bounded by the time to match The worst case is dependent on the only a. few rules. maximum number of WM elements accessed during the If a simple naive match is used at match of the rules. each PE, this may require a considerable amount of computation even though the size of the local WM’s IS limited. Simple hashing of WM may dramatically improve a local naive matching operation, however.

suited 1.

We conclude that this algorithm is probably to implementing PS programs characterized by: Temporal would execution

3.

Many

to

though,

5.

WM

would

it

separately.

Note,

also of

WM.

three

or

for

4.

be

PM

of

common quite

depending pattern

elements

large.

Even

is

resident

element number replicated individual

of

on if an in

additional in

other

elements

the

average

between average

of

efficiently

implemented.

3.

above each

PE

saving

state

Many

rules

cycle.

each

PE

local PE’s),

may

rules,

WM

WM

of

may

unique

(while

elements

minimum

be stored

in WM.

of

6.

Global

8.

are 1000

DAD0

each

cycle,

The allow

tests

WM

such

ability

to

to

WM

for

may

also

be

an

environment,

has

few advantages.

by

WM-changes

changes

to

many

on

WM

rules

concurrency many

are

matchings

each

may

may

achieved

rule

is, unfortunately, one

of the

DAD02,

for

4 of

store

the

AA,

be

potentially at

the

PM-

to be achieved

30

Algorithm

with

a few hundred may

be quite configuration each

PE

thousands

has of

WM

for

WM

elements

point

each

we for

32

rules

rule

For

above

would

consisting

allows

may

a 32-way PE.

be accessed

In

system, systems

example, allow

of 32 PE’s. storage

elements

at each

appropriate.

however.

noted

would Rule

more

the

a PM-level PE’s

a thousand

are

large,

Since

storage,

is underutilized.

performance.

PM-level may

PM

envisage

considerable WM

this

Furthermore,

The original DAD0 algorithm detailed in [Stolfo 19831 makes direct use of the machine’s ability to execute in both MIMD and SIMD modes of operation at the same point in time. The machine is logically divided into three conceptually distinct components: a PM-/eve/, an upper tree and a number of WM-subtrees. The PM-level consists of MIMD-mode PE’s executing the match phase at one appropriately chosen level of the tree. A number of distinct rules are stored in each PM-level PE. The WMsubtrees rooted by the PM-level PE’s consist of a number of SIMD mode PE’s collectively operating as a hardware content-addressable memory. WM elements relevant to the

for

WM-

device.

in size.

for rule

Thus,

rules

by the

parallel

restricted

machine

decreasing

WM-subtrees,

handled mode

is used

example,

WM

DAD02

tree

machine.

roughly

potentially 11

rather

level of the

capacity

efficiently

as an SIMD

PM

level

indeed a later

also

operating

DAD0 2: Original

cycles

the

access

to

For

affected

only full

are to

parallel

massive

affected.

subtrees

In

4.3 Algorithm

PS

efficiently.

be WM

The most serious drawback of this algorithm is the a local WM is too large to be conveniently case where stored in a PE. Clearly, characteristic 5 is appropriate for this algorithm only in the presence of characteristic 9, small WM. Multiple rule firings (characteristic 7) A discussion of this case is deferred possible. section.

on

level would

a significant

a

for

content-addressable

a

changes

between

Since

permitted

3000-4000

large

are

in

only

tests

number

of one

not

but

be

Similarly,

allows

matching,

rules. 11.

designed

Indeed,

elements

a

at

consisting

redundant. WM

memory

each

the

stored

2

specifically

was as:

Non-temporally distribute

appropriate. rules

of Algorithm

against

global

Given

Discussion

This algorithm programs characterized

cycle.

Clearly, match

4.3.1

suitable, would

on each

Hence,

four a

be

PE’s

matches.

potentially

possible.

possible

Thus, it would

results

be particularly is

characteristics,

pattern to

local

not

PM

make

of

match

WM.

of PM,

may

number

new

small

Large

2

small

required

relatively

rules

WM

sequential

cycle.

distribution

characteristic

scope

is

each

It is probably best to view WM as a distributed Each WM-subtree PE thus stores relational relation. tuples. The PM-level PE’s match the LHS’s of rules in a In terms ueries. manner similar to processing relational of the Rete match, e’ntraconditkon tests o? pattern elements in the LHS of a rule are executed as relational selection, equi-join while intercondition tests correspond to operations. Each PM-level PE thus stores a set of relational tests compiled from the LHS of a rule set assigned to it. Concurrency is achieved between PM-level PE’s as well as in accessing PE’s of the WM-subtrees. The algorithm is illustrated in figure 4.

best

to

of

its local

on

similar

computing

Restricted rule

9.

partition

relatively

actively

to modify

initial

changes

amount

affected

the

that

a

PE

are

on

massive

considerable

at each

best

but

a

rules

depending be

since

redundancy, require

rules stored at the PM-level root PE are fully distributed The u per tree consists of throughout the WM-subtree. SIMD mode PE’s lying above t rl e PM-level, which implement synchronization and selection operations.

the for

Since

32

each

capacity,

many

easily

stored.

be

parallel total,

in parallel

access nearly at

to 1000

a given

in time.

While attempting to implement temporally redundant systems, Algorithm 2 may recompute much of its match results calculated on previous cycles. This indeed may not be the case if we modify Algorithm 2 to incorporate many of the capabilities of the Rete match.

303

1. Initialize:

Distribute

partitioned

Set CHANGES 2. Repeat 3. Act:

the For

a.

match

of rules

to the

each

initial

a PE.

elements.

in CHANGES

WM-change

PE’s

and

The

match

level

PE:

phase

PM-level

PE

by

SO,

its

element

addition,

PE

is

at

and

any If

free

identified,

WM-

and

element

occur

of PE

the

Any

are

subsequent

that to the

matching

pattern

(relational

for

bindings

sequentially for

PE

rules

below

variable

reported

PM-level

the

is broadcast

WM-subtree

the

of

elements

equi-join).

A local

conflict

stored

the

an the

deleted. a

a PM-level

matching.

rating

If

on

selection) are

set

of rules

along

with

in a distributed

Algorithm

is formed a

priority

manner

within

WM-subtree.

PM-level

of

the

match

PE’s

synchronize

The

max-RESOLVE

identify

with

the

upper

circuit

maximally

the

operation,

instance 7. Set

tree.

is used

(perhaps

3: Miranker’e

TREAT

Algorithm

TREAT views the pattern elements in the LHS of rules as relational algebra terms, as in Algorithm 2. Thus, the evaluation of such rela,tional algebra tests is also State is saved in a executed within the WM-subtrees. alpha in the form of distributed Rete WM-subtree corresponding to partial selections of tuples memories Rule instances in the matching various pattern elements. conflict set computed on previous cycles are also stored in a distributed manner within the WM-subtrees. These two the performance of substa,ntially improve additions that Anoop Gupta of CarnegieA’gorithm 2. independently v e note analyzed a similar Mellon University Compared to Algorithm 2, algorithm in Gupta 1983. TREAT shoul d perform su 1 stantially better for temporally redundant systems. We note that Gupta’s analysis of depends on certain assumptions that algorithm 2, however, derive misleading results.) Another aspect of TREAT is the clever manner in Pattern elements are first which relevancy is computed. distributed to the WM subtrees. When a new WM element is added to the system, a simple match a,t each WM-subtree PE determines the set of rules at the PMlevel which are affected by the change. Those identified are subsequently matched by the PM-level PE rules restricting the scope of the match to a smaller set of rules than would otherwise be possible with Algorithm 2.

conflict

rated

the

instantiated

to the CHANGES

root

RHS

of

the

4.4.1 set

the

reported

action

Repeat; Figure

4:

Original

DAD0

algorithm

is outlined

in figure

5.

Discussion

of Algorithm

3

The TREAT algorithm is a refinement of Algorithm 2 incorporating temporal redundancy. Hence, TREAT is best suited for PS programs characterized as:

winning

of DADO. to

TREAT

to

specifications. 8. end

matching

the

instance. 6. Report

for

do;

termination

5. Select:

rules

Daniel Miranker has invented an algorithm which modifies Algorithm 2 to include several of the features of The TREe Associative the Rete match for saving state. Temporally redundant (TREAT) algorithm [Miranker 19841 makes use of the same logical division of the DAD0 tree as in Algorithm 2. However, the state of the previous is saved in distributed data structures match operation within the WM-subtrees.

The 4. Upon

to select

updated

is performed

an

pattern

stored

and

of

is added.]

ii. Each

. ..

set

routine. is

instances

subtree

if WM-

[If th is is a deletion, (relational

is

4.4

PM-

local

match

probe

matching

to

its

WM-subtree

associative

this

in each

to

a partial

element

PM-level

determines

is relevant

rules

do;

the

selection) can be used faster than hashing).

Consideration of these techniques led us to investigate Rete for direct implementation on DAD02. Algorithms 3 and 4 detail this approach.

to match.

is initiated

accordingly.

111.

to

an instruction

change

end

and

PM-level

WM

WM-change

i. Each

C.

routine

to each

following:

Broadcast

b.

a

subset

1.

Temporally

redundant.

3.

Many

are

6.

Global

8.

Small

PM.

11.

Large

WM.

rules tests

affected

of WM

are

on each also

cycle.

efficiently

handled.

Algorithm.

changes improve the Simple may _ dramatically For example, rather than lteratmg over each situation. pattern element in each rule as in step S.b.ii, we may only execute the match for those rules affected by new WM changes. The selection of affected rules can be achieved quickly using the WM subtree as an associative memory. By distributing pattern elements as relational tu les in a associative probing similar to WM, manner Prelational

We note, though, that minor changes allow TREAT 2 directly (b setting L to all of to implement Algorithm the rules at the PM-level in step 3. B .ii and ignoring step 3.d.i). Thus, TREAT may also efficiently execute: 4.

Non-temporally In

step

3.d.iii,

redundant TREAT

systems. also

implements

a

useful

1. Initialize: simple

Distribute matcher

set of rules.

each

appropriate

the

LHS PE.

to the

pattern

of the

rules

Set

PM-level

below)

Distribute

the level

to

(described

and

WM-subtree

elements

a

in the to

the

compared to Algorithm 2 executing temporally redundant study and detailed analysis systems. (Th e implementation, of TREAT forms a major part of Daniel Miranker’s Ph.D. thesis.)

PE’s

appearing

appearing

CHANGES

PE

a compiled

root

in PM-

initial

4.5

WM

elements. 2. Repeat 3. Act:

the For

a.

WM-change

Broadcast

in CHANGES

WM-change

to

the

do;

WM-subtree

PE’s. b.

If this

change

instruction elements

d.

At

each

PM-level

PE

to

do;

to WM-subtree to match the

instruction

PE

against

the

local

ii. Report

the

affected

pattern

the

pattern

PE’s

an

WM-change element.

rules

and

Algorithm

store

rules

of

4.5.1

the

each

rule

1. Match rules

in L do;

2. For

remaining

patterns

of the

1.

Temporally

in

as

2.

Few

new

in

priority 3. end

L

in

2.

each

store

Discussion

specified

Algorithm

end

for

instance

found,

WM-subtree

with

10. a

rating.

PM.

We

1023

Rete

nodes

9.

5. Report 6. Set

the

7. end

winning

CHANGES

winning

max-RESOLVE

rule

to

in the

to

find

the

the

instance. the

instantiated

5:

The

TREAT

tree

This

believe

processed overlay

and

changes.

Rete

processed

that

by

only

DADOB.

technique

can

networks

be are

Thus,

in turn.

be achievable. since

(stored storage

at

alpha

capacity the

Rete

for

storage

network

nodes

intermediate and

beta

a

DAD02

of

partial memories), PE

may

size of WM.

Although overlayed Rete networks would be processed significant performance DADOB, sequentially on can be achieved by a natural pipelinin improvements a successful match an 3 Immediately following effect. communication at a node, the next two-input test from the overlayed network is initiated. Thus, while the parent node is processing the first network node, its children are proceeding with their tests of the second overlayed network node.

Algorithm.

strategy. When iterating over each of the rules in L affected by recent changes in WM, those pattern elements with the smallest alpha memories are processed first. This technique tends to process the join operations quickly by filtering out many potentially failing partial joins. As algorithm, Miranker

instance,

several

However,

restricting

WM

of the

instance.

Repeat; Figure

of the programs

19791.

be

forward

significant results

require RHS

may

limited

for

may

where

WM.

match

tree.

may,

in the

PM

Small

implementation for PS

by

in [Forgy

a straight

embedded

instance

4

affected

Large

implemented

each;

rated

are

is noted

However,

do;

do;

Use

maximally

6.

redundant

observation

require 4. Select:

of Algorithm

rules

large e.

in figure

Since this algorithm is a direct Rete match, it is most suitable characterized as:

in L appropriately.

iv. For

v. end

elements

4 is illustrated

in

L. iii. Order

Rete

This observation leads to Algorithm 4 whereby a logical Rete network is embedded on the ph sical DAD0 In the simplest case, i eaf nodes of binary tree structure. the DAD0 tree store and execute the initial linear chains whereas internal DAD0 PE’s of one-input, test nodes, operations. The physical two-input node execute connections between processors correspond to the logical da.ta flow links in the Rete network. The entire DAD0 in MIMD mode while executing this machine operates algorithm, behaving much like a pipelined data flow architecture.

set

cycles.

to PM-level

Phase.

i. Broadcast

WM

conflict

on previous

Match

an

delete

affected

an instruction

the

broadcast

and

any

calculated

Broadcast enter

match

and

instances c.

is a deletion, to

4: Fine-grain

A Rete network compiIed from the LHS’s of a rule set consists of a number of simple nodes encoding match Tokens, representing WM modifications, flow operations. through the network in one direction and are processed by Fortunately, the each node lying on their traversed paths. maximum fan-m of any node in a Rete network is two. Hence, a Rete network can be represented as a binary tree (with some minimal amount of node splitting).

following:

each

Algorithm

source of ninelining can improve A second In this cker the &tire RHS action performance as well. specification is broadcast at once to the DAD0 leaf PE’s Immediately following the conclusion of the at step 3.a. first match operation and communication of the first WM

noted above, Gu ta’s analysis of a TREAT-like as well as su f sequent analysis performed by [1984], show TREAT to be highly efficient

305

1. Initialize:

Map

network

on

provided

with

and

the

load

DAD0

the

the

compiled Each

tree.

appropriate

network

information

details).

Set

match

code

[Forgy

(see CHANGES

Rete node

is and

19821

to

for

initial

WM

1. Initialize.

2. Repeat

the For

in CHANGES

WM-change

to

Distribute

do;

of the a. b.

Broadcast

WM-change

the

leaf PE’s.

DAD0

Broadcast

an

one-input The

token. for

root

of

are

tests, the

The to

the

to the

PE

is provided

the

with

control

the

the

chosen Set

processor.

instantiated

completion SIMD

of PE

of PS

(PS-level), Algorithm

program

each

PS-level.

instruction in their

to each

PS-level

PE

MIMD

mode.

(Upon each

respective

reconnects

to

programs, the

tree

above

PE’s

are

in

to

in

mode.)

3. Repeat a.

2.

to

the

following.

Test

if

all

PS-level

SIMD

mode. End

Repeat;

4. Execution

Complete.

Halt.

Figure

7:

Simple

Multiple

PS Program

Execution.

RHS.

In the cases where various PS-level PE’s need to communicate results with eachother, step 3 is re laced with appropriate code sequences to report and broa a cast values from the PS-level in the proper manner. Each of the programs executed by PS-level PE’s are first modified to synchronize as necessary with the root PE to coordinate the communication acts, at, for example, termination of the Act phase.

Repeat; Figure

6:

Fine-grain

Rete

Algorithm.

token, the leaf PE’s initiate processing of the second WM Hence, as a WM token flows up the DAD0 tree, token. subsequent WM tokens flow close behind at lower levels of the tree in pipeline fashion.

4.0

execution

incorporate

DAD0

processor).

root

begin

to

physical

changes in

at the an

the

immediately

until

reports

the to

which

PE’s

appropriate

2. Broadcast

PS-level

PM-level

the the

DAD0

System-level

their

passing

two-input

maintained

new

waiting

by

tokens

repeated

DAD0

execute the

communicated

their

set

from

CHANGES 4. end

idle

to

do;

The

instance

processors

computed

Those

is then

control

Select:

nodes

ancestors

conflict end

on

processing

process

C.

sequences lay

to

PE’s

leaf

tests

immediate begin

all

the

results

descendants.

to

token)

test interior

match

one-input

Rete

instruction

(First,

Match. their

(a

divide

Production

similar

following:

each

Logically

static

a

elements. 3. Act:

algorithmic process executed in the controlled by some case is represented upper part of the tree. The simplest by the procedure illustrated in figure 7, which is similar in some respects to Algorithm 2.

Algorithm

In our characteristic as

5: Multiple

discussion so far, 7, multiple rule

- multiple,

independently

Asynchronous

Execution

no mention was made about We may view this firings.

executing

PS

programs,

or - executing PS program

multiple

conflict

set

rules

of the

same

concurrently.

In this regard we offer not a single algorithm, but rather an observation that may be put to practical use in each of the abovementioned algorithms. We note that any DAD0 PE may be viewed as a Thus, any algorithm operating root of a DAD0 machine. at the physical root of DAD0 may also be executed by Hence, any of the aforementioned some descendant node. algorithms can be executed at various sites in the machine concurrently! (Th is was noted in [Stolfo and Shaw 1982 .) This coarse level of parallelism, however, will need to IIe

In addition to concurrent execution of multiple PS programs, methods may be employed to concurrently execute portions of a single PS program. These methods are intimately tied to the way rules are partitioned in the tree. Subsets of rules may be constructed by a static analysis of PM separating those rules which do not directly interact with each other. In terms of the match problemsolving paradigm, for example, it may be convenient to think of independent subproblems and the methods implementing their solution (see [Newell 19731). Each such method may be viewed as a high-level subroutine represented as an independent rule set rooted by some internal node of DADO. Algorithm 1, for example, may be applied in parallel for each rule set in question. Asynchronous execution of these subroutines proceeds in a straight forward manner. The complexity arises when one subset of rules infers data required by other rule sets. The coordination of these communication acts is the focus of our ongoing research. Space does not permit a complete specification of this approach, and thus the reader is encouraged to see [Ishida 1984) for details of our initial thinking in this direction.

5 Conclusion

References

We

have outlined five abstract, algorithms for the parallel execution of PS programs on the DAD0 machine and indicated what characteristics they are best suited for. We summarize our results in tabular form as follows: Algorithm 1. Fully

PS

Distributed

2. Original 3. Miranker’s 4. Fine-grain 5. Multiple

PM

DAD0 TREAT Rete Asynchronous

Davis,

and

J.

Forgy,

1, 3, 5, 7, 9, 11

C.

L.

Mellon

1, 3, 4, 6, 7, 8, 11 1, 2, 5, 7, 9, 10 Forgy

to all cases.

the

On

Ph.D.

C.

L.

Gupta,

A

Science,

Carnegie-Mellon and

University, and

Report,

Columbia

University,

University, J.,

and

J.,

and

Report,

Department

University, Computers).

307

D. P.

Maclaine: 1984.

a

Shaw.

Computer. Science,

DADO:

of

Technical Columbia

A

Tree-structured

Production

Computer

Artificial

August,

The DAD0

System-level

Systems.

on

University,

(Submitted

Technical

to AI Journal).

Conference

of

Programming

Thesis.

for

Miranker.

Control

Department

(Submitted

Carnegie-Mellon

System

as

Computer

National

Intelligence, S.

E.

of

Intelligence.

Parallel

1983.

RETE. Science,

1973.

Ph.D.

of

DAD0

Information

Vi’aual

University,

Architecture

Proceedings

Stolfo

D.

Computer

Models

(Ed.),

Systems

DAD0

the and

1984.

Press,

1976.

August

Machine

of

Artificial

Department

for

TREAT

of

Systems:

Academic

The

J.

Systems,

Estimates

Chase

Science,

Report,

Conflict

Hayes-Roth

Inference

Carnegie-Mellon

Computer

S.

System and

April

W.

for

Report,

Science,

Production

Department

Production

M.

Computer

Waterman

Comparison

In

Language

S.

In

Production

A.

Rychener,

of

of

Machines.

1978.

Technical

Processing,

Stolfo

Forgy.

A

Structures.

Firing

1984.

Machine.

Newell,

on

Computer

1983.

Simultaneous

Performance

P.

Systems of

Tree-structured

on

Many

Problem.

Production Department

Pattern-directed

D.

the

19, 17-37.

Stolfo.

C.

Press,

for

University,

Strategies.

(EW, Academic

Science,

Matching

Department

Report,

Columbia

Resolution

Stolfo

J.

Technical

Miranker

1982,

Rules

J.

Carnegie-

Computer

Algorithm

Report,

S.

Production

McDermott,

of

OPS5

Technical

T.,

Report,

Pattern

Intelligence,

DA-DO.

Ishida

In this paper, we have outlined our expectations concerning the suitability of each of the algorithms for a variety of possible PS programs. We expect ouirtreenyrted findings to substantiate our claims, and to demonstrate this with working examples in the near future.

Fast

Object

Implementing

A.

of

Implementation

Technical Department

Rete:

Artificial

Computer

Thesis.

Pattern/Many Of the five reported algorithms, only the original DAD0 algorithm (number 2) has been carefullv studied analyticallj;. The ‘performanie statistics of the ;emaining four algorithms have yet to be analyzed in detail. However, much of the performance statistics cannot be analyzed without specific examples and detailed implementations. Working in close collaboration with researchers at AT&T Bell Laboratories, in the course of the next year of our research we intend to implement each of the stated algorithms on a working prototype of DADO.

of

1975.

Efficient

Systems.

Production

of

Department

University,

University,

1979.

Overview

Report,

Stanford

Production

3, 4, 6, 7, 8, 11

An

King.

Technical

Science,

Characteristics

Applies

R. Systems.

1982.

Production

Details.

Technical

Science,

Columbia

to IEEE

Transactions

on

Five Parallel Algorithms for Production System Execution on the ...

Five Parallel Algorithms for Production System Execution on the ...

Suggest Documents

Efficient parallel execution of genetic algorithms on Epiphany ...

Parcae: A System for Flexible Parallel Execution - Liberty Research ...

Parallel Algorithms for Airline Crew Planning on

Communication Efficient Parallel Algorithms for Optimization on

TUNING PARALLEL EXECUTION

Parallel Thread Execution ISA

Engineering Parallel Algorithms for Community ... - Parallel Computing

Performance Analysis of Parallel Algorithms on Multi-core System

Parallel Algorithms for the Summed Area Table on the Asynchronous ...

PARALLEL ALGORITHMS FOR EFFECTIVE CORRESPONDENCE ...

OPTIMAL RANDOMIZED PARALLEL AlGORITHMS FOR ...

Parallel Load Balancing for Dynamic Execution ... - CiteSeerX

Load Balancing for Parallel Query Execution on NUMA ... - smis - Inria

Parallel Algorithms for Arrangements - CiteSeer

Parallel Dynamic Algorithms for Minimum

Parallel multilevel algorithms for solving the

Parallel Algorithms for the Interpretation of Line

Parallel Algorithms Column: On the Search for ... - Stanford CS Theory

On the analysis of scheduling algorithms for structured parallel

From Serial Loops to Parallel Execution on Distributed ... - The Netlib

Checkpointing Aided Parallel Execution Model

Bulk execution of Euclidean algorithms on the ... - J-STAGE Journals

Parallel Maximum Sum Algorithms on Interconnection ... - CiteSeerX

Parallel Genetic Algorithms on Programmable Graphics Hardware