Transitive closure algorithms based on graph traversal - MADgIK

49 downloads 0 Views 4MB Size Report
descendent sets by adding the descendent sets of children. While the detads of these algorithms differ considerably, one important difference among them.
Transitive Closure Algorithms Based on Graph Traversal YANNIS

IOANNIDIS,

University

Several

graph-based

closure

of a directed

compare

the

known

algorithms graph.

performance

graph-based

traverse

They

differ

and

tbe

considerably,

topological

two

sets in again

addition

as in

additions thereby

result

certain

pass

can be used the trends such

Categories

and

and

that

as Semmaive Subject

storage,

1988). analysis

in this

has

Y. Ioanmdis

respect

to that

been

replaced

was partially

and

Lucile an IBM

Authors’

D.3.3

[Operating E. 1 [Data]:

paper paper,

most

by the

not made

or distributed

publication

Association

and

its

that

as

early

of the execution, of not having

collected

with

extent

to

in the first possible,

we

comparison

another

performance

outperform

Sciences

date

Languages]: Storage

other

Structures—graphs,

form

VLDB

trees:

in “Efficient

Conference

haw

Science

was partially a Presidential in Science

Language

types

of

Constructs

Management—main

memory,

Beach,

Closure Calif.,

Aug.

and the performance

performance

Foundation

[Database

H.2,4

Transitive

(Long

been revised,

of an implementation-based

by the National

Department,

fee all or part

of this

commercial

appear,

Machinery.

Transactions

as much is

performance

slgmficantly

[Programming Systems]:

of the algorithms

under

evaluation.

grant

IRI-8703592

supported by the National Science Young Investigator Award, by a David and

Engineering,

and

University

by a grant

on Database

of Wisconsin,

material

is granted

advantage,

the ACM

notice

is given

To copy otherwise,

specific permission, @ 1993 ACM 0362–5915/93/0900–0512 ACM

m conjunction

algorithms

additions

To the

our

the

component

from

IBM,

Award.

for du-ect

for Computmg

Taken

Again,

m a preliminary

Fellowship

Development

Computer

pass,

International

results

supported

to copy without

second

Data

appeared

of the I-itk

Foundation

Faculty

address:

Permission of the

Packard

in the computations.

queries

and a grant from IBM. R. Ramakrlshan Foundation under grant IRI-8804319 and and

reformation

to bring

deferring

reason

the gains

is that

additions

need

of the strong

reason

the

and Warren,

D.4.2

in Proceedings

With

path

all graph-based

swapping;

Some of the results Algorithms,”

second

offset

set

obtain

performs

over the duration than

building algorithms

to

respect,

first

to

in the

descendent

the

deferrmg The

average

out to more

optimization

Descriptors:

Features—recurston:

secondary

The

DFTC

the root

performance.

turns

seen m reachability

indicate

until

order,

traversal

in thm

arcs

of these

eliminating

and a well-

search

of the

at which

pass. Global_

to our expectations,

this

detads

time

thereby

with

topological

depth-first

of a parent

superior

to perform

is the

transitive

_DFTC)

depth-first

some

the

is intermediate

set sizes on the

often

several

algorithms

our results

algorithms

very

to add them

to apply

these

confirms

descendent

1/0;

sets again

Contrary in

them

in memory,

to that

use

reverse While

the

Global

enwronment

processing

a separate

algorithm

and

algorithms in

m a second

are

Schmitz

results

Our to avoid

among

to compute

a disk-based

nodes

performs

be added

literature (Baslc_TC

sets of children

set of a chdd

in larger

in the in

processing

difference

is Identified.

more

fetch

also adapt

by

does additions

The

Basic_TC,

causing

study

later.

the parent

possible,

AND LINDA WINGER

algorithms

markzng

descendent

must

of the descendent

containing

closure

and

that

new

by Schmitz.

called

Basic _ TC

of nodes

proposed

two

implementations

one important

sets

been

proposed

the

performed

order

whenever these

of their

sets by adding are

have

a technique

compute

descendent additions

RAMAKRISHNAN,

We develop

algorithm

a graph

graph.

RAGHU

of Wisconsin

that

Madison,

provided copyright

copying

that notice

WI

reqmres

are

and the title

is by permission

or to republish,

53706.

the copies

of the

a fee and/or

$01.50 Systems,

Vol. 18, No. 3. September

1993, Pages 512-576

Transitive Closure Algorithms Management]:

Systems—query

General

Algorithms,

Terms:

Additional

Key

transitive

closure

Words

.

513

processing Performance

and

Phrases:

Depth-first

search,

node

reachability,

path

computations,

1. INTRODUCTION Several transitive closure algorithms have been presented These include the Warshall and Warren algorithms [28, bit-matrix

representation

of the

the Eve and Kurki-Suonio to identify

strong

graph,

algorithms

components

the [11],

in reverse

Schmitz

in the literature. 29], which use

[25],

the Ebert

which

use Tarjan’s

topological

order,

[10]

a

and

algorithm

[26]

the Seminaive

[5]

and Smart/Logarithmic algorithms [12, 27], which view the graph as a binary relation and compute the transitive closure by a series of relational joins, and recently, a hybrid algorithm combining matrix-based algorithms and graph-based algorithms [1]. We develop two new algorithms based on depth-first traversal, and compare their performance in a disk-based environment

with

Basic_TC yields

the

well-known

is the

graph-based

simplest

a topological

of our

sort

of the

iteratively

processes

nodes

descendent

sets by adding

nodes

in the

the second of our algorithms

GDFTC,

respectively.

on acyclic

tively,

where

We have

only

the prefix

“Dag_”

implemented

descendent

and compared

The result

of this

difference

their

comparison among

graph,

and

by

Schmitz.

of a first

pass that

a second

pass

order

and

sets of children.

builds

Global

that their

_DFTC

versions named

of the algorithms Dag_BTC

versions

is rather

and

for “Directed

that

acyclic

the algorithms:

given

GDFTC

respec-

graph. and the Schmitz

over randomly

surprising,

are the and

are applica-

Dag–DFTC,

of our algorithms

performance

is

the two passes of Basic_TC

that must be added whenever they the first pass, instead of waiting until we refer to these algorithms as BTC

stands

several

algorithm, teristic

are

proposed

consists

topological

and seeks to combine

Specialized

graphs

and

in the

reverse

by adding two descendent sets simultaneously in memory during second pass to do so. Hereafter, ble

algorithm

algorithms

generated the following

performs

additions

graphs. characas soon

as possible, BTC performs them as late as possible, and Schmitz performs them at an intermediate stage. Counter to the intuition that early additions are better (since descendent sets that have been added together need not be brought back into memory for this addition later), BTC outperforms both Schmitz working the

and GDFTC. The first reason is that early additions result with larger descendent sets for a longer time during the execution

algorithm,

leading

to more

and 1/0.

additions and faster growth of sets being

this

results

Overall,

in buffers the

two

being

growth of sets—appear perhaps a little more

that information collected in the optimizations in the second pass. ACM

Transactions

first

filled

effects—avoiding

up

quicker,

extra

in of

thereby

retrievals

for

to balance out, with the faster dominant. The second reason is

pass

on Database

can

be used

Systems,

to

apply

several

Vol. 18, No. 3, September

1993.

514

Y. Ioannldls

.

We have

adapted

set of nodes paths

BTC

reachable

We

queries,

have

but

only

information BTC,

within

which

the

is

also

GDFTC

graphs.

acyclic

a strong

has

reachability,

for

node, and

(For

inapplicable.)

path

We

GDFTC do

in

compare

such

of

path

maintain

a strong the

set of

and Schmitz

not

an important

nodes

the

each pair

to compute

they

Indeed,

such as the

over

between

queries,

since

of merging

queries posed

Schmitz

these

graphs

component.

effect

of related

or queries

such as the shortest

adapted

for acyclic only

a number

a given

closure

also

applicable

to compute from

in the transitive

nodes. are

et al.

path

optimization

of

component

for

performance

of BTC,

GDFTC, and Schmitz for path queries over acyclic graphs, and show that the results for reachability extend to this case as well. We also present a comparison of the various versions of BTC for path computations on cyclic graphs. This paper

differs

the algorithms correctness BTC

due

have

been

to its

difference,

in many

respects

have been revised included.

simplicity

however,

has been replaced

Second,

and

is that

from

its preliminary

and presented we have

superior

presented

evaluation

[13]. First,

and full

increased

performance.1

the analysis

by a performance

version

differently,

the

The

proofs

most

upon

on

important

in the preliminary

based

of

emphasis

actual

version implemen-

tations of the algorithms. Indeed, this has caused us to revise some of our conclusions about the relative merits of the algorithms. The performance evaluation brought out the fact that the algorithms were affected significantly by the impact on buffer management of the growth of descendent sets. This

was

not

strategies buffer mance,

space

analysis,

similarly

available).

based

we were

which

made

the

analysis

our implementation case differs

able

the

(due to the assumption

Thus,

upon

for the average

algorithms,

tions, which The paper

in our

affected

was

whereas

the behavior the

reflected

were

was

for

only

with

worst-case

Also,

all

perfor-

evaluation,

by implementing

specialized

could not be captured in the analysis. is organized as follows. We introduce

that “minimal”

and performance

considerably.

to experiment

assumption that

data

some notation

organiza-

and present

a summary of the new and the existing graph-based algorithms in Section 2. Section 3 presents the new algorithms in detail, starting with some simple versions and subsequently refining them. We describe the implementation of the

algorithms

Section Section

in

Section

4, and

the

testbed

considered in Section 8, and the algorithms compute them. In Section 9, we present algorithms for path queries. We discuss Graph-based algorithms Finally, our conclusions

1In fact, we did and ACM

for

performance

5. We present a performance comparison 6 (acyclic graphs) and Section 7 (cyclic

we have

deleted

are compared are presented

one of the algorithms

implement

for reachability

did not present

any points

TransactIons

on Database

y queries, of additional

Systems,

Vol

in that

paper,

called

uniformly

3, September

ones in Section

DFTC. worse

interest, 18, No

in

queries in queries are

presented earlier are adapted to a performance comparison of the selection queries in Section 10.

to nongraph-based in Section 12.

performed

evaluation

for reachability graphs). Path

1993

The algorithm,

than

BTC

and

11.

which GDFTC,

Transitive Closure Algorithms 2. GRAPH-BASED A

large

body

transitive

.

515

algorithms

for

ALGORITHMS

of literature

closure.

exists

Recently,

sion in new database reexamined in a data

for

with

the

main-memory realization

applications, transitive intensive environment.

based

of the importance closure In this

of recur-

has been revisited and section, we concentrate

on graph-based algorithms, i.e., ones that take into account the graph structure and its properties and compute the transitive closure by traversing the graph.

Almost

all such algorithms

have the following

(a) they are based on a depth-first strong take

components advantage

of the

graph

of the fact that

same descendants

and that

common

characteristics:

traversal of the graph, (b) they identify the using Tarjan’s algorithm [26], and (c) they nodes in the same component

they

are descendants

have exactly

of each other.

Based

the

on (c),

graph-based algorithms can compute the transitive closure of a graph so that only a pointer is associated with each node in a strong component pointing to a common by Purdom compare

descendent [20], Ebert them

proposed

for

algorithms e.g.,

with

or

notation

our

algorithms.

reachability that

[14],

set. In this section, [ 10], Schmitz [25],

and

been

primarily

computations

are

basic

focus

computation

have

path

The

we discuss graph-based and Eve and Kurki-Suonio

of

is

on

entire

algorithms

graphs,

proposed

for

being

discussed.

not

algorithms [ 11] and

that and

partial

so

have

transitive We

first

been

graph-based closure,

present

some

definitions.

2.1 Notation and Basic Definitions In not

this

paper

we

discuss

follows:

for

each

an arc of G}. i.e.,

for

and

all

j

transitive

descendants of

the

strong

condensation

i in

called

in

i is

at

a

of

generality,

of

G

by

G*.

strongly

G

as

only

+ its

a set

The

nodes.

if there

that

head

of the

G

called

component j)

~ G*

There

is

is a path

an

arc

from

no

(~,

self-loops,

source in

(or

and

condensation

as

j) is

We i

do

E, = {jl(i, has

of

we

specified

the

arc.

children

since

G is

children

i is

The

graph,

graph

of

assume

connected {i}.

the

node

as V, = {i} U {jl(i, V,

a directed

that

we

or

an

to

is

destination

if

and

there

(i, j),

The

defined

if

refer

assume

arc

For

V is nontrivial graph

to

We

graph,

graph

G.

components

all.

the

the

of

of i

node

loss

i, i @ E,.

is

graph

term

ones

Without

closure

component

the

node

nodes

node

nent)

use

undirected

G*

G

the

are

strong i)

or tail

denote

the

compo-

G*)}.

The

graph

of G, G, O., has

from

V, to Vj in the

i to j in G. The set of

descendants in the transitive closure for a node i is S, = {jl(i, j) is an arc of G*}. As mentioned in point (a) above, most graph-based algorithms perform a depth-first traversal of graphs, so we review some definitions relevant to it. Depth-first traversal induces a spanning forest on the graph based on the order

in

depth-first

which

nodes

traversal

are

visited.

is visit(i)

If we

assume

that

for a node

i, then

there

the

main

routine

is an arc (i, j)

in

in the

spanning forest if there is a call to visit(j) during the execution of the call visit(i). An arc (i, j) in the graph G is called a tree arc, if it belongs in the spanning forest. called a forward

An arc (i, j) in the graph G but not in the spanning forest is arc, a back arc, or a cross arc, if in the spanning forest, j is ACM

Transactions

on Database

Systems,

Vol. 18, No. 3, September

1993.

516

Y. Ioannidis

.

a descendant

of

i, j

ancestor-descendant its node r on which

2.2

Summary

The

goal

rather

et al,

is an

ancestor

of

i, or j

is not

related

to

relationship, respectively. For every strong visit( z-) is first called is the root of the strong

i with

of Algorithms

of this

to give

subsection

is not

an abstract

them

and

their

BTC

and

GDFTC

are

to present

description

implications

Detailed expositions ested reader in the

so that

on performance

described

of the original

in

any

algorithm

the

main

in

detail

differences

can be understood.

detail

in

later

of the

children

of parents,

are

evaluation

added

of these

understanding form

to

those

is important

because

three

for

how

paper.

remaining algorithms can be found by the interreferences. In the descriptions that follow, one to the fact that BTC and GDFTC of possibilities for when descendent

This

but

among

Algorithms

sections

should pay special attention extreme points in a spectrum middle.

an

component, component.

algorithms the

with

it allows

other

Schmitz

somewhere

the conclusions

to be used graph-based

are two sets of

as a basis algorithms

in

the

of a performance for

a qualitative

are likely

to per-

as well.

BZ’C.

This

algorithm

uses Tarjan’s

algorithm

as a first

pass to construct

a

topological ordering of nodes and to identify the strong components of the graph. Additionally, that pass can be used to physically cluster the relation in reverse topological order with nodes in descendent sets arranged in topological order. dants called

This

improves

the performance

of a second

pass when

of all nodes are found in reverse topological order. An “marking” is used to avoid the addition of a descendent

cases where the given

earlier

set additions

descendent

set additions

are guaranteed

set. Because

are deferred

to have

of the two-pass

as much

added

structure

the

descen-

optimization set in many all nodes

in

of the algorithm,

as possible.

GDFTC. This algorithm defines the opposite end of the spectrum from BTC in that descendent set additions are performed as early as possible. When returning from a child along a tree arc or an intercomponent cross-arc, the child’s descendent set is added to the parent’s set immediately, thereby eliminating

the

need

addition.

A rather

a strong

component,

to

retrieve

complicated when

component cross-arc), the representative of the strong Pzwdom.

Purdom

stack returning

these

sets

from

an algorithm

ensures

a child

child’s descendent component. Thus,

proposed

subsequently

mechanism

to that,

(along

perform

this

for all nodes in

a forward

or inter-

set is added to the set of a additions are never deferred. that

is similar

to BTC

[20].

It is

based upon computing a topological sort of the condensation graph prior to computing the closure. The main difference with respect to BTC is the absence of marking; the implementation of BTC also incorporates some important optimizations that increase the effectiveness of marking and take advantage ACM

of the topological

TransactIons

on Database

sort for physical

Systems,

Vol

18, No

clustering.

3, September

1993,

Transitive Closure Algorithms Eve and Kurki-Suonio. to a node i and

i after

j are in the

the children above the

the

same

root

strong

comprise

the

component

descendants

of j

are

[11].

component, in the

to Tarjan’s

First,

if

i and

added

observed

j, node j is still

nodes

modifications

closure.

(similarly identified,

a child

of the root of a strong

following

transitive

Eve and Kurki-Suonio

processing

on returning

after

if and only if processing

the nodes on the stack

strong

component.

algorithm

in

j are in different

to the

that

on the stack

Further,

descendants

strong

of i after

to GDFTC). Second, when the root of the descendants of each node in the strong

all

that

They

order

517

.

are

proposed

to compute

the

components, visiting

j

the from

i

a strong component component are added

is to

the descendants of the root (similarly to Schmitz). There are two potential redundancies in the algorithm that affect performance. First, the algorithm propagates descendent sets even when returning from forward arcs although this is unnecessary. Second, if there is an arc (j, k) such that j is in a nontrivial strong component and k is in a different component, k is added to S1, by the first for

the

root

modification Ebert.

modification

of j’s

suggested

another

traversal

but when

of the graph

returning

from

a child,

cross arc, the descendants parent [10]. This algorithm ing no additions identical dant ing

operations

gated

via

the

addition

set constructed

of S1, by the

second

in Ebert

every

tree

of

Tarjan’s

to identify

if the arc is a tree

arcs. For

For

cyclic

in that

arc in the

to the descendent

modification

is performed

algorithm:

strong

a

components,

arc or an intercomponent

of the child are added to the descendants of the improves upon Eve and Kurki-Suonio by perform-

on forward

to Dag _ DFTC.

from

and also to the descendent

component,

above. Ebert

depth-first

above,

strong

acyclic

graphs,

graphs,

the Ebert

however,

there

descendent

sets are propagated

component

until

they

are

algorithm

are many after

is

redunreturn-

eventually

propa-

set of the root.

Schmitz. Schmitz’s modification of Tarjan’s algorithm is based upon the fact that strong components are identified in reverse topological order, and that

all nodes

in the strong

component

when the root is identified [25]. closure is essentially to construct descendent root

sets of all

is identified.

children

Schmitz’s

them; thus, the avoided. Finally,

are on the stack

The modification the descendent of nodes

algorithm

in the

over the marking

condensation graph, although in 13TC, due to the two-pass

one-pass

structure.

Further,

since

strong

also detects

first redundancy of Eve and Schmitz uses an optimization it

above

the root

node

to compute the transitive set of the root by adding the component

forward

arcs

Kurki-Suonio’s that is similar

when

the

and ignores algorithm is to marking

is not in general as flexible as nature of BTC versus Schmitz’s

no additions

are made

to the

descendent

sets of nodes in a strong component until the root is identified, the second redundancy of that algorithm is also avoided as well as the redundancy of Ebert’s algorithm. Like Eve and Kurki-Suonio’s algorithm, however, Schmitz has the potential cost of retrieving descendent sets that may not be in memory. For Schmitz, these are sets of children of nodes on the stack above the root

node;

for Eve and Kurki-Suonio, ACM

Transactions

these on Database

are sets of nodes Systems,

on the stack

Vol. 18, No. 3, September

1993.

518

Y. Ioannidis

.

above the root. BTC in terms eagerly,

but

et al

Thus, both algorithms of when descendent

they

are not

deferred until the entire Schmitz also proposed as an arc basis arcs such

deferred

their

to a second

is computed

transitive

closure

We have not explicitly studied this niques used in the implementation effect.

Thus,

basic

Schmitz

2.3

an upper

BTC

bound

algorithm

Comparison

between GDFTC and additions are not done

pass either.

Instead,

they

and used,

is equal variant. of BTC

i.e., a minimal

to that

of the

be derived

of this

indirectly

variant

(Section

The Eve and Kurki-Suonio by the Schmitz

of Purdom’s

algorithm

algorithm

of when

seminal

are

is uniformly

operations

and

more,

carried

algorithm.

Both

run-times

graphs.

For

cyclic

graphs,

terms

of additions.

While

terms

of CPU

for cyclic

time

We

(A comparison

for a main

to Dag_DFTC

it is not

clear

graphs

how

(since

expect

(and

thus

work

and

GDFTC

GDFTC

uses more

comparison

The detailed

results

of the

performance

of this

comparison

3. THE NEW TRANSITIVE In this based

section upon

3.1 A Marking

GDFTC,

are presented

and

for

GDFTC

in

compare

in

complex

stack

is uniformly to a compreSchmitz

in Sections

only.

6 and 7.

CLOSURE ALGORITHMS

we present

depth-first

of BTC,

cor-

GDFTC)

than

operations), we expect that the 1/0 performance of GDFTC better than that of Ebert. Based on the above observations, we have limited ourselves hensive

the

in terms

implementation

more

Ebert

to it in that

by Schmitz

memory

it does strictly

performed

identical

therefore

the

are seen to algorithm.

all the additions

and it is almost

out.

superior.

roborates this observation.) The Ebert algorithm is identical acyclic

performs

and usually

additions

algorithm

of vector

over the

6.1).

of Algorithms

can be seen as a refinement

Schmitz

of

graph.

Nevertheless, one of the techachieves essentially the same

on the cost improvement

can still

subset

original

marking optimization and the physical clustering that we propose yield significant improvements; BTC clearly dominates Purdom’s

terms

are

strong component containing the node is identified. a variant of his algorithm in which what he refers to

of the graph

that

are intermediate sets are added:

in detail

graph

several

new transitive

closure

algorithms

traversal.

Algorlthm

We first present a simple transitive closure algorithm that introduces a technique called marking. Intuitively, if a descendent set contains a marked node, it also contains the children (but not necessarily all descendants) of that node. In the following, descendent set S, is partitioned in two sets M, and U, that can be thought of as the marked and unmarked subsets of S,. proc Input: ACM

Closure

A

graph

Transactions

(G) G represents on Database

by

Systems,

children Vol. 18, No

sets

E,,

1 =

3, September

1

to

1993

n.

Transitive Closure Algorithms Output:

S, = M, (U, = 0), i = 1 to n, denoting

(1) {for

i = 1 to n do U; := Ei; M, := 0 od

(2) fori=l (3)

519

.

G+.

tondo

while there is a node j s U, – {i} do M,:= ML UMJU{j}; ~:= ULUUJ

–M,

od

(4) od} LEMMA

3.1.

PROOF.

~ G ~,

Whenever

M, U Q.

The

U, = E,, and that

node

j

follows

for all

THEOREM 3.2. of a graph

a

claim

u UL.

El cM,

*

is

from

Algorithm

Closure

correctly

If

Ml

~ G

j ● Ei

u U,, then

are in M, U u,. To see that the algorithm

terminates,

initially

We note the effect

Lemma

that

3.2

Depth-First

algorithms

to and



the transitive

m have

In the presence

is obtained,

by ignoring

closure

k

such

is completed

that

from

U U,, we note that

i

when

by noting

increasing

for all

i,

Lemma

that

achieves

(For the interested

reader,

2 of [25].)

Nodes in

the

nodes with

a lower

back

an optimization

graph.

from

following

sections,

the property

number

of cycles,

than

we need

to

all descendants

m, i.e., a topological

an approximation

arcs. That

that

order

to such a numbering

is, in the acyclic

graph

obtained

from

(G)

A graph G represented Graph G with popped[ ].

by children

nodes numbered.

(1){vis = 1; (2) for i = 1 to n do uisited[i] (3) while

node

are reachable

back arcs, all descendants of a node numbered m have a lower m. The depth-first numbering algorithm is presented below.

Number

array

that

are in Mi

contains

we introduce

numbering.

is some

nodes

U U, is monotonically

derives

of the graph

of a node numbered

Output:

Ml

algorithm that

that

a numbering

Input:

added

M, = 0

increasing.

i, U, = @. The proof

Traversal to Number

G by ignoring number than

only

over the condensation

is the optimization

proc

also



3.1.

Schmitz’s

of marking

this

the

is

initially

computes

or there

that

all such nodes for all

U, = .?3, and that

and by applying

obtain

u Uj

that

G.

PROOF.

In

M,, Ml

observation

i, M, U U, is monotonically

k e El and j ● Mh U Uk. It follows

that

to

added

the

sets E,, i = 1 to n. The numbering

is stored in a global

:= O; popped[ i] := O od

there is some node i s.t. uisited[ i] = O do visit(i)

od

} (4) proc

visit(i)

(5)

{Ukited[

(6)

while there is j G E, s.t. visited[j] popped[i] Z= uis; vis := vis + 1;

(7)

i]

:=

1;

= O do visit(j)

od

} ACM

Transactions

on Database

Systems,

Vol. 18, No 3, September

1993

520

Y. Ioann/dls

.

The following induced LEMMA

3.3

form

Let

[4].

that

strong

and in

be a strong

component

those of its arcs that

strong

of a graph

graph,

for

the simpler

we use Tarjan’s root.

strong

of the spanning

of a graph are common

forest

G. Then,

the

to the spanning

3.3 Algorithm

root

in

suitably

another

of this

to compute

strong

array

root.

array

of) each

While

we

have

in the sequel,

the

as Modified_

is the

the

component

to compute

algorithm

tree

identifying

for ease of exposition,

modified,

modified

for

modified

(of the

Number

is the root

algorithm

is easily

the

example,

to this

that

Tarjan’s

[26]

algorithm

algorithm,

We refer

component

component.

it can also identify

the

presented and

GI with

the node in the

of the

components

node

property

a tree.

root

popped,

an important

Number.

of GI together

Note the

identifies

by algorithm

vertices forest

lemma

et al.

arrays

popped

Tarjan.

BTC

A simple-minded

version

of our first

algorithm

is a straightforward

combina-

tion of the two ideas presented in the previous sections; algorithm Closure is simply run after numbering the nodes in reverse topological order (modulo back

arcs).

assume

This

the

the global graph

array

G and

Closure

is

the

node

on an acyclic not

by BTC’.Z called

constructed

true,

k such

that

when

A

Output:

graph

G represented

S, = M,

{Modified

popped

equalities is

run

at

on a

that,

ordering, on

SJ is added

we

looks

applied

S1 = Ml

k E U] and

(Uz = (23), i =

by

children

1 to

n,

sets

denoting

ndo

(3) fori=l

E,,

when after

and

a cyclic

~

a

= 0

graph

to S, and ( i, j) is a

k is an ancestor

of i in

i =

UL=Ez;

ML=0

od

1 to

n.

G*. / * First

_Tarjan(G);

(2) fori=lto

Pass*/

/ * Second Pass*/

tondo

(4)

I = node_ popped(i);

(5)

while there is a node ,i do M1=MIUM,U {j];

(6)

when

k ] = i, Note

the

when

which

BTC’ (G)

Input:

(1)

Tarjan

Closure

For example,

back arc, there exists a node k such that the spanning forest imposed by Number. proc

ease of presentation,

popped[

G following

to a set S,, the

however,

such an ordering.

For

node _popped(i),

by Modified_

graph

set S1 is added

This

following

is denoted

of a procedure

popped

returns

is run

descendent hold.

version

existence



UI – {1} U1:=UIUU,

–Mlod

Od}

Modified_ Tarjan also computes the array root, which enables an important optimization: since all nodes in a strong component have the same set of descendants, we can construct the descendent set for the root node alone.

‘In BTC,

our

earlier

and

paper

proposed

the

on these

algorlthm is essentially what the use of algorithm Modified ACM

Transactions

algorithms

use of algorithm

on Database

we have _Tarjan Systems,

[13], number

we referred to order

to algorithm

the nodes

presented above as algorithm BTC’. rather than algorithm Number. VO1 18, No 3. September

1993

Closure

appropriately; The only

as algorithm the

resulting

difference

is

Transitive Closure Algorithms Consider

the processing

of a node

1 in the algorithm

above.

Instead

the descendent

set of a child j and the child itself (i.e., Mj is carried out in BTC’ we can add it to S, OO~[I1.This addition by excluding j = 1, we avoid the addition processing a root node (a node 1 such that dent

set to all nodes

If we carry also,

an

out

in its strong

the

addition

additional

whenever

based

j of node 1, if j and processed;

nent

as 1, the nodes

nonroot

upon

thus

node

in the

strong

the

The

root

to

strong

is reachable

to the

we process

the first

such node. Thus,

all

nodes

in

the

marked

component.

from

have

the

some

node

the addition only

compoevery and

is

we process the

strong

descendent

set

of U~ has no effect

is subsequently other

the while

already

root, in

root’s

or

a child

strong

set when

of the

which

empty

too. Further,

before

descendent

The

sets is to control

same

component

subset

for the root,

strong

of descendent

j must

the

essence,

In considering

is processed

itself

ignore In

the set 11~is either

components,

of the root’s

set for j = 1

can set.3

Uj = Q1. If j is in the

is added

subsets

we

observations.

component

set computed

descendent

a descendent

set S~ to another,

subset

node

its

possible:

in

and

descendent

and

in Uj are all in the same

in the marked

root.

j

the following

S~ = M~ and

therefore component,

of descendent set to itself. After 1 = root[ I ]), we copy its descen-

made

nodes

1 are in different

been

the

for

we add a descendent

can be ignored,

when

is

distinction

of adding

U U~ U {j}) to SI, when j ~ UI – 1;

component. of a child

optimization

marked/unmarked

521

.

loop;

use

on

propagated for

unmarked

i.e., to determine

for

what nodes j the addition is to be carried out. Here, we can use EI – M,OOt[ ~1 instead of Ur since UI is initialized to EI and is not added to (based upon the preceding carried

discussion),

We refer can

to the

improve

henceforth with

and all nodes

out are included above

the

changes

performance consider

root in the

subsets

M, and

component

only

Based since

be treated

(1) {Modified_ (3) fori=l

have

optimization”;

reachability

presented

below,

on the

above

observations,

is no need for path

identically,

we consider 8).

which

the

root

its

in a strong

optimization we

is used

between

nodes

computations,

we

is BTC’

Si

to distinguish

computations, and

path

they

computations,

is not

adapt

BTC’,

sets E,, i = 1 to n.

Tarjan(G);

/ *First

S,:=00d

Pass*/

/ * Second Pass*\

tondo

I := node _popped(i);

results

singleton

been

BTC (G)

(2) fori=ltondo

‘This

has already

BTC,

Input: A graph G represented by children Output: S,, i = 1 to n, denoting G*.

(4)

as the “root

For

there

Uz. On the contrary,

applicable. Thus, when rather than BTC (Section proc

algorithm

algorithm,

cannot

the addition

collectively

significantly.

optimization.

directly

for which

in M,OOfr ~1.

in adding

strong

chosen

the descendent

components

to keep

with

the presentation

set of the root to itself

self-arcs. simple

ACM Transactions

This

unnecessary

for nontrivial addition

strong

components

can be avoided,

or

but

we

instead. on Database

Systems,

Vol. 18, No. 3, September

1993.

522

.

(5)

Y. Ioannldls

while

et al.

there is a node ~ ~ EI – S,,,.,[ ~1

–s ,out[ll do S r’oot[I] “— (6)

if I = root[Il od}

Note nonroot

then

U

S,

for all

od

U {j}

h + 1 s.t. root[rl ] = I do Sk = SI od

that in the above algorithm, node in the same component

dants

of j if j and

THEOREM

directed

I are in different Algorithm

3.4.

graph

PROOF.

S~ is empty in line (5) whenever j is a as 1; further, SJ contains all descencomponents.

Basic _ TC computes

the transitive

closure

of a

G.

we

prove

the

theorem

by

showing

that

upon

termination

of execu-

contains all descendants may constitute a trivial

m in every strong component, S,. of m in the transitive closure of G. Note that a node strong component by itself. The proof is by induction

upon

r is the root

tion

of

the

algorithm,

popped[

Basis.

r ], where

Consider

the least

r

is

the

root

is a child

algorithm,

belong

at least node m

claim

holds. component, must

Induction

Step.

node

after

processing

strong

a strong

popped[

statement

the (5),

descendent this

set

set is

added

inductive claim holds. If r is the root of a nontrivial

a root

for

1 such

component,

is that

including

By definition

of the m + r in

loop for 1 = r in the taken

strong

component proof,

k that

the

role

of j

in

of execution. Hence, S, and by statement (6), this component.

with

for all strong

r’ ] < P. As in the basis

If this is a trivial strong component, with a root that has a popped hypothesis,

hold

by

component;

m ] for any node

once since the beginning in the strong component,

Consider

node

with

has

the

i.e.,

is not modified

the

the while

component

r ] is

every

in the

r ] > popped[

popped[

component,

be in

node in the component.

popped[

r ] = P, and let the claim

such that

also

component

every

set is propagated to every node m + r in inductive claim holds for this case as well.

popped[

strong

of S, = @, which

component

in the

r such that

value

to a strong

Thus,

node

root

is a trivial

strong

the

component.

with

If this

r ]. In addition,

component. (5) every

in

component,

every

statement contains

the inductive

of some other

of the strong

its strong

component

of a nontrivial

k would

of a strong

components.

a node

1] < popped~

the root, root

Thus,

from

otherwise, popped[

node

node in it, the initial

the algorithm. reachable

each

the strong

over all strong

r is the only If

for

root

Thus,

r

components

we must

such with

examine

the

that root

r’

two cases.

every child of r is in a strong component number less than P. By the induction each to

child S,

strong

while

m

of

r

is

processing

component,

contained 1 =

in

r.

we can show

Sm.

Thus,

By

the

as in the

basis proof that after processing the while loop for I = r, S, contains all nodes in its strong component. Further, if node k is not in the strong component but is a child of a node j in the strong component, S, contains Sh U {k}. The reason processed before j, ACM

TransactIons

is that which

on Database

the root of the strong component containing k is is processed before r. Since k is in a strong

Systems,

Vol. 18, No. 3, September

1993

Transitive Closure Algorithms component with sis, Sk includes is added

root r’ such that popped[ r’ ] < N, by the induction all descendants of k when we process node j. Thus,

to SrOO~[dl, i.e., to S,, when

3.4 Algorithm

BTC,

closure.

In

performance

trates

and

the

the

the idea,

the

numbering

work

sets

pass.

it only

in a first

subsection,

of the

descendent

although

nodes

following

some

appropriate

during

proc

we numbered

this

by combining

the

memory

concludes

the proof

of

Dag_DFTC

In algorithm

adding

This

hypotheSk U {h}



the theorem.

the

j is processed.

523

.

The

works

from

when

pass before we the

they

following

for acyclic

computing

attempt two

to improve

passes,

i.e.,

are

simultaneously

simple

algorithm

by in

illus-

graphs.

Dag DFTC (G)

Input:

Ar-acyclic

graph G represented

S,, i = 1 to n, denoting

G*,

(1) {for

i = 1 ton

:= O; S1 = 0 od

(2) while

do uisited[i]

sets E,, i = 1 to n.

by children

Output:

there is some node i s.t. uisitecl[ i] = O do visit(i)

od

} (3) proc visit (i) i] := 1; (4) {uisitecl[ (5) while there is some .j e E, – S, do if uisited[j] = O then visit(j);

S, := S, U SJ U {j}

od

} We state

the following

THEOREM

3.5.

acyclic

graph

intuition

in the above

arc, the descendent

descendent memory. if

without

set

any.

of the

parent,

we must

Hence,

algorithm

by

and

moreover

nonetheless

retrieve this

for arbitrary

that stack

stack

when

both

we pop up from

at

graphs

are

in

the process-

the next

that

a

to the

sets

out during

it to identify

addition

descendent The above

be added

descendent

time,

child we

to

avoid

sets later, in the second intuition is used to derive as well,

we develop

an algorithm

that

generalizes

graphs. Like BTC, it avoids duplication the descendants of only one of the nodes

Subsequently,

‘Our

of an

which

is presented

GDFTC

section

version

closure

and must

is paged

performing

an “eager addition” algorithm in the next subsection. 3.5 Algorithm

is that,

set of the parent

of these addition.

In this

the transitive

is complete

possibly fetching one or both phase of BTC, to perform the

on arbitrary generating

proof.

computes

set of the child

If the descendent

ing of the child, visit,

theorem

Dag_DFTC

G.

The basic tree

simple

Algorithm

of the differs

it is a stack

the

sets

of all

algorithm, from

the

stacks

of descendent

the

other

a stack used

in other

sets of nodes

nodes

mechanism4 graph-based

are

Dag_DFTC

updated, is

used

algorithms

in nontrivial

strong

on Database

Systems,

to work

of effort by essentially in a strong component. if any. to

for transitive

components,

In

this

construct

the

closure

as opposed

in to a

of nodes. ACM Transactions

Vol

18, No. 3, September

1993

524

Y. Ioannidis

.

descendent

set for (the

algorithm, nent,

each stack

If we discover

in fact this.

part

Every

contains associated

frame

with

component, f

are

list[

list[ top]

is assigned

concluding visiting

on its type

roughly

two

with

respect

sets

contains

of the

nodes

The

that

belong

in

When

the root

is

and the stack

is

in nodes[ top]

tree

or back

component.

the graph

on each traversed

spanning

cross,

strong

It traverses of calls

in depth-first

arc (i, j) depends

to the

arc (forward

nodes[ f]

component

do not

to all the nodes as follows.

are

to reflect

set

strong

of nodes[ f].

of the corresponding

compo-

“components”

of nodes.

~]

of the

strong

are merged

of the members

to the

it is a tree,

frames

be members

each node once. The action

i.e., on whether

the process

distinct)

stack

to

the processing works

During

some nontrivial

(potentially

maintains

set

with

then

known

f’. The

component.

is associated

some of these

stack

The algorithm order,

that same

that

of) a strong

and are descendants

identified, popped,

root frame

of the

nodes

nodes[ f]

et al.

visit(

) routines,

arcs are ignored).

The

arc type is identified with the help of the values of visited[ i and j. (Note that the array visited contains integer

1 and popped[ elements in

1 for this

action is In all cases, however, pieces of information: first, whether

based

two

algorithm.) additional different

strong

components,

(nontrivial) component, tion to the latter that questions i,

are resolved

root[ i ] < n while

processing the first

in different the second known

second,

i and

in

case

whether j is the first child it is part of a nontrivial based

on the values

i is known

has not finished question,

and

differentiated

of root[ j]

strong components question, the value

to be in a strong

to be part

yet, whereas

the value

of root[

they

are

in

same the

or

same

of i to pass the informastrong component. Both ] for i and j. For any node

of a strong

component

root[ i ] = n + 1 otherwise. should

on

j are in the

be equal

whose Thus,

for

to n + 1 if i and j are

(processing of the component of j is over). For of root[ i ] should be equal to n + 1 if i is not

component.

Based on the specific case identified from the above pieces of information, the algorithm takes the following actions. For all tree and cross arcs, if i and j are in different the descendants graph

strong components, the descendants of i. This is the action when operating

and is straightforward.

manipulation,

addresses

The bulk the

case

of j are propagated to on acyclic parts of the

of the algorithm,

when

i

and

component. Tree arcs are the most interesting frame always corresponds to j. If j is the first

j

are

which in

the

involves same

stack strong

in this case. The top stack child of i through which i is

detected to be part of a strong component, then i is incorporated in the top stack frame. Otherwise, the second stack frame from the top corresponds to i appropriand is merged with the top frame. In both cases, root[ i ] is updated ately. Cross and tree arcs are treated almost identically. If j is the first child of i through stack frame only root[ i ] sponding to Algorithm indicate that

ACM

Transactions

which i is detected to be part of a strong component, then a new is pushed on the stack and becomes associated with i. Otherwise, is updated appropriately (the top stack frame is the one correi), in slightly different ways for cross and back arcs. is used to GDFTC is given below. The notation LI := LIoLZ list Lz is concatenated to list LI by switching a pointer, at 0(1)

on Database

Systems,

Vol

18, No

3, September

1993

Transitive Closure Algorithms

cost. For the special

case when

LI is 0 (that

is, when

list

525

.

Lz is to be assigned

Lz. In contrast, the notation to the empty list Ll) we use the notation LI ‘= ● LI Z= LI u Lz is used to denote that a copy of Lz is inserted into L1. proc

GDFTC(G)

Input:

A graph

G represented

sets E,, i = 1 to n.

by children

S,, i = 1 to n, denoting

Output:

G*.

/.

list[ f ]

descendants

/.

nodes[ f ]

nodes in the strong comp. of stack frame f.

/“ /.

top

pointer

uisited[ i ]

order in which visit( i) is called. potential root of the strong comp. in which 1 if the call to visit(i) has returned.

/.

root[

/“

popped[ i]

i]

of nodes in the strong comp. of stack frame

f. */ */

to the top of the stack.

*/

(1) {vis := 1; top := O; uisited[ n + 1]:= n + 1; := popped[i] := O; root[i] (2) for i := 1 to n do uisited[i]

*/ i belongs.

“/ +/

= n + 1; list[i]

:=

nodes[ i ] := S, := @ od

(3) while

there is some i s.t. visited[i]

proc

= O do visit(i)

od}

visit(i)

= uis; vis Z= uis + 1; (4) {visited[i] for each j ~ Et do (5) / * each

(6)

j

considered

(7)

if /*(i,

visited[j]

j)is

(8) (9)

= O then

if root[j]

{

= n + 1 then

S, := S, u S~ U {j}

strong components.

*/

elseif root[ i] = n + 1 /*firstdetection of i being in a strong comp. (through

(11)

then add_in_top_frame(i, .j) else merge _top _two_frames(i,

(12) (13)

elseif /*(i, / *i,

j in different

update_

elseif /*(i,

(18) / *first

( 19)

j)is

j)}

S, := S,

= n + 1 then being root

in

a strong

_non_back(

popped[ j] = O then

aback if root[i]

U

SJ U

{j}

strong components.*/

else {if root[i] / * first detection of i

(16) (17)

j) * /

popped[ j] = 1 then

j) is a cross arc.*/ if root[j] = n + 1 then

(14) (15)

once.*/

a tree arc.*/ visit(j);

/ * i, j in different (lo)

exactly

if j g S, then / *body of loop not executed when j E S,*/

arc.*\ = n + 1 then

push_ new_ stack_ frame(i, comp.

j);

* /

i, j)}

{ push _new _ stack _frame(

i, j);

detection of i being in a strong comp. * / update_ root_ back(i, -i)}

od

ACM

Transactions

on Database

Systems,

Vol. 18, No. 3, September

1993.

526

Y. Ioannidis

.

Fig.

1.

et al.

A strongly

if i = root[i]

(20)

/ *Propagate of the

root[?]

(21)

then

graph.

G-

d

g

{

descendants

in the

nodes

connected

of root

strong

,= n +

comp.

to the

rest

*/

1;

for each j = noci?es[top] do S, := S,”nOdeS[tOp]; top

= top



root[j]

= n + 1 od:

1}

popped[ i ] := 1

(22)

} proc

add_in

_ top_ frame( i, j) S,; S, = ~list[top];

(23)

{lzst[ top] ;= list[ top]

(24)

nodes[ top] ‘= nodes[ top] u {i); root[ i] = root[.jl;} z, j) proc merge _top_two_frames(

u

(25)

{list[top

(26)

nodes[top

(27)

update –root_non_back(i, j)} proc push _new _ stack _frame(i,

(28)

(top = top + 1; lisd top] = OS1; nodes[topl j) proc update _root _non_back(i,

(29)

{if znsifed[ root[ j]] proc update _root

(30)

{if

– 1] = list[ top – 1] u lz’st[top]; – 11 = rzodes[ top – ll*nodes[

cisited[j]

top];

= top – 1;

j)

< cisited[ root[ ill then _back(z,

top

‘= {i}} rood i] = root[jl}

j )

then

< viszteo![ root[ i]]

root[ i]

= j}

We prove that GDFTC is correct in an appendix. important aspect of the algorithm is that duplication constructing component

the

descendent

and subsequently

list

of just

copying

this

One of the reasons for the complexity track of strong component information the fly. We illustrate

the operation

one

As mentioned above, of effort is avoided

node

(the

root)

of

an by

a strong

list for each node in the component.

of this algorithm while constructing

of the algorithm

is the need descendent

on an example

to keep lists on

in which

a

single strong component is discovered in a piecemeal fashion. Figure 1 shows the input graph. The whole graph is one strong component. Assume that the nodes are visited in the order a, b, c, d, e, f, g, and h. Thus, the back arcs (d, b) and (g, e) are discovered before (h, independent components to be pushed {e, f, g}. After ( h, a) is discovered, a third there is no way of knowing that all component. This is discovered when we

a) is. This results in two potentially on the stack, namely, {b, c, d} and level is added to the stack, because of the nodes belong to the same pop up back to f again, statement

(12) in the algorithm is executed, and the two frames at the top (corresponding to a and e respectively) are merged into one. When c is reached, similar actions are taken, so that when correctly found in the top list. ACM

TransactIons

on Database

Systems.

a, the root,

Vol

is reached,

18, No 3, September

1993

all its descendants

are

Transitive Closure Algorithms 4. IMPLEMENTATION This

section

rithms,

the

the

tives were available clearly interpreted. physical

main

specific

aspects

choices

of descendent

lists,

elimination.

Some of the techniques

ones,

also

have

algorithms

4.1

of our

that

implementation

we made

when

of the

multiple

been

used

by

memory

that

others

management,

we present for

below,

implementing

algo-

alterna-

so that the results of a performance evaluation These aspects include storage structures for

clustering

527

OF ALGORITHMS

describes

analyzing

.

and

may be graphs, duplicate

or closely

related

transitive

closure

[3, 14, 22].

Storage

Structures

We represent and store graphs in several forms. First, both the input output graphs of the algorithms are stored in a plain tuple format, compactly as possible. Tuples with the same source attribute (arcs with same

tail)

structure

are

Second,

all

consecutively

in

the

file,

but

otherwise

no

special

is assumed. during

represented dent

stored

and as the

the

course

as descendent

lists

occurs

other

as part

algorithms,

of the

lists.

execution

of the first and

of all

The restructuring

we

pass of BTC,

refer

to

it

algorithms,

from whereas

as the

graphs

arc-tuples

it is the first

restructuring

are

to descenstep of

phase.

To

accommodate descendent lists, every page is divided into some number of blocks. Each block can store a constant number of node names (equal to the blocking factor), representing arcs from a common source to the stored nodes. There

is

common array source

a pointer source

index

with

additional

one entry

a fixed

for

block factor.

each

in

if

there

a page.

block;

the

entry

choosing

more

arcs

each page

the

the

the block

factor

implies

perfect

with

a

contains

contains

whether

the blocking

Thus,

are

In addition,

and a bit indicating

size page, increasing

can be stored

depends on the long descendent each

an

the blocking

of arcs in the block,

not. Given blocks

to

than

an

common

is empty that

or

fewer

blocking

factor

following trade-offl a high blocking factor saves space for a list, since its source is factored out and stored only once for

set of descendants

that

fit

in

each

block;

on the

other

hand,

a high

blocking factor wastes space for a short descendent list, since a large portion of a block remains empty and unused. This trade-off will become clear from the results

of our experiments.

Third, whenever a descendent list is processed in memory, i.e., whenever nodes are copied from it or into it, its contents are also replicated in the form of an adjacency which

is

descendent O otherwise.

equal

vector. to

The vector

1 if

of the source This allows

the

has an entry

corresponding

for every

node

has

node

been

in the graph,

identified

as

a

of the corresponding descendent list and is equal to for fast duplicate elimination, since the descendent

list does not have to be searched before adding a node to it: a straight lookup at the adjacency vector is enough (Section 4.4). The size of the adjacency vectors is calculated in” the first steps of each algorithm, when the graph is ACM

Transactions

on Database

Systems,

Vol. 18, No, 3, September

1993.

528

Y, Ioannidis

.

transformed nodes In

from

et al

tuples

to descendent

lists,

at which

time

the

some

useful

number

of

is counted. addition

to the

maintained entry

contains

of the

above,

in memory,

an array

with

the following

strong

component

containing

an entry items:

(a) the outdegree

to which

information

for each node in the graph.

the

of the node,

corresponding

rank of the node in the topological order obtained of the graph, (d) an indication of whether the

node

is

Each

such

(b) the root

belongs,

(c) the

by the depth-first traversal node has been visited and

processed or not, (e) a pointer to adjacency vector of the node (if it is memory), and (f) the page number of the file on disk where the descendent list of the node is stored. For leaves, the last entry is equal to a particular reserved value,

making

saving

space and also many

4.2

it unnecessary

Descendent

In the

to store useless

empty

descendent

accesses

lists

implementation

of the

and use much

BTC

algorithm,

information

we take

from

advantage

the first

components

are identified

in the first

condensation

graph

and

mization

is also achieved

intradescendent made

BTC.

These

part,

their

possible

result effect

list

pass,

in significant cannot

store

the

nodes. order,

time,

lists

and this

lists

in

the

This has the effect that and are therefore likely

descendent children reverse

lists are

often

topological

stored order,

the same page as well.

same

same nodes

Hence,

reverse

page.

page.

ways. optipass of

and for the most We elaborate

constructed

For

Also,

processed

in the first

during

the children

topological

us to

important

or GDFTC.

are

contain

allows

in other

other

obtained

many nodes that to be close in the

on the

on the

only

the

graph, improvclosure of the

GDFTC,

two

improvements,

by Schmitz

of a graph

they

and the

of nodes

performance

be realized

of course,

descendent

are

by the ordering

these optimizations below. In BTC, the descendent pass. At that

by Schmitz

ordering

two

This is not possible For cyclic graphs,

essentially compute the transitive closure of the condensation ing performance significantly. The effect of computing the Inter-

of its

pass to expedite

second pass, in which the transitive closure is computed. with the remaining algorithms that we have considered. strong

thus

List Ordering

pass structure

their

for them,

to disk.

the

on first

of each node. We

order

of their

source

are close together in that graph as well, have their example,

since

a parent

nodes

consecutively

the above interdescendent

are

and

its

processed

in

are likely list

ordering

to be on results

in very high hit ratios in the buffer pool and thus in less 1/0. The above technique does not help when the number of children of each node in the graph is so large that only one children list or less fits on a page. Another benefit of the first pass of BTC is that the topological ordering of the nodes can be used to reduce the production of duplicates. Specifically, consider an arc (i, k) in G and assume that there is also a path between i and k whose topological

first arc is (i, j). Clearly, the order of G. If j is processed

inequalities before k

i < j < k hold in when dealing with

children list of i (statement (5) of BTC), then k will be found turn comes, and no action will be taken on it. If k is processed ACM

Transactions

on Database

Systems,

Vol. 18, No. 3, September

1993

the the

in S, when its first, however,

Transitive Closure Algorithms then

j

will

have

essentially nodes

to be processed

be derived

in each

twice

as well,

for i. To avoid

descendent

list

produced

and this

the

descendants

unnecessary

by the

first

529

. of k

will

computation,

pass

of BTC

the

are stored

(and processed) in topological order, i.e., j is stored first in the above example. This intradescendent list ordering has a considerable effect on 1/0 and CPU performance. gadish

The

in their

above

Hybrid

ordering

has

algorithm

also

been

used

by Agrawal

and

Ja-

[3].

As we mentioned earlier, the above data orderings cannot be used in Schmitz or GDFTC, because of their “on-the-fly” type of processing before the is available. The effect of the intradescendent list necessary information by computing the arc basis of a graph closure computation. As mentioned in

ordering, however, can also be achieved and using that for the actual transitive Section

2, Schmitz

preprocessing tion,

as opposed

negligible,

since

accounted

for

adding

the

however,

to the

that

as a variant

adds

some nontrivial

intradescendent

it is a by-product

in

the

numbers

first

pass

the

added

meaningful: second

proposed

step,

pass of BTC

that

of BTC

to

either

nothing,

of BTC,

whose

pass

of the

We

algorithm

should

or

algorithms

and potentially

Such

overall

Schmitz

of these

algorithm.

cost to the

ordering

first

we present).

complexity

gains

list

of the

of his

also

a

execucost is (and

note

GDFTC

is

with

respect

is

that

not

very to

the

costs more.

Memory Management

4.3

Several

data

structures

are assumed

to remain

in main

memory

throughout

the execution of all algorithms. The most important such structure is the array mentioned in Section 4.1. This is in addition to the buffer pool, which is used

to

output

store of the

the

(b) descendent

all algorithms,

depending

divided

the above

among

(a) arc-tuples,

following:

algorithms,

on the types

execution

of data

for

lists

the

and

phase,

initial

input

(c) adjacency a buffer

and

final

vectors.

For

pool of size M

is

as follows.

Restructuring M – 1 pages for input

arc-tuples

1 page for the constructed Main algorithm

1 page for output

descendent

lists

arc-tuples

M-2 pages for descendent lists 1 page for adjacency vectors During the

restructuring,

arc-tuple

pages.

LRU During

is used the

as the

main

page

algorithm,

adjacency vectors to manage the space in the With respect to the pages storing descendent

replacement LRU

single lists,

policy

is used

among

among

the

page devoted to them. we have experimented

with two replacement algorithms: LRU and a specialized algorithm that we introduce below called Least Unprocessed Node Degree (LUND). LUND works as follows. The descendent lists that are in main memory at any point are divided into two classes. The first class contains lists, called

ACM

Transactions

on Database

Systems,

Vol. 18, No. 3, September

1993.

530

Y Ioannidls

.

et al.

complete lists, whose source is a node that its descendants have been found. Clearly, S~ is via an arc pointing source

of that

arc.

to j, with

The

has been processed already, i.e., all any future reference to such a list

the goal

second

class

of copying

contains

incomplete lists, i.e., those whose source is either not yet started being processed. Future references both

arcs

coming

outgoing

into

the arc are added so that

pages

those

that

used

pages

to the list.

requests

complete

Degree

been

for the list

lists

lists. (UND)

and

of the

Thus,

are

a fraction list

node.

(For

every

of the head

f of the

on

from least

sum

of the

is the

or equivalently,

the

among

page,

of times

of a list

of

once

recently

a candidate

the number

UND

are yet to be made,

chosen

as the

minus the

called

of the arc.) In LUND,

pool

each

node for the list

of the

lists,

arc, the list is requested

is computed

requested. that

out

buffer

For

the list

other

the descendants

of the tail

in the

the

still being processed or has to such a list can be due to

going

incoming

to the descendants

of the source already

arcs

once so that

For every

incomplete

Node

and

replacement only

with

out-degree has

for

contain

Unprocessed

node

is requested

it can be added

candidate

list

the

arc, the list

S] into

all

its

in-

and

that

the

number

of

the number

of

unprocessed for all lists

arcs incident on the corresponding node. LUND adds the UNDS in each page and then chooses the page with the Ieast sum as the

victim First,

replacement. most recently

for the

memory

again

fraction

of incomplete

among

soon,

candidate

structure, processed

the

and

lists,

for the source This

should

lists

algorithm

be requested.

The intuition behind the LUND used incomplete lists are likely not be paged

that

are not

without

any

assumes of a list,

algorithm

further

that

the

the further

has

out. These

considered

been

policy is twofold. to be needed in are included

for replacement.

information

about

the

that

need

fewer

the

away

in the future

justified

by

arcs the

in the Second,

that

results

experiments, some of which are discussed in Section 6.2.2. A final issue to consider is related to page splits. If all blocks

graph to be

list

will

of several of a page are

occupied and one of them is full and needs to be expanded, the page must split into two pages. At that point, a decision must be made on how descendent lists will be divided between the new pages. We have experimented with two approaches.

The first

one is to randomly

divide

the second one is to take into account the UND in the original page and separate those with large

UND.

The

important—we

specific

criterion

have experimented

for

the

with

several

them

between

of the source small UND UND-based of them

the pages;

nodes of the lists from those with separation

with

is

no major

not

effects

on the performance. The intuition behind the second approach is that nodes with high UND are expected to be accessed frequently in the future. Hence, combining all of them together in a page increases the chances that the page will stay in main memory long enough for much of the processing of these lists to be done without additional 1/0. Results of some initial experiments showed that as expected,

the first approach is the preferred one when using the second approach is the preferred one when

Therefore, all combinations.

ACM

TransactIons

experiments

on Database

presented

Systems,

Vol

in

18, No

the

results

3, September

1993

sections

LRU, whereas using LUND. are

for

these

Transitive Closure Algorithms

4.4 Duplicate In this

lists

Elimination

section,

elimination nodes from

we briefly

describe

the

algorithm

that

at linear CPU cost. As mentioned a descendent list S~ to another list

exist

531

.

in main

in the adjacency S, and

the

already

exists

memory. vector

bit

For

every

node

of S, is checked:

is switched in S,. This

corresponds

for

duplicate copying for both

k in Sj, the corresponding

entry

if it is equal

to 1; otherwise

is used

in Section 4.1, when S,, adjacency vectors

no action

to 0(1)

to O, then

k is added

to

is taken

on k, since

it

cost for each node in S1, i.e., to a

cost that is linear in the length of S~. The cost of constructing the adjacency vectors is also linear in the length of the lists, so duplicate elimination is very efficient. 5. PERFORMANCE We implemented Schmitz, the

several

using

buffer

match

Although times CPU

size in our

all

descendent

strategy

UNIX. were

BTC,

GDFTC,

The file

chosen

sizes of the machine.

have In our counting

and the

algorithms,

set of the root

node in the component,

been

page

and

size and

to be 2 Kbytes

With

this

are the

other

for

the

can be copied out, with that

output: pointers

to it

choice

the input

the

to 1/0 cost, therefore, the numbers presented in this paper do not the initial cost of reading in the original graph once and the final cost

rithms.)

All

included There

in the numbers presented. are several interesting parameters

algorithms.

other

They

reads

and

can be divided

the algorithms and parameters following two subsections. The graphs were generated. 5,1

Parameters

of Algorithm

only exception is in Section with nongraph-based algo-

performed

into

With

during

that

affect

parameters

the

execution

the performance

of the

are of the

implementations

of

of the data. They are discussed in the third subsection explains how our input

Implementations

There are three interesting the number of buffer pages, ACM

writes

unavoidable.

is

and

writing

once. (The algorithms

and

the

once for each

the same

the cost of reading

algorithms,

the

respect include

of writing out the transitive closure 11.3, where we compare graph-based

all

on

elapsed

in each case. For

writing

We assume that

users

and list

on UNIX-provided implemented page

memory in

copy can be written

means

same

option

component

component. this

no

to

size, each page

the UNIX-provided

of available

is an

of a strong

by all the algorithms;

with

experiments, we relied of 1/0 based on the

amount

or a single

run

management

there

each node in the strong output

algorithms

under

implementation

experiments

graph-based

made

3200

for the input and output representation of graphs, block size 15 and 5, respectively, for the descendent

are not meaningful. times and our own

from

of all three

since we do our own buffer

replacement all

versions

the corresponding

can fit 256 arc-tuples 30 and 72 blocks with representation. machine,

TESTBED

C on a VAXstation

page

with

EVALUATION

parameters the buffer

Transactions

of the algorithm replacement policy,

on Database

Systems,

implementations: and the blocking

Vol. 18, No. 3, September

1993.

532

Y. Ioannidis

. Table

I.

et al,

Parameters

of Algorithm

Implementations

Parameter

Symbol

Buffer size (pages) Buffer replacement policy Blocking factor

factor.

The buffer

sections,

the

discussed

at least

10 pages

Values

02050 :RU Ad LUND 5 and 15

was

varied

are

M = 10,

by M)

values

Tested

Values

M B

size (denoted

only

and their

@O.25)

considerably.

In the results

and

all

phenomena A minimum

occur beyond 50 pages (the 1/0 cost drops sharply, as expected). of ten pages are necessary because all algorithms require that at lists

can fit in memory

and

because

need

two descendent

to run,

50,

algorithms

least

of memory

20,

no interesting

at the same time.

For 2000 node

graphs, this accounts for eight pages are needed, one for the

pages in the worst case. In addition, adjacency vectors and one for input

arc-tuples.

with

We

experimented

several

buffer

particular, LRU and four versions of LUND to f = 0.25, 0.5, 0.75, and 1.0, respectively. the one with

f = 0.25 was almost

we show the results presented

costs

experimented of B depended 6.2.3.

However,

fected

by B,

with

Parameters

All

relations

two blocking

on the input the

relative

so the

results

always

and LUND

are for the best

space of parameters 5.2

for LRU

factors,

graph

type.

either

and tested

all

other

values

in

the best or close to it. Hence, f’ = 0.25 only.

policies

for that

In each case, the case. Finally,

B = 5 and 15. The effect This

performance in

policies,

with the fraction f being equal Among these versions of LUND,

with

of the two

replacement

two more or output

is discussed

of the sections

in detail

algorithms

is summarized

in Section

remained B = 15. The

are for

in Table

we

of the value unafabove

I.

of Data used

in

our

experiments

contained

integer

node

identifiers,

which represent the best case for efficiency. This is without any loss of generality, however, because even if a given relation is not in this form, it can be transformed to it by a single pass over its tuples. In addition, the integers used were random numbers in a specific range, so that the actual values that represented the nodes would not bias the performance of the algorithms. Also, for any specific setting of the values of the parameters described below, all algorithms

under

comparison

were

run

on the

same

input

graph,

so that

no differences in the specific choice of node identifiers or other secondary characteristics could affect the results. We show results for both acyclic and arbitrary graphs below. We also experimented with trees, but since those tests did not insights beyond what was observed for acyclic graphs,

offer any additional we do not present

them. The following are the parameters that were used to characterize graphs, with the symbols that denote them in parentheses: the number of nodes (N), the outdegree or branching factor of each node (b), and the depth (d). Preliminary experiments with several values of IV showed that the main conclusions of this study seem to be unaffected by lV. Hence, we only present ACM

Transactions

on Database

Systems,

Vol. 18, No. 3, September

1993,

Transitive Closure Algorithms results

for the value

with

J/ = 1000, with

with

IV = 2000.

N = 2000. values

(We also studied

of other

The trends

and analysis

for IV = 2000.) We experimented and we present tensively, sures

because

that

stood

in many

path

as follows.

cases,

complete,

the depth

simple

with

some

of graphs

exhaustively

of b were

graphs

extremes

are rather

to be equal

Its

importance

point

during

have

the

1
0 and popped[ j] = O when they are examined, i.e., all arcs (i, j) are back arcs. (In both cases B,EL = EL and POP,EL = T,EL = 0. Moreover, in the former case E, = 0.) Part (b) of the lemma does not apply here, so we only prove part (a). Within the call visit(i), let V denote the set of children of i that have been iterated through in statements (5)–( 19) of the algorithm at any time. Modify (a) in the statement of the lemma into (a’) so that it reads as follows. (a’ ) For every node following holds: (al’)

frame[i]

(a2’)

frame[

i, after

examining

= n + 1, root[i]

all children

= n + 1, and

of i in V c E,, one of the

S, = @, or

i] = top

root[ i] = r, such that nodes[ frame[ S, = list[

i]]

frame[

uisited[

r]

= mink



V{uisited[

k ]}.

= {i} i ]] = 0

We prove the above by induction on the size of V, i.e., the number of i that have been examined at any time. ACM

Transactions

on Database

Systems,

VOI 18, No

3, September

1993

of children

Transitive Closure Algorithms Basis.

Let

executed before,

at

IV I = O. Then, all,

and

i.e., equal

Induction

the for-loop

therefore

to n + 1. Moreover,

Step.

of statements

frame[

Assume

that

i ] and

the

claim

condition in statement lishing the following.

has not been

root[ i ] remain

S, = @. Thus,

children. We prove it for c + 1, i.e., IV] i, i.e., Vnew = VOZ~ U {j}. Then, (i, j) is sis, before examining j in statement former case (j is the first child of i

(5)–(19)

is

(al’)

true

571

.

as they

were

holds.

after

examining

= c + 1. Let j be the (c + l)th

c z O child

of

a back arc. By the induction hypothe(5), either (al’) or (a2’ ) holds. In the to be examined and c + 1 = 1), the

(18) is satisfied

and its then

part

is executed,

estab-

frame[ i] = top nodes[ frame[ i ]] = rzodes[ top] = {i} S, = list[ frame[ i]] = D In addition,

statement

and O < uisited[ (root[

j]

(19) is executed,

< n, the then

i] = j, where

uisited[ j] = ~n~

vacuously

{visited[

part

and because of statement

(since V

is

uisited[

n + 1] = n + 1

(30) establishes

singleton)

k ]}.

Thus (a2’), and therefore (a’), holds. In the latter case (j is not the first child of i to be examined), the test in statement (18) fails, and only statement (19) is executed, possibly updating root [ i ] as (a2’) requires. The remaining clauses of (a2’ ) remain

valid

(a2’ ), and therefore After

examining

node

for which

induction

all the children call

visit(i)

and (a’) in this

Step.

Assume

of i, (a’) holds returned,

Thus,

in

this

case

also,

the

i is the first statement

visit(i) returns. We have already and POP,EI = T,EL = @. Thus, (a)

case of the outer that

for V = E,. Since

by (a’) i # root[ i ], thus

still holds after case, B,~L = E,

to (a’) and the basis

Induction

hypothesis.

(a’) holds.

the

(21) is skipped, mentioned that reduces

by the

lemma

induction is true

is proved.

for all nodes

i such

that

pop[ i ] < pop for some pop >1, i.e., for the first pop nodes i for which visit(i ) returned. We prove it for the (pop + l)th. Let h be he popth node for which the

call

to visit

depth-first

returned

traversal

and

structure

let

i be the

of the

(pop

algorithm,

+ l)th

such

node.

By

the

either

i is

a leaf

in

the

spanning forest of calls to visit (a descendent of a sibling of h or a member of a different tree in the forest) or i is the father of h and visit(h) was called from within visit(i). We examine the two cases separately. Assume are either

that

i is a leaf

back

arcs or cross arcs;

in the

spanning thus,

forest.

Then

all arcs (i, j),

T,EL = @. As in the basis

if any,

case, since

no call visit(j) is issued for any child j of i, part (b) of the IIemma does not apply, so we only prove part (a). Within the call visit(i), let V denote the set of children of z that have been iterated through in statements (5)–( 19) of the algorithm at any time. Modify (a) in the statement of the lemma into (a”) so that

it reads

as follows: ACM

Transactions

on Database

Systems,

Vol

18, No 3. September

1993

572

Y. Ioannidis

.

(a”)

For every

node

(al”)

fkune[il

(a2”)

fkune[

et al,

i, after

the call visit(i

= n + 1, root[il

rzocies[ frczme[ i]] S, = list[ As in the basis

Basis.

Let

initially,

frarne[i]]

all,

S, = D~

i.e., equal

r]

= minh

~ ~Y{visited[

= POP,V

and

fkune[.j]

the above by induction

of i that

and

Uisited[

= {jlj

have

IV I = O. Then,

at

= n + 1, and

holds.

k ]}.

= {i]

case, we prove

of children

executed

one of the following

i] = top

root[ i] = r, such that

number

) returns,

been examined

the for-loop

therefore

frazne[

on the size of V, i.e., the at any time.

of statements i ] and

to n + 1. Moreover,

= n + 1}

(5)-(19)

has not been

root[ i ] remain

S, = ~

= @. Thus,

as they

(al”)

were

holds.

Induction Step. Assume that the claim is true after examining c z O children. We prove it for c + 1, i.e., IV I = c + 1. Let j be the (c + l)th child of i, i.e., Vnew = V“Zd u {j}. Arc (i, j) can be a back arc or a cross arc. We treat the two cases separately. Assume that (i, j) is a back arc, i.e., the condition of statement before

(17)

is

satisfied,

examining

holds (j is the first a nontrivial strong and its then

part

frame[

i] = top

nodes[

frame[

S, = list[

is executed,

equality

V does not affect

part

root[

Thus,

is justified because

of statement

i ] = ~“, where

uisited[j]

=

the

induction

or (a2”)

hypothesis,

holds

for

i. If (al”)

i is a member of (18) is satisfied

the following.

= {i}

by the fact that the contents uisited[

= n + 1) for a back

of POP,”.

arc (i, j),

In addition,

n + 1] = n + 1 and

the addition

statement

O < uisited[j]

(19)

< n, the

(30) establishes

vacuously

mink_ ~~{uisited[

(a2° ), and

By

(al”)

c POP,V and frame[j]

of j into

and

A.2.

(5), either

establishing

i ]] = nodes[ top]

The last

then

Lemma

child of i to be examined and reveal that component), the condition in statement

frame[ i]] = {Jj

is executed,

by

j in statement

therefore

(since

B,v

is singleton)

k ]}.

(a”),

holds.

If (a2° ) holds

before

examining

j, the

test in statement (18) fails, and only statement (19) is executed, possibly updating root[ i ] as (a2”) requires. The remaining clauses of (a2° ) remain valid by the induction hypothesis. (The value of S, and list[ frarne[ i ]] have to remain the same, since the addition of a back arc in V does not affect the contents of POP,V.) Thus, in this case also, (a2”), and therefore (a”) holds. Assume that (i, j) is a cross arc, i.e., the condition of statement (13) is has already satisfied, by Lemma A.2. Since ~wopped[ j] = 1, the call to visit(j) returned. Thus, by the induction hypothesis of the outer induction the lemma for j and the complete set of holds for j. If root [ j] = n + 1, (a) holds descendants of j is stored in Sj and propagated to S, at statement (14). If (al”) holds for i before examining j, S, is correctly updated to D: (with V ACM

Transactions

on Database

Systems,

Vol

18, No

3, September

1993,

Transitive

Closure

Algorithms

.

573

containing j also). If (a2° ) holds for z before examining j, since all nodes in SJ have been popped before i, they are members of POP,V. Thus, S, and list[ ~rczrne[ i]] are updated correctly also. The values of ~rczm,e[ i], root[ i], and nodes[ i ] correctly remain unchanged. (Specifically for mot[ i ], the addition to V of the head

of a cross arc that

is in a different

n + 1), cannot have any effect on the root[ j] + n + 1, the test in statement ment

(15).

statement statement following:

Recall

i] = top

nodes[

frame[

SL = list[

tion, Thus

i]]

i]]

equality

induction

hypothesis,

=

before

examining

j

the condition establishing

in in the

= {i}

c POP,V and frczme[j]

is justified

by the induction

i ] = j,

(a2”),

the following

= n + 1} hypothesis

of the

outer

induc-

and therefore (16)

remaining

be established:

clauses

(a”), holds.

is executed,

the contents

the addition

will

where vacuously (since B,v is singleton k ]}. = ~g~,, {uisited[

statement again

= {jlj

(root[j]

frwne[ j] # n + 1. since for a cross arc (i, j), if root[ j] + n + 1 then the addition of j to V does not affect {j[j = POP,V and frarne[ j] = n +

Uisited[j]

Thus,

= nodes[ top]

frcvne[

1}. In addition, root[

by the

component

(5), either (al” ) or (a2° ) holds for i. If (al”) holds, (15) is satisfied and its then part is executed,

frame[

The last

that,

strong

contents of B,v.) On the other hand, if (14) fails, and control reaches state-

If (a2° ) holds

possibly

of (a2’ ) remain

updating valid

of {jl j = POP,V and

of j to V. Thus,

in this

by the frame[

for

i before

root[ i ] as (a2”) j]

induction

examining,

j,

requires.

hypothesis,

The since

= n + 1} are not affected

case also, (a2”),

and therefore

by

(a”) holds.

Note that (a” ) reduces to (a) when V becomes equal to E,. (Recall that for a leaf of the spanning forest, T,EI = 0.) For a leaf of the spanning forest of calls to visit, it can never be true that i = root[ i ]. This is because root[ i ] is either made equal to a child of i (statement (19)) (but i is never examined as a child of itself (statement (5))), or it is made equal to the root of a child of i (statements (11), (12), and (16)). By the induction hypothesis of the outer induction, (a) holds for j. Statements (11), (12), and (16) are only executed when

root[

j]

# n +

1, thus

(a2) must

hold

for j. If i = root[

j]

at some point,

this means that there is a path from i to j of tree and cross arcs only starting with a tree arc from i (and finishing with a back arc to i). This, however, contradicts the hypothesis that i is a leaf in the spanning forest. Thus i cannot

be equal

to root[ i ], statement

(21) is skipped,

and after

the return

of

visit(i), (a) holds. This completes the proof of the lemma for the case that i is not the father of h. Assume that i is the father of h. For the first time, since the call to visit(h) was taken, we need to prove both (a) and (b) for i. We first prove (b). Node z is never placed in an entry of nodes, except within the top level call of visit( i ) ACM Transactions

on Database

Systems,

Vol. 18, No. 3, September

1993

574

Y Ioanmdls

.

et al.

(statements (11), (12), and (16)). Thus, if frarne[ i 1 = n + 1 before visit(j) is called for a child j of i, this will not be changed after the return of the call to true that visit(j). When i is inserted into some entry of nodes it is always ficzrne[ i ] = top (statements (11), (15), and (18)). Thus, consider the case where ~ranze[ z] = top = TOP, for some child j of i. During decreased visit(j) creased within

multiple

times.

that top was to TOP before of calls to visit.

the forest,

when

by induction

Consider

the last

increased from the call to visit(j)

the call visit(l).

forest

for some value TOP, before the call to visit(j), the call visit(j), top may be increased and

Clearly,

that

for any node

returns,

on the distance

within

fkmze[

of k from

any recursive

TOP + 1 without Assume that this

1 is a descendent

We prove

call visit(k)

time

TOP to returns.

call of

being dehappened

of i and j in the spanning k in the path + 1.This

k ] = TOP

from

j to 1 in

will

be done

1.

to a child m of 1, after Basis. Consider 1 itself. Consider any call visit(m) to TOP + 1 and frame[ 1] is set to TOP + 1 (statement (15) top is increased hypotheor (18)). Since 1 and m have been popped before i, by the induction sis of the outer induction, (a) and (b) hold for 1 and m. Thus, when visit(m)

returns, frame[l] TOP

either = top

which

case,

since

1] (also

frarne[

m])

Step.

Assume

+ 2, in

and

frame[

root[ m] = frame[ = TOP + 1, or

m ] = n +

root[m] root[

1, statement + n + 1 and

1, statement

1] # n +

is set to TOP

(9) is executed, and frame[m] = top = (12)

+ 1 = top. This

is executed,

covers

the basis

case. Induction

the path between j and the induction hypothesis and k’. When the inner induction,

that

the claim

is true

for an arbitrary

node

k’

in

1.We prove of the

it for its father in this path k. Again, by outer induction, (a) and (b) hold for both k

call visit (k’ ) returns, frame[k’] = TOP

by the +

induction hypothesis Moreover, root[

1 = top.

of the k] =

it should be frame[ k ] = TOP (outer induction frame[ k ] = n + 1. Otherwise, hypothesis (b) for k). If that were the case, statement (12) would be executed, and this

top would be decreased to TOP, would not happen between the

which contradicts call visit(l) and

our assumption that the return of the call

(11) is exevisit(j) to visit(i). Thus, root[ k ] = frczme[ k ] = n + 1, statement cuted, which set frame[ k ] equal to frame[ k’ ] = TOP + 1. After any other call visit( k“ ), following the return of visit( k’ ), within the call visit(k), by the frame[ k“ ] = n + 1, in which case outer induction hypothesis for k, either k“ ] = TOP + 2 = top, in frame[ k ] remains equal to top = TOP + 1,or frame[ which case after the execution of statement (12), frarne[ k ] is set back again to top = TOP + 1.Thus, in all cases the claim holds for k. By TOP

the

above

induction,

+ 1 = top. This

after

concludes

visit(j) the proof

returns

The proof of (a) is straightforward. Recall visit(i) returns after the return of visit(h). tion

hypothesis

propagated modified: ACM

(al),

Sk

contains

all

the

within

visit(i),

frame[

j]

=

of (b) for i. that i is the next node for which If root[ h ] = n + 1, by the induc-

descendants

of h, which

are correctly

to St or both S, and list [ frame[ i ]] (statement (9)). Nothing else is frame[ i ] should remain n + 1 or top; root[ i ] should retain its

TransactIons

on Database

Systems,

Vol

18, No. 3, September

1993

Transitive Closure Algorithms value, because since for any frarne[ will

root[ h] = n + 1; nodes[ fhwne[ i ]] should also retain its value, member k ~ Tjh), which is the set of new members of Tiv,

k ] = n + 1 by the

n + 1. Thus,

575

.

in this

be executed

induction

hypothesis

case, (a) holds.

or statement

and

the

fact

that

If root[ h,] # n + 1, either

(12). Addressing

root[ h] =

statement

each case is similar

(11)

to previous

parts of this proof and is omitted. In all cases, (a) is seen to hold. Node i, may have other children to examine after h, all of which must be heads of cross or back arcs. It is easily seen that the claim still holds after examining these nodes also, as was done before. If (al) holds when control reaches statement (20); i.e., root[ i 1 = n + 1, then statement (21) is skipped, and (al), and therefore (a) also, holds after visit(i) returns. If (a2) holds when control reaches statement (2o), i.e., root[ i ] # n + 1, but root[ i 1 # i, then again (a2) remains valid after visit(i) returns. Finally, if (a2) holds but i = root[ i ] # 1, then

n +

returns.

statement

(21) is executed,

In all cases, (a) holds

THEOREM

transitive PROOF. only

after

the

once

in

GDFTC

terminates

the

algorithm

(statement

(5)).

terminates By

(These

nents

but

not

equal

to a node

are the nodes

a root.)

the spanning

This

r for which

forest

that

i after

visit(i)

computes

the

visit

by Lemma

there

of i in the

in

G

i that

is

consid-

satisfied

of nontrivial visit(i)

had not returned

return.

all calls to r’s children of the nodes,

when

r to i, and a path

from

edge

node

(al)

descendants correctly computed and satisfied (a2) after the call visit(i)

that,

after

is an ancestor

for

correctly

every

any

are members

means

is r, r = B,E’, and therefore, finiteness

A.3,

visit(i)

Si.

(al)

and

since

Lemma

returned, has its Consider a node i that

call

returned.

and establish ❑

i.

of G.

Clearly,

ered

stored

Algorithm

A.zI.

closure

for

in a back

A.3 for r, uisited[

must

be a node forest

r’,

root[

such that

of calls

is a path

arc whose

r ]

r ], due to the

root[

to visit.

in

head

r ]] < uisited[

root[ r ]] < uisited[

spanning

compo-

root [ i ] was

yet. Since there

ending

If visited[

strong

returned,

r’ ] = r’,

Before

that

visit(r’

)

returns, all members of nodes[ frame[ i ]] have their descendants set equal to i ] + n + 1 (by Lemma A.3), i = nodes[ r’ ], and S,,. Since i ~ T,E” and frame[ i’s descendants

are appropriately

updated.

The

correctness

straightforward, since there is a cycle that involves therefore, the two nodes have the same descendants.

of the

both D

r’

update

and

i,

is and

REFERENCES L

AGRAWAL, R., DAR, S., AND JAGADISH, H. V. performance

evaluation.

ACM

Trans.

2. AGRAWAL, R., AND JAGADISH, H. V. database

relations.

England,

Sept.

1987),

In

Direct

Proceedings

Direct

Database

transitive

Syst.

algorithms

of the

13th

ment,

algorithms:

1990),

for computing

International

the transitive

VLDB

Design

and

427-458.

Conference

closure

of

(Brighton,

255-266.

3. AGRAWAL, R., AND JAGADISH, H. V. Hybrid transitive the 16th International VLDB Conference (Brisbane, 4. AHo,

closure

15, 3 (Sept.

closure algorithms. In Proceedings of Australia, Aug. 1990). VLDB Endow-

326–334. A. V.,

Algorithms.

HOPCROFT,

J. E.,

Addison-Wesley,

AND ULLMAN, Reading,

Mass.,

ACM Transactions

J. D.

The

Design

and

Analysis

of Computer

1974.

on Database

Systems,

Vol. 18, No. 3, September

1993.

576

Y. Ioannidis et al.

.

5. BANCILHON,

F.

Management Eds.,

Naive

evaluation

Springer-Verlag,

6. CARRE, B.

New

Graphs

7. CRUZ, I.,

and

Proceedings

York,

T. S.

of the 5th IEEE Ph.D.

9. DIJKSTRA,

thesis,

E. W.

Database Clarendon

AI

relations.

Systems,

Press,

Aggregative

Data

Univ.

A note

defined

and

In

On Knowledge

M. Brodie

and

Base

J. Mylopoulos,

1985.

Networks.

AND NORVELL,

8. DAR, S.

of recursively

Systems—Integrating

Oxford,

closure:

Engmeermg

An

Conference

of Wisconsin-Madison,

on two problems

England,

extension

1979. of transitive

(Los Angeles,

Aug.

closure.

Feb. 1989),

In

384-391.

1993.

in connection

with

graphs.

Numer.

Math.

1 (1959),

269-271. 10. EBERT, J.

11.EVE,

A sensitive

transitive

J., AND KURKI-SUONIO,

(1977),

closure

R.

Y. E.

Proceedings

On

the

of the 12th

computation

of the

International

VLDB

13. IOANNIDIS, Y. E., AND RAMAKRISHNAN, of the 14th

14. JIANG,

B.

algorithm

I\ O-efficiency

Data

16. KABLER,

Process.

transitive

R., R.,

Heuristic

HANSON, search

International

Y.

E.,

Znf. Syst.

Lett.

12, 5 (1981).

closure

path

E.,

transitive

of a relation.

partial

Acts

Znf.

operators.

In

Feb.

1989).

algorithms.

In

382–394.

1988),

closures

in

Feb.

In

IEEE,

Performance

403-411.

Aug.

(Los Angeles,

An analysis.

Ariz.,

1986),

Calif.,

transitive

algorithms: M.

of relational Aug.

closure

Beach,

Conference

AND CAREY,

databases.

1990),

Proceedings New

York,

evaluation

ProceedIn

264-271. of the 8th

12-19.

of algorithms

for

(Sept. 1992), 415-441.

IOANNIDIS,

Workshop,

closure (Kyoto,

(Long

computing

(Tempe,

17, 5

in database

Efficient

Engineering

Conference

IOANNIDIS,

closure.

for

Data

transitive

Conference

Conference

of shortest

Engineering

transitive

R.

VLDB

of the 6th IEEE

15. JIANG, B.

17. KUNG,

International

A suitable

Proceedings IEEE

hf.

the

303-314.

12. 10ANNIDIS,

ings

algorithm.

On computing

Y.

E.,

systems.

SHAPIRO,

In

L. Kerschberg,

Expert

Ed.,

L.,

SELLIS,

Database

T.,

AND STONEBRAKER,

M.

Proceedings

1st

Systems,

Benjamin-Cummings,

Menlo

Park,

of the Calif.,

1986,

537-548. 18. Lu,

H.

New

strategies

Proceedings

of the

for

13th

computing

the

International

transitive

VLDB

closure

Conference

of a database

(Brighton,

England,

relation. Sept.

In

1987),

267-274. 19. Lu,

H.,

MIKKILINENI,

compute Data

K.,

thetransitive

Engineering

20. PURDOM, P. 21. NAUGHTON,

AND RICHARDSON,

closure Conference

Atransitive J. F.

J. P.

ofadatabase (Los Angeles,

closure

One-sided

Feb.

algorithm.

recursions.

In

closure

23. RICH, E.

Artificial

24. ROSENTHAL, approach

queries.

A.,

recursive

An improved

26. TARJAN, R. E.

Depth-first

J. J.

Softw.

U.,

D. C., May search

of the

6th

to

International

New

York,

In

ACM-PODS

algorithms

17, 3 (Mar.

AND MANOLA,

1986),

transitive

of algorithms

of the3rd

112-119.

Efficient

Eng.

applications.

evaluation

10(1970),76-94.

Proceedings

McGraw-Hill,

S., DAYAL,

(Washington,

25. SCHMITZ, L.

Trans.

Intelligence. HEILER,

to supporting

Conference

IEEE

and

In Proceedings

1987),

BZT,

(San Diego, Calif., Mar. 1987), 340-348. 22. QADAH, G. Z., HENSCHEN, L. J., AND KIM, transitive

Design

relation.

for

1991),

the

Conference instantiated

296-309.

1983.

F.

Traversal

Proceedings

recursion:

of the

1986

A practical

ACM-SIGMOD

166-176.

closure and linear

algorithm. graph

Computing

algorithms.

30 (1983),

SIAM

359-371.

J. Comput.

1, 2 (1972),

146-160. 27. VALDURIEZ, ings

P., AND BORAL, H.

of the

1986),

1st International

relations.

A modification

Commun.

29. WARSHALL,

ACM

of recursive

Database

queries

using

Systems

Conference

algorithm

for the

join

indices.

(Charleston,

In ProceedS. C., April

197-208.

28. WARREN, H. S.

Received

Evaluation Expert

August

Transactions

S.

of Warshall’s

ACM

18, 4 (April

A theorem

on Boolean

1989;

revised

on Database

September

Systems,

1975), matrices.

1991;

transitive

closure

218-220. JACM,

accepted

9, 1 (Jan.

July

Vol. 18, No. 3, September

1992

1993.

1962),

11-12.

of binary

Suggest Documents