descendent sets by adding the descendent sets of children. While the detads of these algorithms differ considerably, one important difference among them.
Transitive Closure Algorithms Based on Graph Traversal YANNIS
IOANNIDIS,
University
Several
graph-based
closure
of a directed
compare
the
known
algorithms graph.
performance
graph-based
traverse
They
differ
and
tbe
considerably,
topological
two
sets in again
addition
as in
additions thereby
result
certain
pass
can be used the trends such
Categories
and
and
that
as Semmaive Subject
storage,
1988). analysis
in this
has
Y. Ioanmdis
respect
to that
been
replaced
was partially
and
Lucile an IBM
Authors’
D.3.3
[Operating E. 1 [Data]:
paper paper,
most
by the
not made
or distributed
publication
Association
and
its
that
as
early
of the execution, of not having
collected
with
extent
to
in the first possible,
we
comparison
another
performance
outperform
Sciences
date
Languages]: Storage
other
Structures—graphs,
form
VLDB
trees:
in “Efficient
Conference
haw
Science
was partially a Presidential in Science
Language
types
of
Constructs
Management—main
memory,
Beach,
Closure Calif.,
Aug.
and the performance
performance
Foundation
[Database
H.2,4
Transitive
(Long
been revised,
of an implementation-based
by the National
Department,
fee all or part
of this
commercial
appear,
Machinery.
Transactions
as much is
performance
slgmficantly
[Programming Systems]:
of the algorithms
under
evaluation.
grant
IRI-8703592
supported by the National Science Young Investigator Award, by a David and
Engineering,
and
University
by a grant
on Database
of Wisconsin,
material
is granted
advantage,
the ACM
notice
is given
To copy otherwise,
specific permission, @ 1993 ACM 0362–5915/93/0900–0512 ACM
m conjunction
algorithms
additions
To the
our
the
component
from
IBM,
Award.
for du-ect
for Computmg
Taken
Again,
m a preliminary
Fellowship
Development
Computer
pass,
International
results
supported
to copy without
second
Data
appeared
of the I-itk
Foundation
Faculty
address:
Permission of the
Packard
in the computations.
queries
and a grant from IBM. R. Ramakrlshan Foundation under grant IRI-8804319 and and
reformation
to bring
deferring
reason
the gains
is that
additions
need
of the strong
reason
the
and Warren,
D.4.2
in Proceedings
With
path
all graph-based
swapping;
Some of the results Algorithms,”
second
offset
set
obtain
performs
over the duration than
building algorithms
to
respect,
first
to
in the
descendent
the
deferrmg The
average
out to more
optimization
Descriptors:
Features—recurston:
secondary
The
DFTC
the root
performance.
turns
seen m reachability
indicate
until
order,
traversal
in thm
arcs
of these
eliminating
and a well-
search
of the
at which
pass. Global_
to our expectations,
this
detads
time
thereby
with
topological
depth-first
of a parent
superior
to perform
is the
transitive
_DFTC)
depth-first
some
the
is intermediate
set sizes on the
often
several
algorithms
our results
algorithms
very
to add them
to apply
these
confirms
descendent
1/0;
sets again
Contrary in
them
in memory,
to that
use
reverse While
the
Global
enwronment
processing
a separate
algorithm
and
algorithms in
m a second
are
Schmitz
results
Our to avoid
among
to compute
a disk-based
nodes
performs
be added
literature (Baslc_TC
sets of children
set of a chdd
in larger
in the in
processing
difference
is Identified.
more
fetch
also adapt
by
does additions
The
Basic_TC,
causing
study
later.
the parent
possible,
AND LINDA WINGER
algorithms
markzng
descendent
must
of the descendent
containing
closure
and
that
new
by Schmitz.
called
Basic _ TC
of nodes
proposed
two
implementations
one important
sets
been
proposed
the
performed
order
whenever these
of their
sets by adding are
have
a technique
compute
descendent additions
RAMAKRISHNAN,
We develop
algorithm
a graph
graph.
RAGHU
of Wisconsin
that
Madison,
provided copyright
copying
that notice
WI
reqmres
are
and the title
is by permission
or to republish,
53706.
the copies
of the
a fee and/or
$01.50 Systems,
Vol. 18, No. 3. September
1993, Pages 512-576
Transitive Closure Algorithms Management]:
Systems—query
General
Algorithms,
Terms:
Additional
Key
transitive
closure
Words
.
513
processing Performance
and
Phrases:
Depth-first
search,
node
reachability,
path
computations,
1. INTRODUCTION Several transitive closure algorithms have been presented These include the Warshall and Warren algorithms [28, bit-matrix
representation
of the
the Eve and Kurki-Suonio to identify
strong
graph,
algorithms
components
the [11],
in reverse
Schmitz
in the literature. 29], which use
[25],
the Ebert
which
use Tarjan’s
topological
order,
[10]
a
and
algorithm
[26]
the Seminaive
[5]
and Smart/Logarithmic algorithms [12, 27], which view the graph as a binary relation and compute the transitive closure by a series of relational joins, and recently, a hybrid algorithm combining matrix-based algorithms and graph-based algorithms [1]. We develop two new algorithms based on depth-first traversal, and compare their performance in a disk-based environment
with
Basic_TC yields
the
well-known
is the
graph-based
simplest
a topological
of our
sort
of the
iteratively
processes
nodes
descendent
sets by adding
nodes
in the
the second of our algorithms
GDFTC,
respectively.
on acyclic
tively,
where
We have
only
the prefix
“Dag_”
implemented
descendent
and compared
The result
of this
difference
their
comparison among
graph,
and
by
Schmitz.
of a first
pass that
a second
pass
order
and
sets of children.
builds
Global
that their
_DFTC
versions named
of the algorithms Dag_BTC
versions
is rather
and
for “Directed
that
acyclic
the algorithms:
given
GDFTC
respec-
graph. and the Schmitz
over randomly
surprising,
are the and
are applica-
Dag–DFTC,
of our algorithms
performance
is
the two passes of Basic_TC
that must be added whenever they the first pass, instead of waiting until we refer to these algorithms as BTC
stands
several
algorithm, teristic
are
proposed
consists
topological
and seeks to combine
Specialized
graphs
and
in the
reverse
by adding two descendent sets simultaneously in memory during second pass to do so. Hereafter, ble
algorithm
algorithms
generated the following
performs
additions
graphs. characas soon
as possible, BTC performs them as late as possible, and Schmitz performs them at an intermediate stage. Counter to the intuition that early additions are better (since descendent sets that have been added together need not be brought back into memory for this addition later), BTC outperforms both Schmitz working the
and GDFTC. The first reason is that early additions result with larger descendent sets for a longer time during the execution
algorithm,
leading
to more
and 1/0.
additions and faster growth of sets being
this
results
Overall,
in buffers the
two
being
growth of sets—appear perhaps a little more
that information collected in the optimizations in the second pass. ACM
Transactions
first
filled
effects—avoiding
up
quicker,
extra
in of
thereby
retrievals
for
to balance out, with the faster dominant. The second reason is
pass
on Database
can
be used
Systems,
to
apply
several
Vol. 18, No. 3, September
1993.
514
Y. Ioannldls
.
We have
adapted
set of nodes paths
BTC
reachable
We
queries,
have
but
only
information BTC,
within
which
the
is
also
GDFTC
graphs.
acyclic
a strong
has
reachability,
for
node, and
(For
inapplicable.)
path
We
GDFTC do
in
compare
such
of
path
maintain
a strong the
set of
and Schmitz
not
an important
nodes
the
each pair
to compute
they
Indeed,
such as the
over
between
queries,
since
of merging
queries posed
Schmitz
these
graphs
component.
effect
of related
or queries
such as the shortest
adapted
for acyclic only
a number
a given
closure
also
applicable
to compute from
in the transitive
nodes. are
et al.
path
optimization
of
component
for
performance
of BTC,
GDFTC, and Schmitz for path queries over acyclic graphs, and show that the results for reachability extend to this case as well. We also present a comparison of the various versions of BTC for path computations on cyclic graphs. This paper
differs
the algorithms correctness BTC
due
have
been
to its
difference,
in many
respects
have been revised included.
simplicity
however,
has been replaced
Second,
and
is that
from
its preliminary
and presented we have
superior
presented
evaluation
[13]. First,
and full
increased
performance.1
the analysis
by a performance
version
differently,
the
The
proofs
most
upon
on
important
in the preliminary
based
of
emphasis
actual
version implemen-
tations of the algorithms. Indeed, this has caused us to revise some of our conclusions about the relative merits of the algorithms. The performance evaluation brought out the fact that the algorithms were affected significantly by the impact on buffer management of the growth of descendent sets. This
was
not
strategies buffer mance,
space
analysis,
similarly
available).
based
we were
which
made
the
analysis
our implementation case differs
able
the
(due to the assumption
Thus,
upon
for the average
algorithms,
tions, which The paper
in our
affected
was
whereas
the behavior the
reflected
were
was
for
only
with
worst-case
Also,
all
perfor-
evaluation,
by implementing
specialized
could not be captured in the analysis. is organized as follows. We introduce
that “minimal”
and performance
considerably.
to experiment
assumption that
data
some notation
organiza-
and present
a summary of the new and the existing graph-based algorithms in Section 2. Section 3 presents the new algorithms in detail, starting with some simple versions and subsequently refining them. We describe the implementation of the
algorithms
Section Section
in
Section
4, and
the
testbed
considered in Section 8, and the algorithms compute them. In Section 9, we present algorithms for path queries. We discuss Graph-based algorithms Finally, our conclusions
1In fact, we did and ACM
for
performance
5. We present a performance comparison 6 (acyclic graphs) and Section 7 (cyclic
we have
deleted
are compared are presented
one of the algorithms
implement
for reachability
did not present
any points
TransactIons
on Database
y queries, of additional
Systems,
Vol
in that
paper,
called
uniformly
3, September
ones in Section
DFTC. worse
interest, 18, No
in
queries in queries are
presented earlier are adapted to a performance comparison of the selection queries in Section 10.
to nongraph-based in Section 12.
performed
evaluation
for reachability graphs). Path
1993
The algorithm,
than
BTC
and
11.
which GDFTC,
Transitive Closure Algorithms 2. GRAPH-BASED A
large
body
transitive
.
515
algorithms
for
ALGORITHMS
of literature
closure.
exists
Recently,
sion in new database reexamined in a data
for
with
the
main-memory realization
applications, transitive intensive environment.
based
of the importance closure In this
of recur-
has been revisited and section, we concentrate
on graph-based algorithms, i.e., ones that take into account the graph structure and its properties and compute the transitive closure by traversing the graph.
Almost
all such algorithms
have the following
(a) they are based on a depth-first strong take
components advantage
of the
graph
of the fact that
same descendants
and that
common
characteristics:
traversal of the graph, (b) they identify the using Tarjan’s algorithm [26], and (c) they nodes in the same component
they
are descendants
have exactly
of each other.
Based
the
on (c),
graph-based algorithms can compute the transitive closure of a graph so that only a pointer is associated with each node in a strong component pointing to a common by Purdom compare
descendent [20], Ebert them
proposed
for
algorithms e.g.,
with
or
notation
our
algorithms.
reachability that
[14],
set. In this section, [ 10], Schmitz [25],
and
been
primarily
computations
are
basic
focus
computation
have
path
The
we discuss graph-based and Eve and Kurki-Suonio
of
is
on
entire
algorithms
graphs,
proposed
for
being
discussed.
not
algorithms [ 11] and
that and
partial
so
have
transitive We
first
been
graph-based closure,
present
some
definitions.
2.1 Notation and Basic Definitions In not
this
paper
we
discuss
follows:
for
each
an arc of G}. i.e.,
for
and
all
j
transitive
descendants of
the
strong
condensation
i in
called
in
i is
at
a
of
generality,
of
G
by
G*.
strongly
G
as
only
+ its
a set
The
nodes.
if there
that
head
of the
G
called
component j)
~ G*
There
is
is a path
an
arc
from
no
(~,
self-loops,
source in
(or
and
condensation
as
j) is
We i
do
E, = {jl(i, has
of
we
specified
the
arc.
children
since
G is
children
i is
The
graph,
graph
of
assume
connected {i}.
the
node
as V, = {i} U {jl(i, V,
a directed
that
we
or
an
to
is
destination
if
and
there
(i, j),
The
defined
if
refer
assume
arc
For
V is nontrivial graph
to
We
graph,
graph
G.
components
all.
the
the
of
of i
node
loss
i, i @ E,.
is
graph
term
ones
Without
closure
component
the
node
nodes
node
nent)
use
undirected
G*
G
the
are
strong i)
or tail
denote
the
compo-
G*)}.
The
graph
of G, G, O., has
from
V, to Vj in the
i to j in G. The set of
descendants in the transitive closure for a node i is S, = {jl(i, j) is an arc of G*}. As mentioned in point (a) above, most graph-based algorithms perform a depth-first traversal of graphs, so we review some definitions relevant to it. Depth-first traversal induces a spanning forest on the graph based on the order
in
depth-first
which
nodes
traversal
are
visited.
is visit(i)
If we
assume
that
for a node
i, then
there
the
main
routine
is an arc (i, j)
in
in the
spanning forest if there is a call to visit(j) during the execution of the call visit(i). An arc (i, j) in the graph G is called a tree arc, if it belongs in the spanning forest. called a forward
An arc (i, j) in the graph G but not in the spanning forest is arc, a back arc, or a cross arc, if in the spanning forest, j is ACM
Transactions
on Database
Systems,
Vol. 18, No. 3, September
1993.
516
Y. Ioannidis
.
a descendant
of
i, j
ancestor-descendant its node r on which
2.2
Summary
The
goal
rather
et al,
is an
ancestor
of
i, or j
is not
related
to
relationship, respectively. For every strong visit( z-) is first called is the root of the strong
i with
of Algorithms
of this
to give
subsection
is not
an abstract
them
and
their
BTC
and
GDFTC
are
to present
description
implications
Detailed expositions ested reader in the
so that
on performance
described
of the original
in
any
algorithm
the
main
in
detail
differences
can be understood.
detail
in
later
of the
children
of parents,
are
evaluation
added
of these
understanding form
to
those
is important
because
three
for
how
paper.
remaining algorithms can be found by the interreferences. In the descriptions that follow, one to the fact that BTC and GDFTC of possibilities for when descendent
This
but
among
Algorithms
sections
should pay special attention extreme points in a spectrum middle.
an
component, component.
algorithms the
with
it allows
other
Schmitz
somewhere
the conclusions
to be used graph-based
are two sets of
as a basis algorithms
in
the
of a performance for
a qualitative
are likely
to per-
as well.
BZ’C.
This
algorithm
uses Tarjan’s
algorithm
as a first
pass to construct
a
topological ordering of nodes and to identify the strong components of the graph. Additionally, that pass can be used to physically cluster the relation in reverse topological order with nodes in descendent sets arranged in topological order. dants called
This
improves
the performance
of a second
pass when
of all nodes are found in reverse topological order. An “marking” is used to avoid the addition of a descendent
cases where the given
earlier
set additions
descendent
set additions
are guaranteed
set. Because
are deferred
to have
of the two-pass
as much
added
structure
the
descen-
optimization set in many all nodes
in
of the algorithm,
as possible.
GDFTC. This algorithm defines the opposite end of the spectrum from BTC in that descendent set additions are performed as early as possible. When returning from a child along a tree arc or an intercomponent cross-arc, the child’s descendent set is added to the parent’s set immediately, thereby eliminating
the
need
addition.
A rather
a strong
component,
to
retrieve
complicated when
component cross-arc), the representative of the strong Pzwdom.
Purdom
stack returning
these
sets
from
an algorithm
ensures
a child
child’s descendent component. Thus,
proposed
subsequently
mechanism
to that,
(along
perform
this
for all nodes in
a forward
or inter-
set is added to the set of a additions are never deferred. that
is similar
to BTC
[20].
It is
based upon computing a topological sort of the condensation graph prior to computing the closure. The main difference with respect to BTC is the absence of marking; the implementation of BTC also incorporates some important optimizations that increase the effectiveness of marking and take advantage ACM
of the topological
TransactIons
on Database
sort for physical
Systems,
Vol
18, No
clustering.
3, September
1993,
Transitive Closure Algorithms Eve and Kurki-Suonio. to a node i and
i after
j are in the
the children above the
the
same
root
strong
comprise
the
component
descendants
of j
are
[11].
component, in the
to Tarjan’s
First,
if
i and
added
observed
j, node j is still
nodes
modifications
closure.
(similarly identified,
a child
of the root of a strong
following
transitive
Eve and Kurki-Suonio
processing
on returning
after
if and only if processing
the nodes on the stack
strong
component.
algorithm
in
j are in different
to the
that
on the stack
Further,
descendants
strong
of i after
to GDFTC). Second, when the root of the descendants of each node in the strong
all
that
They
order
517
.
are
proposed
to compute
the
components, visiting
j
the from
i
a strong component component are added
is to
the descendants of the root (similarly to Schmitz). There are two potential redundancies in the algorithm that affect performance. First, the algorithm propagates descendent sets even when returning from forward arcs although this is unnecessary. Second, if there is an arc (j, k) such that j is in a nontrivial strong component and k is in a different component, k is added to S1, by the first for
the
root
modification Ebert.
modification
of j’s
suggested
another
traversal
but when
of the graph
returning
from
a child,
cross arc, the descendants parent [10]. This algorithm ing no additions identical dant ing
operations
gated
via
the
addition
set constructed
of S1, by the
second
in Ebert
every
tree
of
Tarjan’s
to identify
if the arc is a tree
arcs. For
For
cyclic
in that
arc in the
to the descendent
modification
is performed
algorithm:
strong
a
components,
arc or an intercomponent
of the child are added to the descendants of the improves upon Eve and Kurki-Suonio by perform-
on forward
to Dag _ DFTC.
from
and also to the descendent
component,
above. Ebert
depth-first
above,
strong
acyclic
graphs,
graphs,
the Ebert
however,
there
descendent
sets are propagated
component
until
they
are
algorithm
are many after
is
redunreturn-
eventually
propa-
set of the root.
Schmitz. Schmitz’s modification of Tarjan’s algorithm is based upon the fact that strong components are identified in reverse topological order, and that
all nodes
in the strong
component
when the root is identified [25]. closure is essentially to construct descendent root
sets of all
is identified.
children
Schmitz’s
them; thus, the avoided. Finally,
are on the stack
The modification the descendent of nodes
algorithm
in the
over the marking
condensation graph, although in 13TC, due to the two-pass
one-pass
structure.
Further,
since
strong
also detects
first redundancy of Eve and Schmitz uses an optimization it
above
the root
node
to compute the transitive set of the root by adding the component
forward
arcs
Kurki-Suonio’s that is similar
when
the
and ignores algorithm is to marking
is not in general as flexible as nature of BTC versus Schmitz’s
no additions
are made
to the
descendent
sets of nodes in a strong component until the root is identified, the second redundancy of that algorithm is also avoided as well as the redundancy of Ebert’s algorithm. Like Eve and Kurki-Suonio’s algorithm, however, Schmitz has the potential cost of retrieving descendent sets that may not be in memory. For Schmitz, these are sets of children of nodes on the stack above the root
node;
for Eve and Kurki-Suonio, ACM
Transactions
these on Database
are sets of nodes Systems,
on the stack
Vol. 18, No. 3, September
1993.
518
Y. Ioannidis
.
above the root. BTC in terms eagerly,
but
et al
Thus, both algorithms of when descendent
they
are not
deferred until the entire Schmitz also proposed as an arc basis arcs such
deferred
their
to a second
is computed
transitive
closure
We have not explicitly studied this niques used in the implementation effect.
Thus,
basic
Schmitz
2.3
an upper
BTC
bound
algorithm
Comparison
between GDFTC and additions are not done
pass either.
Instead,
they
and used,
is equal variant. of BTC
i.e., a minimal
to that
of the
be derived
of this
indirectly
variant
(Section
The Eve and Kurki-Suonio by the Schmitz
of Purdom’s
algorithm
algorithm
of when
seminal
are
is uniformly
operations
and
more,
carried
algorithm.
Both
run-times
graphs.
For
cyclic
graphs,
terms
of additions.
While
terms
of CPU
for cyclic
time
We
(A comparison
for a main
to Dag_DFTC
it is not
clear
graphs
how
(since
expect
(and
thus
work
and
GDFTC
GDFTC
uses more
comparison
The detailed
results
of the
performance
of this
comparison
3. THE NEW TRANSITIVE In this based
section upon
3.1 A Marking
GDFTC,
are presented
and
for
GDFTC
in
compare
in
complex
stack
is uniformly to a compreSchmitz
in Sections
only.
6 and 7.
CLOSURE ALGORITHMS
we present
depth-first
of BTC,
cor-
GDFTC)
than
operations), we expect that the 1/0 performance of GDFTC better than that of Ebert. Based on the above observations, we have limited ourselves hensive
the
in terms
implementation
more
Ebert
to it in that
by Schmitz
memory
it does strictly
performed
identical
therefore
the
are seen to algorithm.
all the additions
and it is almost
out.
superior.
roborates this observation.) The Ebert algorithm is identical acyclic
performs
and usually
additions
algorithm
of vector
over the
6.1).
of Algorithms
can be seen as a refinement
Schmitz
of
graph.
Nevertheless, one of the techachieves essentially the same
on the cost improvement
can still
subset
original
marking optimization and the physical clustering that we propose yield significant improvements; BTC clearly dominates Purdom’s
terms
are
strong component containing the node is identified. a variant of his algorithm in which what he refers to
of the graph
that
are intermediate sets are added:
in detail
graph
several
new transitive
closure
algorithms
traversal.
Algorlthm
We first present a simple transitive closure algorithm that introduces a technique called marking. Intuitively, if a descendent set contains a marked node, it also contains the children (but not necessarily all descendants) of that node. In the following, descendent set S, is partitioned in two sets M, and U, that can be thought of as the marked and unmarked subsets of S,. proc Input: ACM
Closure
A
graph
Transactions
(G) G represents on Database
by
Systems,
children Vol. 18, No
sets
E,,
1 =
3, September
1
to
1993
n.
Transitive Closure Algorithms Output:
S, = M, (U, = 0), i = 1 to n, denoting
(1) {for
i = 1 to n do U; := Ei; M, := 0 od
(2) fori=l (3)
519
.
G+.
tondo
while there is a node j s U, – {i} do M,:= ML UMJU{j}; ~:= ULUUJ
–M,
od
(4) od} LEMMA
3.1.
PROOF.
~ G ~,
Whenever
M, U Q.
The
U, = E,, and that
node
j
follows
for all
THEOREM 3.2. of a graph
a
claim
u UL.
El cM,
*
is
from
Algorithm
Closure
correctly
If
Ml
~ G
j ● Ei
u U,, then
are in M, U u,. To see that the algorithm
terminates,
initially
We note the effect
Lemma
that
3.2
Depth-First
algorithms
to and
❑
the transitive
m have
In the presence
is obtained,
by ignoring
closure
k
such
is completed
that
from
U U,, we note that
i
when
by noting
increasing
for all
i,
Lemma
that
achieves
(For the interested
reader,
2 of [25].)
Nodes in
the
nodes with
a lower
back
an optimization
graph.
from
following
sections,
the property
number
of cycles,
than
we need
to
all descendants
m, i.e., a topological
an approximation
arcs. That
that
order
to such a numbering
is, in the acyclic
graph
obtained
from
(G)
A graph G represented Graph G with popped[ ].
by children
nodes numbered.
(1){vis = 1; (2) for i = 1 to n do uisited[i] (3) while
node
are reachable
back arcs, all descendants of a node numbered m have a lower m. The depth-first numbering algorithm is presented below.
Number
array
that
are in Mi
contains
we introduce
numbering.
is some
nodes
U U, is monotonically
derives
of the graph
of a node numbered
Output:
Ml
algorithm that
that
a numbering
Input:
added
M, = 0
increasing.
i, U, = @. The proof
Traversal to Number
G by ignoring number than
only
over the condensation
is the optimization
proc
also
❑
3.1.
Schmitz’s
of marking
this
the
is
initially
computes
or there
that
all such nodes for all
U, = .?3, and that
and by applying
obtain
u Uj
that
G.
PROOF.
In
M,, Ml
observation
i, M, U U, is monotonically
k e El and j ● Mh U Uk. It follows
that
to
added
the
sets E,, i = 1 to n. The numbering
is stored in a global
:= O; popped[ i] := O od
there is some node i s.t. uisited[ i] = O do visit(i)
od
} (4) proc
visit(i)
(5)
{Ukited[
(6)
while there is j G E, s.t. visited[j] popped[i] Z= uis; vis := vis + 1;
(7)
i]
:=
1;
= O do visit(j)
od
} ACM
Transactions
on Database
Systems,
Vol. 18, No 3, September
1993
520
Y. Ioann/dls
.
The following induced LEMMA
3.3
form
Let
[4].
that
strong
and in
be a strong
component
those of its arcs that
strong
of a graph
graph,
for
the simpler
we use Tarjan’s root.
strong
of the spanning
of a graph are common
forest
G. Then,
the
to the spanning
3.3 Algorithm
root
in
suitably
another
of this
to compute
strong
array
root.
array
of) each
While
we
have
in the sequel,
the
as Modified_
is the
the
component
to compute
algorithm
tree
identifying
for ease of exposition,
modified,
modified
for
modified
(of the
Number
is the root
algorithm
is easily
the
example,
to this
that
Tarjan’s
[26]
algorithm
algorithm,
We refer
component
component.
it can also identify
the
presented and
GI with
the node in the
of the
components
node
property
a tree.
root
popped,
an important
Number.
of GI together
Note the
identifies
by algorithm
vertices forest
lemma
et al.
arrays
popped
Tarjan.
BTC
A simple-minded
version
of our first
algorithm
is a straightforward
combina-
tion of the two ideas presented in the previous sections; algorithm Closure is simply run after numbering the nodes in reverse topological order (modulo back
arcs).
assume
This
the
the global graph
array
G and
Closure
is
the
node
on an acyclic not
by BTC’.Z called
constructed
true,
k such
that
when
A
Output:
graph
G represented
S, = M,
{Modified
popped
equalities is
run
at
on a
that,
ordering, on
SJ is added
we
looks
applied
S1 = Ml
k E U] and
(Uz = (23), i =
by
children
1 to
n,
sets
denoting
ndo
(3) fori=l
E,,
when after
and
a cyclic
~
a
= 0
graph
to S, and ( i, j) is a
k is an ancestor
of i in
i =
UL=Ez;
ML=0
od
1 to
n.
G*. / * First
_Tarjan(G);
(2) fori=lto
Pass*/
/ * Second Pass*/
tondo
(4)
I = node_ popped(i);
(5)
while there is a node ,i do M1=MIUM,U {j];
(6)
when
k ] = i, Note
the
when
which
BTC’ (G)
Input:
(1)
Tarjan
Closure
For example,
back arc, there exists a node k such that the spanning forest imposed by Number. proc
ease of presentation,
popped[
G following
to a set S,, the
however,
such an ordering.
For
node _popped(i),
by Modified_
graph
set S1 is added
This
following
is denoted
of a procedure
popped
returns
is run
descendent hold.
version
existence
●
UI – {1} U1:=UIUU,
–Mlod
Od}
Modified_ Tarjan also computes the array root, which enables an important optimization: since all nodes in a strong component have the same set of descendants, we can construct the descendent set for the root node alone.
‘In BTC,
our
earlier
and
paper
proposed
the
on these
algorlthm is essentially what the use of algorithm Modified ACM
Transactions
algorithms
use of algorithm
on Database
we have _Tarjan Systems,
[13], number
we referred to order
to algorithm
the nodes
presented above as algorithm BTC’. rather than algorithm Number. VO1 18, No 3. September
1993
Closure
appropriately; The only
as algorithm the
resulting
difference
is
Transitive Closure Algorithms Consider
the processing
of a node
1 in the algorithm
above.
Instead
the descendent
set of a child j and the child itself (i.e., Mj is carried out in BTC’ we can add it to S, OO~[I1.This addition by excluding j = 1, we avoid the addition processing a root node (a node 1 such that dent
set to all nodes
If we carry also,
an
out
in its strong
the
addition
additional
whenever
based
j of node 1, if j and processed;
nent
as 1, the nodes
nonroot
upon
thus
node
in the
strong
the
The
root
to
strong
is reachable
to the
we process
the first
such node. Thus,
all
nodes
in
the
marked
component.
from
have
the
some
node
the addition only
compoevery and
is
we process the
strong
descendent
set
of U~ has no effect
is subsequently other
the while
already
root, in
root’s
or
a child
strong
set when
of the
which
empty
too. Further,
before
descendent
The
sets is to control
same
component
subset
for the root,
strong
of descendent
j must
the
essence,
In considering
is processed
itself
ignore In
the set 11~is either
components,
of the root’s
set for j = 1
can set.3
Uj = Q1. If j is in the
is added
subsets
we
observations.
component
set computed
descendent
a descendent
set S~ to another,
subset
node
its
possible:
in
and
descendent
and
in Uj are all in the same
in the marked
root.
j
the following
S~ = M~ and
therefore component,
of descendent set to itself. After 1 = root[ I ]), we copy its descen-
made
nodes
1 are in different
been
the
for
we add a descendent
can be ignored,
when
is
distinction
of adding
U U~ U {j}) to SI, when j ~ UI – 1;
component. of a child
optimization
marked/unmarked
521
.
loop;
use
on
propagated for
unmarked
i.e., to determine
for
what nodes j the addition is to be carried out. Here, we can use EI – M,OOt[ ~1 instead of Ur since UI is initialized to EI and is not added to (based upon the preceding carried
discussion),
We refer can
to the
improve
henceforth with
and all nodes
out are included above
the
changes
performance consider
root in the
subsets
M, and
component
only
Based since
be treated
(1) {Modified_ (3) fori=l
have
optimization”;
reachability
presented
below,
on the
above
observations,
is no need for path
identically,
we consider 8).
which
the
root
its
in a strong
optimization we
is used
between
nodes
computations,
we
is BTC’
Si
to distinguish
computations, and
path
they
computations,
is not
adapt
BTC’,
sets E,, i = 1 to n.
Tarjan(G);
/ *First
S,:=00d
Pass*/
/ * Second Pass*\
tondo
I := node _popped(i);
results
singleton
been
BTC (G)
(2) fori=ltondo
‘This
has already
BTC,
Input: A graph G represented by children Output: S,, i = 1 to n, denoting G*.
(4)
as the “root
For
there
Uz. On the contrary,
applicable. Thus, when rather than BTC (Section proc
algorithm
algorithm,
cannot
the addition
collectively
significantly.
optimization.
directly
for which
in M,OOfr ~1.
in adding
strong
chosen
the descendent
components
to keep
with
the presentation
set of the root to itself
self-arcs. simple
ACM Transactions
This
unnecessary
for nontrivial addition
strong
components
can be avoided,
or
but
we
instead. on Database
Systems,
Vol. 18, No. 3, September
1993.
522
.
(5)
Y. Ioannldls
while
et al.
there is a node ~ ~ EI – S,,,.,[ ~1
–s ,out[ll do S r’oot[I] “— (6)
if I = root[Il od}
Note nonroot
then
U
S,
for all
od
U {j}
h + 1 s.t. root[rl ] = I do Sk = SI od
that in the above algorithm, node in the same component
dants
of j if j and
THEOREM
directed
I are in different Algorithm
3.4.
graph
PROOF.
S~ is empty in line (5) whenever j is a as 1; further, SJ contains all descencomponents.
Basic _ TC computes
the transitive
closure
of a
G.
we
prove
the
theorem
by
showing
that
upon
termination
of execu-
contains all descendants may constitute a trivial
m in every strong component, S,. of m in the transitive closure of G. Note that a node strong component by itself. The proof is by induction
upon
r is the root
tion
of
the
algorithm,
popped[
Basis.
r ], where
Consider
the least
r
is
the
root
is a child
algorithm,
belong
at least node m
claim
holds. component, must
Induction
Step.
node
after
processing
strong
a strong
popped[
statement
the (5),
descendent this
set
set is
added
inductive claim holds. If r is the root of a nontrivial
a root
for
1 such
component,
is that
including
By definition
of the m + r in
loop for 1 = r in the taken
strong
component proof,
k that
the
role
of j
in
of execution. Hence, S, and by statement (6), this component.
with
for all strong
r’ ] < P. As in the basis
If this is a trivial strong component, with a root that has a popped hypothesis,
hold
by
component;
m ] for any node
once since the beginning in the strong component,
Consider
node
with
has
the
i.e.,
is not modified
the
the while
component
r ] is
every
in the
r ] > popped[
popped[
component,
be in
node in the component.
popped[
r ] = P, and let the claim
such that
also
component
every
set is propagated to every node m + r in inductive claim holds for this case as well.
popped[
strong
of S, = @, which
component
in the
r such that
value
to a strong
Thus,
node
root
is a trivial
strong
the
component.
with
If this
r ]. In addition,
component. (5) every
in
component,
every
statement contains
the inductive
of some other
of the strong
its strong
component
of a nontrivial
k would
of a strong
components.
a node
1] < popped~
the root, root
Thus,
from
otherwise, popped[
node
node in it, the initial
the algorithm. reachable
each
the strong
over all strong
r is the only If
for
root
Thus,
r
components
we must
such with
examine
the
that root
r’
two cases.
every child of r is in a strong component number less than P. By the induction each to
child S,
strong
while
m
of
r
is
processing
component,
contained 1 =
in
r.
we can show
Sm.
Thus,
By
the
as in the
basis proof that after processing the while loop for I = r, S, contains all nodes in its strong component. Further, if node k is not in the strong component but is a child of a node j in the strong component, S, contains Sh U {k}. The reason processed before j, ACM
TransactIons
is that which
on Database
the root of the strong component containing k is is processed before r. Since k is in a strong
Systems,
Vol. 18, No. 3, September
1993
Transitive Closure Algorithms component with sis, Sk includes is added
root r’ such that popped[ r’ ] < N, by the induction all descendants of k when we process node j. Thus,
to SrOO~[dl, i.e., to S,, when
3.4 Algorithm
BTC,
closure.
In
performance
trates
and
the
the
the idea,
the
numbering
work
sets
pass.
it only
in a first
subsection,
of the
descendent
although
nodes
following
some
appropriate
during
proc
we numbered
this
by combining
the
memory
concludes
the proof
of
Dag_DFTC
In algorithm
adding
This
hypotheSk U {h}
❑
the theorem.
the
j is processed.
523
.
The
works
from
when
pass before we the
they
following
for acyclic
computing
attempt two
to improve
passes,
i.e.,
are
simultaneously
simple
algorithm
by in
illus-
graphs.
Dag DFTC (G)
Input:
Ar-acyclic
graph G represented
S,, i = 1 to n, denoting
G*,
(1) {for
i = 1 ton
:= O; S1 = 0 od
(2) while
do uisited[i]
sets E,, i = 1 to n.
by children
Output:
there is some node i s.t. uisitecl[ i] = O do visit(i)
od
} (3) proc visit (i) i] := 1; (4) {uisitecl[ (5) while there is some .j e E, – S, do if uisited[j] = O then visit(j);
S, := S, U SJ U {j}
od
} We state
the following
THEOREM
3.5.
acyclic
graph
intuition
in the above
arc, the descendent
descendent memory. if
without
set
any.
of the
parent,
we must
Hence,
algorithm
by
and
moreover
nonetheless
retrieve this
for arbitrary
that stack
stack
when
both
we pop up from
at
graphs
are
in
the process-
the next
that
a
to the
sets
out during
it to identify
addition
descendent The above
be added
descendent
time,
child we
to
avoid
sets later, in the second intuition is used to derive as well,
we develop
an algorithm
that
generalizes
graphs. Like BTC, it avoids duplication the descendants of only one of the nodes
Subsequently,
‘Our
of an
which
is presented
GDFTC
section
version
closure
and must
is paged
performing
an “eager addition” algorithm in the next subsection. 3.5 Algorithm
is that,
set of the parent
of these addition.
In this
the transitive
is complete
possibly fetching one or both phase of BTC, to perform the
on arbitrary generating
proof.
computes
set of the child
If the descendent
ing of the child, visit,
theorem
Dag_DFTC
G.
The basic tree
simple
Algorithm
of the differs
it is a stack
the
sets
of all
algorithm, from
the
stacks
of descendent
the
other
a stack used
in other
sets of nodes
nodes
mechanism4 graph-based
are
Dag_DFTC
updated, is
used
algorithms
in nontrivial
strong
on Database
Systems,
to work
of effort by essentially in a strong component. if any. to
for transitive
components,
In
this
construct
the
closure
as opposed
in to a
of nodes. ACM Transactions
Vol
18, No. 3, September
1993
524
Y. Ioannidis
.
descendent
set for (the
algorithm, nent,
each stack
If we discover
in fact this.
part
Every
contains associated
frame
with
component, f
are
list[
list[ top]
is assigned
concluding visiting
on its type
roughly
two
with
respect
sets
contains
of the
nodes
The
that
belong
in
When
the root
is
and the stack
is
in nodes[ top]
tree
or back
component.
the graph
on each traversed
spanning
cross,
strong
It traverses of calls
in depth-first
arc (i, j) depends
to the
arc (forward
nodes[ f]
component
do not
to all the nodes as follows.
are
to reflect
set
strong
of nodes[ f].
of the corresponding
compo-
“components”
of nodes.
~]
of the
strong
are merged
of the members
to the
it is a tree,
frames
be members
each node once. The action
i.e., on whether
the process
distinct)
stack
to
the processing works
During
some nontrivial
(potentially
maintains
set
with
then
known
f’. The
component.
is associated
some of these
stack
The algorithm order,
that same
that
of) a strong
and are descendants
identified, popped,
root frame
of the
nodes
nodes[ f]
et al.
visit(
) routines,
arcs are ignored).
The
arc type is identified with the help of the values of visited[ i and j. (Note that the array visited contains integer
1 and popped[ elements in
1 for this
action is In all cases, however, pieces of information: first, whether
based
two
algorithm.) additional different
strong
components,
(nontrivial) component, tion to the latter that questions i,
are resolved
root[ i ] < n while
processing the first
in different the second known
second,
i and
in
case
whether j is the first child it is part of a nontrivial based
on the values
i is known
has not finished question,
and
differentiated
of root[ j]
strong components question, the value
to be in a strong
to be part
yet, whereas
the value
of root[
they
are
in
same the
or
same
of i to pass the informastrong component. Both ] for i and j. For any node
of a strong
component
root[ i ] = n + 1 otherwise. should
on
j are in the
be equal
whose Thus,
for
to n + 1 if i and j are
(processing of the component of j is over). For of root[ i ] should be equal to n + 1 if i is not
component.
Based on the specific case identified from the above pieces of information, the algorithm takes the following actions. For all tree and cross arcs, if i and j are in different the descendants graph
strong components, the descendants of i. This is the action when operating
and is straightforward.
manipulation,
addresses
The bulk the
case
of j are propagated to on acyclic parts of the
of the algorithm,
when
i
and
component. Tree arcs are the most interesting frame always corresponds to j. If j is the first
j
are
which in
the
involves same
stack strong
in this case. The top stack child of i through which i is
detected to be part of a strong component, then i is incorporated in the top stack frame. Otherwise, the second stack frame from the top corresponds to i appropriand is merged with the top frame. In both cases, root[ i ] is updated ately. Cross and tree arcs are treated almost identically. If j is the first child of i through stack frame only root[ i ] sponding to Algorithm indicate that
ACM
Transactions
which i is detected to be part of a strong component, then a new is pushed on the stack and becomes associated with i. Otherwise, is updated appropriately (the top stack frame is the one correi), in slightly different ways for cross and back arcs. is used to GDFTC is given below. The notation LI := LIoLZ list Lz is concatenated to list LI by switching a pointer, at 0(1)
on Database
Systems,
Vol
18, No
3, September
1993
Transitive Closure Algorithms
cost. For the special
case when
LI is 0 (that
is, when
list
525
.
Lz is to be assigned
Lz. In contrast, the notation to the empty list Ll) we use the notation LI ‘= ● LI Z= LI u Lz is used to denote that a copy of Lz is inserted into L1. proc
GDFTC(G)
Input:
A graph
G represented
sets E,, i = 1 to n.
by children
S,, i = 1 to n, denoting
Output:
G*.
/.
list[ f ]
descendants
/.
nodes[ f ]
nodes in the strong comp. of stack frame f.
/“ /.
top
pointer
uisited[ i ]
order in which visit( i) is called. potential root of the strong comp. in which 1 if the call to visit(i) has returned.
/.
root[
/“
popped[ i]
i]
of nodes in the strong comp. of stack frame
f. */ */
to the top of the stack.
*/
(1) {vis := 1; top := O; uisited[ n + 1]:= n + 1; := popped[i] := O; root[i] (2) for i := 1 to n do uisited[i]
*/ i belongs.
“/ +/
= n + 1; list[i]
:=
nodes[ i ] := S, := @ od
(3) while
there is some i s.t. visited[i]
proc
= O do visit(i)
od}
visit(i)
= uis; vis Z= uis + 1; (4) {visited[i] for each j ~ Et do (5) / * each
(6)
j
considered
(7)
if /*(i,
visited[j]
j)is
(8) (9)
= O then
if root[j]
{
= n + 1 then
S, := S, u S~ U {j}
strong components.
*/
elseif root[ i] = n + 1 /*firstdetection of i being in a strong comp. (through
(11)
then add_in_top_frame(i, .j) else merge _top _two_frames(i,
(12) (13)
elseif /*(i, / *i,
j in different
update_
elseif /*(i,
(18) / *first
( 19)
j)is
j)}
S, := S,
= n + 1 then being root
in
a strong
_non_back(
popped[ j] = O then
aback if root[i]
U
SJ U
{j}
strong components.*/
else {if root[i] / * first detection of i
(16) (17)
j) * /
popped[ j] = 1 then
j) is a cross arc.*/ if root[j] = n + 1 then
(14) (15)
once.*/
a tree arc.*/ visit(j);
/ * i, j in different (lo)
exactly
if j g S, then / *body of loop not executed when j E S,*/
arc.*\ = n + 1 then
push_ new_ stack_ frame(i, comp.
j);
* /
i, j)}
{ push _new _ stack _frame(
i, j);
detection of i being in a strong comp. * / update_ root_ back(i, -i)}
od
ACM
Transactions
on Database
Systems,
Vol. 18, No. 3, September
1993.
526
Y. Ioannidis
.
Fig.
1.
et al.
A strongly
if i = root[i]
(20)
/ *Propagate of the
root[?]
(21)
then
graph.
G-
d
g
{
descendants
in the
nodes
connected
of root
strong
,= n +
comp.
to the
rest
*/
1;
for each j = noci?es[top] do S, := S,”nOdeS[tOp]; top
= top
–
root[j]
= n + 1 od:
1}
popped[ i ] := 1
(22)
} proc
add_in
_ top_ frame( i, j) S,; S, = ~list[top];
(23)
{lzst[ top] ;= list[ top]
(24)
nodes[ top] ‘= nodes[ top] u {i); root[ i] = root[.jl;} z, j) proc merge _top_two_frames(
u
(25)
{list[top
(26)
nodes[top
(27)
update –root_non_back(i, j)} proc push _new _ stack _frame(i,
(28)
(top = top + 1; lisd top] = OS1; nodes[topl j) proc update _root _non_back(i,
(29)
{if znsifed[ root[ j]] proc update _root
(30)
{if
– 1] = list[ top – 1] u lz’st[top]; – 11 = rzodes[ top – ll*nodes[
cisited[j]
top];
= top – 1;
j)
< cisited[ root[ ill then _back(z,
top
‘= {i}} rood i] = root[jl}
j )
then
< viszteo![ root[ i]]
root[ i]
= j}
We prove that GDFTC is correct in an appendix. important aspect of the algorithm is that duplication constructing component
the
descendent
and subsequently
list
of just
copying
this
One of the reasons for the complexity track of strong component information the fly. We illustrate
the operation
one
As mentioned above, of effort is avoided
node
(the
root)
of
an by
a strong
list for each node in the component.
of this algorithm while constructing
of the algorithm
is the need descendent
on an example
to keep lists on
in which
a
single strong component is discovered in a piecemeal fashion. Figure 1 shows the input graph. The whole graph is one strong component. Assume that the nodes are visited in the order a, b, c, d, e, f, g, and h. Thus, the back arcs (d, b) and (g, e) are discovered before (h, independent components to be pushed {e, f, g}. After ( h, a) is discovered, a third there is no way of knowing that all component. This is discovered when we
a) is. This results in two potentially on the stack, namely, {b, c, d} and level is added to the stack, because of the nodes belong to the same pop up back to f again, statement
(12) in the algorithm is executed, and the two frames at the top (corresponding to a and e respectively) are merged into one. When c is reached, similar actions are taken, so that when correctly found in the top list. ACM
TransactIons
on Database
Systems.
a, the root,
Vol
is reached,
18, No 3, September
1993
all its descendants
are
Transitive Closure Algorithms 4. IMPLEMENTATION This
section
rithms,
the
the
tives were available clearly interpreted. physical
main
specific
aspects
choices
of descendent
lists,
elimination.
Some of the techniques
ones,
also
have
algorithms
4.1
of our
that
implementation
we made
when
of the
multiple
been
used
by
memory
that
others
management,
we present for
below,
implementing
algo-
alterna-
so that the results of a performance evaluation These aspects include storage structures for
clustering
527
OF ALGORITHMS
describes
analyzing
.
and
may be graphs, duplicate
or closely
related
transitive
closure
[3, 14, 22].
Storage
Structures
We represent and store graphs in several forms. First, both the input output graphs of the algorithms are stored in a plain tuple format, compactly as possible. Tuples with the same source attribute (arcs with same
tail)
structure
are
Second,
all
consecutively
in
the
file,
but
otherwise
no
special
is assumed. during
represented dent
stored
and as the
the
course
as descendent
lists
occurs
other
as part
algorithms,
of the
lists.
execution
of the first and
of all
The restructuring
we
pass of BTC,
refer
to
it
algorithms,
from whereas
as the
graphs
arc-tuples
it is the first
restructuring
are
to descenstep of
phase.
To
accommodate descendent lists, every page is divided into some number of blocks. Each block can store a constant number of node names (equal to the blocking factor), representing arcs from a common source to the stored nodes. There
is
common array source
a pointer source
index
with
additional
one entry
a fixed
for
block factor.
each
in
if
there
a page.
block;
the
entry
choosing
more
arcs
each page
the
the
the block
factor
implies
perfect
with
a
contains
contains
whether
the blocking
Thus,
are
In addition,
and a bit indicating
size page, increasing
can be stored
depends on the long descendent each
an
the blocking
of arcs in the block,
not. Given blocks
to
than
an
common
is empty that
or
fewer
blocking
factor
following trade-offl a high blocking factor saves space for a list, since its source is factored out and stored only once for
set of descendants
that
fit
in
each
block;
on the
other
hand,
a high
blocking factor wastes space for a short descendent list, since a large portion of a block remains empty and unused. This trade-off will become clear from the results
of our experiments.
Third, whenever a descendent list is processed in memory, i.e., whenever nodes are copied from it or into it, its contents are also replicated in the form of an adjacency which
is
descendent O otherwise.
equal
vector. to
The vector
1 if
of the source This allows
the
has an entry
corresponding
for every
node
has
node
been
in the graph,
identified
as
a
of the corresponding descendent list and is equal to for fast duplicate elimination, since the descendent
list does not have to be searched before adding a node to it: a straight lookup at the adjacency vector is enough (Section 4.4). The size of the adjacency vectors is calculated in” the first steps of each algorithm, when the graph is ACM
Transactions
on Database
Systems,
Vol. 18, No, 3, September
1993.
528
Y, Ioannidis
.
transformed nodes In
from
et al
tuples
to descendent
lists,
at which
time
the
some
useful
number
of
is counted. addition
to the
maintained entry
contains
of the
above,
in memory,
an array
with
the following
strong
component
containing
an entry items:
(a) the outdegree
to which
information
for each node in the graph.
the
of the node,
corresponding
rank of the node in the topological order obtained of the graph, (d) an indication of whether the
node
is
Each
such
(b) the root
belongs,
(c) the
by the depth-first traversal node has been visited and
processed or not, (e) a pointer to adjacency vector of the node (if it is memory), and (f) the page number of the file on disk where the descendent list of the node is stored. For leaves, the last entry is equal to a particular reserved value,
making
saving
space and also many
4.2
it unnecessary
Descendent
In the
to store useless
empty
descendent
accesses
lists
implementation
of the
and use much
BTC
algorithm,
information
we take
from
advantage
the first
components
are identified
in the first
condensation
graph
and
mization
is also achieved
intradescendent made
BTC.
These
part,
their
possible
result effect
list
pass,
in significant cannot
store
the
nodes. order,
time,
lists
and this
lists
in
the
This has the effect that and are therefore likely
descendent children reverse
lists are
often
topological
stored order,
the same page as well.
same
same nodes
Hence,
reverse
page.
page.
ways. optipass of
and for the most We elaborate
constructed
For
Also,
processed
in the first
during
the children
topological
us to
important
or GDFTC.
are
contain
allows
in other
other
obtained
many nodes that to be close in the
on the
on the
only
the
graph, improvclosure of the
GDFTC,
two
improvements,
by Schmitz
of a graph
they
and the
of nodes
performance
be realized
of course,
descendent
are
by the ordering
these optimizations below. In BTC, the descendent pass. At that
by Schmitz
ordering
two
This is not possible For cyclic graphs,
essentially compute the transitive closure of the condensation ing performance significantly. The effect of computing the Inter-
of its
pass to expedite
second pass, in which the transitive closure is computed. with the remaining algorithms that we have considered. strong
thus
List Ordering
pass structure
their
for them,
to disk.
the
on first
of each node. We
order
of their
source
are close together in that graph as well, have their example,
since
a parent
nodes
consecutively
the above interdescendent
are
and
its
processed
in
are likely list
ordering
to be on results
in very high hit ratios in the buffer pool and thus in less 1/0. The above technique does not help when the number of children of each node in the graph is so large that only one children list or less fits on a page. Another benefit of the first pass of BTC is that the topological ordering of the nodes can be used to reduce the production of duplicates. Specifically, consider an arc (i, k) in G and assume that there is also a path between i and k whose topological
first arc is (i, j). Clearly, the order of G. If j is processed
inequalities before k
i < j < k hold in when dealing with
children list of i (statement (5) of BTC), then k will be found turn comes, and no action will be taken on it. If k is processed ACM
Transactions
on Database
Systems,
Vol. 18, No. 3, September
1993
the the
in S, when its first, however,
Transitive Closure Algorithms then
j
will
have
essentially nodes
to be processed
be derived
in each
twice
as well,
for i. To avoid
descendent
list
produced
and this
the
descendants
unnecessary
by the
first
529
. of k
will
computation,
pass
of BTC
the
are stored
(and processed) in topological order, i.e., j is stored first in the above example. This intradescendent list ordering has a considerable effect on 1/0 and CPU performance. gadish
The
in their
above
Hybrid
ordering
has
algorithm
also
been
used
by Agrawal
and
Ja-
[3].
As we mentioned earlier, the above data orderings cannot be used in Schmitz or GDFTC, because of their “on-the-fly” type of processing before the is available. The effect of the intradescendent list necessary information by computing the arc basis of a graph closure computation. As mentioned in
ordering, however, can also be achieved and using that for the actual transitive Section
2, Schmitz
preprocessing tion,
as opposed
negligible,
since
accounted
for
adding
the
however,
to the
that
as a variant
adds
some nontrivial
intradescendent
it is a by-product
in
the
numbers
first
pass
the
added
meaningful: second
proposed
step,
pass of BTC
that
of BTC
to
either
nothing,
of BTC,
whose
pass
of the
We
algorithm
should
or
algorithms
and potentially
Such
overall
Schmitz
of these
algorithm.
cost to the
ordering
first
we present).
complexity
gains
list
of the
of his
also
a
execucost is (and
note
GDFTC
is
with
respect
is
that
not
very to
the
costs more.
Memory Management
4.3
Several
data
structures
are assumed
to remain
in main
memory
throughout
the execution of all algorithms. The most important such structure is the array mentioned in Section 4.1. This is in addition to the buffer pool, which is used
to
output
store of the
the
(b) descendent
all algorithms,
depending
divided
the above
among
(a) arc-tuples,
following:
algorithms,
on the types
execution
of data
for
lists
the
and
phase,
initial
input
(c) adjacency a buffer
and
final
vectors.
For
pool of size M
is
as follows.
Restructuring M – 1 pages for input
arc-tuples
1 page for the constructed Main algorithm
1 page for output
descendent
lists
arc-tuples
M-2 pages for descendent lists 1 page for adjacency vectors During the
restructuring,
arc-tuple
pages.
LRU During
is used the
as the
main
page
algorithm,
adjacency vectors to manage the space in the With respect to the pages storing descendent
replacement LRU
single lists,
policy
is used
among
among
the
page devoted to them. we have experimented
with two replacement algorithms: LRU and a specialized algorithm that we introduce below called Least Unprocessed Node Degree (LUND). LUND works as follows. The descendent lists that are in main memory at any point are divided into two classes. The first class contains lists, called
ACM
Transactions
on Database
Systems,
Vol. 18, No. 3, September
1993.
530
Y Ioannidls
.
et al.
complete lists, whose source is a node that its descendants have been found. Clearly, S~ is via an arc pointing source
of that
arc.
to j, with
The
has been processed already, i.e., all any future reference to such a list
the goal
second
class
of copying
contains
incomplete lists, i.e., those whose source is either not yet started being processed. Future references both
arcs
coming
outgoing
into
the arc are added so that
pages
those
that
used
pages
to the list.
requests
complete
Degree
been
for the list
lists
lists. (UND)
and
of the
Thus,
are
a fraction list
node.
(For
every
of the head
f of the
on
from least
sum
of the
is the
or equivalently,
the
among
page,
of times
of a list
of
once
recently
a candidate
the number
UND
are yet to be made,
chosen
as the
minus the
called
of the arc.) In LUND,
pool
each
node for the list
of the
lists,
arc, the list is requested
is computed
requested. that
out
buffer
For
the list
other
the descendants
of the tail
in the
the
still being processed or has to such a list can be due to
going
incoming
to the descendants
of the source already
arcs
once so that
For every
incomplete
Node
and
replacement only
with
out-degree has
for
contain
Unprocessed
node
is requested
it can be added
candidate
list
the
arc, the list
S] into
all
its
in-
and
that
the
number
of
the number
of
unprocessed for all lists
arcs incident on the corresponding node. LUND adds the UNDS in each page and then chooses the page with the Ieast sum as the
victim First,
replacement. most recently
for the
memory
again
fraction
of incomplete
among
soon,
candidate
structure, processed
the
and
lists,
for the source This
should
lists
algorithm
be requested.
The intuition behind the LUND used incomplete lists are likely not be paged
that
are not
without
any
assumes of a list,
algorithm
further
that
the
the further
has
out. These
considered
been
policy is twofold. to be needed in are included
for replacement.
information
about
the
that
need
fewer
the
away
in the future
justified
by
arcs the
in the Second,
that
results
experiments, some of which are discussed in Section 6.2.2. A final issue to consider is related to page splits. If all blocks
graph to be
list
will
of several of a page are
occupied and one of them is full and needs to be expanded, the page must split into two pages. At that point, a decision must be made on how descendent lists will be divided between the new pages. We have experimented with two approaches.
The first
one is to randomly
divide
the second one is to take into account the UND in the original page and separate those with large
UND.
The
important—we
specific
criterion
have experimented
for
the
with
several
them
between
of the source small UND UND-based of them
the pages;
nodes of the lists from those with separation
with
is
no major
not
effects
on the performance. The intuition behind the second approach is that nodes with high UND are expected to be accessed frequently in the future. Hence, combining all of them together in a page increases the chances that the page will stay in main memory long enough for much of the processing of these lists to be done without additional 1/0. Results of some initial experiments showed that as expected,
the first approach is the preferred one when using the second approach is the preferred one when
Therefore, all combinations.
ACM
TransactIons
experiments
on Database
presented
Systems,
Vol
in
18, No
the
results
3, September
1993
sections
LRU, whereas using LUND. are
for
these
Transitive Closure Algorithms
4.4 Duplicate In this
lists
Elimination
section,
elimination nodes from
we briefly
describe
the
algorithm
that
at linear CPU cost. As mentioned a descendent list S~ to another list
exist
531
.
in main
in the adjacency S, and
the
already
exists
memory. vector
bit
For
every
node
of S, is checked:
is switched in S,. This
corresponds
for
duplicate copying for both
k in Sj, the corresponding
entry
if it is equal
to 1; otherwise
is used
in Section 4.1, when S,, adjacency vectors
no action
to 0(1)
to O, then
k is added
to
is taken
on k, since
it
cost for each node in S1, i.e., to a
cost that is linear in the length of S~. The cost of constructing the adjacency vectors is also linear in the length of the lists, so duplicate elimination is very efficient. 5. PERFORMANCE We implemented Schmitz, the
several
using
buffer
match
Although times CPU
size in our
all
descendent
strategy
UNIX. were
BTC,
GDFTC,
The file
chosen
sizes of the machine.
have In our counting
and the
algorithms,
set of the root
node in the component,
been
page
and
size and
to be 2 Kbytes
With
this
are the
other
for
the
can be copied out, with that
output: pointers
to it
choice
the input
the
to 1/0 cost, therefore, the numbers presented in this paper do not the initial cost of reading in the original graph once and the final cost
rithms.)
All
included There
in the numbers presented. are several interesting parameters
algorithms.
other
They
reads
and
can be divided
the algorithms and parameters following two subsections. The graphs were generated. 5,1
Parameters
of Algorithm
only exception is in Section with nongraph-based algo-
performed
into
With
during
that
affect
parameters
the
execution
the performance
of the
are of the
implementations
of
of the data. They are discussed in the third subsection explains how our input
Implementations
There are three interesting the number of buffer pages, ACM
writes
unavoidable.
is
and
writing
once. (The algorithms
and
the
once for each
the same
the cost of reading
algorithms,
the
respect include
of writing out the transitive closure 11.3, where we compare graph-based
all
on
elapsed
in each case. For
writing
We assume that
users
and list
on UNIX-provided implemented page
memory in
copy can be written
means
same
option
component
component. this
no
to
size, each page
the UNIX-provided
of available
is an
of a strong
by all the algorithms;
with
experiments, we relied of 1/0 based on the
amount
or a single
run
management
there
each node in the strong output
algorithms
under
implementation
experiments
graph-based
made
3200
for the input and output representation of graphs, block size 15 and 5, respectively, for the descendent
are not meaningful. times and our own
from
of all three
since we do our own buffer
replacement all
versions
the corresponding
can fit 256 arc-tuples 30 and 72 blocks with representation. machine,
TESTBED
C on a VAXstation
page
with
EVALUATION
parameters the buffer
Transactions
of the algorithm replacement policy,
on Database
Systems,
implementations: and the blocking
Vol. 18, No. 3, September
1993.
532
Y. Ioannidis
. Table
I.
et al,
Parameters
of Algorithm
Implementations
Parameter
Symbol
Buffer size (pages) Buffer replacement policy Blocking factor
factor.
The buffer
sections,
the
discussed
at least
10 pages
Values
02050 :RU Ad LUND 5 and 15
was
varied
are
M = 10,
by M)
values
Tested
Values
M B
size (denoted
only
and their
@O.25)
considerably.
In the results
and
all
phenomena A minimum
occur beyond 50 pages (the 1/0 cost drops sharply, as expected). of ten pages are necessary because all algorithms require that at lists
can fit in memory
and
because
need
two descendent
to run,
50,
algorithms
least
of memory
20,
no interesting
at the same time.
For 2000 node
graphs, this accounts for eight pages are needed, one for the
pages in the worst case. In addition, adjacency vectors and one for input
arc-tuples.
with
We
experimented
several
buffer
particular, LRU and four versions of LUND to f = 0.25, 0.5, 0.75, and 1.0, respectively. the one with
f = 0.25 was almost
we show the results presented
costs
experimented of B depended 6.2.3.
However,
fected
by B,
with
Parameters
All
relations
two blocking
on the input the
relative
so the
results
always
and LUND
are for the best
space of parameters 5.2
for LRU
factors,
graph
type.
either
and tested
all
other
values
in
the best or close to it. Hence, f’ = 0.25 only.
policies
for that
In each case, the case. Finally,
B = 5 and 15. The effect This
performance in
policies,
with the fraction f being equal Among these versions of LUND,
with
of the two
replacement
two more or output
is discussed
of the sections
in detail
algorithms
is summarized
in Section
remained B = 15. The
are for
in Table
we
of the value unafabove
I.
of Data used
in
our
experiments
contained
integer
node
identifiers,
which represent the best case for efficiency. This is without any loss of generality, however, because even if a given relation is not in this form, it can be transformed to it by a single pass over its tuples. In addition, the integers used were random numbers in a specific range, so that the actual values that represented the nodes would not bias the performance of the algorithms. Also, for any specific setting of the values of the parameters described below, all algorithms
under
comparison
were
run
on the
same
input
graph,
so that
no differences in the specific choice of node identifiers or other secondary characteristics could affect the results. We show results for both acyclic and arbitrary graphs below. We also experimented with trees, but since those tests did not insights beyond what was observed for acyclic graphs,
offer any additional we do not present
them. The following are the parameters that were used to characterize graphs, with the symbols that denote them in parentheses: the number of nodes (N), the outdegree or branching factor of each node (b), and the depth (d). Preliminary experiments with several values of IV showed that the main conclusions of this study seem to be unaffected by lV. Hence, we only present ACM
Transactions
on Database
Systems,
Vol. 18, No. 3, September
1993,
Transitive Closure Algorithms results
for the value
with
J/ = 1000, with
with
IV = 2000.
N = 2000. values
(We also studied
of other
The trends
and analysis
for IV = 2000.) We experimented and we present tensively, sures
because
that
stood
in many
path
as follows.
cases,
complete,
the depth
simple
with
some
of graphs
exhaustively
of b were
graphs
extremes
are rather
to be equal
Its
importance
point
during
have
the
1
0 and popped[ j] = O when they are examined, i.e., all arcs (i, j) are back arcs. (In both cases B,EL = EL and POP,EL = T,EL = 0. Moreover, in the former case E, = 0.) Part (b) of the lemma does not apply here, so we only prove part (a). Within the call visit(i), let V denote the set of children of i that have been iterated through in statements (5)–( 19) of the algorithm at any time. Modify (a) in the statement of the lemma into (a’) so that it reads as follows. (a’ ) For every node following holds: (al’)
frame[i]
(a2’)
frame[
i, after
examining
= n + 1, root[i]
all children
= n + 1, and
of i in V c E,, one of the
S, = @, or
i] = top
root[ i] = r, such that nodes[ frame[ S, = list[
i]]
frame[
uisited[
r]
= mink
●
V{uisited[
k ]}.
= {i} i ]] = 0
We prove the above by induction on the size of V, i.e., the number of i that have been examined at any time. ACM
Transactions
on Database
Systems,
VOI 18, No
3, September
1993
of children
Transitive Closure Algorithms Basis.
Let
executed before,
at
IV I = O. Then, all,
and
i.e., equal
Induction
the for-loop
therefore
to n + 1. Moreover,
Step.
of statements
frame[
Assume
that
i ] and
the
claim
condition in statement lishing the following.
has not been
root[ i ] remain
S, = @. Thus,
children. We prove it for c + 1, i.e., IV] i, i.e., Vnew = VOZ~ U {j}. Then, (i, j) is sis, before examining j in statement former case (j is the first child of i
(5)–(19)
is
(al’)
true
571
.
as they
were
holds.
after
examining
= c + 1. Let j be the (c + l)th
c z O child
of
a back arc. By the induction hypothe(5), either (al’) or (a2’ ) holds. In the to be examined and c + 1 = 1), the
(18) is satisfied
and its then
part
is executed,
estab-
frame[ i] = top nodes[ frame[ i ]] = rzodes[ top] = {i} S, = list[ frame[ i]] = D In addition,
statement
and O < uisited[ (root[
j]
(19) is executed,
< n, the then
i] = j, where
uisited[ j] = ~n~
vacuously
{visited[
part
and because of statement
(since V
is
uisited[
n + 1] = n + 1
(30) establishes
singleton)
k ]}.
Thus (a2’), and therefore (a’), holds. In the latter case (j is not the first child of i to be examined), the test in statement (18) fails, and only statement (19) is executed, possibly updating root [ i ] as (a2’) requires. The remaining clauses of (a2’ ) remain
valid
(a2’ ), and therefore After
examining
node
for which
induction
all the children call
visit(i)
and (a’) in this
Step.
Assume
of i, (a’) holds returned,
Thus,
in
this
case
also,
the
i is the first statement
visit(i) returns. We have already and POP,EI = T,EL = @. Thus, (a)
case of the outer that
for V = E,. Since
by (a’) i # root[ i ], thus
still holds after case, B,~L = E,
to (a’) and the basis
Induction
hypothesis.
(a’) holds.
the
(21) is skipped, mentioned that reduces
by the
lemma
induction is true
is proved.
for all nodes
i such
that
pop[ i ] < pop for some pop >1, i.e., for the first pop nodes i for which visit(i ) returned. We prove it for the (pop + l)th. Let h be he popth node for which the
call
to visit
depth-first
returned
traversal
and
structure
let
i be the
of the
(pop
algorithm,
+ l)th
such
node.
By
the
either
i is
a leaf
in
the
spanning forest of calls to visit (a descendent of a sibling of h or a member of a different tree in the forest) or i is the father of h and visit(h) was called from within visit(i). We examine the two cases separately. Assume are either
that
i is a leaf
back
arcs or cross arcs;
in the
spanning thus,
forest.
Then
all arcs (i, j),
T,EL = @. As in the basis
if any,
case, since
no call visit(j) is issued for any child j of i, part (b) of the IIemma does not apply, so we only prove part (a). Within the call visit(i), let V denote the set of children of z that have been iterated through in statements (5)–( 19) of the algorithm at any time. Modify (a) in the statement of the lemma into (a”) so that
it reads
as follows: ACM
Transactions
on Database
Systems,
Vol
18, No 3. September
1993
572
Y. Ioannidis
.
(a”)
For every
node
(al”)
fkune[il
(a2”)
fkune[
et al,
i, after
the call visit(i
= n + 1, root[il
rzocies[ frczme[ i]] S, = list[ As in the basis
Basis.
Let
initially,
frarne[i]]
all,
S, = D~
i.e., equal
r]
= minh
~ ~Y{visited[
= POP,V
and
fkune[.j]
the above by induction
of i that
and
Uisited[
= {jlj
have
IV I = O. Then,
at
= n + 1, and
holds.
k ]}.
= {i]
case, we prove
of children
executed
one of the following
i] = top
root[ i] = r, such that
number
) returns,
been examined
the for-loop
therefore
frazne[
on the size of V, i.e., the at any time.
of statements i ] and
to n + 1. Moreover,
= n + 1}
(5)-(19)
has not been
root[ i ] remain
S, = ~
= @. Thus,
as they
(al”)
were
holds.
Induction Step. Assume that the claim is true after examining c z O children. We prove it for c + 1, i.e., IV I = c + 1. Let j be the (c + l)th child of i, i.e., Vnew = V“Zd u {j}. Arc (i, j) can be a back arc or a cross arc. We treat the two cases separately. Assume that (i, j) is a back arc, i.e., the condition of statement before
(17)
is
satisfied,
examining
holds (j is the first a nontrivial strong and its then
part
frame[
i] = top
nodes[
frame[
S, = list[
is executed,
equality
V does not affect
part
root[
Thus,
is justified because
of statement
i ] = ~“, where
uisited[j]
=
the
induction
or (a2”)
hypothesis,
holds
for
i. If (al”)
i is a member of (18) is satisfied
the following.
= {i}
by the fact that the contents uisited[
= n + 1) for a back
of POP,”.
arc (i, j),
In addition,
n + 1] = n + 1 and
the addition
statement
O < uisited[j]
(19)
< n, the
(30) establishes
vacuously
mink_ ~~{uisited[
(a2° ), and
By
(al”)
c POP,V and frame[j]
of j into
and
A.2.
(5), either
establishing
i ]] = nodes[ top]
The last
then
Lemma
child of i to be examined and reveal that component), the condition in statement
frame[ i]] = {Jj
is executed,
by
j in statement
therefore
(since
B,v
is singleton)
k ]}.
(a”),
holds.
If (a2° ) holds
before
examining
j, the
test in statement (18) fails, and only statement (19) is executed, possibly updating root[ i ] as (a2”) requires. The remaining clauses of (a2° ) remain valid by the induction hypothesis. (The value of S, and list[ frarne[ i ]] have to remain the same, since the addition of a back arc in V does not affect the contents of POP,V.) Thus, in this case also, (a2”), and therefore (a”) holds. Assume that (i, j) is a cross arc, i.e., the condition of statement (13) is has already satisfied, by Lemma A.2. Since ~wopped[ j] = 1, the call to visit(j) returned. Thus, by the induction hypothesis of the outer induction the lemma for j and the complete set of holds for j. If root [ j] = n + 1, (a) holds descendants of j is stored in Sj and propagated to S, at statement (14). If (al”) holds for i before examining j, S, is correctly updated to D: (with V ACM
Transactions
on Database
Systems,
Vol
18, No
3, September
1993,
Transitive
Closure
Algorithms
.
573
containing j also). If (a2° ) holds for z before examining j, since all nodes in SJ have been popped before i, they are members of POP,V. Thus, S, and list[ ~rczrne[ i]] are updated correctly also. The values of ~rczm,e[ i], root[ i], and nodes[ i ] correctly remain unchanged. (Specifically for mot[ i ], the addition to V of the head
of a cross arc that
is in a different
n + 1), cannot have any effect on the root[ j] + n + 1, the test in statement ment
(15).
statement statement following:
Recall
i] = top
nodes[
frame[
SL = list[
tion, Thus
i]]
i]]
equality
induction
hypothesis,
=
before
examining
j
the condition establishing
in in the
= {i}
c POP,V and frczme[j]
is justified
by the induction
i ] = j,
(a2”),
the following
= n + 1} hypothesis
of the
outer
induc-
and therefore (16)
remaining
be established:
clauses
(a”), holds.
is executed,
the contents
the addition
will
where vacuously (since B,v is singleton k ]}. = ~g~,, {uisited[
statement again
= {jlj
(root[j]
frwne[ j] # n + 1. since for a cross arc (i, j), if root[ j] + n + 1 then the addition of j to V does not affect {j[j = POP,V and frarne[ j] = n +
Uisited[j]
Thus,
= nodes[ top]
frcvne[
1}. In addition, root[
by the
component
(5), either (al” ) or (a2° ) holds for i. If (al”) holds, (15) is satisfied and its then part is executed,
frame[
The last
that,
strong
contents of B,v.) On the other hand, if (14) fails, and control reaches state-
If (a2° ) holds
possibly
of (a2’ ) remain
updating valid
of {jl j = POP,V and
of j to V. Thus,
in this
by the frame[
for
i before
root[ i ] as (a2”) j]
induction
examining,
j,
requires.
hypothesis,
The since
= n + 1} are not affected
case also, (a2”),
and therefore
by
(a”) holds.
Note that (a” ) reduces to (a) when V becomes equal to E,. (Recall that for a leaf of the spanning forest, T,EI = 0.) For a leaf of the spanning forest of calls to visit, it can never be true that i = root[ i ]. This is because root[ i ] is either made equal to a child of i (statement (19)) (but i is never examined as a child of itself (statement (5))), or it is made equal to the root of a child of i (statements (11), (12), and (16)). By the induction hypothesis of the outer induction, (a) holds for j. Statements (11), (12), and (16) are only executed when
root[
j]
# n +
1, thus
(a2) must
hold
for j. If i = root[
j]
at some point,
this means that there is a path from i to j of tree and cross arcs only starting with a tree arc from i (and finishing with a back arc to i). This, however, contradicts the hypothesis that i is a leaf in the spanning forest. Thus i cannot
be equal
to root[ i ], statement
(21) is skipped,
and after
the return
of
visit(i), (a) holds. This completes the proof of the lemma for the case that i is not the father of h. Assume that i is the father of h. For the first time, since the call to visit(h) was taken, we need to prove both (a) and (b) for i. We first prove (b). Node z is never placed in an entry of nodes, except within the top level call of visit( i ) ACM Transactions
on Database
Systems,
Vol. 18, No. 3, September
1993
574
Y Ioanmdls
.
et al.
(statements (11), (12), and (16)). Thus, if frarne[ i 1 = n + 1 before visit(j) is called for a child j of i, this will not be changed after the return of the call to true that visit(j). When i is inserted into some entry of nodes it is always ficzrne[ i ] = top (statements (11), (15), and (18)). Thus, consider the case where ~ranze[ z] = top = TOP, for some child j of i. During decreased visit(j) creased within
multiple
times.
that top was to TOP before of calls to visit.
the forest,
when
by induction
Consider
the last
increased from the call to visit(j)
the call visit(l).
forest
for some value TOP, before the call to visit(j), the call visit(j), top may be increased and
Clearly,
that
for any node
returns,
on the distance
within
fkmze[
of k from
any recursive
TOP + 1 without Assume that this
1 is a descendent
We prove
call visit(k)
time
TOP to returns.
call of
being dehappened
of i and j in the spanning k in the path + 1.This
k ] = TOP
from
j to 1 in
will
be done
1.
to a child m of 1, after Basis. Consider 1 itself. Consider any call visit(m) to TOP + 1 and frame[ 1] is set to TOP + 1 (statement (15) top is increased hypotheor (18)). Since 1 and m have been popped before i, by the induction sis of the outer induction, (a) and (b) hold for 1 and m. Thus, when visit(m)
returns, frame[l] TOP
either = top
which
case,
since
1] (also
frarne[
m])
Step.
Assume
+ 2, in
and
frame[
root[ m] = frame[ = TOP + 1, or
m ] = n +
root[m] root[
1, statement + n + 1 and
1, statement
1] # n +
is set to TOP
(9) is executed, and frame[m] = top = (12)
+ 1 = top. This
is executed,
covers
the basis
case. Induction
the path between j and the induction hypothesis and k’. When the inner induction,
that
the claim
is true
for an arbitrary
node
k’
in
1.We prove of the
it for its father in this path k. Again, by outer induction, (a) and (b) hold for both k
call visit (k’ ) returns, frame[k’] = TOP
by the +
induction hypothesis Moreover, root[
1 = top.
of the k] =
it should be frame[ k ] = TOP (outer induction frame[ k ] = n + 1. Otherwise, hypothesis (b) for k). If that were the case, statement (12) would be executed, and this
top would be decreased to TOP, would not happen between the
which contradicts call visit(l) and
our assumption that the return of the call
(11) is exevisit(j) to visit(i). Thus, root[ k ] = frczme[ k ] = n + 1, statement cuted, which set frame[ k ] equal to frame[ k’ ] = TOP + 1. After any other call visit( k“ ), following the return of visit( k’ ), within the call visit(k), by the frame[ k“ ] = n + 1, in which case outer induction hypothesis for k, either k“ ] = TOP + 2 = top, in frame[ k ] remains equal to top = TOP + 1,or frame[ which case after the execution of statement (12), frarne[ k ] is set back again to top = TOP + 1.Thus, in all cases the claim holds for k. By TOP
the
above
induction,
+ 1 = top. This
after
concludes
visit(j) the proof
returns
The proof of (a) is straightforward. Recall visit(i) returns after the return of visit(h). tion
hypothesis
propagated modified: ACM
(al),
Sk
contains
all
the
within
visit(i),
frame[
j]
=
of (b) for i. that i is the next node for which If root[ h ] = n + 1, by the induc-
descendants
of h, which
are correctly
to St or both S, and list [ frame[ i ]] (statement (9)). Nothing else is frame[ i ] should remain n + 1 or top; root[ i ] should retain its
TransactIons
on Database
Systems,
Vol
18, No. 3, September
1993
Transitive Closure Algorithms value, because since for any frarne[ will
root[ h] = n + 1; nodes[ fhwne[ i ]] should also retain its value, member k ~ Tjh), which is the set of new members of Tiv,
k ] = n + 1 by the
n + 1. Thus,
575
.
in this
be executed
induction
hypothesis
case, (a) holds.
or statement
and
the
fact
that
If root[ h,] # n + 1, either
(12). Addressing
root[ h] =
statement
each case is similar
(11)
to previous
parts of this proof and is omitted. In all cases, (a) is seen to hold. Node i, may have other children to examine after h, all of which must be heads of cross or back arcs. It is easily seen that the claim still holds after examining these nodes also, as was done before. If (al) holds when control reaches statement (20); i.e., root[ i 1 = n + 1, then statement (21) is skipped, and (al), and therefore (a) also, holds after visit(i) returns. If (a2) holds when control reaches statement (2o), i.e., root[ i ] # n + 1, but root[ i 1 # i, then again (a2) remains valid after visit(i) returns. Finally, if (a2) holds but i = root[ i ] # 1, then
n +
returns.
statement
(21) is executed,
In all cases, (a) holds
THEOREM
transitive PROOF. only
after
the
once
in
GDFTC
terminates
the
algorithm
(statement
(5)).
terminates By
(These
nents
but
not
equal
to a node
are the nodes
a root.)
the spanning
This
r for which
forest
that
i after
visit(i)
computes
the
visit
by Lemma
there
of i in the
in
G
i that
is
consid-
satisfied
of nontrivial visit(i)
had not returned
return.
all calls to r’s children of the nodes,
when
r to i, and a path
from
edge
node
(al)
descendants correctly computed and satisfied (a2) after the call visit(i)
that,
after
is an ancestor
for
correctly
every
any
are members
means
is r, r = B,E’, and therefore, finiteness
A.3,
visit(i)
Si.
(al)
and
since
Lemma
returned, has its Consider a node i that
call
returned.
and establish ❑
i.
of G.
Clearly,
ered
stored
Algorithm
A.zI.
closure
for
in a back
A.3 for r, uisited[
must
be a node forest
r’,
root[
such that
of calls
is a path
arc whose
r ]
r ], due to the
root[
to visit.
in
head
r ]] < uisited[
root[ r ]] < uisited[
spanning
compo-
root [ i ] was
yet. Since there
ending
If visited[
strong
returned,
r’ ] = r’,
Before
that
visit(r’
)
returns, all members of nodes[ frame[ i ]] have their descendants set equal to i ] + n + 1 (by Lemma A.3), i = nodes[ r’ ], and S,,. Since i ~ T,E” and frame[ i’s descendants
are appropriately
updated.
The
correctness
straightforward, since there is a cycle that involves therefore, the two nodes have the same descendants.
of the
both D
r’
update
and
i,
is and
REFERENCES L
AGRAWAL, R., DAR, S., AND JAGADISH, H. V. performance
evaluation.
ACM
Trans.
2. AGRAWAL, R., AND JAGADISH, H. V. database
relations.
England,
Sept.
1987),
In
Direct
Proceedings
Direct
Database
transitive
Syst.
algorithms
of the
13th
ment,
algorithms:
1990),
for computing
International
the transitive
VLDB
Design
and
427-458.
Conference
closure
of
(Brighton,
255-266.
3. AGRAWAL, R., AND JAGADISH, H. V. Hybrid transitive the 16th International VLDB Conference (Brisbane, 4. AHo,
closure
15, 3 (Sept.
closure algorithms. In Proceedings of Australia, Aug. 1990). VLDB Endow-
326–334. A. V.,
Algorithms.
HOPCROFT,
J. E.,
Addison-Wesley,
AND ULLMAN, Reading,
Mass.,
ACM Transactions
J. D.
The
Design
and
Analysis
of Computer
1974.
on Database
Systems,
Vol. 18, No. 3, September
1993.
576
Y. Ioannidis et al.
.
5. BANCILHON,
F.
Management Eds.,
Naive
evaluation
Springer-Verlag,
6. CARRE, B.
New
Graphs
7. CRUZ, I.,
and
Proceedings
York,
T. S.
of the 5th IEEE Ph.D.
9. DIJKSTRA,
thesis,
E. W.
Database Clarendon
AI
relations.
Systems,
Press,
Aggregative
Data
Univ.
A note
defined
and
In
On Knowledge
M. Brodie
and
Base
J. Mylopoulos,
1985.
Networks.
AND NORVELL,
8. DAR, S.
of recursively
Systems—Integrating
Oxford,
closure:
Engmeermg
An
Conference
of Wisconsin-Madison,
on two problems
England,
extension
1979. of transitive
(Los Angeles,
Aug.
closure.
Feb. 1989),
In
384-391.
1993.
in connection
with
graphs.
Numer.
Math.
1 (1959),
269-271. 10. EBERT, J.
11.EVE,
A sensitive
transitive
J., AND KURKI-SUONIO,
(1977),
closure
R.
Y. E.
Proceedings
On
the
of the 12th
computation
of the
International
VLDB
13. IOANNIDIS, Y. E., AND RAMAKRISHNAN, of the 14th
14. JIANG,
B.
algorithm
I\ O-efficiency
Data
16. KABLER,
Process.
transitive
R., R.,
Heuristic
HANSON, search
International
Y.
E.,
Znf. Syst.
Lett.
12, 5 (1981).
closure
path
E.,
transitive
of a relation.
partial
Acts
Znf.
operators.
In
Feb.
1989).
algorithms.
In
382–394.
1988),
closures
in
Feb.
In
IEEE,
Performance
403-411.
Aug.
(Los Angeles,
An analysis.
Ariz.,
1986),
Calif.,
transitive
algorithms: M.
of relational Aug.
closure
Beach,
Conference
AND CAREY,
databases.
1990),
Proceedings New
York,
evaluation
ProceedIn
264-271. of the 8th
12-19.
of algorithms
for
(Sept. 1992), 415-441.
IOANNIDIS,
Workshop,
closure (Kyoto,
(Long
computing
(Tempe,
17, 5
in database
Efficient
Engineering
Conference
IOANNIDIS,
closure.
for
Data
transitive
Conference
Conference
of shortest
Engineering
transitive
R.
VLDB
of the 6th IEEE
15. JIANG, B.
17. KUNG,
International
A suitable
Proceedings IEEE
hf.
the
303-314.
12. 10ANNIDIS,
ings
algorithm.
On computing
Y.
E.,
systems.
SHAPIRO,
In
L. Kerschberg,
Expert
Ed.,
L.,
SELLIS,
Database
T.,
AND STONEBRAKER,
M.
Proceedings
1st
Systems,
Benjamin-Cummings,
Menlo
Park,
of the Calif.,
1986,
537-548. 18. Lu,
H.
New
strategies
Proceedings
of the
for
13th
computing
the
International
transitive
VLDB
closure
Conference
of a database
(Brighton,
England,
relation. Sept.
In
1987),
267-274. 19. Lu,
H.,
MIKKILINENI,
compute Data
K.,
thetransitive
Engineering
20. PURDOM, P. 21. NAUGHTON,
AND RICHARDSON,
closure Conference
Atransitive J. F.
J. P.
ofadatabase (Los Angeles,
closure
One-sided
Feb.
algorithm.
recursions.
In
closure
23. RICH, E.
Artificial
24. ROSENTHAL, approach
queries.
A.,
recursive
An improved
26. TARJAN, R. E.
Depth-first
J. J.
Softw.
U.,
D. C., May search
of the
6th
to
International
New
York,
In
ACM-PODS
algorithms
17, 3 (Mar.
AND MANOLA,
1986),
transitive
of algorithms
of the3rd
112-119.
Efficient
Eng.
applications.
evaluation
10(1970),76-94.
Proceedings
McGraw-Hill,
S., DAYAL,
(Washington,
25. SCHMITZ, L.
Trans.
Intelligence. HEILER,
to supporting
Conference
IEEE
and
In Proceedings
1987),
BZT,
(San Diego, Calif., Mar. 1987), 340-348. 22. QADAH, G. Z., HENSCHEN, L. J., AND KIM, transitive
Design
relation.
for
1991),
the
Conference instantiated
296-309.
1983.
F.
Traversal
Proceedings
recursion:
of the
1986
A practical
ACM-SIGMOD
166-176.
closure and linear
algorithm. graph
Computing
algorithms.
30 (1983),
SIAM
359-371.
J. Comput.
1, 2 (1972),
146-160. 27. VALDURIEZ, ings
P., AND BORAL, H.
of the
1986),
1st International
relations.
A modification
Commun.
29. WARSHALL,
ACM
of recursive
Database
queries
using
Systems
Conference
algorithm
for the
join
indices.
(Charleston,
In ProceedS. C., April
197-208.
28. WARREN, H. S.
Received
Evaluation Expert
August
Transactions
S.
of Warshall’s
ACM
18, 4 (April
A theorem
on Boolean
1989;
revised
on Database
September
Systems,
1975), matrices.
1991;
transitive
closure
218-220. JACM,
accepted
9, 1 (Jan.
July
Vol. 18, No. 3, September
1992
1993.
1962),
11-12.
of binary