DENNIS SHASHA and TSONG-LI WANG. New York University. Consider the class of distributed database systems consisting of a set of nodes connected by a.
Optimizing Equijoin Queries In Distributed Databases Where Relations Are t-lash Partitioned DENNIS
SHASHA
and TSONG-LI
WANG
New York University
Consider the class of distributed database systems consisting of a set of nodes connected by a high bandwidth network. Each node consists of a processor, a random access memory, and a slower but much larger memory such as a disk. There is no shared memory among the nodes. The data are horizontally partitioned often using a hash function. Such a description characterizes many parallel or distributed database systems that have recently been proposed, both commercial and academic. We study the optimization problem that arises when the query processor must repartition the relations and intermediate results participating in a multijoin query. Using estimates of the sizes of intermediate relations, we show (1) optimum solutions for closed chain queries; (2) the NP-completeness of the optimization problem for star, tree, and general graph queries; and (3) effective heuristics for these hard cases. Our general approach and many of our results extend to other attribute partitioning schemes, for example, sort-partitioning on attributes, and to partitioned object databases. Categories and Subject Descriptors: F.2 [Theory of Computation]: Analysis Problem Complexity; H.2.4 [Database Management]: Systems–distributed processing; H.2.6 [Database Management]: Database Machines General
Terms: Algorithms,
Performance,
Additional Key Words and Phrases: models, spanning trees, systems
of Algorithms systems;
and query
Theory
Equijoin,
hashing,
NP-complete
database
systems
problems,
relational
data
1. INTRODUCTION We consider tion
of nodes
the
class
(sites),
of distributed each having
The nodes communicate latency)
network.
Each
with
a processor,
one another
relation
local
consisting
memory,
across a high
is horizontally
and
bandwidth
partitioned
of a colleclocal
disk.
(and high
to different
nodes
This work was partially supported by the National Science Foundation under grantsDCR8501611 and IRI-8901699, and by the Office of Naval Research under grant NOO014-85-K-O046. Authors’ addresses: D. Shasha, Courant Institute of Mathematical Sciences, New York University, New York, NY 10012; J. T. L. Wang, Department of Computer and Information Science, New Jersey Institute of Technology, University Heights, Newark, New Jersey, 07102 Permission to copy without fee all or part of this material not made or distributed for direct commercial advantage,
is granted provided that the copies are the ACM copyright notice and the title
of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1991 ACM 0362-5915/91/0600-0279 $01.50 ACM
Transactions
on Database
Systems,
Vol. 16, No
2, June 1991, Pages 279-308.
280
.
D Shasha
and T
L Wang
RI
Emp #
Salary
Node 1:
R,
24,000
Emp #
Salary
1
38,000 \
Fig. 1.
by
hashing:
use
a hash
S2
Age
John Mary
35 25
Name
Age
4
-i-
Emp #
+
h whose
50
Andy I Ann
13
of two partitioned
function
Name
2
1
42,000
Example
Emp #
I
55,000
2 4
Node 2:
s,
I
40
relations.
domain
is the
domain
of some
attribute A of relation R and whose range is the number of processing nodes. node h( t. A). 1 When performing For each tuple t of R, we put t in processing joins,
we use the same function
joined
for any pair
of relations
that
could
possibly
be
together.
1. Suppose we have a system composed of two nodes, node 1 and Example node 2, and a hash function h(x) = ( x mod 2) + 1. Figure 1 shows two relations that are partitioned on the join attribute emp #: R(S) is partitioned RI and into respectively. Such parallel including well
Rz
(Sl
a loosely
coupled
or distributed commercially
as research
Sabre [34]. In this paper
and
Sz ), which
hash-partitioned
database available
prototypes we are
are
stored
at
node
architecture
as Bubba
concerned
with
[91, Gamma
optimizing
node
characterizes
systems that have recently ones such as the Teradata
such
1 and
many
been developed, machine [33], as
[111, Grace
equijoin
2,
queries
[18],
and
in such
systems. We restrict ourselves to queries that retrieve all fields of joined relations. 2 For our purposes, we assume that relations are not replicated, that is, each node holds a unique fragment for each relation. We also assume that there is no answer site in the systems. Thus we mainly concentrate on join processing, For example, on the condition emp #,
ignoring consider R. emp #
no communication
on emp #,
whereas
unioning results at the answer R and doing the join between = S.emp
#.
is required.
S on age, each
Since both
relations
On the other S tuple
site. S in Figure
hand,
1, based
are partitioned
on
if R is partitioned
t needs to be sent to processing
‘Though one may partition on a set of attributes, for the purpose of this paper, we assume partitioning on a single attribute only. Some partitioning schemes (see, eg., [181) also allow the range to be k times the number of processing nodes, with each node having k buffers that can receive data. In that case, h(t. A) designates a buffer as well as a node. The results in this paper hold for that generalization, 2This restriction is made mainly for simplifying cost computation. It M not difficult, however, to extend our techniques to handle queries whose outputs contain only certain fields of joined relations ACM Transactions
on Database Systems, Vol. 16, No 2, June 1991
Optimizing Equijoin Queries Table I.
Effect of Repartitioning With
10,000 with 1,000 100,000 with 10,000 1,000,000 with 100,000
Source: DeWitt
281
on Join Performance
Repartitioning (seconds)
Join Size
.
Without (seconds)
35.4 2864 3,144,0
180 122.0 1,143.0
et al. [121
node
h( t.emp # ). Such
head
caused
by
data
transmission
repartitioning
is not
is called negligible,
repartitioning. If
The
communication
nodes is slow, then most time will be spent in routing data in a network repartitioning occur. Even if communication is fast, repartitioning reduce throughput because it entails more disk accesses. To understand ment by DeWitt
overamong when may
the importance of such overhead, let us review an experiet al. on the Teradata machine [12]. The machine consisted
of 20 nodes and 40 disk storage units, which were interconnected with a 12-Mbyte/second-bandwidth network. The data consisted of 200 byte tuples. DeWitt et al. performed joins between two relations having different sizes (10,000
tuples
with
1,000 tuples,
100,000
tuples
with
10,000
tuples,
1,000,000
tuples with 100,000 tuples, respectively) and recorded the time for executing these joins with and without repartitionings. Table I shows the result of their experiment. By comparing the values listed in the second and third columns of the table, we see that there size joins. This indicates that multiprocessor
system,
exist ratios of between 2 or 3 to 1 for the same when processing queries in a loosely coupled
the repartitioning
cost can dominate
the local
process-
ing cost. The authors also measured the influence of the bandwidth of the network. They observed that repartitioning a relation of 200 Mbyte took 2,000 seconds and used less than 1 percent of network capacity on the average.
This
result
reveals
that
repartitioning
even when a network is fast. But can we control the repartitionings thereby
minimizing
Example relations
2.
response
Suppose
time
necessary
a suppliers-and-parts
process
a query,
database and
containing
SUP-PART(S
#,
SUP.*, PART.*, SUP-PART. * SUP, PART, SUP-PART SUP.s# = SUP-PART.s# SUP-PART. P# = PART. P#
in Table
SUP-PART)
is expensive
PART on p #, and SUP-PART
One way to process this query is to join SUP-PART with a result Tl, and then join TI with SUP. Assume that takes 100 seconds and repartitioning a relation takes cated
to
PART(P #, pname, city),
P #, city). that SUP is partitioned ons #, . Suppose -. the following SQL query: on s #. Consider SELECT FROM WHERE AND
relations
in such an environment?
we have
SUP(S #, sname, city),
large
I. The cost of this
+ (the
cost
of
strategy
joining ACM
would
SUP-PART
TransactIons
PART first, obtaining joining two relations 100 seconds, as indi-
be (the cost of repartitioning with
on Database
PART)
Systems,
Vol.
+ (the
cost
of
16, No. 2, June 1991
282
D. Shasha and T. L. Wang
.
repartitioning
7’1) + (the
cost of joining
7’1 with
SUP) = 100 + 100 + 100 +
100 = 400 seconds. On the other hand, joining SUP with SUP-PART first no repartitioning. Then joining the intermediate result Tz with requires PART requires repartitioning Tz. The cost of this strategy is thus 100 + 100 + 100 = 300 seconds. Clearly,
the
example relations,
second
is better
strategy
for processing
the
given
query.
This
demonstrates that by properly arranging the order of joins among can be achieved. Exactly how to do this to better performance
minimize
the
paper.
cost for various
In Section
important
2, we present
queries
overview
an
is the
main
of previous
concern
work
of this
in the
area.
Section 3 discusses basic assumptions and formally defines the optimization problem. Section 4 then gives a dynamic programming algorithm to solve the problem in the context of closed chain queries. Section 5 establishes the NP-completeness Section
of the
6 presents
problem
the results of computational some directions for future 2. RELATED
Neuhold
amount
using [311,
(DDBMSS).
in the In
A common
of work
these
objective
has
been
schemes.
Williams
to increase time.
minimize
and
tree,
general
procedures.
experiments. Finally research in Section 8.
partitioning and
(fragmentation) strategy response
star,
heuristic
graph
In Section
we conclude
queries.
7, we report
the paper
with
WORK
A considerable processing
for
and analyzes
et
al.
context
of distributed fragmentation
adopted
the communication
in
thus
database
of query
by researchers
as part
in
area
the
systems
of processing
throughput
processing
and
partitioning
management
used
is
in
area
horizontal
increasing
cost incurred
the
et al. [31, Stonebraker
[351 discussed
systems, parallelism,
performed
Bernstein
and
saving
of DDBMS
a query.
is to
Segev [251,
can signififor example, introduced the notion of remote semijoin, which cantly reduce the communication cost while incurring higher processing cost. Fragments may also be replicated at different sites. Yu et al. [38] distinguished queries as locally processable and nonlocally processable. If a query data transfer. is locally processable, then the query can be processed without
They cally
proposed a linear time “fragment processable queries. Relevant work
al. [1], Wong
[36], Pramanik
and replicate” algorithm has also been discussed
and Vineyard
[22], Stamos
for nonloby Apers et
and Young
[30], and
us from these workers is that we seek the optimal others. What distinguishes queries, starting with partitioned data and ending join order when processing is on the environment with partitioned data. Moreover, our main concern without data replication. We defer to Section 8 the possibility of replicating data. in multiprocessor systems The idea of using hash partitioning for joins comes
from
Grace
database
the
sort
Valduriez
[2].
engine and
ACM Transactions
This machine rather Gardarin
scheme project than [34]
is
then
adopted
[181, though on
the
described
on Database Systems, Vol
by their
Kitsuregawa emphasis
performance also
various
16, No. 2, June 1991
of join
et
al.
is on the the
join
and
semijoin
on
the
speed
of
algorithm, strate
-
Optimizing Equijoin Queries is applied only gies, but hashing Gerber [10] proposed two algorithms
partition join
and join
operations
phases.
and
They
during that
applied
concluded
that
the partition exploit hashing
the
both
algorithms
.
phase. 13eWitt and thoroughly in both
to a number
algorithms
283
have
of simple
satisfactory
perfor-
mance. In
contrast
multijoin and
to the
queries
partition
strategy joins
Thus,
but
rather
used,
minimizing
developed
a model and
more
We
simpler
then
tree
we
finding
optimal
tree
cost
view
for
with the
we
into
the
of
address
account.
whose
We
closure
is
problem
general
is
for
the
problem.
related
performance
we
size
NP-hardness
a heuristic
best
[281,
the
to a simpler
of the
heuristics
algorithm
size
many
In
Here,
queries
solutions
In
various
taking
optimizing
queries.
[201 achieves
and
for
involving time.
a join.
joins
hash-join
considering
in
strategies
chain
algorithm
topologies
without
examines
execute
particular
of a query
participating for
the
to
processing
queries
algorithm
consider our
FORMULATION
3.1
and
cost and
paper
methods
with
the
such
this
hash
concerned
1/0,
explore that
star
use
reducing
results
and
combining
of query
not
processing
show
problem,
spanning
are
with
time
for
that
3.
for
cases
NP-complete find
we
partitioning,
that
communication,
a polynomial
a chain.
on hash
systems
intermediate
general
present
work in
data.
by
relations
previous
arising
We
to Kruskal’s
over
a wide
range
assumptions.
OF THE PROBLEM
Basic Assumptions
Our
objective
is to minimize
approximate
this
costs.
Processing
node.
Repartitioning
determined (2)
system bandwidth To
cost
by
the
goal
(1) the
the
to
the
refers
total
parameters
simplify
considering
refers cost
of
of the
response
by
model
when
both
processing
time
to the
amount
spent
time
developed
in
reparti in
System speed
this
paper,
joins
at
the
we
1/0
each
which
network,
parameters
of the
We
repartitioning
tionings,
transmitted
as the
a query.
and
performing
for
architecture.
as well
processing
for
spent
of data
a given
network,
time
include
and
make
is and
processors.
the
following
assumptions.
Assumption
1.
the same.
are
Assumption execute
2.
joins
complete
that
For to its
and
either it
processing
when acts
as
speeds
at different
same hashing
with
distributes
parameters
nodes
technique
at
fashion is
function
the
of a relation, tuples
related
Assumption
data
attribute
a random
(or partition)
evenly
pipelined the
1/O
is a constant.
node uses the
each repartition attribute
transfer
in
and
to
and Gerber’s).
together
joins
reasonable is,
speeds
of the network
1 and 2 deal with
assumptions,
transferred is
Each
3.
applied
Assumptions two
processing
(e. g., DeWitt
Assumption function
The
The bandwidth
to a given
3, ensure same
or one
time.
batch
to each
that
(Tlhe
and
the
hash
from
the
set
of tuples
ACM Transactions
The
processors
data
could
be
Assumption
function
on Database Systems, Vol
hash
system.
the
at a tinne.)
a key
the node.
to
is the
3
“good,” set
of
n
16, No. 2, June 1991.
284
D. Shasha and T. L. Wang
.
nodes
such
This
has
that
each
been
probability
tuple
has probability
established any of the
that
in analytically n nodes has more
of tuples
is very
small,
provided
attribute
is not
a key
or
reader
is familiar
taken
I/n
that
the
hash
the
of being
[291
where
than
twice
number
function
sent we
show
the
the
number
n log n. If the
>4
our
node.
that
average
of tuples
is biased,
any
to
results
should
be
as heuristics.
3.2
Terminology
we
assume
join,
used
finite
the in
relational
with
database
set of attributes
{ Al,
the
standard
systems.
terms,
Define
AZ, . . . . A.}
attribute,
relation
a
[16]. Associated
tuple,
schema
with
R as a
each attribute
AZ, 1 s i s n, is a domain,
denoted dom( A ,). A relation instance (or simply R is a finite set of mappings { tl, t2, . , tm} from R to
R on schema
reldiorz)
the set of domains
such that
in R. be an attribute {tl. B,t2. B,..., tn. B}. An equijoin clause, or
B and
where sents
the
C are
join
write
clause
for
short,
is a conjunction
of such
join
as a set of clauses. 3 To represent
same
relation,
we
consider
instances
salary
of each
for
the
the
employee’s
In
order
to find
QV,
= {relations
QEq = {{R,
optimal
join
each
clauses query
referenced
S} f some
An example graph is never with
an
relation strategy,
of a query
edge.
we
is connected.
connected Observe
executing
product that tree
though
in
of these
to
be
a
of the
different
to
find
the
EM P(name,
sal, manager)
EM PI and
EMP2.
These
the
query
to consider
both
Figure may
a spanning
tree,
consider
query
graph;
individual
processing treating
on Database
define
q}
in
in
there
we
of the
of it,
TransactIons
we tuples
example,
it is helpful
R and
S}
2a. Notice
be many
a all
we
that
clauses
a query associated
queries
Systems,
Vol.
to
resulting
correspond
query
to
would
then
the be
queries.
query
other can
the the graph,
clauses
take
only
one
16, No
can
always
as selections. one
clause,
~Because all attributes of joined relations appear in the output 1, we exclude the target list in defining a query, ACM
whose
R, S}, q) to denote the set of equijoin clauses({ R and S in q. We assume that the query graph of a
Otherwise
components
a Cartesian spanning
write
both
join
repre-
queries
q, where
given
is
the
be
{R. B, S. C; clause
in
B
to
query.
of q references
graph
a Multigraph,
referencing
the
by the clauses
member
an
instances in
form
different
For
given
relations
QGq = ( QV~, QEQ) for a query
of
query.
manager,
two employee schema, we assume would be represented as distinct
graph
the
R. B
Therefore, links
arguments of
current
that
i s n. Let
This
interested
conditions.
a join
two
purpose
of the
S, respectively.
R. B = S. C. We are only
query relation
is a pair
of R and
attributes
condition
qualification
te R, t(Al) e dom(AZ), 1< t.B for t(B). We define
for each
We
2, June
1991
for
execute
Furthermore, each
edge,
of a query, as marked
a in
as the
in Section
Optimizing Equijoin Queries s.*,
SELECT FROM WHERE AND AND AND
T.+,
R.*
S.C= s.G=
cl!
s
S, T, R T.B T.H
T
Ci = {S. G, T.Hj
~4 R
S.F=RE T. B=R.
285
c1 = {S. C, T.I?}
C2
C3
.
v
C3 = {SF,
R.E}
Cb = {TB,
R.D}
D
(a) (1) Consider the query graph in Figure 2a just above. We can execute the graph by joining S of two relation names represents the with T first, and then S2’ with R. (The juxtaposition join result of the relations.) (2) While joining S and T based on, say {S. C, T. B}, the selection condition S. G = T. El is also processed. ST with R based on, say {S. F, R. E}, the clause {T. B, R. D) enforces (3) Then when joining another selection condition on tuples of ST and R (see Figure 2c). (b)
c1
T s
T
C3
R
(c)
(4) Consider
the query graph in Figure 2a. When joining S and T on the clause {S.C, 2’. B}, S ST and R, needs to be partitioned on C and T on B. Then in joining (i) if {S. F, R. E} is used as the join clause, ST needs to be repartitioned on F, such a repartitioning causes part(T) to be empty; of ST is needed, (ii) if {7’. B, R. D. } is used as the join clause, since no repartitioning part(T) = {1?}. Fig. 2.
clause,
join
motivates
applying
(i) (ii) (iii)
the
1. A
QGT,,.q
other
=
the
spanning
(QvT,,.q,
clauses
tree
QET,..q
~ QEq and (QvTrceq,
Treeq) If clauses(e, single-clause
e G
QETr,eq,
as selections
of a query )
QETreeq
= Qvq.
each
spanning
tree.
(see Figure
2b).
This
following.
QvT,..q For
with
QET,,eq)
clauses(e,
graph
QG~ = (QVq, QEq) is a
the following
properties:
is a tree.
~reeq)
G clauses(
e, q)
and
clauses(e,
e ~ QETre.q,
QGT,.,q
is called
graph
Figure
# ~. Treeq)
is a singleton
spanning
Figure
2C shows
corresponds
to the
3.3
of query graph and its single-clause
us to introduce
Definition ~aph
Example
tree an
ss
execution
set for each
(ss tree) tree
for
of
the
sequence
a
QGq.
in
query Figure
in
2a,
which
2b.
Cost Model
on which a relation R is partiWe denote as part(R) the set of attributes in the query execution. By the assumption stated in Section tioned sometime ACM
Transactions
on Database
Systems,
Vol
16, No. 2, June 1991.
286
.
1, the
D Shasha and T L Wang
initial
value
repartitionings
of part(R)
occur,
is always
part(R)
may
a join
strategy
a singleton
become
empty.
set.
Figure
However,
when
2d illustrates
such
a case. In order that
to evaluate
reflects
the
cost.
we
begin
effectively,
with
the
we
definition
need of the
to have cost
for
a measure a clause.
Definition 2. Let I R I denote the number of tuples in relation R and I tR I the width (in bytes) of a tuple in R. The cost of a clause {R. A, S. B} equals PC+ RC; PC=ax(l Rlxlt~l +lSlxlts l)isthe processing cost, and cost, where a, @ are nonzero constants RC = P x Move is the repartitioning (e.g.,
in the Teradata
case ~ = 2 a), and
P Moue =
if (Aepart( if (A
lslxlt~l 1 /Rlx/t~l
~ (Bepart(S))
cpart(R))
~ (B ~part(S)) A (Be
if (A~part(R))
II Rlxlt~l+l
Sl
Thus
we are assuming
sizes
of the
benchmarks
R))
relations. shown
that
xltsl
both
This
otherwise.
PC and RC are linearly
assumption
in Section
part(S))
1 and
is
the
proportional
consistent
Grace
with
the
measurements
to the Teradata
[18,
191; it is
also consistent with the cost equations derived elsewhere [271. Moue represents the amount of data transmitted in the network, where case 1 refers to the situation in which represent the situation case 4 represents
neither where
the situation
R nor S will be repartitioned; cases 2 and 3 one of’ the relations will be repartitioned, and where
both
relations
need to be repartitioned.
Also, we consider the measure of processing cost to be the number of bytes moved from disks, as opposed to the number of page fetches from disks often assumed
in the literature
[4, 17, 24], though
the two measures
a constant additive factor. By including factors and bandwidth, one can convert the byte units
Next, as noted in Section 1, in executing the joining nodes in different order can yield different the foHowing. Definition tree
3.
QGTr,P~
A clause
= (QvT,.eq,
string,
or string
QETreeq
)
C,l CLZ. .
QE~r,.q where 1,2, . . . . n, and
= {e, j 1 s i s c,, ~clauses(eL,,
n}, il, Treeq).
The order of clauses in a string associated tree are joined. There there are n edges in it.
for
is an ordered
on Database Systems, Vol
by only
same ss tree of a graph, costs.4 We thus introduce
short,
associated
clause
with
an ss
sequence
C?,n
i2, . . . , in
indicates the are n ! strings
is
a
permutation
of
order in which nodes of the associated with QGT,e.~ if
4Since the results obtained in this paper depend on the properties node and relation interchangeably, ACM Transactions
differ
such as page size, 1/0 speed, to corresponding time unit.
16, No 2, June 1991
of graphs,
we use the terms
Optimizing Equijoln Queries
El--@
Fig. 3.
287
.
Join graph of query shown in Figure
2a.
@---@
We
define
clauses.
the
cost
However,
partition influenced
of a string
since
the
as the
sum
that,
specified Finally,
in the string. we define the cost of an ss tree
when
calculating
the cost of a string,
Cost( QG~,,eq) CS is the
spanning tree
of
set of all strings
tree of a query QG~
minimum 3.4
costs
has
graph
a smaller
cost spanning
Exploiting
cost.
tree
Redundant
associated For
with
yields
Clauses
minimum
QG~,,,~.
A
Define
the
may refer to the graph model.
join
graph
for
same
a query
order
minimum
query
graph,
cost
no other
ss
executing
its
time.
Queries
Our goal is to minimize response time when processing a query. Thus have concentrated on query graphs, rather than on queries. However, ent query graphs we need another
are it is
}
response
to Optimize
the exact
of QG~ such that
a given
sizes and
which in turn tree are joined,
as
{COSt(cs)
QG~ is an ss tree
component
on the
one follow
QG~,,,~
= ,&&
of its
is dependent
attributes of its two corresponding relations, by the order in which relations of the query
required
where
of the
cost of a clause
query.
To have
a clear
view
JGq, as a pair
q, denoted
far we differof this,
( JV,,
JE,),
where
JVq = {R. A I R. A is an attribute JEq=
{{ R. A, S. B}l{R.
Intuitively from equijoin nent We
a join
different
clauses. ) Each
of the
graph.
say two
A, S. B}issome
graph
relations.
represents (This
3 shows are
occurs clause
in some clause
we
relation
consider
class of attributes the join
equivalent
graph
if their
connectivity y. For any state of a database, result. A clause c of a query q is redundant
of q)
in q}.
an equivalence
is because
equivalence
Figure
queries
of R that
on attributes
only
queries
is a connected
of the join
query
graphs
with compo-
in Figure have
the
2a. same
equivalent queries give the same if c is an edge in a cycle of JGq.
The closure of a query q is an equivalent query, denoted q+, to which one cannot add more redundant clauses. Figure 4 illustrates the above concepts. tree query, respectively) if it is A query is a chain query (star query, equivalent
to
some
query ACM
whose
query
Transactions
graph
on Database
is
a
Systems,
chain Vol
(star,
tree,
16, NO 2, June 1991
288
D. Shasha and T. L. Wang
.
(1) Consider the query q =
{{ R. A,’l’
B},
{2’ B,S
C},
{R.
A,S
C},
{T. B, U. D}}
(2) q’s Jom graph
(3) q’ = {{R (4)
A, T.B},
{7’ B,S
C},
{T, B,U
D}}
is equivalent
{R
C},
{T, B, U, D},
to
q,
{ It, A,S. C} is redundant,
(5) q+=
{{ R, A, T. B},
Fig. 4,
(1) Consider
the query
{T, B, S. C},
A,S
Join graph representation
{R
A, U. D},
{S C, U. D}}
of a query and its closure
q and Its closure (expressed in terms of them query graphs):
c1 = {R. A,S. B} QGV:
Cz = {S C3=
‘G”+: T
B,T
{SC,
D}
T E}
Cj = {T. E,UF} C5 = {R
A,T.
Dj
CG= {S. C, U.F]
(2) Assume
that all relations and intermediate results have the same size, K. Moreover, = C, part( ?’) = D, and part(U) = F part(R) = A, part(S) (3) lf we only co&ider QG,, we get the best .&-mgs (e,g., {R. A,S B}{S. B, T. D}{7’ E, U. F}) with cost 10K. the redundant clauses In (4) If we consider QG, +, we can get the string {R A, T. D}{ T. E, S. C}{ T.E, U F} with cost SK. Fig. 5.
respectively). tributed [5-7,
The
database
Example
showing
importance systems
the need to compute the query closure
of these
have
been
queries discussed
and
their
processing
in
in the
literature
extensively
dis-
16,371.
In order to minimize costs when processing a query, one might conjecture that it is enough to look at the spanning trees of its query graph. But this is wrong. Figure 5 illustrates such a case. 5 This example highlights an important idea: redundant clauses may lead to cheaper strings (and ss trees). It also suggests that when processing a query, we should begin with the closure of the query.
5For model
concreteness, (cf. Defimtion
benchmark ACM
in thm
and
subsequent
2) to 1 and 2, respectively
examples, (two
we set the values
that
2, June
1991
result)
TransactIons
on Database
Systems,
Vol
16, No
constants
are consistent
a and with
~ in the
cost
the Teradata
Optimizing Equijoin Queries Computing
closures
nodes to themselves) q has
clauses
may
sometimes
in the query
{R. A, S. B},
introduce
graph.
selfiloops
For example,
{S. B, T. C},
and
(i.e.,
suppose
{T. C, R. D}.
289
. edges
a given
Then
from query
QG~+
will
one can eliminate a contain { R. A, R. D}, an edge from R to R. In general, by replacing either R.X by R.Y or R.Y by R.X in all loop {R. X, R.Y) clauses the
of the
query.
The choice
partition {R. X, R. Y} will selection condition.
cal attributes we may loop-free query graphs. Based
unless
reduce
above
costs.
Henceforth,
say R. Y, is
one of them,
we restrict
formulation,
the problem
Given a query q, Find (1) QG~+; (2) a minimum string es that yields ~–’s cost.
cost spanning
(MRP)
on the
is arbitrary
R. X by R. Y. attribute of R, in which case we replace then be removed from the query graph and used as a Eliminating self-loops is useful, since by renaming identi -
initial
our
of minimizing
attention
response
to time
becomes
In the
following
general
graph
sections,
queries, process
once the spanning them by executing
queries.
the It
problem
tree :?- of QGq+ and the clause
is addressed
is important
to note
for chain,
star,
when
processing
that
tree,
and these
tree .7 and associated string cs are found, we can Y- based on the clauses and join order specified
in cs. 4. CLOSED As shown
CHAIN in Figure
necessarily remain section, we focus queries are called the next section. A dynamic
queries.
of joins.
closure
following
of a chain
technique
The dynamic
The algorithm,
dently by Sun et database systems. The
5, the
in the
query
graph
does not
a chain. If it is not a chain, it will contain a cycle.6 In this on queries q whose QG~+ is a chain. (Such a class of closed chain queries.) More general queries are treated in
programming
closed chain order
QUERIES
al.
in its style,
[321 for
notation
is used to solve
programming
is similar
optimizing
is needed
the
algorithm
chain
to describe
to that queries our
MRP
problem
computes proposed in
for
the best indepen-
object-oriented
algorithm.
For
1
1,
examine
all
CHAIN
runs
in 0(n3
in joining
we
have
or 0(n3) chain
when and
1 is
two cases. When k = 1, we are joining two results. Thus in order to find the lowest
examine
R,p and them
in such
in a closed
on an edge.
component
between
because
result,
minimum,
+ n21) time,
of nodes
all
clauses
attributes
of the two relations.
clauses
p + 1 = j),
intermediate
to
their
known
of the algorithm.
n is the number
of clauses
need
checking
the
El
Consider the following rather than intermediate
PROOF.
relations
When
Cost(&(
j).
The following
but
CHAIN
+ l,j).
Now,
cost
R, and
algorithm
Es(i, j)
by concatenating
remaining
constraint that the order By induction hypothesis,
by
This
Rp+l~ only
for
situations
some
the
the
requires
when
and we need to check
between are
same
relations,
as the
initial
0( nl) time. p,
i = p but
we are joining whether
two
i s p
0.00001 0.00005
0.0001
0.001
Range Fig.
13.
tuple’s
Effect
width
on an edge
factors
[1, 10], graph
of selectivity
(relation’s
cardinality
size = 12, chain
numbers
To avoid
out by fixing
as follows:
the
mutual
influence
the parameters
\ QG I = 12,
Notice
Num
= 2,
drawn
to query
I en I = 4,
also that increase.
algorithm
When
figure,
from
[10,000,
size = 4, number
selectivity
performance
of the
unattractive
Effect
of Query
algorithms. when
The
figure
in
20,000], of clauses
order
analysis
which
were
was set
QG) I = 2 for
any
to distinguish
the
algorithms
we
factors
factors
the
graphs,
I clauses(e,
for
see that
all
varying
join
the
heuristics
as the
selectivity
(LYR ~ s 0.00005).
PH deteriorates
the larger the pivot becomes. Consequently, nodes will incur an intolerable cost. Figure 14 shows the effect of varying
7.4
was
of parameters,
related
the selectivity factors. 13 Examining have close behavior for low selectivity
becomes
factors
= 2, chain
edge e in QG. The chains have been included simple heuristics from their hybrid counterparts. Figure 13 illustrates the behavior of the
factors
0.5
= 2)
procedures. carried
of selectivity
from
0.1
0.01
significantly
are high,
the more
joining the
the pivot
relation
shows
the sizes of relations
nodes are joined,
sizes
clearly
that
with
even more
on the
relative
algorithm
PH
increase.
Parameters
The main distinction between the hybrid heuristics and simple ones lies in the way they handle chains. The objective of this subsection is to explore the effect of varying chains in a query graph on the behavior of the heuristics. The following parameters’ values were assumed throughout the experiments: I t~ I was drawn from [1, 10], I R I was drawn from the range [100, 1,000,000], CYR,~ was fixed
at 0.1, and
I clauses(e,
QG) I was fixed
at 2.
131n subsequent experiments, ten query graphs were tested for each algorithm value of the solutions’ costs produced by each algorithm was plotted. ACM
Transactions
on Database
Systems,
Vol
16, No
2, June
1991
and the average
Optimizing Equijoin Queries
A
~
303
.
KH
70 -— ——
~~H ~~~~~ HPH
50 cost (in billions of bytes)
30 -
10
I [10,
100][1,
[1 ,000,
Range Fig.
14.
Impact
(the first item
item
gives
2, chain
of relation
of each label
the range
size,
of tuple’s
size = 4, number
where
on the
width;
A
H —
10,000][20, 30]
of relation
relation
x axis gives
of clauses
I
I
[100,1 ,000][10, 20]
10]
sizes
size = relation’s
the range
selectivity
factor
on an edge
= 2).
>
[10,000, 100,000][30, 40]
cardinality
of relation’s
x tuple’s
cardinality
= 0.1, graph
width
and the second
size = 12, chain
numbers
=
KH PH HI
size
Fig. 15. Cost of heuristics as a function of chain size (graph size = 20, number of chains = 1, selectivity factor = 0.1, number of clauses on an edge = 2, relation’s cardinality was drawn from [100, 1,000,0001, tuple’s
width
from [1, 101).
Figure 15 shows the effect of changing the chain size in a graph, where the size of the graph was fixed at 20 and Num at 1. It is apparent that as the chain becomes a major portion of a graph, the hybrid heuristics become better than
the simple
ones. One expects ACM
this
Transactions
because
the hybrid
on Database
Systems,
heuristics Vol.
guaran-
16, No. 2, June 1991.
304
D. Shashaand
.
A
T. L. Wang
~
90 -
KH
HPH 70 –
cost (in billions
50
,,
of bytes)
I
14
12
Graph
Fig. 16
Cost ofheuristics
chains
= 1, selectivity
drawn
from
tee that mance
an optimal clauses,
tuple’s
solution
which
fixed
at 1/2).
by its inherently inside
Figure
fixing
It is clear
I QG I was fixed
also influences
size, number cardmality
of was
large
local
The
poor
perfor-
is due to the way
view.
a chain—very
This
situation
few choices
can
which of the clauses should be picked at each step. the size of a graph has very little impact on the
that
the
16 illustrates
portion
increasing
of the only
relative performance of the heuristics—the remain and gradually become larger. Figure 17 shows the effect of varying (where
on a chain.
size becomes
searching
of the heuristics. while
x graph
= 2, relation’s
[1, 10])
the chain
is limited
size = 1/2
on an edge
can be achieved
if it starts
be made in determining On the other hand, size of a graph
from
size
(chain
of clauses
width
PH when
to be worse
performance
of~aphslze
= 0.1, number
[100, 1,000,0001,
of algorithm
it picks tends
asafunction
factor
20
18
16
at 20 and
the performance
the
chain
the graph gaps
the
of varying portion
size does not affect
between
number
these
of chains
I en I at 4). Although
of the hybrid
effect
in it (this
the number
heuristics,
the was the
algorithms in
a graph of chains
it has a less signifi-
cant effect than the size of a chain. The big gap between algorithms HPH and KH (HKH) may be due to the fact that the chosen size of chains is too small, as compared with that of the whole graph. Notice that there is a slight increase of the gap between algorithms HKH and KH when the number of chains is 4. This indicates that the existence of many short chains in a graph can have a negative effect on the performance of algorithm HKH. A possible explanation for this is that these chains decrease the number of nonchain edges in the graph
and consequently
can be made when globally picking In conclusion, when selectivity
restrict
the total
a clause. factors are
low,
number the
of choices
proposed
that
heuristics
have close performance. In the presence of high selectivity factors, algorithm PH becomes less competitive while algorithm KH (and HKH) remain good. If ACM
TransactIons
on Database
Systems,
Vol
16, No
2, June
1991
Optimizing Equijoin Queries
o
1
2 Number
Fig. 17. selectivity
.
3
305
4
of chains
Cost of heuristics as a function of number of chains (graph size = 20, chain size = 4, factor = 0.1, number of clauses on an edge = 2, relation’s cardinality was drawn from
[1OO, 1,000,0001, tuple’s
width
from [1, 10]).
a graph contains long chains (with size being over half nodes in the graph), we suggest to use the hybrid Kruskal
of the number of heuristic (HKH).
8. SUMMARY In this
paper,
important
we investigated
queries
partitioned
in
data
ways
loosely
of minimizing
coupled
distribution
response
multiprocessor
scheme.
First,
time
systems
we
for various
using
developed
a hash-
a
dynamic
programming algorithm for closed chain queries. Next, we proved the NPcompleteness for more general queries and proposed four heuristics for them. Our simulation results show that an algorithm similar to Kruskal’s spanning tree algorithm performs well when chains in a query graph are small. When chains are long, then Kruskal’s is best. Like other relevant relied sizes priori
a hybrid work
on the knowledge of intermediate in
actual
algorithm
[15,
25],
of selectivity results.
the
using heuristics
factors,
which
This information, 14 Fortunately,
environments.
the
chain
algorithm
presented were
with
in the
used to find
paper out the
however, is not known a our qualitative conclusions
were independent of the selectivity factors. Our results have all assumed that relations were partitioned on a single attribute. Generalizing beyond this is not difficult. For example, consider a R and S based on the clauses {{R. A, S. C}, { R. B, S. D} }, join between where
“A
R is partitioned
practical
systems Naughton
approach
(see,
e.g.,
on A and
would
[26]).
[21] for theoretical
S on C. One can process
be to estimate
Readers
may
analyses
also of the
the
selectivity
refer
to
factors
Gardy
sizes of intermediate
ACM Transactions
and
like
the join
those
Puech
used
[13]
and
based
on
in most
real
Lipton
and
relations.
on Database Systems, Vol
16, No. 2, June 1991.
306
D. Shasha and T. L. Wang
.
{R. A, S. C}, treating hand, if the join were
{R. B, S. D} as a selection condition. On based on {R. A, S. C} and R were partitioned
the
other AB
on
would have to be repartitioned to A and C’, and S on CD, both relations respectively. Therefore, the set of repartitionings required when R is partiwhen R is partitioned tioned on A alone is always a subset of those required on AB. attributes
To handle multiattribute partitions, we can treat the partition Thus, to perform the (e. g., AB) as if they were a single attribute. join having clauses {R. A, S. C}, { R.B, S. D}, and {R. E, S. F}, given AB and CD as partition attributes, we process { {R. A, S. C}, {R. B, S. D} ) and use condition. {R. E, S. F} as a selection We have perhaps
not discussed
partitioned
the possibility
on different
of using
attributes.
multiple
Often,
copies
keeping
of relations,
multiple
copies
at
each node reduces the cost of repartitionings considerably. As an example, consider again the suppliers-and-parts database and SQL query presented in Section 1. Suppose one copy of SUP is partitioned on s #, one copy of PART on p #, one copy of SUP-PART on s #, and another copy of SUP-PART on p #. Tz to SUP-PART. p #, instead of sending its complete When repartitioning tuples, the second strategy could send the surrogates [8] of tuples in SUP-PART, along
with
the
materialized
complete
tuples
in
SUP.
at each node by consulting
is partitioned
on p #.
Suppose
cost of this
scheme
would
significant
improvement
These
surrogates
the local
sending
would
fragment
surrogates
takes
10 seconds.
be 100 + 10 + 100 = 210 seconds, over the original
scheme.
then
of SUP-PART Then
which
Investigating
can be optimized in such a highly parallel environment with seems to be a very interesting problem for future research.
be that the
achieves
a
how queries replicated
data
ACKNOWLEDGMENTS
The
authors
Won
Kim,
are deeply for their
also wish
to thank
for helpful
indebted
to the
encouragement Richard
discussions
Cole,
and Zvi
anonymous
referees
thoughtful
recommendations.
Kedem,
in the formative
Bud
stages
Mishra,
of this
and
the
editor, They
and Paul
Spirakis
work.
REFERENCES 1. APERS, P, M. queries.
2. BABB, E. Trans. 3
Trans.
6, 4 (Dec. 1981),
SE-9,
a relational
P. A , Goomwmv,
SIN,
is a system
Optimization
algorithms
for
19S3), 57-68. database by means of specialized
1979),
N.,
S. B
distributed
1 (Jan,
hardware.
ACM
1-29.
Worm>
E , CHIUSTOFHEE,
for distributed
databases
L.
(SDD-1).
R.
AND ROTHNIE,
ACM
Trans.
J, B,,
Database
Jxz. Syst,
602-625.
D., BORAL,
execution
R., AND YAO, Eng.
1 (Mar.
Syst. 4,
processing
4. BITTON,
A.
Soflw.
Implementing
Database
~SItNS,
Query
G., HEVNER,
IEEE
H , DEWITT,
of relational
D. J., AND WILKINSON,
database
operations,
ACM
W
Trans.
K.
Parallel
Database
algorithms
Syst.
for the
1983)
8, 3 (Sept
324-353. An optimal algorithm for processing 5. CHEN, A. L. P , AND LI, V, O. K. queries. IEEE Trans. Softw. Eng. SE-11, 10 (Oct. 1985), 1097-1107.
distributed
6
in a distributed
CHILI, D. M., database
7. CHIU, ACM
D.
BERNSTEIN,
system. M.,
Transactions
SIAM
AND Ho, on Database
P. A.,
AND Ho,
J. Comput. Y,
C.
A
Systems,
13,
Y.
C.
1 (Feb.
methodology Vol.
16, No
Optimizing 1984), for
chain
queries
star
116-134 interpreting
2, June
1991
tree
queries
mto
optimal
Optimizing Equijoin Queries semi-join
expressions.
Management 8. CODD, E. Trans.
In Proceedings of the A CM-81GMOD International (May 1980). ACM, New York, 1980, pp. 169-178.
of Data F.
Extending
Database
the
Proceedings
W.,
BOUGHTER,
of the ACM-SIGMOD
ACM,
(1988).
New York,
model
to
capture
Conference
on the
meaning.
ACM
more
E., AND KELLER,
International
T.
Data
Conference
placement
in Bubba.
on the Management
of Data
1988, pp. 99-108.
10, DEWITT, D. J., AND GERBER, R, of th e 11th
relational
307
1979), 397-434.
Syst. 4, 4 (Dec.
9. COPELAND, G., ALEXANDER, In
database
.
International
Multiprocessor
Conference
hash-based join algorithms.
on Very
Large
GRAEFE,
G.,
Data
Bases
In Proceedings
(Stockholm,
Aug.
1985),
pp.
151-164. 11. DEWITT,
D.
J.,
GERBER,
MURALIKRISHNA, ceedings 1986),
M.
R.
H.,
GAMMA—A
of the 12th
high
International
Conference
D.
Teradata
J.,
SMITH,
Database
M.,
AND BORAL,
Machine.
MCC
13. GARDY, D,, AND PUECH, C. Database
Syst.
On the
Freeman,
Database
Syst.
Database
17. KIM,
W,
Syst.
A new
ACM-SIGMOD
effect
Set query
1980).
RIMS
way
ACM,
York,
Springer-Verlagj
Generation
KRUSKAL,
21. LIPTON,
B.,
R,
J,,
Database
Tree
(1957),
the
pp.
York,
1 (1983), On
the
Shortest
and 1983,
and join
ACM
Trans.
to the Theory
database
of
systems.
ACM
queries
ACM
of relational In
of Data
Proceedings
(Santa
of the
Monica,
Calif.,
T.
Relational
pp.
algebra
(1982),
machine
Lecture
GRACE,
Notes
In
in Computer
191-212. to data
base
machine
and
its
architecture,
62-74.
J.
ACM
of relations.
Engineering
of hash spanning
Math. F,
subtree
of
1990). ACM,
connection
a graph
and
the
traveling
7, 1 (1956), 48-50.
Sot.
Query
size
estimation
SIGACT-SIGMOD-SIGART
SELINGER,
New York,
networks
by
adaptive
Symposium
on
sampling,
In
Principles
of
1990, pp. 40-46. queries
m distributed
and some generalizations,
P. G,,
Syst.
databases. Bell
Syst.
IEEE
Tech.
J.
pp
SHAPIRO, L. Syst.
operations
1 (Mar.
M
M.,
in a relational
Conference
1979,
for
efficient
query
processing,
ACM
Trans.
113-133.
11,
ASTRAHAN,
selection
International
A technique
1986), of join
Database
path
Database
11, 2 (Jun.
Optimization
Trans.
Access
Fragmentation:
Syst.
SEGEV, A. ACM
of the
1389-1401,
SACCO, G. M.
in
1986),
horizontally
CHAMBERLAIN, D. database
on the Management
partitioned
database
systems,
48-80. system, of Data
D., In
LORIE,
R. A.,
Proceedings
(Boston,
AND PRICE,
T.
G.
of the ACM-SIGMOD
Mass.,
May
1979).
ACM,
New
23-34.
Join
processing
11, 3 (Sept.
SHASHA, D. Query Advanced Database of the International
in database 1986),
processing Symposium
SHASHA, D., AND SPIR~KIS, ings
Aug.
179-187.
shortest
Amer.
Stth
(Apr.
sizes.
– A Guide
class
on the Management
Application
Proc.
Systems
Database
29,
Japan,
evaluation
on relation
in distributed A simple
product
Science
AND NAUGHTON,
of
23. PRIM, R. C.
28.
Pro-
1979.
22. PRAMANIK, S., AND VINEYARD, D. Optimizing join Trans. Softw, Eng. 14, 9 (Sept. 1988), 1319-1326.
27.
AND
In
(Kyoto,
performance
and Intractability
queries:
the
1980,
New
JR,
problem.
Proceedings
York,
B.,
1987,
operations
Calif.,
H., AND MOTO-OKA,
ET AL,
Comput.
J.
salesman
26.
Bases
K.
machine.
1982), 653-677.
on Sof%ware
M.,
of join
optimization
Conference
TANAKA,
Symposium
Science.
O.
to compute
New
M.,
19. KITSUREGAWA,
25.
KUMAR,
1986), 265-293.
7, 4 (Dec.
International
18. KITSUREGAWA,
24.
Data
user
DB-081-87,
Computers
11, 3 (Sept.
16. GOODMAN, N , AND SHMUELI, Trans.
A single
Rep.
San Francisco,
GAWSH, B., AND SEGEV, A. Trans.
20,
L.,
database
1989), 574-603.
14, 4 (Dec.
NP-Completeness.
New
on Very Large
H.
Tech.
14. GAREY, M. R., AND JOHNSON, D. S,
May
M.
dataflow
pp. 228-237.
12, DEWITT,
15,
HEYTENS,
performance
P.
systems
in a symmetric (Tokyo, Japan, Fast
Conference ACM
with
large
main
memories.
ACM
Trans.
239-264.
parallel
parallel environment, In Aug. 1986), pp. 183-192, algorithms
on Supercompu Transactions
ting
for processing (Athens,
on Database
Greece,
Systems,
Vol
proceedings
of the
of joins,
In
June
1987).
Proceed-
16, No, 2, June 1991.
308 W
D. Shasha and T. L. Wang
. STAMOS,
J.
W.,
distributed 31.
STONEBRAKER, ceedings Networks
M,,
Berkelebv
W , AND
manuscript, of Illinois
TERADATA
CORPORATION.
database
A
symmetric
IBM
E.
Corp.,
fragment
San Jose,
A distributed
Workshop
on
P.,
Query
at Chicago,
Teradata
and
Dee
database
Dwtrlbuted
replicate
algorithm
for
1989, version
Data
of INGRES
Management
In
Pro-
and
Computer
Database
Trans.
in object-oriented
Electrical
computer
Engineering
system
Los Angeles,
GARDARIN,
ACM
optimization of
and
database
systems
Computer
Science,
1990,
Corporation,
AND
machine.
Yu, C.
Department
University
VALDURIEZ,
C
1977).
(May
C02-0001-01,
H.
RJ7188,
AND NEUHOL~,
3rd
MENG,
Unpublished
34.
Res. Rep
of the
32. SUN, W.,
33
AND YOUNG,
joins.
G,
Jom
Database
and Syst.
Oct.
concepts
semijoin 9,
and
facilities,
Document
1984 algorithms
for
a multiprocessor
1 (Mar. 1984), 133-161.
35
WILLI.4M, R., ET AL. R*: An overview of the architecture Res, Rep. RJ3325, IBM Corp., San Jose, Dec. 1981. 36. WoN~, E, Dynamic rematerlalization: Processing distributed queries using redundant Softw. Eng. SE-9, 3 (May 1983), 228-232. data IEEE Trans
37
Yu,
C, T,, OZSOYOGLU, M
J. Comput. 38.
Yu,
C.
T.,
Algorithms 10 (Ott
Received
ACM
Syst,
Z., AND LAM, K,
Sci. 29 (1984),
GUH,
K.
to process
C.,
Distributed
query
optimization
ZHAN~,
distributed
W.,
TEMPLETON.
queries
in fast
M.,
local
BRILL,
networks
D.,
AND
TransactIons
1988;
revised
on Database
May
Systems,
1989
Vol
and April
16, No
1990;
2. June
accepted
1991
CHEN,
IEEE
Trans.
April
1990
1987), 1153-1164,
January
for tree
queries
409-445. A.
Comput.
L.
P
C-36,