Optimizing Equijoin Queries In Distributed Databases Where ...

4 downloads 0 Views 2MB Size Report
DENNIS SHASHA and TSONG-LI WANG. New York University. Consider the class of distributed database systems consisting of a set of nodes connected by a.
Optimizing Equijoin Queries In Distributed Databases Where Relations Are t-lash Partitioned DENNIS

SHASHA

and TSONG-LI

WANG

New York University

Consider the class of distributed database systems consisting of a set of nodes connected by a high bandwidth network. Each node consists of a processor, a random access memory, and a slower but much larger memory such as a disk. There is no shared memory among the nodes. The data are horizontally partitioned often using a hash function. Such a description characterizes many parallel or distributed database systems that have recently been proposed, both commercial and academic. We study the optimization problem that arises when the query processor must repartition the relations and intermediate results participating in a multijoin query. Using estimates of the sizes of intermediate relations, we show (1) optimum solutions for closed chain queries; (2) the NP-completeness of the optimization problem for star, tree, and general graph queries; and (3) effective heuristics for these hard cases. Our general approach and many of our results extend to other attribute partitioning schemes, for example, sort-partitioning on attributes, and to partitioned object databases. Categories and Subject Descriptors: F.2 [Theory of Computation]: Analysis Problem Complexity; H.2.4 [Database Management]: Systems–distributed processing; H.2.6 [Database Management]: Database Machines General

Terms: Algorithms,

Performance,

Additional Key Words and Phrases: models, spanning trees, systems

of Algorithms systems;

and query

Theory

Equijoin,

hashing,

NP-complete

database

systems

problems,

relational

data

1. INTRODUCTION We consider tion

of nodes

the

class

(sites),

of distributed each having

The nodes communicate latency)

network.

Each

with

a processor,

one another

relation

local

consisting

memory,

across a high

is horizontally

and

bandwidth

partitioned

of a colleclocal

disk.

(and high

to different

nodes

This work was partially supported by the National Science Foundation under grantsDCR8501611 and IRI-8901699, and by the Office of Naval Research under grant NOO014-85-K-O046. Authors’ addresses: D. Shasha, Courant Institute of Mathematical Sciences, New York University, New York, NY 10012; J. T. L. Wang, Department of Computer and Information Science, New Jersey Institute of Technology, University Heights, Newark, New Jersey, 07102 Permission to copy without fee all or part of this material not made or distributed for direct commercial advantage,

is granted provided that the copies are the ACM copyright notice and the title

of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1991 ACM 0362-5915/91/0600-0279 $01.50 ACM

Transactions

on Database

Systems,

Vol. 16, No

2, June 1991, Pages 279-308.

280

.

D Shasha

and T

L Wang

RI

Emp #

Salary

Node 1:

R,

24,000

Emp #

Salary

1

38,000 \

Fig. 1.

by

hashing:

use

a hash

S2

Age

John Mary

35 25

Name

Age

4

-i-

Emp #

+

h whose

50

Andy I Ann

13

of two partitioned

function

Name

2

1

42,000

Example

Emp #

I

55,000

2 4

Node 2:

s,

I

40

relations.

domain

is the

domain

of some

attribute A of relation R and whose range is the number of processing nodes. node h( t. A). 1 When performing For each tuple t of R, we put t in processing joins,

we use the same function

joined

for any pair

of relations

that

could

possibly

be

together.

1. Suppose we have a system composed of two nodes, node 1 and Example node 2, and a hash function h(x) = ( x mod 2) + 1. Figure 1 shows two relations that are partitioned on the join attribute emp #: R(S) is partitioned RI and into respectively. Such parallel including well

Rz

(Sl

a loosely

coupled

or distributed commercially

as research

Sabre [34]. In this paper

and

Sz ), which

hash-partitioned

database available

prototypes we are

are

stored

at

node

architecture

as Bubba

concerned

with

[91, Gamma

optimizing

node

characterizes

systems that have recently ones such as the Teradata

such

1 and

many

been developed, machine [33], as

[111, Grace

equijoin

2,

queries

[18],

and

in such

systems. We restrict ourselves to queries that retrieve all fields of joined relations. 2 For our purposes, we assume that relations are not replicated, that is, each node holds a unique fragment for each relation. We also assume that there is no answer site in the systems. Thus we mainly concentrate on join processing, For example, on the condition emp #,

ignoring consider R. emp #

no communication

on emp #,

whereas

unioning results at the answer R and doing the join between = S.emp

#.

is required.

S on age, each

Since both

relations

On the other S tuple

site. S in Figure

hand,

1, based

are partitioned

on

if R is partitioned

t needs to be sent to processing

‘Though one may partition on a set of attributes, for the purpose of this paper, we assume partitioning on a single attribute only. Some partitioning schemes (see, eg., [181) also allow the range to be k times the number of processing nodes, with each node having k buffers that can receive data. In that case, h(t. A) designates a buffer as well as a node. The results in this paper hold for that generalization, 2This restriction is made mainly for simplifying cost computation. It M not difficult, however, to extend our techniques to handle queries whose outputs contain only certain fields of joined relations ACM Transactions

on Database Systems, Vol. 16, No 2, June 1991

Optimizing Equijoin Queries Table I.

Effect of Repartitioning With

10,000 with 1,000 100,000 with 10,000 1,000,000 with 100,000

Source: DeWitt

281

on Join Performance

Repartitioning (seconds)

Join Size

.

Without (seconds)

35.4 2864 3,144,0

180 122.0 1,143.0

et al. [121

node

h( t.emp # ). Such

head

caused

by

data

transmission

repartitioning

is not

is called negligible,

repartitioning. If

The

communication

nodes is slow, then most time will be spent in routing data in a network repartitioning occur. Even if communication is fast, repartitioning reduce throughput because it entails more disk accesses. To understand ment by DeWitt

overamong when may

the importance of such overhead, let us review an experiet al. on the Teradata machine [12]. The machine consisted

of 20 nodes and 40 disk storage units, which were interconnected with a 12-Mbyte/second-bandwidth network. The data consisted of 200 byte tuples. DeWitt et al. performed joins between two relations having different sizes (10,000

tuples

with

1,000 tuples,

100,000

tuples

with

10,000

tuples,

1,000,000

tuples with 100,000 tuples, respectively) and recorded the time for executing these joins with and without repartitionings. Table I shows the result of their experiment. By comparing the values listed in the second and third columns of the table, we see that there size joins. This indicates that multiprocessor

system,

exist ratios of between 2 or 3 to 1 for the same when processing queries in a loosely coupled

the repartitioning

cost can dominate

the local

process-

ing cost. The authors also measured the influence of the bandwidth of the network. They observed that repartitioning a relation of 200 Mbyte took 2,000 seconds and used less than 1 percent of network capacity on the average.

This

result

reveals

that

repartitioning

even when a network is fast. But can we control the repartitionings thereby

minimizing

Example relations

2.

response

Suppose

time

necessary

a suppliers-and-parts

process

a query,

database and

containing

SUP-PART(S

#,

SUP.*, PART.*, SUP-PART. * SUP, PART, SUP-PART SUP.s# = SUP-PART.s# SUP-PART. P# = PART. P#

in Table

SUP-PART)

is expensive

PART on p #, and SUP-PART

One way to process this query is to join SUP-PART with a result Tl, and then join TI with SUP. Assume that takes 100 seconds and repartitioning a relation takes cated

to

PART(P #, pname, city),

P #, city). that SUP is partitioned ons #, . Suppose -. the following SQL query: on s #. Consider SELECT FROM WHERE AND

relations

in such an environment?

we have

SUP(S #, sname, city),

large

I. The cost of this

+ (the

cost

of

strategy

joining ACM

would

SUP-PART

TransactIons

PART first, obtaining joining two relations 100 seconds, as indi-

be (the cost of repartitioning with

on Database

PART)

Systems,

Vol.

+ (the

cost

of

16, No. 2, June 1991

282

D. Shasha and T. L. Wang

.

repartitioning

7’1) + (the

cost of joining

7’1 with

SUP) = 100 + 100 + 100 +

100 = 400 seconds. On the other hand, joining SUP with SUP-PART first no repartitioning. Then joining the intermediate result Tz with requires PART requires repartitioning Tz. The cost of this strategy is thus 100 + 100 + 100 = 300 seconds. Clearly,

the

example relations,

second

is better

strategy

for processing

the

given

query.

This

demonstrates that by properly arranging the order of joins among can be achieved. Exactly how to do this to better performance

minimize

the

paper.

cost for various

In Section

important

2, we present

queries

overview

an

is the

main

of previous

concern

work

of this

in the

area.

Section 3 discusses basic assumptions and formally defines the optimization problem. Section 4 then gives a dynamic programming algorithm to solve the problem in the context of closed chain queries. Section 5 establishes the NP-completeness Section

of the

6 presents

problem

the results of computational some directions for future 2. RELATED

Neuhold

amount

using [311,

(DDBMSS).

in the In

A common

of work

these

objective

has

been

schemes.

Williams

to increase time.

minimize

and

tree,

general

procedures.

experiments. Finally research in Section 8.

partitioning and

(fragmentation) strategy response

star,

heuristic

graph

In Section

we conclude

queries.

7, we report

the paper

with

WORK

A considerable processing

for

and analyzes

et

al.

context

of distributed fragmentation

adopted

the communication

in

thus

database

of query

by researchers

as part

in

area

the

systems

of processing

throughput

processing

and

partitioning

management

used

is

in

area

horizontal

increasing

cost incurred

the

et al. [31, Stonebraker

[351 discussed

systems, parallelism,

performed

Bernstein

and

saving

of DDBMS

a query.

is to

Segev [251,

can signififor example, introduced the notion of remote semijoin, which cantly reduce the communication cost while incurring higher processing cost. Fragments may also be replicated at different sites. Yu et al. [38] distinguished queries as locally processable and nonlocally processable. If a query data transfer. is locally processable, then the query can be processed without

They cally

proposed a linear time “fragment processable queries. Relevant work

al. [1], Wong

[36], Pramanik

and replicate” algorithm has also been discussed

and Vineyard

[22], Stamos

for nonloby Apers et

and Young

[30], and

us from these workers is that we seek the optimal others. What distinguishes queries, starting with partitioned data and ending join order when processing is on the environment with partitioned data. Moreover, our main concern without data replication. We defer to Section 8 the possibility of replicating data. in multiprocessor systems The idea of using hash partitioning for joins comes

from

Grace

database

the

sort

Valduriez

[2].

engine and

ACM Transactions

This machine rather Gardarin

scheme project than [34]

is

then

adopted

[181, though on

the

described

on Database Systems, Vol

by their

Kitsuregawa emphasis

performance also

various

16, No. 2, June 1991

of join

et

al.

is on the the

join

and

semijoin

on

the

speed

of

algorithm, strate

-

Optimizing Equijoin Queries is applied only gies, but hashing Gerber [10] proposed two algorithms

partition join

and join

operations

phases.

and

They

during that

applied

concluded

that

the partition exploit hashing

the

both

algorithms

.

phase. 13eWitt and thoroughly in both

to a number

algorithms

283

have

of simple

satisfactory

perfor-

mance. In

contrast

multijoin and

to the

queries

partition

strategy joins

Thus,

but

rather

used,

minimizing

developed

a model and

more

We

simpler

then

tree

we

finding

optimal

tree

cost

view

for

with the

we

into

the

of

address

account.

whose

We

closure

is

problem

general

is

for

the

problem.

related

performance

we

size

NP-hardness

a heuristic

best

[281,

the

to a simpler

of the

heuristics

algorithm

size

many

In

Here,

queries

solutions

In

various

taking

optimizing

queries.

[201 achieves

and

for

involving time.

a join.

joins

hash-join

considering

in

strategies

chain

algorithm

topologies

without

examines

execute

particular

of a query

participating for

the

to

processing

queries

algorithm

consider our

FORMULATION

3.1

and

cost and

paper

methods

with

the

such

this

hash

concerned

1/0,

explore that

star

use

reducing

results

and

combining

of query

not

processing

show

problem,

spanning

are

with

time

for

that

3.

for

cases

NP-complete find

we

partitioning,

that

communication,

a polynomial

a chain.

on hash

systems

intermediate

general

present

work in

data.

by

relations

previous

arising

We

to Kruskal’s

over

a wide

range

assumptions.

OF THE PROBLEM

Basic Assumptions

Our

objective

is to minimize

approximate

this

costs.

Processing

node.

Repartitioning

determined (2)

system bandwidth To

cost

by

the

goal

(1) the

the

to

the

refers

total

parameters

simplify

considering

refers cost

of

of the

response

by

model

when

both

processing

time

to the

amount

spent

time

developed

in

reparti in

System speed

this

paper,

joins

at

the

we

1/0

each

which

network,

parameters

of the

We

repartitioning

tionings,

transmitted

as the

a query.

and

performing

for

architecture.

as well

processing

for

spent

of data

a given

network,

time

include

and

make

is and

processors.

the

following

assumptions.

Assumption

1.

the same.

are

Assumption execute

2.

joins

complete

that

For to its

and

either it

processing

when acts

as

speeds

at different

same hashing

with

distributes

parameters

nodes

technique

at

fashion is

function

the

of a relation, tuples

related

Assumption

data

attribute

a random

(or partition)

evenly

pipelined the

1/O

is a constant.

node uses the

each repartition attribute

transfer

in

and

to

and Gerber’s).

together

joins

reasonable is,

speeds

of the network

1 and 2 deal with

assumptions,

transferred is

Each

3.

applied

Assumptions two

processing

(e. g., DeWitt

Assumption function

The

The bandwidth

to a given

3, ensure same

or one

time.

batch

to each

that

(Tlhe

and

the

hash

from

the

set

of tuples

ACM Transactions

The

processors

data

could

be

Assumption

function

on Database Systems, Vol

hash

system.

the

at a tinne.)

a key

the node.

to

is the

3

“good,” set

of

n

16, No. 2, June 1991.

284

D. Shasha and T. L. Wang

.

nodes

such

This

has

that

each

been

probability

tuple

has probability

established any of the

that

in analytically n nodes has more

of tuples

is very

small,

provided

attribute

is not

a key

or

reader

is familiar

taken

I/n

that

the

hash

the

of being

[291

where

than

twice

number

function

sent we

show

the

the

number

n log n. If the

>4

our

node.

that

average

of tuples

is biased,

any

to

results

should

be

as heuristics.

3.2

Terminology

we

assume

join,

used

finite

the in

relational

with

database

set of attributes

{ Al,

the

standard

systems.

terms,

Define

AZ, . . . . A.}

attribute,

relation

a

[16]. Associated

tuple,

schema

with

R as a

each attribute

AZ, 1 s i s n, is a domain,

denoted dom( A ,). A relation instance (or simply R is a finite set of mappings { tl, t2, . , tm} from R to

R on schema

reldiorz)

the set of domains

such that

in R. be an attribute {tl. B,t2. B,..., tn. B}. An equijoin clause, or

B and

where sents

the

C are

join

write

clause

for

short,

is a conjunction

of such

join

as a set of clauses. 3 To represent

same

relation,

we

consider

instances

salary

of each

for

the

the

employee’s

In

order

to find

QV,

= {relations

QEq = {{R,

optimal

join

each

clauses query

referenced

S} f some

An example graph is never with

an

relation strategy,

of a query

edge.

we

is connected.

connected Observe

executing

product that tree

though

in

of these

to

be

a

of the

different

to

find

the

EM P(name,

sal, manager)

EM PI and

EMP2.

These

the

query

to consider

both

Figure may

a spanning

tree,

consider

query

graph;

individual

processing treating

on Database

define

q}

in

in

there

we

of the

of it,

TransactIons

we tuples

example,

it is helpful

R and

S}

2a. Notice

be many

a all

we

that

clauses

a query associated

queries

Systems,

Vol.

to

resulting

correspond

query

to

would

then

the be

queries.

query

other can

the the graph,

clauses

take

only

one

16, No

can

always

as selections. one

clause,

~Because all attributes of joined relations appear in the output 1, we exclude the target list in defining a query, ACM

whose

R, S}, q) to denote the set of equijoin clauses({ R and S in q. We assume that the query graph of a

Otherwise

components

a Cartesian spanning

write

both

join

repre-

queries

q, where

given

is

the

be

{R. B, S. C; clause

in

B

to

query.

of q references

graph

a Multigraph,

referencing

the

by the clauses

member

an

instances in

form

different

For

given

relations

QGq = ( QV~, QEQ) for a query

of

query.

manager,

two employee schema, we assume would be represented as distinct

graph

the

R. B

Therefore, links

arguments of

current

that

i s n. Let

This

interested

conditions.

a join

two

purpose

of the

S, respectively.

R. B = S. C. We are only

query relation

is a pair

of R and

attributes

condition

qualification

te R, t(Al) e dom(AZ), 1< t.B for t(B). We define

for each

We

2, June

1991

for

execute

Furthermore, each

edge,

of a query, as marked

a in

as the

in Section

Optimizing Equijoin Queries s.*,

SELECT FROM WHERE AND AND AND

T.+,

R.*

S.C= s.G=

cl!

s

S, T, R T.B T.H

T

Ci = {S. G, T.Hj

~4 R

S.F=RE T. B=R.

285

c1 = {S. C, T.I?}

C2

C3

.

v

C3 = {SF,

R.E}

Cb = {TB,

R.D}

D

(a) (1) Consider the query graph in Figure 2a just above. We can execute the graph by joining S of two relation names represents the with T first, and then S2’ with R. (The juxtaposition join result of the relations.) (2) While joining S and T based on, say {S. C, T. B}, the selection condition S. G = T. El is also processed. ST with R based on, say {S. F, R. E}, the clause {T. B, R. D) enforces (3) Then when joining another selection condition on tuples of ST and R (see Figure 2c). (b)

c1

T s

T

C3

R

(c)

(4) Consider

the query graph in Figure 2a. When joining S and T on the clause {S.C, 2’. B}, S ST and R, needs to be partitioned on C and T on B. Then in joining (i) if {S. F, R. E} is used as the join clause, ST needs to be repartitioned on F, such a repartitioning causes part(T) to be empty; of ST is needed, (ii) if {7’. B, R. D. } is used as the join clause, since no repartitioning part(T) = {1?}. Fig. 2.

clause,

join

motivates

applying

(i) (ii) (iii)

the

1. A

QGT,,.q

other

=

the

spanning

(QvT,,.q,

clauses

tree

QET,..q

~ QEq and (QvTrceq,

Treeq) If clauses(e, single-clause

e G

QETr,eq,

as selections

of a query )

QETreeq

= Qvq.

each

spanning

tree.

(see Figure

2b).

This

following.

QvT,..q For

with

QET,,eq)

clauses(e,

graph

QG~ = (QVq, QEq) is a

the following

properties:

is a tree.

~reeq)

G clauses(

e, q)

and

clauses(e,

e ~ QETre.q,

QGT,.,q

is called

graph

Figure

# ~. Treeq)

is a singleton

spanning

Figure

2C shows

corresponds

to the

3.3

of query graph and its single-clause

us to introduce

Definition ~aph

Example

tree an

ss

execution

set for each

(ss tree) tree

for

of

the

sequence

a

QGq.

in

query Figure

in

2a,

which

2b.

Cost Model

on which a relation R is partiWe denote as part(R) the set of attributes in the query execution. By the assumption stated in Section tioned sometime ACM

Transactions

on Database

Systems,

Vol

16, No. 2, June 1991.

286

.

1, the

D Shasha and T L Wang

initial

value

repartitionings

of part(R)

occur,

is always

part(R)

may

a join

strategy

a singleton

become

empty.

set.

Figure

However,

when

2d illustrates

such

a case. In order that

to evaluate

reflects

the

cost.

we

begin

effectively,

with

the

we

definition

need of the

to have cost

for

a measure a clause.

Definition 2. Let I R I denote the number of tuples in relation R and I tR I the width (in bytes) of a tuple in R. The cost of a clause {R. A, S. B} equals PC+ RC; PC=ax(l Rlxlt~l +lSlxlts l)isthe processing cost, and cost, where a, @ are nonzero constants RC = P x Move is the repartitioning (e.g.,

in the Teradata

case ~ = 2 a), and

P Moue =

if (Aepart( if (A

lslxlt~l 1 /Rlx/t~l

~ (Bepart(S))

cpart(R))

~ (B ~part(S)) A (Be

if (A~part(R))

II Rlxlt~l+l

Sl

Thus

we are assuming

sizes

of the

benchmarks

R))

relations. shown

that

xltsl

both

This

otherwise.

PC and RC are linearly

assumption

in Section

part(S))

1 and

is

the

proportional

consistent

Grace

with

the

measurements

to the Teradata

[18,

191; it is

also consistent with the cost equations derived elsewhere [271. Moue represents the amount of data transmitted in the network, where case 1 refers to the situation in which represent the situation case 4 represents

neither where

the situation

R nor S will be repartitioned; cases 2 and 3 one of’ the relations will be repartitioned, and where

both

relations

need to be repartitioned.

Also, we consider the measure of processing cost to be the number of bytes moved from disks, as opposed to the number of page fetches from disks often assumed

in the literature

[4, 17, 24], though

the two measures

a constant additive factor. By including factors and bandwidth, one can convert the byte units

Next, as noted in Section 1, in executing the joining nodes in different order can yield different the foHowing. Definition tree

3.

QGTr,P~

A clause

= (QvT,.eq,

string,

or string

QETreeq

)

C,l CLZ. .

QE~r,.q where 1,2, . . . . n, and

= {e, j 1 s i s c,, ~clauses(eL,,

n}, il, Treeq).

The order of clauses in a string associated tree are joined. There there are n edges in it.

for

is an ordered

on Database Systems, Vol

by only

same ss tree of a graph, costs.4 We thus introduce

short,

associated

clause

with

an ss

sequence

C?,n

i2, . . . , in

indicates the are n ! strings

is

a

permutation

of

order in which nodes of the associated with QGT,e.~ if

4Since the results obtained in this paper depend on the properties node and relation interchangeably, ACM Transactions

differ

such as page size, 1/0 speed, to corresponding time unit.

16, No 2, June 1991

of graphs,

we use the terms

Optimizing Equijoln Queries

El--@

Fig. 3.

287

.

Join graph of query shown in Figure

2a.

@---@

We

define

clauses.

the

cost

However,

partition influenced

of a string

since

the

as the

sum

that,

specified Finally,

in the string. we define the cost of an ss tree

when

calculating

the cost of a string,

Cost( QG~,,eq) CS is the

spanning tree

of

set of all strings

tree of a query QG~

minimum 3.4

costs

has

graph

a smaller

cost spanning

Exploiting

cost.

tree

Redundant

associated For

with

yields

Clauses

minimum

QG~,,,~.

A

Define

the

may refer to the graph model.

join

graph

for

same

a query

order

minimum

query

graph,

cost

no other

ss

executing

its

time.

Queries

Our goal is to minimize response time when processing a query. Thus have concentrated on query graphs, rather than on queries. However, ent query graphs we need another

are it is

}

response

to Optimize

the exact

of QG~ such that

a given

sizes and

which in turn tree are joined,

as

{COSt(cs)

QG~ is an ss tree

component

on the

one follow

QG~,,,~

= ,&&

of its

is dependent

attributes of its two corresponding relations, by the order in which relations of the query

required

where

of the

cost of a clause

query.

To have

a clear

view

JGq, as a pair

q, denoted

far we differof this,

( JV,,

JE,),

where

JVq = {R. A I R. A is an attribute JEq=

{{ R. A, S. B}l{R.

Intuitively from equijoin nent We

a join

different

clauses. ) Each

of the

graph.

say two

A, S. B}issome

graph

relations.

represents (This

3 shows are

occurs clause

in some clause

we

relation

consider

class of attributes the join

equivalent

graph

if their

connectivity y. For any state of a database, result. A clause c of a query q is redundant

of q)

in q}.

an equivalence

is because

equivalence

Figure

queries

of R that

on attributes

only

queries

is a connected

of the join

query

graphs

with compo-

in Figure have

the

2a. same

equivalent queries give the same if c is an edge in a cycle of JGq.

The closure of a query q is an equivalent query, denoted q+, to which one cannot add more redundant clauses. Figure 4 illustrates the above concepts. tree query, respectively) if it is A query is a chain query (star query, equivalent

to

some

query ACM

whose

query

Transactions

graph

on Database

is

a

Systems,

chain Vol

(star,

tree,

16, NO 2, June 1991

288

D. Shasha and T. L. Wang

.

(1) Consider the query q =

{{ R. A,’l’

B},

{2’ B,S

C},

{R.

A,S

C},

{T. B, U. D}}

(2) q’s Jom graph

(3) q’ = {{R (4)

A, T.B},

{7’ B,S

C},

{T, B,U

D}}

is equivalent

{R

C},

{T, B, U, D},

to

q,

{ It, A,S. C} is redundant,

(5) q+=

{{ R, A, T. B},

Fig. 4,

(1) Consider

the query

{T, B, S. C},

A,S

Join graph representation

{R

A, U. D},

{S C, U. D}}

of a query and its closure

q and Its closure (expressed in terms of them query graphs):

c1 = {R. A,S. B} QGV:

Cz = {S C3=

‘G”+: T

B,T

{SC,

D}

T E}

Cj = {T. E,UF} C5 = {R

A,T.

Dj

CG= {S. C, U.F]

(2) Assume

that all relations and intermediate results have the same size, K. Moreover, = C, part( ?’) = D, and part(U) = F part(R) = A, part(S) (3) lf we only co&ider QG,, we get the best .&-mgs (e,g., {R. A,S B}{S. B, T. D}{7’ E, U. F}) with cost 10K. the redundant clauses In (4) If we consider QG, +, we can get the string {R A, T. D}{ T. E, S. C}{ T.E, U F} with cost SK. Fig. 5.

respectively). tributed [5-7,

The

database

Example

showing

importance systems

the need to compute the query closure

of these

have

been

queries discussed

and

their

processing

in

in the

literature

extensively

dis-

16,371.

In order to minimize costs when processing a query, one might conjecture that it is enough to look at the spanning trees of its query graph. But this is wrong. Figure 5 illustrates such a case. 5 This example highlights an important idea: redundant clauses may lead to cheaper strings (and ss trees). It also suggests that when processing a query, we should begin with the closure of the query.

5For model

concreteness, (cf. Defimtion

benchmark ACM

in thm

and

subsequent

2) to 1 and 2, respectively

examples, (two

we set the values

that

2, June

1991

result)

TransactIons

on Database

Systems,

Vol

16, No

constants

are consistent

a and with

~ in the

cost

the Teradata

Optimizing Equijoin Queries Computing

closures

nodes to themselves) q has

clauses

may

sometimes

in the query

{R. A, S. B},

introduce

graph.

selfiloops

For example,

{S. B, T. C},

and

(i.e.,

suppose

{T. C, R. D}.

289

. edges

a given

Then

from query

QG~+

will

one can eliminate a contain { R. A, R. D}, an edge from R to R. In general, by replacing either R.X by R.Y or R.Y by R.X in all loop {R. X, R.Y) clauses the

of the

query.

The choice

partition {R. X, R. Y} will selection condition.

cal attributes we may loop-free query graphs. Based

unless

reduce

above

costs.

Henceforth,

say R. Y, is

one of them,

we restrict

formulation,

the problem

Given a query q, Find (1) QG~+; (2) a minimum string es that yields ~–’s cost.

cost spanning

(MRP)

on the

is arbitrary

R. X by R. Y. attribute of R, in which case we replace then be removed from the query graph and used as a Eliminating self-loops is useful, since by renaming identi -

initial

our

of minimizing

attention

response

to time

becomes

In the

following

general

graph

sections,

queries, process

once the spanning them by executing

queries.

the It

problem

tree :?- of QGq+ and the clause

is addressed

is important

to note

for chain,

star,

when

processing

that

tree,

and these

tree .7 and associated string cs are found, we can Y- based on the clauses and join order specified

in cs. 4. CLOSED As shown

CHAIN in Figure

necessarily remain section, we focus queries are called the next section. A dynamic

queries.

of joins.

closure

following

of a chain

technique

The dynamic

The algorithm,

dently by Sun et database systems. The

5, the

in the

query

graph

does not

a chain. If it is not a chain, it will contain a cycle.6 In this on queries q whose QG~+ is a chain. (Such a class of closed chain queries.) More general queries are treated in

programming

closed chain order

QUERIES

al.

in its style,

[321 for

notation

is used to solve

programming

is similar

optimizing

is needed

the

algorithm

chain

to describe

to that queries our

MRP

problem

computes proposed in

for

the best indepen-

object-oriented

algorithm.

For

1
1,

examine

all

CHAIN

runs

in 0(n3

in joining

we

have

or 0(n3) chain

when and

1 is

two cases. When k = 1, we are joining two results. Thus in order to find the lowest

examine

R,p and them

in such

in a closed

on an edge.

component

between

because

result,

minimum,

+ n21) time,

of nodes

all

clauses

attributes

of the two relations.

clauses

p + 1 = j),

intermediate

to

their

known

of the algorithm.

n is the number

of clauses

need

checking

the

El

Consider the following rather than intermediate

PROOF.

relations

When

Cost(&(

j).

The following

but

CHAIN

+ l,j).

Now,

cost

R, and

algorithm

Es(i, j)

by concatenating

remaining

constraint that the order By induction hypothesis,

by

This

Rp+l~ only

for

situations

some

the

the

requires

when

and we need to check

between are

same

relations,

as the

initial

0( nl) time. p,

i = p but

we are joining whether

two

i s p

0.00001 0.00005

0.0001

0.001

Range Fig.

13.

tuple’s

Effect

width

on an edge

factors

[1, 10], graph

of selectivity

(relation’s

cardinality

size = 12, chain

numbers

To avoid

out by fixing

as follows:

the

mutual

influence

the parameters

\ QG I = 12,

Notice

Num

= 2,

drawn

to query

I en I = 4,

also that increase.

algorithm

When

figure,

from

[10,000,

size = 4, number

selectivity

performance

of the

unattractive

Effect

of Query

algorithms. when

The

figure

in

20,000], of clauses

order

analysis

which

were

was set

QG) I = 2 for

any

to distinguish

the

algorithms

we

factors

factors

the

graphs,

I clauses(e,

for

see that

all

varying

join

the

heuristics

as the

selectivity

(LYR ~ s 0.00005).

PH deteriorates

the larger the pivot becomes. Consequently, nodes will incur an intolerable cost. Figure 14 shows the effect of varying

7.4

was

of parameters,

related

the selectivity factors. 13 Examining have close behavior for low selectivity

becomes

factors

= 2, chain

edge e in QG. The chains have been included simple heuristics from their hybrid counterparts. Figure 13 illustrates the behavior of the

factors

0.5

= 2)

procedures. carried

of selectivity

from

0.1

0.01

significantly

are high,

the more

joining the

the pivot

relation

shows

the sizes of relations

nodes are joined,

sizes

clearly

that

with

even more

on the

relative

algorithm

PH

increase.

Parameters

The main distinction between the hybrid heuristics and simple ones lies in the way they handle chains. The objective of this subsection is to explore the effect of varying chains in a query graph on the behavior of the heuristics. The following parameters’ values were assumed throughout the experiments: I t~ I was drawn from [1, 10], I R I was drawn from the range [100, 1,000,000], CYR,~ was fixed

at 0.1, and

I clauses(e,

QG) I was fixed

at 2.

131n subsequent experiments, ten query graphs were tested for each algorithm value of the solutions’ costs produced by each algorithm was plotted. ACM

Transactions

on Database

Systems,

Vol

16, No

2, June

1991

and the average

Optimizing Equijoin Queries

A

~

303

.

KH

70 -— ——

~~H ~~~~~ HPH

50 cost (in billions of bytes)

30 -

10

I [10,

100][1,

[1 ,000,

Range Fig.

14.

Impact

(the first item

item

gives

2, chain

of relation

of each label

the range

size,

of tuple’s

size = 4, number

where

on the

width;

A

H —

10,000][20, 30]

of relation

relation

x axis gives

of clauses

I

I

[100,1 ,000][10, 20]

10]

sizes

size = relation’s

the range

selectivity

factor

on an edge

= 2).

>

[10,000, 100,000][30, 40]

cardinality

of relation’s

x tuple’s

cardinality

= 0.1, graph

width

and the second

size = 12, chain

numbers

=

KH PH HI

size

Fig. 15. Cost of heuristics as a function of chain size (graph size = 20, number of chains = 1, selectivity factor = 0.1, number of clauses on an edge = 2, relation’s cardinality was drawn from [100, 1,000,0001, tuple’s

width

from [1, 101).

Figure 15 shows the effect of changing the chain size in a graph, where the size of the graph was fixed at 20 and Num at 1. It is apparent that as the chain becomes a major portion of a graph, the hybrid heuristics become better than

the simple

ones. One expects ACM

this

Transactions

because

the hybrid

on Database

Systems,

heuristics Vol.

guaran-

16, No. 2, June 1991.

304

D. Shashaand

.

A

T. L. Wang

~

90 -

KH

HPH 70 –

cost (in billions

50

,,

of bytes)

I

14

12

Graph

Fig. 16

Cost ofheuristics

chains

= 1, selectivity

drawn

from

tee that mance

an optimal clauses,

tuple’s

solution

which

fixed

at 1/2).

by its inherently inside

Figure

fixing

It is clear

I QG I was fixed

also influences

size, number cardmality

of was

large

local

The

poor

perfor-

is due to the way

view.

a chain—very

This

situation

few choices

can

which of the clauses should be picked at each step. the size of a graph has very little impact on the

that

the

16 illustrates

portion

increasing

of the only

relative performance of the heuristics—the remain and gradually become larger. Figure 17 shows the effect of varying (where

on a chain.

size becomes

searching

of the heuristics. while

x graph

= 2, relation’s

[1, 10])

the chain

is limited

size = 1/2

on an edge

can be achieved

if it starts

be made in determining On the other hand, size of a graph

from

size

(chain

of clauses

width

PH when

to be worse

performance

of~aphslze

= 0.1, number

[100, 1,000,0001,

of algorithm

it picks tends

asafunction

factor

20

18

16

at 20 and

the performance

the

chain

the graph gaps

the

of varying portion

size does not affect

between

number

these

of chains

I en I at 4). Although

of the hybrid

effect

in it (this

the number

heuristics,

the was the

algorithms in

a graph of chains

it has a less signifi-

cant effect than the size of a chain. The big gap between algorithms HPH and KH (HKH) may be due to the fact that the chosen size of chains is too small, as compared with that of the whole graph. Notice that there is a slight increase of the gap between algorithms HKH and KH when the number of chains is 4. This indicates that the existence of many short chains in a graph can have a negative effect on the performance of algorithm HKH. A possible explanation for this is that these chains decrease the number of nonchain edges in the graph

and consequently

can be made when globally picking In conclusion, when selectivity

restrict

the total

a clause. factors are

low,

number the

of choices

proposed

that

heuristics

have close performance. In the presence of high selectivity factors, algorithm PH becomes less competitive while algorithm KH (and HKH) remain good. If ACM

TransactIons

on Database

Systems,

Vol

16, No

2, June

1991

Optimizing Equijoin Queries

o

1

2 Number

Fig. 17. selectivity

.

3

305

4

of chains

Cost of heuristics as a function of number of chains (graph size = 20, chain size = 4, factor = 0.1, number of clauses on an edge = 2, relation’s cardinality was drawn from

[1OO, 1,000,0001, tuple’s

width

from [1, 10]).

a graph contains long chains (with size being over half nodes in the graph), we suggest to use the hybrid Kruskal

of the number of heuristic (HKH).

8. SUMMARY In this

paper,

important

we investigated

queries

partitioned

in

data

ways

loosely

of minimizing

coupled

distribution

response

multiprocessor

scheme.

First,

time

systems

we

for various

using

developed

a hash-

a

dynamic

programming algorithm for closed chain queries. Next, we proved the NPcompleteness for more general queries and proposed four heuristics for them. Our simulation results show that an algorithm similar to Kruskal’s spanning tree algorithm performs well when chains in a query graph are small. When chains are long, then Kruskal’s is best. Like other relevant relied sizes priori

a hybrid work

on the knowledge of intermediate in

actual

algorithm

[15,

25],

of selectivity results.

the

using heuristics

factors,

which

This information, 14 Fortunately,

environments.

the

chain

algorithm

presented were

with

in the

used to find

paper out the

however, is not known a our qualitative conclusions

were independent of the selectivity factors. Our results have all assumed that relations were partitioned on a single attribute. Generalizing beyond this is not difficult. For example, consider a R and S based on the clauses {{R. A, S. C}, { R. B, S. D} }, join between where

“A

R is partitioned

practical

systems Naughton

approach

(see,

e.g.,

on A and

would

[26]).

[21] for theoretical

S on C. One can process

be to estimate

Readers

may

analyses

also of the

the

selectivity

refer

to

factors

Gardy

sizes of intermediate

ACM Transactions

and

like

the join

those

Puech

used

[13]

and

based

on

in most

real

Lipton

and

relations.

on Database Systems, Vol

16, No. 2, June 1991.

306

D. Shasha and T. L. Wang

.

{R. A, S. C}, treating hand, if the join were

{R. B, S. D} as a selection condition. On based on {R. A, S. C} and R were partitioned

the

other AB

on

would have to be repartitioned to A and C’, and S on CD, both relations respectively. Therefore, the set of repartitionings required when R is partiwhen R is partitioned tioned on A alone is always a subset of those required on AB. attributes

To handle multiattribute partitions, we can treat the partition Thus, to perform the (e. g., AB) as if they were a single attribute. join having clauses {R. A, S. C}, { R.B, S. D}, and {R. E, S. F}, given AB and CD as partition attributes, we process { {R. A, S. C}, {R. B, S. D} ) and use condition. {R. E, S. F} as a selection We have perhaps

not discussed

partitioned

the possibility

on different

of using

attributes.

multiple

Often,

copies

keeping

of relations,

multiple

copies

at

each node reduces the cost of repartitionings considerably. As an example, consider again the suppliers-and-parts database and SQL query presented in Section 1. Suppose one copy of SUP is partitioned on s #, one copy of PART on p #, one copy of SUP-PART on s #, and another copy of SUP-PART on p #. Tz to SUP-PART. p #, instead of sending its complete When repartitioning tuples, the second strategy could send the surrogates [8] of tuples in SUP-PART, along

with

the

materialized

complete

tuples

in

SUP.

at each node by consulting

is partitioned

on p #.

Suppose

cost of this

scheme

would

significant

improvement

These

surrogates

the local

sending

would

fragment

surrogates

takes

10 seconds.

be 100 + 10 + 100 = 210 seconds, over the original

scheme.

then

of SUP-PART Then

which

Investigating

can be optimized in such a highly parallel environment with seems to be a very interesting problem for future research.

be that the

achieves

a

how queries replicated

data

ACKNOWLEDGMENTS

The

authors

Won

Kim,

are deeply for their

also wish

to thank

for helpful

indebted

to the

encouragement Richard

discussions

Cole,

and Zvi

anonymous

referees

thoughtful

recommendations.

Kedem,

in the formative

Bud

stages

Mishra,

of this

and

the

editor, They

and Paul

Spirakis

work.

REFERENCES 1. APERS, P, M. queries.

2. BABB, E. Trans. 3

Trans.

6, 4 (Dec. 1981),

SE-9,

a relational

P. A , Goomwmv,

SIN,

is a system

Optimization

algorithms

for

19S3), 57-68. database by means of specialized

1979),

N.,

S. B

distributed

1 (Jan,

hardware.

ACM

1-29.

Worm>

E , CHIUSTOFHEE,

for distributed

databases

L.

(SDD-1).

R.

AND ROTHNIE,

ACM

Trans.

J, B,,

Database

Jxz. Syst,

602-625.

D., BORAL,

execution

R., AND YAO, Eng.

1 (Mar.

Syst. 4,

processing

4. BITTON,

A.

Soflw.

Implementing

Database

~SItNS,

Query

G., HEVNER,

IEEE

H , DEWITT,

of relational

D. J., AND WILKINSON,

database

operations,

ACM

W

Trans.

K.

Parallel

Database

algorithms

Syst.

for the

1983)

8, 3 (Sept

324-353. An optimal algorithm for processing 5. CHEN, A. L. P , AND LI, V, O. K. queries. IEEE Trans. Softw. Eng. SE-11, 10 (Oct. 1985), 1097-1107.

distributed

6

in a distributed

CHILI, D. M., database

7. CHIU, ACM

D.

BERNSTEIN,

system. M.,

Transactions

SIAM

AND Ho, on Database

P. A.,

AND Ho,

J. Comput. Y,

C.

A

Systems,

13,

Y.

C.

1 (Feb.

methodology Vol.

16, No

Optimizing 1984), for

chain

queries

star

116-134 interpreting

2, June

1991

tree

queries

mto

optimal

Optimizing Equijoin Queries semi-join

expressions.

Management 8. CODD, E. Trans.

In Proceedings of the A CM-81GMOD International (May 1980). ACM, New York, 1980, pp. 169-178.

of Data F.

Extending

Database

the

Proceedings

W.,

BOUGHTER,

of the ACM-SIGMOD

ACM,

(1988).

New York,

model

to

capture

Conference

on the

meaning.

ACM

more

E., AND KELLER,

International

T.

Data

Conference

placement

in Bubba.

on the Management

of Data

1988, pp. 99-108.

10, DEWITT, D. J., AND GERBER, R, of th e 11th

relational

307

1979), 397-434.

Syst. 4, 4 (Dec.

9. COPELAND, G., ALEXANDER, In

database

.

International

Multiprocessor

Conference

hash-based join algorithms.

on Very

Large

GRAEFE,

G.,

Data

Bases

In Proceedings

(Stockholm,

Aug.

1985),

pp.

151-164. 11. DEWITT,

D.

J.,

GERBER,

MURALIKRISHNA, ceedings 1986),

M.

R.

H.,

GAMMA—A

of the 12th

high

International

Conference

D.

Teradata

J.,

SMITH,

Database

M.,

AND BORAL,

Machine.

MCC

13. GARDY, D,, AND PUECH, C. Database

Syst.

On the

Freeman,

Database

Syst.

Database

17. KIM,

W,

Syst.

A new

ACM-SIGMOD

effect

Set query

1980).

RIMS

way

ACM,

York,

Springer-Verlagj

Generation

KRUSKAL,

21. LIPTON,

B.,

R,

J,,

Database

Tree

(1957),

the

pp.

York,

1 (1983), On

the

Shortest

and 1983,

and join

ACM

Trans.

to the Theory

database

of

systems.

ACM

queries

ACM

of relational In

of Data

Proceedings

(Santa

of the

Monica,

Calif.,

T.

Relational

pp.

algebra

(1982),

machine

Lecture

GRACE,

Notes

In

in Computer

191-212. to data

base

machine

and

its

architecture,

62-74.

J.

ACM

of relations.

Engineering

of hash spanning

Math. F,

subtree

of

1990). ACM,

connection

a graph

and

the

traveling

7, 1 (1956), 48-50.

Sot.

Query

size

estimation

SIGACT-SIGMOD-SIGART

SELINGER,

New York,

networks

by

adaptive

Symposium

on

sampling,

In

Principles

of

1990, pp. 40-46. queries

m distributed

and some generalizations,

P. G,,

Syst.

databases. Bell

Syst.

IEEE

Tech.

J.

pp

SHAPIRO, L. Syst.

operations

1 (Mar.

M

M.,

in a relational

Conference

1979,

for

efficient

query

processing,

ACM

Trans.

113-133.

11,

ASTRAHAN,

selection

International

A technique

1986), of join

Database

path

Database

11, 2 (Jun.

Optimization

Trans.

Access

Fragmentation:

Syst.

SEGEV, A. ACM

of the

1389-1401,

SACCO, G. M.

in

1986),

horizontally

CHAMBERLAIN, D. database

on the Management

partitioned

database

systems,

48-80. system, of Data

D., In

LORIE,

R. A.,

Proceedings

(Boston,

AND PRICE,

T.

G.

of the ACM-SIGMOD

Mass.,

May

1979).

ACM,

New

23-34.

Join

processing

11, 3 (Sept.

SHASHA, D. Query Advanced Database of the International

in database 1986),

processing Symposium

SHASHA, D., AND SPIR~KIS, ings

Aug.

179-187.

shortest

Amer.

Stth

(Apr.

sizes.

– A Guide

class

on the Management

Application

Proc.

Systems

Database

29,

Japan,

evaluation

on relation

in distributed A simple

product

Science

AND NAUGHTON,

of

23. PRIM, R. C.

28.

Pro-

1979.

22. PRAMANIK, S., AND VINEYARD, D. Optimizing join Trans. Softw, Eng. 14, 9 (Sept. 1988), 1319-1326.

27.

AND

In

(Kyoto,

performance

and Intractability

queries:

the

1980,

New

JR,

problem.

Proceedings

York,

B.,

1987,

operations

Calif.,

H., AND MOTO-OKA,

ET AL,

Comput.

J.

salesman

26.

Bases

K.

machine.

1982), 653-677.

on Sof%ware

M.,

of join

optimization

Conference

TANAKA,

Symposium

Science.

O.

to compute

New

M.,

19. KITSUREGAWA,

25.

KUMAR,

1986), 265-293.

7, 4 (Dec.

International

18. KITSUREGAWA,

24.

Data

user

DB-081-87,

Computers

11, 3 (Sept.

16. GOODMAN, N , AND SHMUELI, Trans.

A single

Rep.

San Francisco,

GAWSH, B., AND SEGEV, A. Trans.

20,

L.,

database

1989), 574-603.

14, 4 (Dec.

NP-Completeness.

New

on Very Large

H.

Tech.

14. GAREY, M. R., AND JOHNSON, D. S,

May

M.

dataflow

pp. 228-237.

12, DEWITT,

15,

HEYTENS,

performance

P.

systems

in a symmetric (Tokyo, Japan, Fast

Conference ACM

with

large

main

memories.

ACM

Trans.

239-264.

parallel

parallel environment, In Aug. 1986), pp. 183-192, algorithms

on Supercompu Transactions

ting

for processing (Athens,

on Database

Greece,

Systems,

Vol

proceedings

of the

of joins,

In

June

1987).

Proceed-

16, No, 2, June 1991.

308 W

D. Shasha and T. L. Wang

. STAMOS,

J.

W.,

distributed 31.

STONEBRAKER, ceedings Networks

M,,

Berkelebv

W , AND

manuscript, of Illinois

TERADATA

CORPORATION.

database

A

symmetric

IBM

E.

Corp.,

fragment

San Jose,

A distributed

Workshop

on

P.,

Query

at Chicago,

Teradata

and

Dee

database

Dwtrlbuted

replicate

algorithm

for

1989, version

Data

of INGRES

Management

In

Pro-

and

Computer

Database

Trans.

in object-oriented

Electrical

computer

Engineering

system

Los Angeles,

GARDARIN,

ACM

optimization of

and

database

systems

Computer

Science,

1990,

Corporation,

AND

machine.

Yu, C.

Department

University

VALDURIEZ,

C

1977).

(May

C02-0001-01,

H.

RJ7188,

AND NEUHOL~,

3rd

MENG,

Unpublished

34.

Res. Rep

of the

32. SUN, W.,

33

AND YOUNG,

joins.

G,

Jom

Database

and Syst.

Oct.

concepts

semijoin 9,

and

facilities,

Document

1984 algorithms

for

a multiprocessor

1 (Mar. 1984), 133-161.

35

WILLI.4M, R., ET AL. R*: An overview of the architecture Res, Rep. RJ3325, IBM Corp., San Jose, Dec. 1981. 36. WoN~, E, Dynamic rematerlalization: Processing distributed queries using redundant Softw. Eng. SE-9, 3 (May 1983), 228-232. data IEEE Trans

37

Yu,

C, T,, OZSOYOGLU, M

J. Comput. 38.

Yu,

C.

T.,

Algorithms 10 (Ott

Received

ACM

Syst,

Z., AND LAM, K,

Sci. 29 (1984),

GUH,

K.

to process

C.,

Distributed

query

optimization

ZHAN~,

distributed

W.,

TEMPLETON.

queries

in fast

M.,

local

BRILL,

networks

D.,

AND

TransactIons

1988;

revised

on Database

May

Systems,

1989

Vol

and April

16, No

1990;

2. June

accepted

1991

CHEN,

IEEE

Trans.

April

1990

1987), 1153-1164,

January

for tree

queries

409-445. A.

Comput.

L.

P

C-36,