STR: A SIMPLE AND EFFICIENT ALGORITHM ... - Semantic Scholar

6 downloads 41 Views 1MB Size Report
Scott T. Leutenegger. Jeffrey M. Edgington. Mario A. Lopez. NASA. Contract. No. NAS1-19480. February. 1997. Institute for Computer. Applications in Science.
NASA

Contractor

ICASE

Report

STR:

Report

201661

No. 97-14

A SIMPLE

FOR

R-TREE

AND EFFICIENT

ALGORITHM

PACKING

Scott T. Leutenegger Jeffrey Mario

NASA February Institute

M. Edgington A. Lopez

Contract

for

Computer

Applications

Research 23681-0001

by Universities

National Aeronautics Space Administration Langley Hampton,

NAS1-19480

1997

NASA Langley Hampton, VA Operated

No.

Research Virginia

in Science

and

Center

Space

and

Center 23681-0001

Research

Association

Engineering

STR:

A Simple Scott

and Efficient

T. Leutenegger Mathematics University

Algorithm

for R-Tree

Jeffrey M. Edgington Mario and Computer Science Department of Denver

Packing

*

A. Lopez

Denver, CO 80208-0189

{leut,jedgingt, mlopez} @cs. du. edu

Abstract

In this paper Mgorithms, synthetic

we present

including

data

from various

computational

fluid dynamics.

have

performance.

all types best

of data.

previously

skewed

*The

point

work

TT-97-05. Administration and

Engineering,

proposed

of Leutenegger work under NASA

of

our new algorithm

data,

and

Lopez

contract

Langley

for point

and

was supported was

Research

Center,

that

of three

VLSI that

none

design,

various of the

up to 50_, fewer queries

in part

by Colorado

supported

while Hampton,

visiting

in the

part Institute

VA 23681-0001

by

the

distributed

for Computer

is best

for

than

the

or mildly

and region

Institute

and

of buffering

disk accesses

Software National

GIS (tiger),

degrees

point

packing

using both

algorithms

on uniformly

Advance

R-tree

are evaluated

the same for highly skewed

additionally

NAS1-19480

indicate

region

including

the ilnpact

requires

and approximately

Leutenegger

NASA

domains

results

study

The algorithms

also consider

Experimental

algorithm

comparison

Mgorithm.

application

Our studies

In general,

and region

The

from an extensive

a new easy to implement

and actual

on query

the results

grant

data.

number

Aeronautics

and

Space

Applications

in Science

1

Introduction

R-trees

[5] are a common

multi-dimensional points,

databases.

polygons,

a given

query

region.

computer

scientific

databases.

deletion

of objects.

Unfortunately, disadvantages: R-tree

not competitive preprocess

in R-trees

Other

need

[12] and later

it with the

of the bounding

preferred

and

and buffer

in better at internal

a new packing

utilizing

and synthetic

We know of no other

competitive

(due

has

most

important, in order but

to

still are

are allowed

to

where

the

results

to the fact that

algorithms

[12].

queries.

fewer

were first proposed

packing

an LRU buffer on VLSI design,

The

for point

latter

and

is simpler

to

due to the smaller

algorithm

outperforms is usually

the

queries.

algorithms

that

is simple

for a wide range

experimental

GIS, computational

has considered

ordering,

of [6] significantly

packing

we provide

Curve

However,

(Sort-Tile-Recursive)

metrics,

work that

by Guttman

that

on the Hilbert

the algorithm

Hilbert-based

algorithm

to area. and perimeter

based

for point

and Nearest-X

on real implementations data sets.

the

and

for applications

times

by Roussopoulos

nodes,

while remaining

it with the Hilbert

In addition

trees

for insertion

[6].

algorithm

proposed

query

and

can be modified

of nodes

algorithms

intersect

a priori and, when done properly,

Such packing

and Faloutsos

and,

such as

information

of the R-tree,

reasonable

and improved a query).

Consequently,

queries

we propose

compare

sizes.

results

queries.

is particularly

a packing

Mgorithln

rectangles

choice for region

In this paper

by Kan:el

Nearest-X

of [12] for region

pertbrming

routines

and

temporal

contents

number

quality

to loading

or available

their

utilization,

the

time when con:pared

often)

geographic

at a time as specified

large

objects

databases,

efficient

space

[1, 13] improve

utilization

[6] propose

and in some cases

perimeter

plement

while

one object

of an unduly

Preprocessing

100% space

and Faloutsos

implement

to query

[5] provides

objects,

which

design,

in the sense that

used in spatial

geometric

for traditionM

(b) sub-optimal

Mgorithms

are widely

of arbitrary

indexing

by inserting

the retrieval

and

computer-aided

structures,

(i.e., does not change

to be accessed

Kamel con:pare

requiring dynamic

with nearly

boxes

tree, and Guttman

to be stored.

static

include

time,

data

can be used to determine

multi-keyed

an R-tree

with regard

by Roussopoulos

that

structure

R-trees

applications

(a) high load

the data

data. is fairly

objects,

the entire

for spatial

the bounding

are dynamic

building

a query.

nodes

Typical

R-trees

reconstructing

satisfy

By storing

vision and robotics,

without

(c) poor

technique

or more complex

systems,

several

indexing

evidence

to imof data. based

fluid dynamics,

such a wide range

of data. set

_13

14_

2O

16

1: A sample R-tree. Input rectangles are shown solid.

Figure

types

and the effect they

buffered

in main

as shown

have on packing

memory.

This buffering

in [8]. Consequently,

The rest of the paper on R-trees present

and

describe

our experimental

performance.

our experimentM

is organized the

of portions

three

methodology.

utilize

In Section

algorithms

Section

some portion

of the tree can significantly

studies

as follows.

packing

In real databases

4 contains

a buffer

affect

as described

2 we provide

of the tree is performance

in Section

background

considered

in this paper.

results

fl'om our experiments

3.

information

In Section

3 we

and Section

5

concludes.

2

Overview

In this section R-trees,

of

which

but not required descriptions

and

Nearest-X

[12], Hilbert

we are proposing. for understanding

should

refer

Packing

a brief overview

we provide

including

algorithm

R-tree

to [12, 6].

Detailed

Algorithms

of the R-tree

and describe

several

[6], as well as Sort-Tile-Recursive knowledge

the remainder

of Nearest-X

of this paper.

Readers

and

packing

algorithms

(STR),

a new packing

Hilbert

interested

packing in more

for

is useful detailed

2.1

R-trees

An R-treeis a hierarchicaldatastructurederivedfromthe B-treeanddesignedfor efficientexecution of intersectionqueries.R-treesstorea collectionof rectangleswhich can changeovertime through insertionsand deletions.Arbitrary geometricobjectsarehandledby representingeachobjectby its minimum

bounding

generalize

easily

dimensional Each

to dimensions

node P.

For nodes nodes,

subtree

pointed

nested

rectangles,

Figure

rectangle

that

and that

MBR enclosing

contains

an R-tree

a 3-level

bounding

every path

the last of which

boxes.

of n entries.

which

encloses

simplicity

Each

at the leaf level, R is the bounding

are numbered

the small dark

a maximum

R is the minimum

1 illustrates

the levels

MBRs

upright

than two, but for notational

stores

to by P. Note

level may overlap

The

higher

of the R-tree

At internal

that

i.e., the smallest

the object.

we review

R-trees

only the two

case.

and a pointer P.

rectangle,

created

R-tree

are grouped

each leaf node is the smallest

into

box that

are the rectangles

of 4 rectangles There

in the

to by

stored

in the

rectangles

fit per node.

are 64 rectangles

unique.

We assume

numbered

at the next

of

at any

represented

the rectangles

nodes

pointed

is by no means

16 leaf level nodes,

R

to a sequence

Note also that

fully contains

stored

object

of all rectangles

set of objects

1, and 2 (leaf level).

The 64 rectangles

of the leaf nodes

data. object.

a maximum

of a rectangle

the tree corresponds

fl'om a particular

where

0 (root),

(MBR)

through

an actual

consists

box of an actual

rectangle

down

entry

by

1 to 16. The

within higher

the node. level of the

tree. For example,

leaf nodes

18,19,20)

1 through

17 (and

nodes

contains

the four level 1 nodes:

To perform examined

is accomplished

is purposely

a query

(regardless

by using

several

paths

stored

at that

node which

rectangles

down

rectangles (or the data

For illustration,

drawn

Q, all rectangles they

a simple through intersect

are searched objects consider

in node

slightly

larger

17 which than

is at level 1. The MBR of node

needed

for clarity.

The

root

node

be retrieved

and

17, 18, 19, and 20.

of whether

follow

the retrieved

4 are placed

are stored

recursive the

intersect

the

query

in an internal

node

procedure

tree.

A node

that

starts

is processed

Q. If the node is an internal recursively.

themselves) the

that

query

Otherwise,

are simply

region

must

or a leaf node).

at the root by first

node,

node

retrieving

the subtrees

This retrieval and

which

may

all rectangles corresponding

to

the node is a leaf node and the retrieved

reported.

Q in the example

of Figure

1. After

examining

the root

node,wedeterminethat nodes19and20 of level 1 must besearched.The searchthen proceedsto eachof thesenodes.It is then determinedthat the queryregiondoesnot intersectany rectangles storedin node19or node20 andeachof thesetwo subqueriesareterminated. The R-treeshownin Figure1 is fairly well structured.Insertingthesesamerectanglesinto an R-treeusingthe insertionalgorithmsof Guttman [5]wouldlikely resultin a lesswellstructuredtree. Algorithms to createwell structuredtreeshavebeen developedand are describedin Section2.2. Thesealgorithmscluster rectanglesin an attempt to minimizethe numberof nodesvisited while processinga query. Forthe restof the paperwewill assumethat exactlyonenodefits per diskpage,andhereafter weusethe two termsinterchangeably. 2.2

Packing Algorithms

In this sectionwedescribethreepackingalgorithms.All of the algorithmsusea similarframework. In the followingtext weassumethat the datafile consistsof r

rectangles

and

that

each R-Tree

node

can hold n rectangles. The

general

process

level first and then General

rectangles,

2. Load

creating

each successively

the data. file so that where

each

the last group the

[r/n]

leaf level page nodes

to building

a B-tree higher

from a collection

of keys by creating

level until the root node is crea.ted

group

pack

these

fewer than

of rectangles

into a temporary higher

the r rectangles

of n is intended

may contain

groups

of the next

3. Recursively

to be placed

[11].

in It/hi

consecutive

in the same

algorithlns

groups

leaf level node.

of n Note

n rectangles.

into pages

file. The

are ordered

and output

page-nulnbers

the (MBR, are used

page-number)

as the child

pointers

for each in the

level. MBRs

into nodes

at the

next level, proceeding

upwards,

node is created.

The three

the leaf

Algorithm:

1. Preprocess

that

is similar

differ only in how the rectangles

4

are ordered

at each level.

until

the root

Nearest-X (NX): This algorithmwasproposedin [12].The rectanglesaresortedby x-coordinate. given in the paper

so we assume

rectangles

packed

Hilbert

are then

Sort

Hilbert

based

distance

in which

space

extended

fi'om the origin,

Faloutsos

point

may or may mantissa

Sun Sparc

to the

representation,

order

_pon ent)

:"

is used.

The

this ordering.

point.

[6] to accomplish

at a time,

until

that

in this

Successive two center

than

the

other

discrimination grid.

Once

the

x- and

Curve). is possible.

and

the

based

on

the order

which

float

of the

be represented

using

numbers

numbers

Conceptually, In practice,

(which

position

native

is a conceptual are

in the

representation viewed

using

this

can be applied. data

set.

Consider

of a point

successively

smaller

(x2, Y2) need that

can use the sense and gathered

can be

a mantissa

starting

could

y-coordinates

it can be determined (one

the

which

a grid

for this grid is used to produce

(x_, y_) and

The information

the Hilbert

on the hypothetical

points until

(This

of a 2-dimensional

bits determine

coordinates,

exponent,

for 32-bit

required.

Curve

integer

nmnbers

used for integers

The first bit of the

using

below.

point

form.)

the method

are sorted

This determines

determines

For example,

The Hilbert

this task).

(along

exponent

all floating

bits.

stored

are examined

subquadrant

to the origin

point,

the processing

When

coordinate

different

as a sign, signed

the

the rectangles

of the R-Tree.

stored Since

orders

Curve.

are usually

+sizeof(Mantissa).

it.

the Hilbert

as described

not

describe

contains

the

of each

values

are

of the rectangles.

quadrant tain

center

of the rectangles

on how to handle

2s + 23 bits would

it is dear

We now briefly

along

points

values

binary

architecture,

center

algorithm

point

q- sizeof(Mantissa)

Coordinates

"

floating

The

details

not be normalized).

relative

slzeof(E

of size n, using

in [6]. The

into the nodes

only provide

numbers

2 size°f(Exp°nent)

22

of the rectangle's

in groups

measured

are placed

to arbitrary

Floating

was proposed

filling curve.

the rectangles

and

only.

x-coordinate

into the nodes,

algorithm

(fractal)

Kamel

the

are

(HS):

A fractM

their

that

No details

the packing

determine

which

subquadrants

con-

to be compared, one of the

rotation

is used to decide

points

tables which

the process

computes

one does

not store

of size

the bits lies in a

described point

is closer

bit positions, or compute

in

one all bit

Sort-Tile-Recursive (STR): Consider

a k-dimensionM

intervals

of the

for all 1 < i < k.

STR

described

is best

handled

the plane. contains

enough

rectangles points

[V_].

rectangles

into the first node,

according

slabs, where

the sorted

list.

(i.e., treated

Each

To aid in visualizing by using

the Long

show the resultant

3

Beach

perform

a set of rectangles

we assume

in

each slice

coordinates

of leaf level pages

P = [r/n]

them

into

S vertical

from the sorted

list.

Note that

rectangles

of each

into runs of length

slice

n (the first

n

and so on). described

center.

Then

above. divide

consecutive using

k = 1 is

First,

sort the

the input

set into

hyper-rectangles

from

the remaining

k - 1 coordinates

set).

algorithms,

set assuming

our experimental

of the algorithms

as part

recursively

the

consider

100 rectangles

same data

the leaf level nodes

fit per node.

set for each of the three

obtained

Figures

2, 3, and

algorithms.

Note

4

the

algorithm.

Methodology

we describe

"real world"

data

of their

case

slices so that

again

by k

falls inside

(The

partition

node,

of the approach

of these packing

4 for the STR packing

Experimental

comparison

Tiger

them

n into the second

data

case.

Now sort the

of a run of n. [P_

leaf level MBR for the

slices in Figure

In this section

both

the result

and

rectangles

by grouping

slab is now processed

as a k - 1-dimensional

Once

is defined

coordinate

consider

the number

S. n rectangles.

to the first coordinate

base

we first

by x-coordinate

generalization

a slab consists

the

V'_-/n nodes.

into nodes

i-th

using v/7-/n vertical

Determine

the next

case k > 2 is is a simple

,S' = [P¼]

space

roughly

the rectangles

them

whose

Accordingly,

of a run of S • n consecutive

and pack

hyper-rectangles

of points

the data

fewer than

A hyper-rectangle

k = 2 providing

B-trees.)

to pack

slice may contain

by y-coordinate

The

with

of the rectangles.

Sort

A slice consists

the last

is the locus

The basic idea is to "tile"

let S =

slices.

of r hyper-rectangles.

recursively

well by regular

are for the center and

set

form [Ai, Bi] and

i-th interval,

already

vertical

data

and synthetic of a typical

through data database

methodology.

actual sets.

Our goal is to provide

a solid experimental

R-tree

implementations

We intend

to provide

insight

into how well R-trees

supports

spatial

queries.

system

which

over a wide range

Thus,

of data

using would

in order

to

providerealistic andmeaningfulperformancemeasurements, the effectof bufferingmust be taken into consideration. Our primary comparisonmetric is the numberof disk accesses requiredto satisfya queryof a givensize. Note that this metric alsoallowsus to get an accurateindication of performanceeven usingnon-dedicatedworkstations.If we hadfocusedon retrievaltime,interferencefromother users wouldhavecloudedour results. We assumean LRU may

arguably

be to pin the

LRU scheme pinning, which

except

in unusual

case it should

disk partition.

Many than

sets

performance of the

is the

to be used

of the

time.

four Sun Sparc For each

query

the

is made.

percent

should

and

is often

then

use an

no gain from

the root just fits into the buffer (regardless

of their

data

manager.

by near

term

future

the depth

this

pool, in

level) to simplify

of the R-tree, R-trees

using

written

a raw

to disk and

we can easily

we consider

Data

have

several

small

buffer

data

data

set

vary

a fan out

are still smaller

size affects

and 2) it decreases

Since

using larger

smaller

yet they

Thus, one of the experimental

can be buffered.

be obtained

with these

manager

Thus,

studies,

applications.

Since most

as the second.

as would

memory

used in previous

50,000 rectangles),

Even

virtual

those

set that

our buffer

the node is immediately

than

1) it increases

of our experiments,

considered.

curves

levels

routine

the percentage

of 25 to 100 the first parameters

of interest

of our experiments sizes.

use small

This aJlows us to obtain

sets and buffers,

sets our experiments

R-tree

at a great

took two months

savings utilizing

5 workstations.

being

data

management

the OS or hardware.

will fit in the buffer.

of results

in experimental

buffer

in [8] there

size, we implement

system's

reconfiguring

is not as significant

type

As shown

out of the buffer

sets are larger

sets (approximately

the same

better

first few R-tree

a level near

of buffer

operating

size without

likely

percentage

data

by the

set that

consideration

where

of the

We use LRU for all the nodes

the impact

in two ways:

data

A slightly

number

of the R-tree.

a node is pushed

of our data

data.

some

circumstances

assess

When

buffer

routine.

of our experiments.

"false-buffered"

the actual

and

nodes

be pinned.

space

To accurately

management

root

for the remaining

the parameter

not

buffer

we build

In each experiment set with 2,000 Thus,

queries.

to err on the

not be considered

the R-tree

the exact

to the specific

packing

same data. set is used for all algorithms.

No attempt side of caution

significant.

according

at collecting we advise

confidence that

intervals

differences

algorithm We then

or smoothing

of less than

a few

To providea uniform experimentspacewe normalizeall data setsto the unit square.Point queriesareuniformlydistributedin the unit square.Weconsiderregionquerieswhoseregionequals 1%and9%of the unit square.Thelowerleft handcorneris uniformlydistributedin the unit square. The upperright hand corneris computedby addinge 0.3 for region

queries

the coordinate of the

much

Our

but

secondary

nodes.

The

are good

data

sets are.

types

(GIS): data

(VLSI): the

of a chip

highly skewed, larger

thousand

.

(CFD):

than

than

1.0 we set

is larger

of 9% will return data

sets described

9% of the unit

and

number

[8].

perimeter of nodes

We include

for both

We argue

in a general

that

square

of the accessed

roughly

9%

below)_

the

may

return

MBRs

of the

by a query

[6]

these

measures

as additional

tree

(summed

over all nodes

the whole

the leaf level metric

data.

In particular

of geographic

is of most

interest

the smallest and

in past

we consider

the following

Similarly,

rectangles

Fluid

This

there

the largest

are regions

Beach

53,145

line

Dynamics.

by Bell Labs and rectangle rectangle

of the

used in

distribution is roughly

chip covered

is

40,000

by several

at all.

of this work is to apply

of a Boeing

sets:

set contains

the input

the techniques

We consider

is used to model the air flows over and around

data. sets are for a cross section

data

different

we chose the Long

data

provided

because

by no rectangles

motivations

systems

many

our

studies.

is interesting

some covered

from Computational

by considering

and in size. For example, one.

on how representative

this issue

of Census.

set of 453,994

This data

is dependent

information

of the U. S. Bureau

a CIF data [9].

setting

We address

used extensively

One of the primary

of equations

area

of the

metrics

question.

both in location

rectangles

sets obtained system

system

We consider

e = 0.1 or

will likely be buffered.

set representative

has been

CFD

covers

sum of the

indicators

and perimeter

as well as synthetic

and

design

times

is the

of our conclusion

of the TIGER

segments .

area

As a data

query

and

that

is not considered

This is a non-trivial

of real data

.

if buffering

level nodes

applicability

VLSI

a query

and also only for the leaf level.

since the non-leaf

a region

where

9% of the data.

measures

and present

at all levels)

and

metric

These

data

(as in the

comparison

can be misleading

information

data

is large

less than

If the x- or y-coordinate

distributed

skewed

of output

or much

9% respectively.

For uniformly

but for highly

on amount

more

R-tree

to 1.0.

data,

variance

of 1% and

to the x- and y-coordinates

to scientific

a 2-dimensional aero-space

737 wing with flaps out in landing

problem.

vehicles

data A

[10]. The

configuration

at

MACH 0.2. Thedata spaceconsistsof a collectionof points(nodes)of varyingdensity.Nodes aredensein areasof greatchangein the solutionof the equationsand sparsein areasof little change.To help the readerunderstandthe natureof the data weincludea plot of a data set with 5088nodes(seeFigure5). The experimentMresultsusea data.set with 52,510nodes, whichis similarbut lookslike a blacksmudgewhenplotteddueto the densityof points. Note that the blackregionin the middleof Figm'e5 accountsfor the majority of the data. In Figure 6 weplot onlythe areaaroundthe centroidof the dataset. The blank oval-ishareasareparts of the wing. It is evidentthat the datasetis highly skewed. TheseCFD data setsandtwo other (smallerand larger)phs the tiger data setswill soonbe availablefrom http 4. (Synthetic): 300,000 lower

corner

squares

a square

is determined

between

the actual

Then,

uniformly

corner

is chosen

which

case the coordinate(s)

(point

data),

that

the average

Specifically, the average

between

to give the desired

the unit

data

area exceeds

1.0, 2.5, and 5.0. We present

containing unit

area.

between

square.

square.

The

10,000

and

For

each square

the

area

of the

density

let r equal

the number

area. of a square

it exceeds

equals

equals

the average the

bounds

for densities

is

area of

the sum of the of squares

in the

_a For each square,

area.. The upper of the unit

1.0 is set to 1.0. We considered results

square

The value of the average

set, where

0 and two times unless

a.html.

created in the

over

[6] of the

data. set.

the density.

were

contained

0 and 2 times

in the

area is chosen

sets

distributed

by the density

of all the squares set a.nd d equal

data.

are fully

was uniformly

distributed

data

cs. du. edu/-:ieut/Mu].tiD±mensiona:iDat

distributed

All

uniformly

areas

4

Uniformly

squares.

left

://www.

right

square,

data. densities

in of 0

of 0 and 5.0.

Results

In this section and region

we present

queries

from R-trees

on 2-D synthetic,

with 100 rectangles

NX algorithm

are not included

2 - 8 times

as many

point

To be complete

data.

the results

disk accesses

of our experimental GIS (tiger),

VLSI,

per node, with a range in the

figures

since

NX results

and CFD data of buffers

We present sets.

is not

for all experiments in the tables

results

All results

sizes examined.

the NX algorithm

as the STR algorithm

we do include

methodology.

are obtained

The curves

competitive, except

of this section.

for point

point

for the

requiring queries

on

Data Size 10,000 25,000 50,000 100,000 300,000

R-Tree Pages 101 254 506 1011 3031

Synthetic

We first what

Data

consider

percent

synthetic

of the

the nmnber 100 rectangles

is the percent

buffer

of 250 pages

queries

the ordering

using

density

R-tree

that

are for a density

sets.

(including

a buffer

The

1 we show

first column

n0n-leaf

is

) assuming

of 10 pages can hold,

We do not consider

data

and the

a data. size of 10,000 for a

data. set size (in thousands respectively.

The

of 0 (i.e., point The

data).

31 - 42% more disk accesses

and

5. For a buffer

26 - 32% more

set ahnost

in the figures

data,

of size 250, HS requires for region

data

hence

5)

show

lines are

STR for point

tree fits and

in each equals

and the dashed

than

disk access

the entire

rectangles

The legends

solid lines are for STR

of rectangles)

top two curves

sum of the areas of the input

of density

data

versus

250 pages

top to bottom.

for the 25,000 rectangle

and

33 - 41% of density

the comparison

meaningful. 9 we plot

for region

the

queries

buffer

of 250 pages

access

than

STR.

For region

access

than

STR.

Note that

data

size of 10 and

data,

smaller

pages

In Table

fits in the buffer.

STR for point

In Figure

of the R-tree

can hold.

250 pages.

sizes of data

of R-tree

of disk accesses

disk access for region

is not particularly

the more

is the percent

of size 10, HS requires

than

sizes of l0 and

for the different

of 5 (i.e., the expected

of the lines from

disk access

buffer

is the number

of 250 pages

a buffer

two curves

26 - 32% more

becomes

second

as the entire

for HS. For a buffer

rectangles)

the

7 and 8 plot the number

and the bottom

5. Note,

fits in the buffer

a buffer

figure are for a data

more

We consider

per page, the third

fourth

tbr point

data.

R-tree

of rectangles,

Figures

Buffer = 250 100% 98.43% 49.41% 24.73% 8.25%

1: Percent of R-Tree Held By Buffer

Table

4.1

Buffer = 10 9.90% 3.94% 1.98% 0.99% 0.33%

is similar.

(but that

number

of disk accesses

of 1% of the For point data

data,

of density

as the query

STR always

requires

needs to be retrieved

data

space

the bottom

versus using

the

a buffer

two curves,

5, the top two curves,

region fewer

size increases, disk accesses).

the more naive

10

data. set

the

(in thousands

of 10 pages. HS requires

ItS requires difference

This result

the search

size

_lgorithm

The

of

plot for a

6 - 22% mcre

disk

6 - 16% more

disk

between

STR

and

is not surprising can afford

HS since

to be [3].

Point

Data

Region

Data. Size

STR

HS

NX

10 25 5O 100 30O

0.89 1.03 1.27 1.61 1.95

1.26 1.41 1.74 2.18 2.55

0.87 1.04 1.27 1.57 1.83

10 25 5O 100 30O

3.27 6.85 11.48 18.21 41.46

3.99 8.00 12.81 19.93 44.02

Region 10.86 26.61 50.64 98.47 290.05

Queries, 1.22 1.17 1.12 1.09 1.06

Query Region 3.33 3.89 4.41 5.41 7.00

10 25 50 100 300

11.73 26.40 46.20 84.54 229.75

13.02 28.07 48.74 87.51 234.82

Region 26.86 67.26 131i96 261.35 779.96

Queries, 1.11 1.06 1.05 1.04 1.02

Table

Carrying needs

this

2:

argument

Rtree

results

are

if the

packing

Density

= 5.0

STR

HS

NX

1.40 1.67 1.97 2.31 2.60

1.85 2.19 2.57 2.99 3.27

3.52 6.11 8.43 12.45 19.70

1.32 1.31 1.30 1.29 1.26

2.51 3.67 4.28 5.39 7.56

= 1% of Data 4.25 4.97 8.53 9.87 13.12 14.55 20.40 22.14 44.73 47.26

13.78 31.44 57.52 108.37 307.38

1.17 1.16 1.11 1.09 1.06

3.24 3.69 4.38 5.31 6.87

Query Region = 9% of Data 2.29 13.57 14.80 2.55 29.01 30.76 2.86 49.48 51.97 3.09 89.25 92.18 3.39 237.42 242.41

29.91 72.17 48.74 271.58 797.94

1.09 1.06 1.05 1.03 1.02

2.20 2.49 0.98 3.04 3.36

of Disk Accesses,

extreme,

all

NX/STR Point Queries 0.99 1.01 1.00 0.97 0.94

1.42 1.38 1.37 1.35 1.31

Number

to an

to be performed:

HS/STR

Data,

Synthetic

Data,

regions

encloses

query

schemes

exhibit

the

Buffersize

NX/STR

-- 10

all input

same

HS/STR

data.

performance

then

no

search

as all leaves

need

to be examined. More

exhaustive

respectively.

The

first

column

colunms

are

the

the

fifth

and

sixth

columns

are

the

same

as 2-6

7-11

for point the

queries

query We

include The

while

the

smMler produces

is the

of disk

on point

are but

number

accesses the

for

data,

in Tables

ratio region

and,

2 and

of data

items

3 for buffer in thousands,

to satisfy

the

query

of HS and

NX

relative

data.

of density

5.

as expected,

the

information

for the

sizes the

for STIR., HS, to STR

Note

difference

that

of 10 and second

and

NX

between

is not

pages,

through

NX

for point

250

on

data,

fourth

point and

data,

columns

competitive

except

STR

and

HS diminishes

data

sets

in Table

level

and

all

as

size increases. present the

tree.

number

presented

area.

and

perimeter

sun:

of both

area

second

through

fourth

fifth area.

through and

a slightly

and

seventh

perimeter smMler

perimeter

colunms colunm

than

the

total

area.

for

both

are for the are

for

the

HS algorithm for the

the

data.

300K

data

11

data

and

MBRs

50K

for

point

50K

both but

300K at leaf

set for STR, set. data the

The sets, same

HS, and STR

leaf

MBRs

level

We in the

NX respectively,

algorithm

whereas

4.

the

produces NX

area..

a

algorithm Note

that

DataSize STR

Region Data, HS NX Density= HS/STR 5.0 NX/STR

0.14 0.79 1.16 1.45

0.14 1.00 1.53 1.83

0.20 3.85 8.05 16.98

1.05 1.27 1.32 1.26

1.43 4.88 6.95 11.67

0.14 0.69 1.04 1.23

25 50 100 300

0.16 4.74 12.14 36.72

0.17 5.30 13.27 38.92

Region 0.49 26.24 76.60 279.67

Queries, 1.06 1.12 1.09 1.06

Query Region = 1% of Data 3.05 0.17 0.19 5.54 5.57 6.15 6.31 13.84 14.94 7.62 39.84 42.04

1.01 29.85 84.53 296.80

1.14 1.10 1.08 1.06

6.04 5.36 6.11 7.45

25 50 100 300

0.20 20.11 61.78 218.61

0.21 21.20 64.18 224.09

Region 1.29 76.11 228.98 769.74

Queries, 1.08 1.05 1.04 1.03

Query Region = 9% of Data 6.51 0.25 0.25 3.78 22.12 23.10 3.71 65.52 68.12 3.52 226.77 231.62

2.88 81.78 239.43 787.30

1.00 1.04 1.04 1.02

11.57 3.70 3.65 3.47

NX

algorithm

performance

4.2

We now of disk

has

oll regions

GIS

tiger

1.03 1.33 1.41 1.35

STR

0.13 0.52 0.74 0.91

3:

0.13 0.52 0.77 0.92

NX/STR Point Queries 1.00 1.01 1.04 1.01

25 50 100 300

Table

the

PointData NX HS/STR

HS

Number

much

larger

of Disk Accesses,

Synthetic

Data,

perimeters

the

two

than

other

Buffersize

algorithms

= 250

accounting

for its poor

queries.

data

present

results

for the

accesses

versus

buffer

Long

Beach

size for point

County

TIGER

queries.

Point leaf area. total area leaf perimeter total perimeter leaf area, total area leaf perimeter total perimeter

STR 50K 0.97 3.05 88.21 101.74

4:

Synthetic

data

set. set

In Figure

requires

10 we plot

532 leaf

level

the nodes

Data

NX 50K 0.97 2.97 982.49 998.48

Region Data, Density 1.96 7.58 4.31 9.63 127.46 1000.77 142.09 1016.88

1.53 3.65 110.82 124.51

Table

HS 50K 1.33 3.64 106.26 120.76

The

data

Data

12

Areas

STR 300K 0.97 3.12 216.24 243.85

HS 300K

NX 300K

1.31 3.76 258.36 289.45

0.97 2.97 5882.38 5948.36

1.96 4.46 312.57 344.03

17.47 19.63 5937.22 6003.54

= 5.0 1.54 3.74 272.79 300.78

and Perimeters

number and

7

BufferSizeI ST l 10 25 50 100 250 Region

100 250 Region 10 25 50 100 250

5:

Number

of Disk

Point 1.07 0.73 0.63 0.54 0.38

0.72 0.52 0.48 0.42 0.31

10 25 50

Table

nS I NX

Queries, Query 10.51 11.11 9.90 10.40 8.98 9.38

Region 35.89 35.59 34.44

7.83 5.12

31.61 19.25

8.10 5.34

Queries, Query 51.17 52.13 50.82 51.72 49.52 50.54 45.67 46.60 30.50 31.11

Accesses,

Long

Beach

[ HS/STR

Queries 3.54 3.08 2.76 2.31 1.39

1.49 1.41 1.33 1.27 1.20 = 1% of Data 1.06 1.05 1.04 1.04 1.04 9% of Data = 1.02 1.02 1.02 1.02 1.02

Region 107.51 107.32 106.23 101.47 77.17

Data,

[ NX/STR

Point

4.90 5.94 5.78 5.49 4.45 3.41 3.59 3.83 4.04 3.76 2.10 2.11 2.15 2.22 2.53

and Region

Queries

and Different

Buffer

Sizes

leaf area total area leaf perimeter total perimeter

Table

index

nodes

9.28%

18.55%

STR. up

for a total

Again,

to 9%

In Table

46.38%) the

space

5 we present

Tiger

Long

of 539

pages.

Thus,

of the

Rtree.

The

relative

of the

6:

difference the

the

two

number

of disk

HS 0.76 2.51

NX 2.85 4.27

74.11 86.04

76.67 89.77

544.30 557.07

Beach

Data,

a buffer

Areas

of size

HS algorithm

increases

algorithms

STR 0.53 2.00

as the are

accesses

and

25 50

requires

buffer

similar,

(10

and Perimeters

with the

100

20 - 50%

size decreases.

relative

holds

,nore For

HS requiring ratio

250)

disk

region

2 - 6% more to STR

(1.86%

4.64%

accesses queries disk

for point

than of sizes

accesses. and

region

queries. We present smaller

areas

area than

and both

perimeter HS and

information NX,

and

slightly

in Table smaller

13

6. The

STR

perimeters

algorithm than

HS.

produces

significantly

Buffer

Size

I STRI

HS

14.13 12.80 11.54 9.57 6.46 4.26

Point 13.67 11.84 10.36 8.48 5.78 4.01

10 25 50 100 250 500

4.3

7:

Number

NX Queries 197.57 197.15 196.27 193.50 177.10 134.57

10 25 50 100 250 500

Region Queries, Query 93.98 92.98 93.68 92.71 93.11 92.07 9!.53 90.34 85.53 84.05 76.50 75.51

10 25

Region Queries, Query 398.78 396.26 398.44 396.01

50 100 250 500

Table

r

398.07 396.97 389.71 369.43

VLSI

NX/STR

0.97 0.93 0.90 0.89 0.90 0.94

14.45 16.65 18.94 22.81 30.63 33.55

= 1% of Data 0.99 0.99 0.99 0.99 0.98 0.99

6.51 6.53 6.57 6.67 7.06 7.56

P_egion = 9% of Data 1243.05 0.99 1242.93 0.99 1242.54 0.99 1241.10 0.99 1235.85 1.00 1216.26 0.99

3.14 3.14 3.14 3.14 3.18 3.32

Region 605.61 605.48 604.96 602.17 593.18 570.91

395.63 394.76 389.00 366.31

of Disk Accesses,

HS/STR

Data,

Buffer

Size Varied

for Point

and Region

Queries

VLSI

The

VLSI

and

location.

slightly

data

As

better)

In Table The

set consists

region

4.4

seen point

7 we present

the

the

performs same

queries.

experimental

be

for both

HS algorithm

practically

can

of approximately

In Table

11,

and

queries

region

slightly

HS

results better

queries.

8 we present

area

and

STR

which

perform

regardless and

STR

for point

NX and

region

algorithm

vary

considerably

almost

of buffer

tot point

than The

rectangles

the

size

same

(HS

performs

buffer

size is varied.

size.

queries

queries

as the

of 3% - 11%)

(by a factor

is significantly

perimeter

in both

information

worse which

for both

point

is consistent

with

and and the

results.

Computation

Fluid

For

the

experiments

in this

the

box

" ' (0.48,0.48)

(0.6,0.6).

variance

in Figure

detailed

for region

453,994

in the

number

Dynamics

section

of nodes

When

we restricted allowed

accessed

to

as the

point

and

region

range

over

the

remaining

14

area

queries entire

to

data.

is extremely

the set

area. there

sparse.

bounded was

Note

by

a large that

the

STR 9.81 14.79 707.92 769.40

leaf area total area leaf perimeter total perimeter Table

region

considered

are uniformly reduced

is also highly

distributed

space.

in the

The upper

the lower left corner

right

the 1% and 9% of the data In Figure In Table

9 we present

information nodes

3.75% 4.69% STR

For region accesses

5

the

queries

than

node

reduced

space.

Region

query

of the region

Thus,

The

a buffer

of region

This

area

queries

to fit within

by adding

for point

and

roughly

this

0.01 or 0.03 to corresponds

to

in Table

queries

for STR and

HS.

10 the

area

and perimeter

526 leaf level nodes

of size (10 15 20 25 50 100 250) holds As can be seen in Figure

fewer disk accesses

perform

were obtained

data. set requires

of the Rtree.

significantly

]eft corner

size is also reduced

required

of all our experiments set.

lower

experiments.

of disk accesses

data

STR

queries

the

at 0.6 if needed.

used in the other

46.90%)

HS and

the other

and

results

9.38% 18.76% requires

queries

number

of 533 pages.

algorithm

Point

region

for the 52,510

for a total

skewed.

and truncating

12 we plot the

NX 181.06 194.54 7733.60 7852.27

8: VLSI Data, Areas and Perimeters

corner

coordinates

HS 8.40 14.33 686.92 753.46

similarly.

than

The

for small

requires

7 index

(1.88%

12, for point

HS, especially

NX a.lgorithm

and

2.81%

queries buffer

the sizes.

significantly

more

two.

Conclusions

All three

algorithms

it is not surprising algorithms well and

studied that

none of them

on the different when

are based

types

it does not.

point

and region

terms

of location

on heuristics is best

and

provide

for all data

sets.

of data. we can gain insight

We considered

data. (synthetic)

three

general

; 2) Mildly skewed

and size, region

data

(VLSI)

no perforlnance By studying

into when classes

; 4) Highly

data

skewed,

algorithln

1) Uniformly

(tiger)

in terms

Thus,

the performance

a. specific

of data:

line segment

guarantees.

of the performs

distributed

; 3) Highly of location,

skewed, point

in

data

(CFD). Consider

first the uniformly

up to 42% more algorithm

disk accesses

performs

distributed

than

the

data.

For this type

STR algorithm

as well as STR for point

queries

15

for both

on point

data

of data point

the HS algorithm

and region

but much

queries.

worse for point

requires The

NX

queries

Buffer

Table

9:

Number

of Disk

Size

HS I NX I HS/STR NX/STR Point Queries 0.21 0.26 1.11 1.38 0.28 0.38 1.15 1.56 0.47 0.53 1.15 1.30 0.81 0.72 1.18 1.05 0.95 0.79 1.20 1.00 1.23 0.88 1.38 0.99 1.76 1.06 1.68 1.01 Query Region Area = 0.0001 2.42 14.19 1.02 6.00 4.79 23.73 0.98 4.86 5.83 26.81 0.96 4.41 6.66 28.32 0.97 4.12 6.98 28.84 0.98 4.03 7.50 29.27 1.01 3.94 8.41 29.64 1.07 3.77

I STRI

250 100 50 25 20 15 10

0.19 0.25 0.41 0.69 0.79 0.89 1.05

Region 250 100 50 25 20 15 10

Queries, 2.37 4.88 6.08 6.88 7.15 7.43 7.87

Region 250 100 50 25 20 15 10

Queries, 11.83 19.05 20.90 22.80 23.45 24.46 25.55

Accesses,

CFD

Query 11.72 18.95 20.82 22.31 22.75 23.42 25.02

52,510

Region 44.48 66.23 72.68 74.97 75.53 75.91 76.32

Node

Area = 0.0009 0.99 3.76 0.99 3.48 1.00 3.48 0.98 3.29 0.97 3.22 0.96 3.10 0.98 2.99

Data,

Buffer

Size Varied

Queries

STR

HS

NX

0.93 2.93

1.73 4.68

0.88 2.87

62.15 75.54

30.78 45.23

206.69 223.61

leaf area total area, leaf perimeter total perimeter

Table

IO:

CFD 52,510

Node

Data

16

Set, Areas

and Perimeters

for Point

and

Region

on region

data

or region

the NX algorithm nodes

packs

and hence

queries.

As previously

in long skinny

poor performance

pointed

rectangles

for region

and we drop it from subsequent

Consider

skewed

accesses

now the mildly

than

for smaller

STR for both

buffer

For highly

For the

(point)

STR for point

that

competitive,

algorithm

importance

and region

firm conclusions

no single but

than

and roughly

which

STR outperforms

packing

2) resulting

For all other

all but one dimension

in a large perimeter

types

of data

of the

the NX algorithm

discussion.

tiger data. set. The HS algorithm queries.

As expected,

requires

up to 49% more disk

the difference

of STR

HS for mildly

is best

under

or HS to use, skewed

to draw.

is more

noticeable

the same for region

11% - 68% more

data, queries.

disk access

than

queries.

all cases. depends

or uniform

It is clear on the

data.

STR or HS, particularly

Mgorithm

For the VLSI (region)

and roughly

ItS required

the same for region

algorithm

a. packing

queries,

is reversed:

is a toss up between

of choosing

are more difficult

STR for point

data. the situation

queries,

In summary, not

data,

3% - 11% faster

CFD

[6], by ignoring

sizes.

skewed

H$ performed

point

(see Figure

queries.

does not compete

out

is diminished

that

situation

For highly for region

as either

the NX algorithm at hand.

skewed queries.

the query

data,

is

It appears choosing

As expected,

size or the buffer

a the size

increase. Developing pursued. dynamic parallel

In the R-tree

a new algorithm future

we plan

variants

based

shared-nothing

that

works

to continue on the

STR

well for all types our

search

packing

platform.

17

of data

for a better algorithm,

and

is a challenge packing

that

algorithm,

also extend

our

should

be

investigate results

to a

Acknowledgements We work.

would

like

We would

to thank

also

like

Ken

Sevcik

to thank

for useful

Dimitri

discussions

Mavriplis

about

for providing

a preliminary

the

CFD

version

data

sets.

An Efficient May 1990.

and

of this

References [1] Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B., "The R*-tree: Method for Points and Rectangles," Proc. ACM SIGMOD, p. 323-331,

[2] Bhide, A., Dan, A., Dias, D., "A Simple Analysis of LRU Buffer Replacement to Buffer Warm-up Transient," Proc. IEEE Data Engineering, 1993. [3] Chazelle, 703-724,

B., "Filtering 1986.

[4] Faloutsos,

Search:

C., Roseman,

ples of Database [5] Guttman, A., 47-57, 1984.

A New Approach

S., "Fractals

Systems "R-trees:

to Query

for Secondary

(PODS-89), A DYnamic

Index

Answering,"

Key Retrieval,"

p. 247-252,

March

Structure

Policy

Proc.

Robust

and Its Relationship

SIAM

J. Comput.,

vol. 15, p.

Eighth

Symposium

on Princi-

1989.

for Spatial

Searching,"

Proc.

ACM

SIGMOD,

[6] Kamel, I., FMoutsos, CI, "On Packing R-trees," Proc. 2nd International Conference Knowledge Management (CKIM-93), p. 490-499, Arlington, VA, November 1993.

on Information

[7] Kamel,

Proc.

I.,

Faloutsos,

Conference

on Very

C.., "Hilbert Large

[8] Leutenegger, S.T., Lopez, M.A., Denver Technical Report number [9] Lopez, appear

R-tree:

Databases

An

improved

R-tree

M.A., Janardan, R., Sahni, S., "Efficient in IEEE Transactions on CAD.

and Space

[12] Roussopoulos, N, Leifker, D., "Direct ACM SIGMOD, May 1985.

Spatial

T.,

Roussopoulos,

Proc.

N.,

13th International

Fra.ctals,"

"The Effect of Buffering on the Performance 96-2, submitted for publication.

[11] Rosenberg, A.L., Snyder, L., "Time Systems, vol. 6, no. 1, March 1981.

jects," 1987.

Using

Faloutsos,

p.

and

International

1995 (VLDB-95).

Net

Extraction

[10] Mavriplis, D.J,, "An Advancing Front Delaunay Triangulation Journal of Computational Physics, vol. 117, p. 90-101, 1995.

[13] Sellis,

Access

C.,

Conference

Optimality

Search

"The

R+

on Very

18

Tree:

University

of

Orientation

Designs,"

to

for Restricted

Algorithm

in B-Trees,"

on Pictorial

of R-Trees,"

Databases

A Dynamic

Large Databases

Designed

ACM

for

Trasactions

Using

Packed

Robustness,"

on Database

R-trees,"

Index

for Multidimensional

(VLDB-87),

p. 507-518,

Proc.

Ob-

September

0.9

0.8

0

7

0

6

0

5

0

4

0

3

0

2

0

1

0

0.2

Figure

0.4

0.6

0.8

2: Leaf Bounding Rectangles for Long Beach Data using NX

]9

1

0.9

m

0.8

0.7

0.6

0.5

,

0.4

i

0.3 IH

0.2

0.i

0

i

0

0.2

Figure

i

i

0.4

0.6

i

0.8

3: Leaf Bounding Rectangles for Long Beach Data using HS

2O

1

0°9

r

T

T

r

0.8

0.7

0.5

0.4

0.3 z

0.2

0 0 0

0.2

Figure

' 0.4

0.6

0.8

4: Leaf Bounding Rectangles for Long Beach Data using STR

2]

o

t

0

o

"5k. RectNode.normal.ascii"

o

o

o



o

0.8

0.6

o

o

>-

o

0.4

0.2

o

0

0

I

t

0.2

o

0

0.4

I

I

o

0.6

0.8

X

Figure

5:

Full

Data for 5088

22

Node

Data

Set

o

0.52 o

0.515

0.51

0.505 o

o

>-

0.5

.o

o

o

o

o

0.495

°

o

°°

o o

0.49

0.485

o

° o °Oo

0.48 0.48

0.52

0.53

0.54

0.55

0.56

X

Figure

6:

Data Around Center for 5088 Node Data Set

23

0.57

3.5 HS STR

density : 5.0--_ .... density : 5.0 HS density : 0 _--:;.--_ STR densi.ty---_"0"_

..i

/"

2.5 // /

©

O O

Suggest Documents