Scott T. Leutenegger. Jeffrey M. Edgington. Mario A. Lopez. NASA. Contract. No. NAS1-19480. February. 1997. Institute for Computer. Applications in Science.
NASA
Contractor
ICASE
Report
STR:
Report
201661
No. 97-14
A SIMPLE
FOR
R-TREE
AND EFFICIENT
ALGORITHM
PACKING
Scott T. Leutenegger Jeffrey Mario
NASA February Institute
M. Edgington A. Lopez
Contract
for
Computer
Applications
Research 23681-0001
by Universities
National Aeronautics Space Administration Langley Hampton,
NAS1-19480
1997
NASA Langley Hampton, VA Operated
No.
Research Virginia
in Science
and
Center
Space
and
Center 23681-0001
Research
Association
Engineering
STR:
A Simple Scott
and Efficient
T. Leutenegger Mathematics University
Algorithm
for R-Tree
Jeffrey M. Edgington Mario and Computer Science Department of Denver
Packing
*
A. Lopez
Denver, CO 80208-0189
{leut,jedgingt, mlopez} @cs. du. edu
Abstract
In this paper Mgorithms, synthetic
we present
including
data
from various
computational
fluid dynamics.
have
performance.
all types best
of data.
previously
skewed
*The
point
work
TT-97-05. Administration and
Engineering,
proposed
of Leutenegger work under NASA
of
our new algorithm
data,
and
Lopez
contract
Langley
for point
and
was supported was
Research
Center,
that
of three
VLSI that
none
design,
various of the
up to 50_, fewer queries
in part
by Colorado
supported
while Hampton,
visiting
in the
part Institute
VA 23681-0001
by
the
distributed
for Computer
is best
for
than
the
or mildly
and region
Institute
and
of buffering
disk accesses
Software National
GIS (tiger),
degrees
point
packing
using both
algorithms
on uniformly
Advance
R-tree
are evaluated
the same for highly skewed
additionally
NAS1-19480
indicate
region
including
the ilnpact
requires
and approximately
Leutenegger
NASA
domains
results
study
The algorithms
also consider
Experimental
algorithm
comparison
Mgorithm.
application
Our studies
In general,
and region
The
from an extensive
a new easy to implement
and actual
on query
the results
grant
data.
number
Aeronautics
and
Space
Applications
in Science
1
Introduction
R-trees
[5] are a common
multi-dimensional points,
databases.
polygons,
a given
query
region.
computer
scientific
databases.
deletion
of objects.
Unfortunately, disadvantages: R-tree
not competitive preprocess
in R-trees
Other
need
[12] and later
it with the
of the bounding
preferred
and
and buffer
in better at internal
a new packing
utilizing
and synthetic
We know of no other
competitive
(due
has
most
important, in order but
to
still are
are allowed
to
where
the
results
to the fact that
algorithms
[12].
queries.
fewer
were first proposed
packing
an LRU buffer on VLSI design,
The
for point
latter
and
is simpler
to
due to the smaller
algorithm
outperforms is usually
the
queries.
algorithms
that
is simple
for a wide range
experimental
GIS, computational
has considered
ordering,
of [6] significantly
packing
we provide
Curve
However,
(Sort-Tile-Recursive)
metrics,
work that
by Guttman
that
on the Hilbert
the algorithm
Hilbert-based
algorithm
to area. and perimeter
based
for point
and Nearest-X
on real implementations data sets.
the
and
for applications
times
by Roussopoulos
nodes,
while remaining
it with the Hilbert
In addition
trees
for insertion
[6].
algorithm
proposed
query
and
can be modified
of nodes
algorithms
intersect
a priori and, when done properly,
Such packing
and Faloutsos
and,
such as
information
of the R-tree,
reasonable
and improved a query).
Consequently,
queries
we propose
compare
sizes.
results
queries.
is particularly
a packing
Mgorithln
rectangles
choice for region
In this paper
by Kan:el
Nearest-X
of [12] for region
pertbrming
routines
and
temporal
contents
number
quality
to loading
or available
their
utilization,
the
time when con:pared
often)
geographic
at a time as specified
large
objects
databases,
efficient
space
[1, 13] improve
utilization
[6] propose
and in some cases
perimeter
plement
while
one object
of an unduly
Preprocessing
100% space
and Faloutsos
implement
to query
[5] provides
objects,
which
design,
in the sense that
used in spatial
geometric
for traditionM
(b) sub-optimal
Mgorithms
are widely
of arbitrary
indexing
by inserting
the retrieval
and
computer-aided
structures,
(i.e., does not change
to be accessed
Kamel con:pare
requiring dynamic
with nearly
boxes
tree, and Guttman
to be stored.
static
include
time,
data
can be used to determine
multi-keyed
an R-tree
with regard
by Roussopoulos
that
structure
R-trees
applications
(a) high load
the data
data. is fairly
objects,
the entire
for spatial
the bounding
are dynamic
building
a query.
nodes
Typical
R-trees
reconstructing
satisfy
By storing
vision and robotics,
without
(c) poor
technique
or more complex
systems,
several
indexing
evidence
to imof data. based
fluid dynamics,
such a wide range
of data. set
_13
14_
2O
16
1: A sample R-tree. Input rectangles are shown solid.
Figure
types
and the effect they
buffered
in main
as shown
have on packing
memory.
This buffering
in [8]. Consequently,
The rest of the paper on R-trees present
and
describe
our experimental
performance.
our experimentM
is organized the
of portions
three
methodology.
utilize
In Section
algorithms
Section
some portion
of the tree can significantly
studies
as follows.
packing
In real databases
4 contains
a buffer
affect
as described
2 we provide
of the tree is performance
in Section
background
considered
in this paper.
results
fl'om our experiments
3.
information
In Section
3 we
and Section
5
concludes.
2
Overview
In this section R-trees,
of
which
but not required descriptions
and
Nearest-X
[12], Hilbert
we are proposing. for understanding
should
refer
Packing
a brief overview
we provide
including
algorithm
R-tree
to [12, 6].
Detailed
Algorithms
of the R-tree
and describe
several
[6], as well as Sort-Tile-Recursive knowledge
the remainder
of Nearest-X
of this paper.
Readers
and
packing
algorithms
(STR),
a new packing
Hilbert
interested
packing in more
for
is useful detailed
2.1
R-trees
An R-treeis a hierarchicaldatastructurederivedfromthe B-treeanddesignedfor efficientexecution of intersectionqueries.R-treesstorea collectionof rectangleswhich can changeovertime through insertionsand deletions.Arbitrary geometricobjectsarehandledby representingeachobjectby its minimum
bounding
generalize
easily
dimensional Each
to dimensions
node P.
For nodes nodes,
subtree
pointed
nested
rectangles,
Figure
rectangle
that
and that
MBR enclosing
contains
an R-tree
a 3-level
bounding
every path
the last of which
boxes.
of n entries.
which
encloses
simplicity
Each
at the leaf level, R is the bounding
are numbered
the small dark
a maximum
R is the minimum
1 illustrates
the levels
MBRs
upright
than two, but for notational
stores
to by P. Note
level may overlap
The
higher
of the R-tree
At internal
that
i.e., the smallest
the object.
we review
R-trees
only the two
case.
and a pointer P.
rectangle,
created
R-tree
are grouped
each leaf node is the smallest
into
box that
are the rectangles
of 4 rectangles There
in the
to by
stored
in the
rectangles
fit per node.
are 64 rectangles
unique.
We assume
numbered
at the next
of
at any
represented
the rectangles
nodes
pointed
is by no means
16 leaf level nodes,
R
to a sequence
Note also that
fully contains
stored
object
of all rectangles
set of objects
1, and 2 (leaf level).
The 64 rectangles
of the leaf nodes
data. object.
a maximum
of a rectangle
the tree corresponds
fl'om a particular
where
0 (root),
(MBR)
through
an actual
consists
box of an actual
rectangle
down
entry
by
1 to 16. The
within higher
the node. level of the
tree. For example,
leaf nodes
18,19,20)
1 through
17 (and
nodes
contains
the four level 1 nodes:
To perform examined
is accomplished
is purposely
a query
(regardless
by using
several
paths
stored
at that
node which
rectangles
down
rectangles (or the data
For illustration,
drawn
Q, all rectangles they
a simple through intersect
are searched objects consider
in node
slightly
larger
17 which than
is at level 1. The MBR of node
needed
for clarity.
The
root
node
be retrieved
and
17, 18, 19, and 20.
of whether
follow
the retrieved
4 are placed
are stored
recursive the
intersect
the
query
in an internal
node
procedure
tree.
A node
that
starts
is processed
Q. If the node is an internal recursively.
themselves) the
that
query
Otherwise,
are simply
region
must
or a leaf node).
at the root by first
node,
node
retrieving
the subtrees
This retrieval and
which
may
all rectangles corresponding
to
the node is a leaf node and the retrieved
reported.
Q in the example
of Figure
1. After
examining
the root
node,wedeterminethat nodes19and20 of level 1 must besearched.The searchthen proceedsto eachof thesenodes.It is then determinedthat the queryregiondoesnot intersectany rectangles storedin node19or node20 andeachof thesetwo subqueriesareterminated. The R-treeshownin Figure1 is fairly well structured.Insertingthesesamerectanglesinto an R-treeusingthe insertionalgorithmsof Guttman [5]wouldlikely resultin a lesswellstructuredtree. Algorithms to createwell structuredtreeshavebeen developedand are describedin Section2.2. Thesealgorithmscluster rectanglesin an attempt to minimizethe numberof nodesvisited while processinga query. Forthe restof the paperwewill assumethat exactlyonenodefits per diskpage,andhereafter weusethe two termsinterchangeably. 2.2
Packing Algorithms
In this sectionwedescribethreepackingalgorithms.All of the algorithmsusea similarframework. In the followingtext weassumethat the datafile consistsof r
rectangles
and
that
each R-Tree
node
can hold n rectangles. The
general
process
level first and then General
rectangles,
2. Load
creating
each successively
the data. file so that where
each
the last group the
[r/n]
leaf level page nodes
to building
a B-tree higher
from a collection
of keys by creating
level until the root node is crea.ted
group
pack
these
fewer than
of rectangles
into a temporary higher
the r rectangles
of n is intended
may contain
groups
of the next
3. Recursively
to be placed
[11].
in It/hi
consecutive
in the same
algorithlns
groups
leaf level node.
of n Note
n rectangles.
into pages
file. The
are ordered
and output
page-nulnbers
the (MBR, are used
page-number)
as the child
pointers
for each in the
level. MBRs
into nodes
at the
next level, proceeding
upwards,
node is created.
The three
the leaf
Algorithm:
1. Preprocess
that
is similar
differ only in how the rectangles
4
are ordered
at each level.
until
the root
Nearest-X (NX): This algorithmwasproposedin [12].The rectanglesaresortedby x-coordinate. given in the paper
so we assume
rectangles
packed
Hilbert
are then
Sort
Hilbert
based
distance
in which
space
extended
fi'om the origin,
Faloutsos
point
may or may mantissa
Sun Sparc
to the
representation,
order
_pon ent)
:"
is used.
The
this ordering.
point.
[6] to accomplish
at a time,
until
that
in this
Successive two center
than
the
other
discrimination grid.
Once
the
x- and
Curve). is possible.
and
the
based
on
the order
which
float
of the
be represented
using
numbers
numbers
Conceptually, In practice,
(which
position
native
is a conceptual are
in the
representation viewed
using
this
can be applied. data
set.
Consider
of a point
successively
smaller
(x2, Y2) need that
can use the sense and gathered
can be
a mantissa
starting
could
y-coordinates
it can be determined (one
the
which
a grid
for this grid is used to produce
(x_, y_) and
The information
the Hilbert
on the hypothetical
points until
(This
of a 2-dimensional
bits determine
coordinates,
exponent,
for 32-bit
required.
Curve
integer
nmnbers
used for integers
The first bit of the
using
below.
point
form.)
the method
are sorted
This determines
determines
For example,
The Hilbert
this task).
(along
exponent
all floating
bits.
stored
are examined
subquadrant
to the origin
point,
the processing
When
coordinate
different
as a sign, signed
the
the rectangles
of the R-Tree.
stored Since
orders
Curve.
are usually
+sizeof(Mantissa).
it.
the Hilbert
as described
not
describe
contains
the
of each
values
are
of the rectangles.
quadrant tain
center
of the rectangles
on how to handle
2s + 23 bits would
it is dear
We now briefly
along
points
values
binary
architecture,
center
algorithm
point
q- sizeof(Mantissa)
Coordinates
"
floating
The
details
not be normalized).
relative
slzeof(E
of size n, using
in [6]. The
into the nodes
only provide
numbers
2 size°f(Exp°nent)
22
of the rectangle's
in groups
measured
are placed
to arbitrary
Floating
was proposed
filling curve.
the rectangles
and
only.
x-coordinate
into the nodes,
algorithm
(fractal)
Kamel
the
are
(HS):
A fractM
their
that
No details
the packing
determine
which
subquadrants
con-
to be compared, one of the
rotation
is used to decide
points
tables which
the process
computes
one does
not store
of size
the bits lies in a
described point
is closer
bit positions, or compute
in
one all bit
Sort-Tile-Recursive (STR): Consider
a k-dimensionM
intervals
of the
for all 1 < i < k.
STR
described
is best
handled
the plane. contains
enough
rectangles points
[V_].
rectangles
into the first node,
according
slabs, where
the sorted
list.
(i.e., treated
Each
To aid in visualizing by using
the Long
show the resultant
3
Beach
perform
a set of rectangles
we assume
in
each slice
coordinates
of leaf level pages
P = [r/n]
them
into
S vertical
from the sorted
list.
Note that
rectangles
of each
into runs of length
slice
n (the first
n
and so on). described
center.
Then
above. divide
consecutive using
k = 1 is
First,
sort the
the input
set into
hyper-rectangles
from
the remaining
k - 1 coordinates
set).
algorithms,
set assuming
our experimental
of the algorithms
as part
recursively
the
consider
100 rectangles
same data
the leaf level nodes
fit per node.
set for each of the three
obtained
Figures
2, 3, and
algorithms.
Note
4
the
algorithm.
Methodology
we describe
"real world"
data
of their
case
slices so that
again
by k
falls inside
(The
partition
node,
of the approach
of these packing
4 for the STR packing
Experimental
comparison
Tiger
them
n into the second
data
case.
Now sort the
of a run of n. [P_
leaf level MBR for the
slices in Figure
In this section
both
the result
and
rectangles
by grouping
slab is now processed
as a k - 1-dimensional
Once
is defined
coordinate
consider
the number
S. n rectangles.
to the first coordinate
base
we first
by x-coordinate
generalization
a slab consists
the
V'_-/n nodes.
into nodes
i-th
using v/7-/n vertical
Determine
the next
case k > 2 is is a simple
,S' = [P¼]
space
roughly
the rectangles
them
whose
Accordingly,
of a run of S • n consecutive
and pack
hyper-rectangles
of points
the data
fewer than
A hyper-rectangle
k = 2 providing
B-trees.)
to pack
slice may contain
by y-coordinate
The
with
of the rectangles.
Sort
A slice consists
the last
is the locus
The basic idea is to "tile"
let S =
slices.
of r hyper-rectangles.
recursively
well by regular
are for the center and
set
form [Ai, Bi] and
i-th interval,
already
vertical
data
and synthetic of a typical
through data database
methodology.
actual sets.
Our goal is to provide
a solid experimental
R-tree
implementations
We intend
to provide
insight
into how well R-trees
supports
spatial
queries.
system
which
over a wide range
Thus,
of data
using would
in order
to
providerealistic andmeaningfulperformancemeasurements, the effectof bufferingmust be taken into consideration. Our primary comparisonmetric is the numberof disk accesses requiredto satisfya queryof a givensize. Note that this metric alsoallowsus to get an accurateindication of performanceeven usingnon-dedicatedworkstations.If we hadfocusedon retrievaltime,interferencefromother users wouldhavecloudedour results. We assumean LRU may
arguably
be to pin the
LRU scheme pinning, which
except
in unusual
case it should
disk partition.
Many than
sets
performance of the
is the
to be used
of the
time.
four Sun Sparc For each
query
the
is made.
percent
should
and
is often
then
use an
no gain from
the root just fits into the buffer (regardless
of their
data
manager.
by near
term
future
the depth
this
pool, in
level) to simplify
of the R-tree, R-trees
using
written
a raw
to disk and
we can easily
we consider
Data
have
several
small
buffer
data
data
set
vary
a fan out
are still smaller
size affects
and 2) it decreases
Since
using larger
smaller
yet they
Thus, one of the experimental
can be buffered.
be obtained
with these
manager
Thus,
studies,
applications.
Since most
as the second.
as would
memory
used in previous
50,000 rectangles),
Even
virtual
those
set that
our buffer
the node is immediately
than
1) it increases
of our experiments,
considered.
curves
levels
routine
the percentage
of 25 to 100 the first parameters
of interest
of our experiments sizes.
use small
This aJlows us to obtain
sets and buffers,
sets our experiments
R-tree
at a great
took two months
savings utilizing
5 workstations.
being
data
management
the OS or hardware.
will fit in the buffer.
of results
in experimental
buffer
in [8] there
size, we implement
system's
reconfiguring
is not as significant
type
As shown
out of the buffer
sets are larger
sets (approximately
the same
better
first few R-tree
a level near
of buffer
operating
size without
likely
percentage
data
by the
set that
consideration
where
of the
We use LRU for all the nodes
the impact
in two ways:
data
A slightly
number
of the R-tree.
a node is pushed
of our data
data.
some
circumstances
assess
When
buffer
routine.
of our experiments.
"false-buffered"
the actual
and
nodes
be pinned.
space
To accurately
management
root
for the remaining
the parameter
not
buffer
we build
In each experiment set with 2,000 Thus,
queries.
to err on the
not be considered
the R-tree
the exact
to the specific
packing
same data. set is used for all algorithms.
No attempt side of caution
significant.
according
at collecting we advise
confidence that
intervals
differences
algorithm We then
or smoothing
of less than
a few
To providea uniform experimentspacewe normalizeall data setsto the unit square.Point queriesareuniformlydistributedin the unit square.Weconsiderregionquerieswhoseregionequals 1%and9%of the unit square.Thelowerleft handcorneris uniformlydistributedin the unit square. The upperright hand corneris computedby addinge 0.3 for region
queries
the coordinate of the
much
Our
but
secondary
nodes.
The
are good
data
sets are.
types
(GIS): data
(VLSI): the
of a chip
highly skewed, larger
thousand
.
(CFD):
than
than
1.0 we set
is larger
of 9% will return data
sets described
9% of the unit
and
number
[8].
perimeter of nodes
We include
for both
We argue
in a general
that
square
of the accessed
roughly
9%
below)_
the
may
return
MBRs
of the
by a query
[6]
these
measures
as additional
tree
(summed
over all nodes
the whole
the leaf level metric
data.
In particular
of geographic
is of most
interest
the smallest and
in past
we consider
the following
Similarly,
rectangles
Fluid
This
there
the largest
are regions
Beach
53,145
line
Dynamics.
by Bell Labs and rectangle rectangle
of the
used in
distribution is roughly
chip covered
is
40,000
by several
at all.
of this work is to apply
of a Boeing
sets:
set contains
the input
the techniques
We consider
is used to model the air flows over and around
data. sets are for a cross section
data
different
we chose the Long
data
provided
because
by no rectangles
motivations
systems
many
our
studies.
is interesting
some covered
from Computational
by considering
and in size. For example, one.
on how representative
this issue
of Census.
set of 453,994
This data
is dependent
information
of the U. S. Bureau
a CIF data [9].
setting
We address
used extensively
One of the primary
of equations
area
of the
metrics
question.
both in location
rectangles
sets obtained system
system
We consider
e = 0.1 or
will likely be buffered.
set representative
has been
CFD
covers
sum of the
indicators
and perimeter
as well as synthetic
and
design
times
is the
of our conclusion
of the TIGER
segments .
area
As a data
query
and
that
is not considered
This is a non-trivial
of real data
.
if buffering
level nodes
applicability
VLSI
a query
and also only for the leaf level.
since the non-leaf
a region
where
9% of the data.
measures
and present
at all levels)
and
metric
These
data
(as in the
comparison
can be misleading
information
data
is large
less than
If the x- or y-coordinate
distributed
skewed
of output
or much
9% respectively.
For uniformly
but for highly
on amount
more
R-tree
to 1.0.
data,
variance
of 1% and
to the x- and y-coordinates
to scientific
a 2-dimensional aero-space
737 wing with flaps out in landing
problem.
vehicles
data A
[10]. The
configuration
at
MACH 0.2. Thedata spaceconsistsof a collectionof points(nodes)of varyingdensity.Nodes aredensein areasof greatchangein the solutionof the equationsand sparsein areasof little change.To help the readerunderstandthe natureof the data weincludea plot of a data set with 5088nodes(seeFigure5). The experimentMresultsusea data.set with 52,510nodes, whichis similarbut lookslike a blacksmudgewhenplotteddueto the densityof points. Note that the blackregionin the middleof Figm'e5 accountsfor the majority of the data. In Figure 6 weplot onlythe areaaroundthe centroidof the dataset. The blank oval-ishareasareparts of the wing. It is evidentthat the datasetis highly skewed. TheseCFD data setsandtwo other (smallerand larger)phs the tiger data setswill soonbe availablefrom http 4. (Synthetic): 300,000 lower
corner
squares
a square
is determined
between
the actual
Then,
uniformly
corner
is chosen
which
case the coordinate(s)
(point
data),
that
the average
Specifically, the average
between
to give the desired
the unit
data
area exceeds
1.0, 2.5, and 5.0. We present
containing unit
area.
between
square.
square.
The
10,000
and
For
each square
the
area
of the
density
let r equal
the number
area. of a square
it exceeds
equals
equals
the average the
bounds
for densities
is
area of
the sum of the of squares
in the
_a For each square,
area.. The upper of the unit
1.0 is set to 1.0. We considered results
square
The value of the average
set, where
0 and two times unless
a.html.
created in the
over
[6] of the
data. set.
the density.
were
contained
0 and 2 times
in the
area is chosen
sets
distributed
by the density
of all the squares set a.nd d equal
data.
are fully
was uniformly
distributed
data
cs. du. edu/-:ieut/Mu].tiD±mensiona:iDat
distributed
All
uniformly
areas
4
Uniformly
squares.
left
://www.
right
square,
data. densities
in of 0
of 0 and 5.0.
Results
In this section and region
we present
queries
from R-trees
on 2-D synthetic,
with 100 rectangles
NX algorithm
are not included
2 - 8 times
as many
point
To be complete
data.
the results
disk accesses
of our experimental GIS (tiger),
VLSI,
per node, with a range in the
figures
since
NX results
and CFD data of buffers
We present sets.
is not
for all experiments in the tables
results
All results
sizes examined.
the NX algorithm
as the STR algorithm
we do include
methodology.
are obtained
The curves
competitive, except
of this section.
for point
point
for the
requiring queries
on
Data Size 10,000 25,000 50,000 100,000 300,000
R-Tree Pages 101 254 506 1011 3031
Synthetic
We first what
Data
consider
percent
synthetic
of the
the nmnber 100 rectangles
is the percent
buffer
of 250 pages
queries
the ordering
using
density
R-tree
that
are for a density
sets.
(including
a buffer
The
1 we show
first column
n0n-leaf
is
) assuming
of 10 pages can hold,
We do not consider
data
and the
a data. size of 10,000 for a
data. set size (in thousands respectively.
The
of 0 (i.e., point The
data).
31 - 42% more disk accesses
and
5. For a buffer
26 - 32% more
set ahnost
in the figures
data,
of size 250, HS requires for region
data
hence
5)
show
lines are
STR for point
tree fits and
in each equals
and the dashed
than
disk access
the entire
rectangles
The legends
solid lines are for STR
of rectangles)
top two curves
sum of the areas of the input
of density
data
versus
250 pages
top to bottom.
for the 25,000 rectangle
and
33 - 41% of density
the comparison
meaningful. 9 we plot
for region
the
queries
buffer
of 250 pages
access
than
STR.
For region
access
than
STR.
Note that
data
size of 10 and
data,
smaller
pages
In Table
fits in the buffer.
STR for point
In Figure
of the R-tree
can hold.
250 pages.
sizes of data
of R-tree
of disk accesses
disk access for region
is not particularly
the more
is the percent
of size 10, HS requires
than
sizes of l0 and
for the different
of 5 (i.e., the expected
of the lines from
disk access
buffer
is the number
of 250 pages
a buffer
two curves
26 - 32% more
becomes
second
as the entire
for HS. For a buffer
rectangles)
the
7 and 8 plot the number
and the bottom
5. Note,
fits in the buffer
a buffer
figure are for a data
more
We consider
per page, the third
fourth
tbr point
data.
R-tree
of rectangles,
Figures
Buffer = 250 100% 98.43% 49.41% 24.73% 8.25%
1: Percent of R-Tree Held By Buffer
Table
4.1
Buffer = 10 9.90% 3.94% 1.98% 0.99% 0.33%
is similar.
(but that
number
of disk accesses
of 1% of the For point data
data,
of density
as the query
STR always
requires
needs to be retrieved
data
space
the bottom
versus using
the
a buffer
two curves,
5, the top two curves,
region fewer
size increases, disk accesses).
the more naive
10
data. set
the
(in thousands
of 10 pages. HS requires
ItS requires difference
This result
the search
size
_lgorithm
The
of
plot for a
6 - 22% mcre
disk
6 - 16% more
disk
between
STR
and
is not surprising can afford
HS since
to be [3].
Point
Data
Region
Data. Size
STR
HS
NX
10 25 5O 100 30O
0.89 1.03 1.27 1.61 1.95
1.26 1.41 1.74 2.18 2.55
0.87 1.04 1.27 1.57 1.83
10 25 5O 100 30O
3.27 6.85 11.48 18.21 41.46
3.99 8.00 12.81 19.93 44.02
Region 10.86 26.61 50.64 98.47 290.05
Queries, 1.22 1.17 1.12 1.09 1.06
Query Region 3.33 3.89 4.41 5.41 7.00
10 25 50 100 300
11.73 26.40 46.20 84.54 229.75
13.02 28.07 48.74 87.51 234.82
Region 26.86 67.26 131i96 261.35 779.96
Queries, 1.11 1.06 1.05 1.04 1.02
Table
Carrying needs
this
2:
argument
Rtree
results
are
if the
packing
Density
= 5.0
STR
HS
NX
1.40 1.67 1.97 2.31 2.60
1.85 2.19 2.57 2.99 3.27
3.52 6.11 8.43 12.45 19.70
1.32 1.31 1.30 1.29 1.26
2.51 3.67 4.28 5.39 7.56
= 1% of Data 4.25 4.97 8.53 9.87 13.12 14.55 20.40 22.14 44.73 47.26
13.78 31.44 57.52 108.37 307.38
1.17 1.16 1.11 1.09 1.06
3.24 3.69 4.38 5.31 6.87
Query Region = 9% of Data 2.29 13.57 14.80 2.55 29.01 30.76 2.86 49.48 51.97 3.09 89.25 92.18 3.39 237.42 242.41
29.91 72.17 48.74 271.58 797.94
1.09 1.06 1.05 1.03 1.02
2.20 2.49 0.98 3.04 3.36
of Disk Accesses,
extreme,
all
NX/STR Point Queries 0.99 1.01 1.00 0.97 0.94
1.42 1.38 1.37 1.35 1.31
Number
to an
to be performed:
HS/STR
Data,
Synthetic
Data,
regions
encloses
query
schemes
exhibit
the
Buffersize
NX/STR
-- 10
all input
same
HS/STR
data.
performance
then
no
search
as all leaves
need
to be examined. More
exhaustive
respectively.
The
first
column
colunms
are
the
the
fifth
and
sixth
columns
are
the
same
as 2-6
7-11
for point the
queries
query We
include The
while
the
smMler produces
is the
of disk
on point
are but
number
accesses the
for
data,
in Tables
ratio region
and,
2 and
of data
items
3 for buffer in thousands,
to satisfy
the
query
of HS and
NX
relative
data.
of density
5.
as expected,
the
information
for the
sizes the
for STIR., HS, to STR
Note
difference
that
of 10 and second
and
NX
between
is not
pages,
through
NX
for point
250
on
data,
fourth
point and
data,
columns
competitive
except
STR
and
HS diminishes
data
sets
in Table
level
and
all
as
size increases. present the
tree.
number
presented
area.
and
perimeter
sun:
of both
area
second
through
fourth
fifth area.
through and
a slightly
and
seventh
perimeter smMler
perimeter
colunms colunm
than
the
total
area.
for
both
are for the are
for
the
HS algorithm for the
the
data.
300K
data
11
data
and
MBRs
50K
for
point
50K
both but
300K at leaf
set for STR, set. data the
The sets, same
HS, and STR
leaf
MBRs
level
We in the
NX respectively,
algorithm
whereas
4.
the
produces NX
area..
a
algorithm Note
that
DataSize STR
Region Data, HS NX Density= HS/STR 5.0 NX/STR
0.14 0.79 1.16 1.45
0.14 1.00 1.53 1.83
0.20 3.85 8.05 16.98
1.05 1.27 1.32 1.26
1.43 4.88 6.95 11.67
0.14 0.69 1.04 1.23
25 50 100 300
0.16 4.74 12.14 36.72
0.17 5.30 13.27 38.92
Region 0.49 26.24 76.60 279.67
Queries, 1.06 1.12 1.09 1.06
Query Region = 1% of Data 3.05 0.17 0.19 5.54 5.57 6.15 6.31 13.84 14.94 7.62 39.84 42.04
1.01 29.85 84.53 296.80
1.14 1.10 1.08 1.06
6.04 5.36 6.11 7.45
25 50 100 300
0.20 20.11 61.78 218.61
0.21 21.20 64.18 224.09
Region 1.29 76.11 228.98 769.74
Queries, 1.08 1.05 1.04 1.03
Query Region = 9% of Data 6.51 0.25 0.25 3.78 22.12 23.10 3.71 65.52 68.12 3.52 226.77 231.62
2.88 81.78 239.43 787.30
1.00 1.04 1.04 1.02
11.57 3.70 3.65 3.47
NX
algorithm
performance
4.2
We now of disk
has
oll regions
GIS
tiger
1.03 1.33 1.41 1.35
STR
0.13 0.52 0.74 0.91
3:
0.13 0.52 0.77 0.92
NX/STR Point Queries 1.00 1.01 1.04 1.01
25 50 100 300
Table
the
PointData NX HS/STR
HS
Number
much
larger
of Disk Accesses,
Synthetic
Data,
perimeters
the
two
than
other
Buffersize
algorithms
= 250
accounting
for its poor
queries.
data
present
results
for the
accesses
versus
buffer
Long
Beach
size for point
County
TIGER
queries.
Point leaf area. total area leaf perimeter total perimeter leaf area, total area leaf perimeter total perimeter
STR 50K 0.97 3.05 88.21 101.74
4:
Synthetic
data
set. set
In Figure
requires
10 we plot
532 leaf
level
the nodes
Data
NX 50K 0.97 2.97 982.49 998.48
Region Data, Density 1.96 7.58 4.31 9.63 127.46 1000.77 142.09 1016.88
1.53 3.65 110.82 124.51
Table
HS 50K 1.33 3.64 106.26 120.76
The
data
Data
12
Areas
STR 300K 0.97 3.12 216.24 243.85
HS 300K
NX 300K
1.31 3.76 258.36 289.45
0.97 2.97 5882.38 5948.36
1.96 4.46 312.57 344.03
17.47 19.63 5937.22 6003.54
= 5.0 1.54 3.74 272.79 300.78
and Perimeters
number and
7
BufferSizeI ST l 10 25 50 100 250 Region
100 250 Region 10 25 50 100 250
5:
Number
of Disk
Point 1.07 0.73 0.63 0.54 0.38
0.72 0.52 0.48 0.42 0.31
10 25 50
Table
nS I NX
Queries, Query 10.51 11.11 9.90 10.40 8.98 9.38
Region 35.89 35.59 34.44
7.83 5.12
31.61 19.25
8.10 5.34
Queries, Query 51.17 52.13 50.82 51.72 49.52 50.54 45.67 46.60 30.50 31.11
Accesses,
Long
Beach
[ HS/STR
Queries 3.54 3.08 2.76 2.31 1.39
1.49 1.41 1.33 1.27 1.20 = 1% of Data 1.06 1.05 1.04 1.04 1.04 9% of Data = 1.02 1.02 1.02 1.02 1.02
Region 107.51 107.32 106.23 101.47 77.17
Data,
[ NX/STR
Point
4.90 5.94 5.78 5.49 4.45 3.41 3.59 3.83 4.04 3.76 2.10 2.11 2.15 2.22 2.53
and Region
Queries
and Different
Buffer
Sizes
leaf area total area leaf perimeter total perimeter
Table
index
nodes
9.28%
18.55%
STR. up
for a total
Again,
to 9%
In Table
46.38%) the
space
5 we present
Tiger
Long
of 539
pages.
Thus,
of the
Rtree.
The
relative
of the
6:
difference the
the
two
number
of disk
HS 0.76 2.51
NX 2.85 4.27
74.11 86.04
76.67 89.77
544.30 557.07
Beach
Data,
a buffer
Areas
of size
HS algorithm
increases
algorithms
STR 0.53 2.00
as the are
accesses
and
25 50
requires
buffer
similar,
(10
and Perimeters
with the
100
20 - 50%
size decreases.
relative
holds
,nore For
HS requiring ratio
250)
disk
region
2 - 6% more to STR
(1.86%
4.64%
accesses queries disk
for point
than of sizes
accesses. and
region
queries. We present smaller
areas
area than
and both
perimeter HS and
information NX,
and
slightly
in Table smaller
13
6. The
STR
perimeters
algorithm than
HS.
produces
significantly
Buffer
Size
I STRI
HS
14.13 12.80 11.54 9.57 6.46 4.26
Point 13.67 11.84 10.36 8.48 5.78 4.01
10 25 50 100 250 500
4.3
7:
Number
NX Queries 197.57 197.15 196.27 193.50 177.10 134.57
10 25 50 100 250 500
Region Queries, Query 93.98 92.98 93.68 92.71 93.11 92.07 9!.53 90.34 85.53 84.05 76.50 75.51
10 25
Region Queries, Query 398.78 396.26 398.44 396.01
50 100 250 500
Table
r
398.07 396.97 389.71 369.43
VLSI
NX/STR
0.97 0.93 0.90 0.89 0.90 0.94
14.45 16.65 18.94 22.81 30.63 33.55
= 1% of Data 0.99 0.99 0.99 0.99 0.98 0.99
6.51 6.53 6.57 6.67 7.06 7.56
P_egion = 9% of Data 1243.05 0.99 1242.93 0.99 1242.54 0.99 1241.10 0.99 1235.85 1.00 1216.26 0.99
3.14 3.14 3.14 3.14 3.18 3.32
Region 605.61 605.48 604.96 602.17 593.18 570.91
395.63 394.76 389.00 366.31
of Disk Accesses,
HS/STR
Data,
Buffer
Size Varied
for Point
and Region
Queries
VLSI
The
VLSI
and
location.
slightly
data
As
better)
In Table The
set consists
region
4.4
seen point
7 we present
the
the
performs same
queries.
experimental
be
for both
HS algorithm
practically
can
of approximately
In Table
11,
and
queries
region
slightly
HS
results better
queries.
8 we present
area
and
STR
which
perform
regardless and
STR
for point
NX and
region
algorithm
vary
considerably
almost
of buffer
tot point
than The
rectangles
the
size
same
(HS
performs
buffer
size is varied.
size.
queries
queries
as the
of 3% - 11%)
(by a factor
is significantly
perimeter
in both
information
worse which
for both
point
is consistent
with
and and the
results.
Computation
Fluid
For
the
experiments
in this
the
box
" ' (0.48,0.48)
(0.6,0.6).
variance
in Figure
detailed
for region
453,994
in the
number
Dynamics
section
of nodes
When
we restricted allowed
accessed
to
as the
point
and
region
range
over
the
remaining
14
area
queries entire
to
data.
is extremely
the set
area. there
sparse.
bounded was
Note
by
a large that
the
STR 9.81 14.79 707.92 769.40
leaf area total area leaf perimeter total perimeter Table
region
considered
are uniformly reduced
is also highly
distributed
space.
in the
The upper
the lower left corner
right
the 1% and 9% of the data In Figure In Table
9 we present
information nodes
3.75% 4.69% STR
For region accesses
5
the
queries
than
node
reduced
space.
Region
query
of the region
Thus,
The
a buffer
of region
This
area
queries
to fit within
by adding
for point
and
roughly
this
0.01 or 0.03 to corresponds
to
in Table
queries
for STR and
HS.
10 the
area
and perimeter
526 leaf level nodes
of size (10 15 20 25 50 100 250) holds As can be seen in Figure
fewer disk accesses
perform
were obtained
data. set requires
of the Rtree.
significantly
]eft corner
size is also reduced
required
of all our experiments set.
lower
experiments.
of disk accesses
data
STR
queries
the
at 0.6 if needed.
used in the other
46.90%)
HS and
the other
and
results
9.38% 18.76% requires
queries
number
of 533 pages.
algorithm
Point
region
for the 52,510
for a total
skewed.
and truncating
12 we plot the
NX 181.06 194.54 7733.60 7852.27
8: VLSI Data, Areas and Perimeters
corner
coordinates
HS 8.40 14.33 686.92 753.46
similarly.
than
The
for small
requires
7 index
(1.88%
12, for point
HS, especially
NX a.lgorithm
and
2.81%
queries buffer
the sizes.
significantly
more
two.
Conclusions
All three
algorithms
it is not surprising algorithms well and
studied that
none of them
on the different when
are based
types
it does not.
point
and region
terms
of location
on heuristics is best
and
provide
for all data
sets.
of data. we can gain insight
We considered
data. (synthetic)
three
general
; 2) Mildly skewed
and size, region
data
(VLSI)
no perforlnance By studying
into when classes
; 4) Highly
data
skewed,
algorithln
1) Uniformly
(tiger)
in terms
Thus,
the performance
a. specific
of data:
line segment
guarantees.
of the performs
distributed
; 3) Highly of location,
skewed, point
in
data
(CFD). Consider
first the uniformly
up to 42% more algorithm
disk accesses
performs
distributed
than
the
data.
For this type
STR algorithm
as well as STR for point
queries
15
for both
on point
data
of data point
the HS algorithm
and region
but much
queries.
worse for point
requires The
NX
queries
Buffer
Table
9:
Number
of Disk
Size
HS I NX I HS/STR NX/STR Point Queries 0.21 0.26 1.11 1.38 0.28 0.38 1.15 1.56 0.47 0.53 1.15 1.30 0.81 0.72 1.18 1.05 0.95 0.79 1.20 1.00 1.23 0.88 1.38 0.99 1.76 1.06 1.68 1.01 Query Region Area = 0.0001 2.42 14.19 1.02 6.00 4.79 23.73 0.98 4.86 5.83 26.81 0.96 4.41 6.66 28.32 0.97 4.12 6.98 28.84 0.98 4.03 7.50 29.27 1.01 3.94 8.41 29.64 1.07 3.77
I STRI
250 100 50 25 20 15 10
0.19 0.25 0.41 0.69 0.79 0.89 1.05
Region 250 100 50 25 20 15 10
Queries, 2.37 4.88 6.08 6.88 7.15 7.43 7.87
Region 250 100 50 25 20 15 10
Queries, 11.83 19.05 20.90 22.80 23.45 24.46 25.55
Accesses,
CFD
Query 11.72 18.95 20.82 22.31 22.75 23.42 25.02
52,510
Region 44.48 66.23 72.68 74.97 75.53 75.91 76.32
Node
Area = 0.0009 0.99 3.76 0.99 3.48 1.00 3.48 0.98 3.29 0.97 3.22 0.96 3.10 0.98 2.99
Data,
Buffer
Size Varied
Queries
STR
HS
NX
0.93 2.93
1.73 4.68
0.88 2.87
62.15 75.54
30.78 45.23
206.69 223.61
leaf area total area, leaf perimeter total perimeter
Table
IO:
CFD 52,510
Node
Data
16
Set, Areas
and Perimeters
for Point
and
Region
on region
data
or region
the NX algorithm nodes
packs
and hence
queries.
As previously
in long skinny
poor performance
pointed
rectangles
for region
and we drop it from subsequent
Consider
skewed
accesses
now the mildly
than
for smaller
STR for both
buffer
For highly
For the
(point)
STR for point
that
competitive,
algorithm
importance
and region
firm conclusions
no single but
than
and roughly
which
STR outperforms
packing
2) resulting
For all other
all but one dimension
in a large perimeter
types
of data
of the
the NX algorithm
discussion.
tiger data. set. The HS algorithm queries.
As expected,
requires
up to 49% more disk
the difference
of STR
HS for mildly
is best
under
or HS to use, skewed
to draw.
is more
noticeable
the same for region
11% - 68% more
data, queries.
disk access
than
queries.
all cases. depends
or uniform
It is clear on the
data.
STR or HS, particularly
Mgorithm
For the VLSI (region)
and roughly
ItS required
the same for region
algorithm
a. packing
queries,
is reversed:
is a toss up between
of choosing
are more difficult
STR for point
data. the situation
queries,
In summary, not
data,
3% - 11% faster
CFD
[6], by ignoring
sizes.
skewed
H$ performed
point
(see Figure
queries.
does not compete
out
is diminished
that
situation
For highly for region
as either
the NX algorithm at hand.
skewed queries.
the query
data,
is
It appears choosing
As expected,
size or the buffer
a the size
increase. Developing pursued. dynamic parallel
In the R-tree
a new algorithm future
we plan
variants
based
shared-nothing
that
works
to continue on the
STR
well for all types our
search
packing
platform.
17
of data
for a better algorithm,
and
is a challenge packing
that
algorithm,
also extend
our
should
be
investigate results
to a
Acknowledgements We work.
would
like
We would
to thank
also
like
Ken
Sevcik
to thank
for useful
Dimitri
discussions
Mavriplis
about
for providing
a preliminary
the
CFD
version
data
sets.
An Efficient May 1990.
and
of this
References [1] Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B., "The R*-tree: Method for Points and Rectangles," Proc. ACM SIGMOD, p. 323-331,
[2] Bhide, A., Dan, A., Dias, D., "A Simple Analysis of LRU Buffer Replacement to Buffer Warm-up Transient," Proc. IEEE Data Engineering, 1993. [3] Chazelle, 703-724,
B., "Filtering 1986.
[4] Faloutsos,
Search:
C., Roseman,
ples of Database [5] Guttman, A., 47-57, 1984.
A New Approach
S., "Fractals
Systems "R-trees:
to Query
for Secondary
(PODS-89), A DYnamic
Index
Answering,"
Key Retrieval,"
p. 247-252,
March
Structure
Policy
Proc.
Robust
and Its Relationship
SIAM
J. Comput.,
vol. 15, p.
Eighth
Symposium
on Princi-
1989.
for Spatial
Searching,"
Proc.
ACM
SIGMOD,
[6] Kamel, I., FMoutsos, CI, "On Packing R-trees," Proc. 2nd International Conference Knowledge Management (CKIM-93), p. 490-499, Arlington, VA, November 1993.
on Information
[7] Kamel,
Proc.
I.,
Faloutsos,
Conference
on Very
C.., "Hilbert Large
[8] Leutenegger, S.T., Lopez, M.A., Denver Technical Report number [9] Lopez, appear
R-tree:
Databases
An
improved
R-tree
M.A., Janardan, R., Sahni, S., "Efficient in IEEE Transactions on CAD.
and Space
[12] Roussopoulos, N, Leifker, D., "Direct ACM SIGMOD, May 1985.
Spatial
T.,
Roussopoulos,
Proc.
N.,
13th International
Fra.ctals,"
"The Effect of Buffering on the Performance 96-2, submitted for publication.
[11] Rosenberg, A.L., Snyder, L., "Time Systems, vol. 6, no. 1, March 1981.
jects," 1987.
Using
Faloutsos,
p.
and
International
1995 (VLDB-95).
Net
Extraction
[10] Mavriplis, D.J,, "An Advancing Front Delaunay Triangulation Journal of Computational Physics, vol. 117, p. 90-101, 1995.
[13] Sellis,
Access
C.,
Conference
Optimality
Search
"The
R+
on Very
18
Tree:
University
of
Orientation
Designs,"
to
for Restricted
Algorithm
in B-Trees,"
on Pictorial
of R-Trees,"
Databases
A Dynamic
Large Databases
Designed
ACM
for
Trasactions
Using
Packed
Robustness,"
on Database
R-trees,"
Index
for Multidimensional
(VLDB-87),
p. 507-518,
Proc.
Ob-
September
0.9
0.8
0
7
0
6
0
5
0
4
0
3
0
2
0
1
0
0.2
Figure
0.4
0.6
0.8
2: Leaf Bounding Rectangles for Long Beach Data using NX
]9
1
0.9
m
0.8
0.7
0.6
0.5
,
0.4
i
0.3 IH
0.2
0.i
0
i
0
0.2
Figure
i
i
0.4
0.6
i
0.8
3: Leaf Bounding Rectangles for Long Beach Data using HS
2O
1
0°9
r
T
T
r
0.8
0.7
0.5
0.4
0.3 z
0.2
0 0 0
0.2
Figure
' 0.4
0.6
0.8
4: Leaf Bounding Rectangles for Long Beach Data using STR
2]
o
t
0
o
"5k. RectNode.normal.ascii"
o
o
o
o
0.8
0.6
o
o
>-
o
0.4
0.2
o
0
0
I
t
0.2
o
0
0.4
I
I
o
0.6
0.8
X
Figure
5:
Full
Data for 5088
22
Node
Data
Set
o
0.52 o
0.515
0.51
0.505 o
o
>-
0.5
.o
o
o
o
o
0.495
°
o
°°
o o
0.49
0.485
o
° o °Oo
0.48 0.48
0.52
0.53
0.54
0.55
0.56
X
Figure
6:
Data Around Center for 5088 Node Data Set
23
0.57
3.5 HS STR
density : 5.0--_ .... density : 5.0 HS density : 0 _--:;.--_ STR densi.ty---_"0"_
..i
/"
2.5 // /
©
O O