surveys a number of algorithms for efficiently answering range queries. ... of this paper is to collect together the known results on range searching and to.
SLAC PUB-21 89 August 1978
A SURVEY OF ALGORITHMS AND DATA STRUCTURES FOR RANGE SEARCHING* Jon Louis Bentley Departments of Computer Science and Mathematics Carnegie-Mellon University Pittsburgh, Pennsylvania 15213 Jerome H. Friedman Computation Research Group Stanford Linear Accelerator Center Stanford, California 94305
ABSTRACT An important
problem
surveys
a number
“loGical
structures”
secondary
of algorithms
memories
“theoretical”
in database
interest.
systems is answering
for efficiently
is described is discussed. Although
together
present
them in a common terminology.
to
partially
supported
the known
ACM Transactions by Department
quickly.
range queries.
implementation
included
some new results
paper
(Submitted
their
The algorithms
of this
-Work
is to collect
and ‘then
answering
queries
results
and
are of both “practical”
and
the primary
on range
on Database of
First a set of primary
are presented,
in
This paper
Energy
searching
Systems)
purpose and to
-i-
Table of Contents 1. Introduction
1
2. Logical Structures
3
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8
Brute Force Projection Cells k-d Trees Range Trees k-ranges Other Structures Comparison of Methods
17
3. Implementations 3.1 Internal 3.2 Disk 3.3 Tape 4. Further
Work
5. Conclusions
3 3 5 8 11 13 14 14
Memory
17 17 18 22 23
-l-
i. Int reduction Rezarchers
in database
fundamental
areas
system s have recently
of study
in their
security,
reliability,
however,
is that
algorithms
and data structures
we
the
apply
examining
and integrity.
of
algorithms
each
satisfying key
values
records
is called
geometric values The
for each record of
framework
do
query
problems
by
type of query.
asks for
all records
range query asks for all records of retrieving
of range
searching
attributes
can
be
represented
The problem of range searching
with
the appropriate can be cast
as coordinates,
a point in a k-dimensional
ranges
In this paper
A file is a collection
or keys. A query
the record
ih
and the
coordinate
k
space.
as a k-dimensional is then to find all points
We will often cast range searching
arises in many applications.
students whose age is between
is greater
list of all those 102’
the
possible”
in this geometric
as an aid to intuition.
those
average
problem.
The process
as representing
this hyperrectangle.
Range searching know
this searching
ranges.
of “best
to database
a particular
searching. The problem
in this space.
inside
analysis
An orthogonal
much attention,
kinds of queries.
for answering
attributes
One can regard
intersection
lying
several
specified
range
terms.
hyperrectangle
to describe
and
many
are issues such as database
is the study
different
design
characteristics.
each within
which
for answering
and data structures
containing
certain
efficiency,
algorithm
We need some definitions of records,
field; among these
and investigated
One area which has not received
of algorithmic
tools
identified
for which the latitude
analyses
observation
space
Stanford
Linear
370/168
time
is devoted
to
determine
empirical
In this searching.
of data
37’
to this the
we survey
application.)
distributions, various
the respective over
ten hours
content
range
for further
point.
density
structures work
between
it is often
useful
(ranges)
of
results. per
week
the
of IBM
searching
can be to
estimation. useful
and then
and conclusions
we have omitted
to
(At the
of a hyperrectangle,
and data structures
the “logical”
4 and 5. Because this is a survey,
and longitude regions
and to perform
algorithms
3. Directions
and 41’
In statistics
probability
grade
of U.S. cities one might seek a
in different
for example,
empirical
2 we study
in, Section
lying
may wish to
and whose
In data analysis
(or contrast)
Center,
cumulative
In Section
implementations in Sections
determine
paper
is between
and then compare
Accelerator
employed
database
the state of Colorado). on sets
administrator
21 and 24 years
than 3.5. In a geographic
and 109’ (defining separate
A university
for
turn
range
to their
are offered
the more -mathematical
-2-
analyses
of the various
Readers
interested
There
been considerable points
research.
a fixed
[1977]
Williams
problems
Bentley,
neighbors
of a point
finding
the nearest
are closely coordinates
‘related
Bentley
this
and Finkel
problem
[1977]
neighbor
discusses
on which
the problem
Yuval [1975] the special
discuss
the problem
Bentley
[1976]
to range searching;
investigate
the calculation
by any other point.
of how many points
distribution
[1975]
evaluated
applied
to
of finding
all
case
L,
of
the
of finding discusses
and
metric.
the k nearest the problem
of
Domination
problems
another
if all of its
discuss the determination Bentley
and Shamos [1977]
a given point dominates,
at the point.
has
Stanat,
a point is said to dominate
Kung, Luccio, and Preparata
there
and Bentley,
to each of the N points in the file.
a given point is dominated cumulative
[1975a] for
in a file of N points.
are larger.
description.
to the works in which they appear. to range searching
radius of a given point.
of whether empirical
related
a more intuitive
In the future, these methods might be usefully
investigate
Friedman,
are referred
closely
of range searching.
within
in favor of presenting
in the analyses
are several
the problem
structures
which
is the
-3-
2. Logical Structures In this section logical and
structures; “pointers”
problem
without
method
building
There
regard
analyzes
a search
preprocessing search
structure
N points
time
or their
explicitly
mentioning
by a data structure utility
operations
In Section
3 we will
on specific
for storing
storage
three
and searching
We will
the average whenever
usually
media.
the structure. One
1) the cost of
cost functions:
PS(N,k); 2) the storage required,
cost.
the
and deletion.
SS(N,k); and 3) the
cost, QS(N,k). These costs can be analyzed
worst-case
study
the data and algorithms
such as insertion
(say S) by giving
in k-space,
or query
average
to implementation.
(which we call preprocessing),
also be various
in terms of their
of the data at the level of “adjacency”
these logical structures
is specified
the structure
may
methods for range searching
that is, the logical structure
of how one implements
A search for
we discuss the various
in terms
of their
speak of the worst-case
cost,
we employ it.
2.1 Brute Force The simplest sequential
approach
list.
As each query
records
that
handled
immediately
with
satisfy
the query
searching
arrives
is to store
all members
are enumerated.
each of the
pass through
8, possesses
the file.
N points
of the list are scanned
If the queries
do not have
then they can be batched so that many queries
one sequential
structure,
to range
in a
and all to be
can be processed
It is easy to see that
the brute
force
the properties
PB(N,k) = O(Nk), St$N,k) = O(Nk), and Q$N,k) = O(Nk). Brute
force
medium.
searching
has the advantage
It is competitive
of being trivial
with the more sophisticated
the file is small and the number of attributes records
in the file satisfy
to implement
on any storage
methods described
is large;or
below
when a large fraction
when of the
the query (or queries, if they are batched).
2.2 Projection The projection technique the
‘wa s applied
nearest
database
technique
neighbor
problems.
is referred
by Friedman, problem,
Projection
to as inverted
Baskett,
lists by Knuth
and Shustek
[1976] in their
and by Lee, Chin, and Chang [1976] involves
[1973].
keeping, for each attribute,
This
solution
of
to a number
of
a sequence
of the
,,
-4-
records
in the file sorted
projation
of the points
can be obtained range
query
attributes,
can be answered say the i-th.
search)
query.
All records
found.
sixteen lines sorted
sorting
values defining
technique
into
is illustrated
slab is the y-range,
The vertical
satisfy
the query.
which
are inside the vertical
in the i-th
sequence
a the
(using
attribute
a
of the
these two positions
by brute force. represent
by ,x and y coordinates.
onto the x coordinate
(that
which is their intersection
is, the records the horizontal
contains
To answer this query we need only investigate
a set of
The dashed
slab is the x- range of the query,
and the rectangle
which
preprocessing, choose one’of
in Figure 2.1. The points
of the records
x order).
After
search procedure:
as a
the projections
the range on the i-th
of two keys each, represented
are the projection
k times.
the query will be in the list between
This smaller list is then searched
records
algorithm
Look up the two positions
satisfying
this geometrically
The k lists representing
by the following
of the extreme
The projection
One can view
on each coordinate.
by using a standard
binary just
by that attribute.
those
points
the six points
slab, marked by the 45O lines.
I/n x range
? I
Figure 2.1. Illustration One can apply
the projection
of values
of the various
the query
ranges
it
can
pay
to
technique
attributes
of each attribute keep
sorted
. ’
Y
of projection.
with only one sorted
are more or less uniform
list.
If the distribution
over similar
are similar, then one list is sufficient.
sequences
on all k attributes.
ranges
and
If not, then
The positions
of
the
‘5-
corresponding which-the
query
range
difference
Analysis
extremes
in positions
of the projection
in Friedman,
Baskett,
to the problem
is smallest is searched
technique,
and Shustek
between
P, for nearest
[1976].
of range searching.
The list for
are found in each of the k lists.
the two positions.
neighbor
searching
Most of this analysis
directly
is reported carries
over
It is clear that
PP(N,k) = O(kN log N), and Sp(N,k) = O(kN). For searches neighbor
that
find a small number of records
searches)
query
[Average
technique
is usually
similar
to near
one has
Qp(N,k) = O(N’-‘/kX The projection
(and are therefore
is most effective
Case]
when the number of records
satisfying
each
close to zero.
2.3 Cells. There are two ways they can search [for the murder weapon]: from the body outward in a spiral, or divide the room up into squares--that’s the grid meth0d.l Cartographers
as well as detectives
use the grid (or cell) method:
metropolitan
areas are often printed
in the form of books.
shows
entire
remaining
the
one-mile-square
area
regions.
would
look
in the’ first
check
only
on those
immediately. the
file
A square
within
corresponds
and
the
pages
Street
The first page of the book
at-e detailed
maps
To find (for example) all schools in a specified page to find which squares
pages to find the schools. of the map corresponds
the cell are stored to a directory
maps of
overlap
of
(say)
rectangle
the rectangle
one
and then
This approach
can be mechanized
to a cell in k-space,
and the points
as a linked list.
The first
of
page of the map book
which allows one to take a hyperrectangle
and look up the
set of cells. Knuth [1966]
[1973]
has discussed
used a cell technique
atoms within
five angstroms
as “cubing”.
Yuval [1975]
lFrom
the CBS
seriee
Kojsk,
this scheme for the two-dimensional in three-dimensional
of every
“Death
Ie Not a Passing
molecule--he
apply an overlapping
Grade”.
Levinthal
Euclidean space for determining
atom in a protein
and Rabin [1976]
case. referred
cell structure
all
40 this to the
-6closest-pair
problem.
The directory distributed
can be implemented
on
[0, lOJ2
and
we
two-dimensional
array
list of all points
in the cell [i, itl]
If the points
in two ways. have
are (say) uniformly
1 x 1 cells, then
chosen
we
can
use
in DIRECT(i, j) we would keep a pointer
as the directory.
x [j, jtl].
If we then wanted
to find
a
to a
all points
in
[5.2, 6.31 x C1.2, 3.41 then we would only have to examine cells (5,1), (5,2), (5,3), (6,1), (6,2), (6,3).
The multidimensional
array works very well when the points
priori
all to be in some given rectangle.
would
probably
Instead
To process
file. those The
of storing
so cell (i, j) is a pointer
all cells, however, a query
storage
locations
required
for
usually
the linked
much smaller
The
cell
represent
for
stored
together
upper
part
the rectangle
in the occupied
the cell technique
list representing
is illustrated
records”
points
containing
in Figure two
in an implementation.
of the figure,
need be investigated.
in [i,i+l]
x [j,j+l].
points
of% the
into a set of cell id’s, look up
cells for inclusion
is the storage
for
in the rectangle. the directory
plus
in cells; the size of the directory
is
than N.
technique sixteen
to the points
In this method
we store only cells which contain
we “decode”
id’s, and check the points
a
When this is not known to be the case one
use a search method such as hashing for the directory.
we name each celt as before,
are known
and to answer
2.2. The sixteen
keys each.
The query
points
The points
is given
in that
figure
in each cell
by the rectangle
it only those points in the four dashed
are
in the ceils
-7A ’ ‘.. /’ / ’ / ’ ,‘/ ’ ’ ’ / ’ ’ ’ ,’
0 4r
)
l
”
’
’
,
’ ’
’
’
’
/ ’
’ /
parameters
analyzing
a search
look-ups)
and inclusion
’
’
tpe cells are extremely tests.
The
cell
hyperrectangle.
shape
and shape
,Bentley,
hyperrectangles coordinate
size
space)
have
the number
given
by
upon
the size
and Williams
[1977]
size and shape so that
is unspecified,
of points
then for a single
in the range.
PC(N,k) = OWL SC(N,k) = O(Nk), and Q&k) = OPk I=) F is the number size
and shape
the range
In
directory
If
query).
many
grid
and shape show only
that their
a nearly
of the
query
if
query
the
location optimum
hyperrectangle.
size
and
For this case
search time is proportional
In this context
(in the
the ,performance
to 2k
of cells is
[Average]
of records as well
will be very
Clearly either extr.eme is bad.
depends
Stanat,
constant
tests.
of cells accessed is 2k and the expected
times
their
a point satisfies
for the cells are the same as that for the query
the number
where
whether
cell accesses (the number
small, on the other hand, then there
and very few inclusion
best
are the size and shape of each ceil.
large, then there will be few cell accesses and many inclusion
If the ceil size is very
cell accesses
of cells,
are two costs to count: tests (testing
’
’
’
of the cell technique there
,
‘0
Figure 2.2. illustration Basic
7-
/
’
’
’
’
’
/, l
’
/
’
/
’
0’ ’
. ’ (, ’ /
/
/
’
”
’
’
’
,
,’
..”
’ 0
.-
/
’
found.
as their
In most applications location,
so that
the queries
there
is little
will vary information
in
-aavailable
for making a good choice of cell size and shape.
2.4 k-d Trees This
data
structure
Finked [1977] effective
effect the
introduced
for
application
nearest
property
the
that
adapts k-d
searching.
terminal
tree
tree
case
coordinate
mutually of k-space.
cells which
each with
induced
as a discriminator
with
prescription
is represented
tree
used for sorting
defined small
discriminator
subsets
by any convenient partitioning
prescription
statistid) value
by k coordinate
values.
the subcollection
The
the
The
points,
which
are called buckets.
is represented
by
a single
of that coordinate.
All That A
Any one of these can serve by a particular
node
can range from 1 to k. in
adaptive
k-d
tree
is to
j for which the spread of attribute is maximum for the subcollection
r&cursively
two sons of the node just partitioned.
son.
to the two subcollections.
represented
is chosen. to be the median
is then applied
a
node has two sons;
of
by some value
records
constructing
that coordinate
both
by the -partitioning.
subsets
a point
and
of that subcollection.
Each nonterminal
foi assigning
for partitioning
for
by k-d
key values less than or equal to the partition’value
in the tree; that is, the discriminator The
limits
each node represents
These terminal
is defined
is thus a discriminator
in k-space
the same
severely
to the left son while those with a larger value belong to the’right
point
the
has the
nearly
The cell pattern
in which
searching
and a partition
in a subcollection
coordinate
The
tree
exclusive
of one-dimensional
value
discussed
of k-d trees
and all contain
search
the two su&ollections
form a partition
In the
of the binary
the entire collection.
represent
is very
hyperrectangles
in the space and a partitioning
represent
has
and
of the points in k-space.
is a binary
of the points
[1978]
of empty
grids.
Bentley
that this structure
of irregular cubical
regular
Friedman,
The application
problems.
the problem
is a generalization
nodes
collectively
belong
Bentley
with
to the distribution
son nodes
records
searching.
are approximately
of the tree represents
these
and showed
into a collection
of searching
[1975b].
k-d trees
This overcomes
The k-d
subcollection root
they
of points.
The
neighbor
the k-space
performance
trees
adaptive
by Bentley
of k-d t’rees to database
of dividing
number
was introduced
value
of this
is stopped,
for
the
values (as measured
represented
to the two subcollections The partitioning
choose
by the node.
attribute.
represented creating
This by the
a terminal
-9-
node (or bucket),
when the cardinality
maxiwm,
is a parameter
which
[1977]
found
of buckets,
each containing
the stopping
of discriminator).
Range searching’with recursively
by the j-th
of the query
in the
with the j-th
does not satisfy key (that
son can be pruned
is, the key is between
sons need be searched.
recursively,
making use of a stack.
recursive
algorithm within
Friedman, subtree
a region,
shows
subtree
The
many large
in that
region
the j-th above
(respectively any node it
range overlaps of the range),
by searching
both sons
the speed of this basic
subtrees
“bounds-array”
which
range is totally
If the query
can increase
[ 19771 can be employed
and the points
abstract
tree.
will
technique to detect
be contained described
the inclusion
can be listed
line.
of A (which
without
by of a the
rectangle
within
the rectangle subtree
continues,
rectangle
of A (which
2.3.b
recursively, is illustrated
is illustrated
searching
and Figure
is internal
node
A; it is an Jabelled
A is
line is in the with
root
and the cells in this tree each contain
two
in Figure 2.3.a, and the search for all points The search
to the right of the vertical
is B) can be pruned
by the perpendicular
in 2-space
point to the right is in the subtree
in both figures.
is entirely
2.3. The k-d
point to the left of ,that vertical
is B), and every
The query
in Figure
line in the right part of the figure
That is, every
points.
the query
is illustrated
The root of the tree
and the the vertical
continues
Figure
that
(by the
a node
the low and high bounds
A modification
region.
C. This partitioning
left
key.
in two ways; Figure 2.3.a shows the structure
the discriminating
since
in shape
from the search because
of k-d trees to range searching
the
x-discriminator, left
of
of the tree traversal.
is depicted
2.3.b
query
and Finkel
The application tree
the same number
one compares
This can be accomplished
if it is suspected
the
Bentley, within
overhead
is
visiting
If the query
the query in that particular
nearest
at the root the k-d tree
When
manner.
for
space
“cubical”
Starting
key of the node.
both
entirely
following
well
the key’s value then one need only search the right subtree
the node’s then
approximately
and each approximately
and Finkel
the coordinate
key (which we call a j-discriminator)
of that node; the other
contains
is that
k-d trees is straightforward.
searched
discriminates
(or below)
criteria)
Bentley,
from 8 to 16 work
into a number
choice
left)
ranging
of this procedure
(by
range
values
The result
points
is
that
is less than a prespecified
Friedman,
of the procedure.
searching.
neighbor divided
empirically
of the subcollection
line through
from the search.
starts
at the root,
line defined
and
by A the
This is illustrated
in
the son link from A to B. The search
both sons of C, both sons of F, and only the left son of G. A total
- 10 -
of three
buckets
are searched;
these buckets are dashed in the planar
representation
bv an S in the tree representation.
and are marked
a.) Planar representation A
b.) Tree representation Figure 2.3. Illustration Analysis researchers.
of
k-d
trees
for
The work required
range
of k-d trees.
searching
to construct
has been
considered
a k-d tree and its storage
by
several
requirements
are Pk (N,k) = 0 (N log N), and Sk (N,k) = O(Nk). The search
cost depends
and Wong [1976]
upon the nature of the query.
show that
Qk(N,k) I O(N’-‘/k)
[Worst Case].
In the very
wotst
case Lee
- 11 If the number
of records
sirnil%
to a nearest
[1977]
that Q&b&)
where
that satisfies
neighbor
fraction
[Average
of points
of the file satisfies
[Average
tree
the nature useful
structure
of the queries
if other
types
is
and ‘Finkel
For the case where
a large
and Stanat [1975] show that
Case for large answer]
is most effective
in situations
or a wide variety
of queries
of queries
Bentley
query
Case for small answer]
found in the region.
the query Bentley
Qk(N, k) = O(F). The k-d
is small so that the range
search then one has from Friedman,
= O(log N+F)
F is the number
the query
where
little
is known
are expected.
(in addition to range queries)
They
about
are also
are’anticipated.
2.5 Range Trees In this [1977].
section
we describe
It achieves
the best (worst-case)
so far, but has relatively the
high
storage
will
theoretical
viewpoint.
discussion
by
structure
The simplest preprocessing range
query points
structure searches
by
costs.
structure,
Bentley discussed
For most applications
is very
interesting
from
a
recursively
we will begin
our
and
generalize
that
then
dimensions. for one-dimensional
the N elements
part
we use linear
to be in ascending
these two positions
of the array
be proportional
order
searches to find the positions
After
storage
range searching
as the answer
is a sorted by key.
array.
The
To answer
of the low and high end
to the range time.
the points
to the number of such points.
query. The two
found
Letting
For this binary
in the region
F be the number
found in the region, we have
= OW lg NJ, s~(N,l) = O(N), and QR(N,l) = O(lg N + F). +(N,l)
We will now build a two-dimensional
a
have been found we can list all
and O(N Ig N) preprocessing
will each cost O(lg N), and the cost of listing
will, of course, of points
and storage
but the range tree
at a one-dimensional
in the array. in that
introduced
search time of all the structures
Since the range tree is defined
we do two binary
of the range the
be prohibitive,
structure sorts
tree, a structure
high preprocessing
looking
to higher
the range
range tree, using as a tool the one-dimensional
’
- 12 -
sorted
arrays
we described
simile
to the “binary
left
every
all points.
The left than
median
by Knuth [1973,
by y-coordinate.
x-coordinate,
and each containing continues
6.21 so we will binary
tree
and (unlike
in
a regular
value the median x-value
greater
Likewise
the N/2
the left
than the median
This partitioning
we have 2i subtrees,
is
value (all nodes in the
of the root has an SA containing x-value
tree
an SA. The root of the range tree contains
and has as discriminating
with
range
Section
value less than the node’s)
sorted by y-coordinate.
points
sorted
the root
partitioning
S/I’s). The
The range tree will be a rooted
node contains
subtree
the N/2
points
from
every
by y-coordinate)
less
those
in our discussions.
tree)
an SA (sorted
represents
described
have a discriminating
search
x-value
we abbreviate
node has a left son, a right son, a discriminating
subtree
binary
(which
search trees”
use his terminology which
above
continues
each representing
for a total of (approximately)
sorted
points
with
son of the
root
and has an SA of
so that i levels
N/2i points
an SA of the points
for
away
contiguous
in the
by y-coordinate.
This
Ig N levels; we handle small point
sets (say less than a dozen ponts) by brute force. The search node
algorithm
in the tree
represents
compare
the x-range
entirely
within
query’s
y-range.
After
value.
The
in the
this
x-range
node’s
we
compare
the below
query’s
the discriminator
to
the
we recursively
if it is above ‘we visit the right; and if the range overlaps
of the planar
tree
Analysis
the time for listing of ‘points
is somewhat
are stored
complicated.
in O(N lg N) time if clever
shows that at most two SA searches
the points
Since there
on each level, the total storage
can performed
(each of cost approximately
number
we
SA for all points
O(N lg N). The preprocessing tree
a node
then we search that structure’s
in the tree and N points
employed.
When visiting
Each
the
then we visit both subtrees.
analysis
levels
recursively.
to the range of the node, and if the node’s range is
If the range is entirely
the left subtree;
discriminator
a range in the x-dimension.
of the query
the query’s
discriminator visit
for a range tree is most easily described
Letting
required
techniques
is are
are done on each level of the
lg N) so the total cost for a search in the region.
are lg N
is O(lg2 N) plus
F stand, as before,
for the total
found in the ‘region we have
PP(N,2) = O(N lg N), SP(N,2) = O(N lg N), and QR(N,2) = O(lg2 N + F). If
we
step
back
for
a moment
we can see how we built
the
structure:
we
- 13 -
constructed
a two-dimensional
struct+ures.
We
can
perform
three-dimensional
structure:
structures
nodes.
in the
k-dimensions, yield
which
a structure
structure
by
building
essentially we
the
construct
This process
a tree
to
containing
can be continued
will be a tree containing
one-dimens.ional
operation
same
a tree
of
a
two-dimensional
to yield
(k-l)-dimensional
yield
a structure
structures.
for
This will
with performances
PP(N,k) = O(N lgk-’ N), SP(N k) = O(N lgk-’ N), and QP(N:k) = O(lgk N + F). The
range
tree
structure
asymptotic
search
prohibitive
in practice.
will
probably
theoretical
time
is very
be limited
(Indeed,
interesting
from
the application
to cases when
of storage
of this structure
viewpoint. used
are some very interesting
problems
an important
method that might yield
relationships
The
is probably
to practical
k = 2 or 3, it does provide
It also gives us an interesting
there
a theoretical
fast, but the amount
Although
benchmark.
practice.
is very
between
fruit
range
in
trees
and the k-d trees of Section 2.4.)
2.6 k-ranges The k-range Bentley
I‘s an efficient
and Maurer
noneverlapping.
[1978].
coordinates;
range
of the
structure
They developed
Both of these structures
by different trees
worst-case
additional
last section.
for range searching
two types of k-ranges:
involve storing
dimensions
overlapping
sets of lists of points
are added recursively,
The overlapping
introduced
k-ranges
by and
sorted
much like the
can be made to have
performance PO(N,k) = SO(N k) = O(N1+‘) Qo(N,k) = O(lg ; + F) ’ for
any f > 0. It is pleasing
above but
equations somewhat
k-ranges, times.
have Their
to note that the consants
a;e just k/C. Overlapping high
preprocessing
very
efficient
performance
PN(N,k) = O(N lg N), SN(N,k) = O(N), and QN(N,k) = CUJ’?.
is
k-ranges
and storage
preprocessing
“hidden”
in the big-ohs
have very efficient costs;
and storage
their costs
dual,
retrieval
of the time
nonoverlapping
but increased
query
- 14 -
for
any
fixed
c 5 0. The details
Maurcr
[1978].
Although
device,
they might prove efficient
of these
these structures
structures
can be found
were developed
in Bentley
primarily
and
as a theoretical
in some implementations.
2.7 Other Structures In this
section
competitive
we briefly
with those discussed
hope that someone those
mention
several
that we feel
by one of them to invent
when
divided
into
multidimensional imposed
tree
has more than some certain subcubes
of
yet
with, multiway
smaller
number
superior
In terms
branching.
on the space and the ease of implementation,
Finkel
and
Bent1e.y [1974]
generalization
of the binary
Stanat
analyzed
[1975]
uniform
planar
point
“search-sort
synchronized
to
sets. k trees”)
system.
cumulative
of
points.
is required
can be used to obtain performance
only
over
a
partitioning
It
is a
Bentley
and
tree.
range searches
trees
when
aside, however,
in
(which used
he in a
the quad tree
the k-d tree.
of a point
(the ECDF tree)
(in k-dimensional
the
of
space)
among
in the- query
and, not a listing of the points, then several
ECDF searches
This structure
number
for finding
points
storage
of
the
the quad
binary
a data structure
a count
that count.
but requires
of Section
has very desirable
and preprocessing
requirements
worst-case
a
query
similar to the range
2.5.
2.8 Comparison In Sections
If
implies
the fact that quad trees
successor,
describe
distribution
of both
node has 2k sons.
This application
by its historical
and Shamos [1977]
hyperrectangle
performance
have advantages
is
scheme
of quad trees for “square” discussed
the cube
by the k-d tree.
called
in which every
Lin [1973]
multiprocessor
empirical
collection
tree
of points,
That is,
this idea seems to be dominated
a structure
the performance
seems to be dominated Bentley
describe
recursively.
This
size.
by the quad tree (see below), which is in turn dominated
trees
techniques
points out that the notion of .cells can be applied
one of the cubes
further
the
and in the
we have discussed.
Knuth [1973]
called
are no longer
We include them for completeness
above.
might be inspired
structures
of Methods
2.1 to 2.6 we have discussed of these six structures
six structures
(seven including
for range searching.
the two .variants
of k-ranges)
The is ’
- 15 -
summarized
All of the functions
.costs, costs
in Table 2.1, which shows for each the preprocessing,
which
are
assumptions
marked
in that table reflect with
are described
For those
an asterisk.
:
P(N,k)
Force
Projection
O(N Ie N)
Cells
ON
k-d trees k-ranges
those
the
query
probabilistic
O(N)
O(N)
.O(N1-‘/k+F)
*
O(N)
O(F)
*
O(N)
O(N, lg N)
ON
O(N
l-l/k O(N
+F) O(lg N + F)
1+c O(N )
)
*:
O(NE + F)
O(N lgk-IN)
ltc k-ranges
QNk)
O(N)
O(N Ig NJ
O(N k k-lN)
Range Trees Overlapping
functions
. S(N,k)
O(N)
Nonoverlappping
costs, except
and .query
in the notes.
Structure
Brute
worst-case
storage,
O(lgk N + F) O(lg i
+ F)
s’ed query times indicate average case analysis. Probabilistic assumptions: 1. Smooth data sets, very small query region. 2. Any data set, cell size equals query size. 3. Smooth data set. Table 2.1. Performance Four of these six structures presented
as providing
there
situations
are
performs
badly,
performance
(brut.e force, projection,
practical for
of data structures
solutions
which
superior
we will mention
and other
various
of the other
problem.
For each
situations
where
situations
and compare
it the
is large, if the file is to be searched
a few times, or if the ‘queries can be batched so that nearly
the file satisfy
have been
of the four methods.
If the file is small and the number of attributes only
cells, and k-d trees)
to the range searching
it is clearly
In this section
for range searching.
at least one, then brute methods
is likely
all of the records
force is the method of choice.
to be more efficient.
Projection
Otherwise
does best when
in one the
- 16 -
query file
range
on only one of the attributes
r&cords.
dominate restrict
query
For this case the low overhead
the others.
and k-d
restricts
grid
advantageous. design,
size F&
however,
and shape queries
common to that
allows
relatively
in those
serve
to
poorly.
situations
If the approximate
of the expected
it to
where
the
size and shape
queries
considerably
by a
is most from
the
is poor.
is characterized
coordinate
of records;
are no empty cells.
of the coordinate
ali of the
and are known in advance, then cells defined
the k-dimensional
division
performs
are appropriate
adapts to the distribution
there
this structure
with sizes and shapes that differ
performance
The k-d tree structure The cell design
constant
nearly
or many of the attributes
or many of the attributes.
are roughly
with
technique
structures
to eliminate
of searching
where several
the projection
tree
several
of the queries fixed
In situations
the range query
The cell
is sufficient
space.
by its robustness of the attribute
cells.
If a wide variety
should
serve
best.
varying
queries.
values of the file records
The ,cells all contain very nearly
in
the same number
In dense regions there are many cells and a fine
space; in sparse
fewer
to wildly
of queries
regions
there
are expected
is a coarser
division
then the k-d tree
with
structure
- 17 -
3.
Implementations In Section
2 we discussed
less abstract
way without
how one implement:
3.1 Internal
the various regard
structures
for range searching
to implementation.
these structures
We now turn
then implementation
structure
is implemented of pointers
As discussed
to records; in Section
well
represented
the k attribute known
search
The k-d tree
2.3, there
is straightforward.
(N x k) array.
are two possible
to
defining
array.
can be implemented
the subcollection
records
satisfying
without
traversing
The brute
force
For projection
one has ,k
coordinate.
ways to associate
If there
a cell can be treated
method such as binary searching
defining
of the
records
with
distributed
in a
area (so that there are few empty cells) then the grid can be
values
[1975b].
memory
If the points are uniformly
as a multidimensional
Bentley
in the internal
each table is sorted on a different
the grid method.
more or less rectangular then
of these structures
as a two dimensional
implementing
efficiently
attention
Memory
computer
cells when
our
on real computers.
If the file is small enough so that it can be contained
tables
in a more or
are many empty
collectively
as a key and a
or hashing can be employed-.
as any other
binary
tree; see Knuth [1973]
It is easy to store for each node a pair of pointers associated
the query
with the node,
This facilitates
(if this is the case for
the descendants
cells
ail records
and
to the records
enumeration belo%
of the
that
node)
slightly
less
of that node.
3.2 Disk Implementing
these
structures
straightforward
than on central
replace
addresses.
memory
the records.
on
random
memory.
access
For the most part
the sorted
lists contain
the corresponding
records.
the disk
one list at a time need reside
technique records
the hash tables themselves
disk page.
contain
are stored
memory.
pointers
The lists for each attribute
Only pages containing
read into central
is only
disk addresses
For brute force one simply performs
With projection
and only
disks
disk pointers
and reside
those cells overlapping
scan of
to the disk address
can themselves
in central
on disk, with all records
a sequential
simply
memory. in central
in a cell stored the query
be stored With
of on
the cell
memory.
The
on the same
rectangle
need be
- 18Tree
structures
lend themselves
discussed
tree
can be grouped is chosen
as some convenient
the
memory.
file
then
number detail,
on the order
searching
Figure 3.2.1 shows how the nodes of the The size of each
unit of disk memory (such as a track satisfying
of lg (N/b)
per page.
and Williams
access disks;
or sector),
in the usual manner only a few pages at a time .need reside
If the records
of records
on ‘random
(as shown by the dotted lines) onto disk pages.
While the tree is searched in central
to implementation
by Knuth [1973, p.4721.
this&
page
nicely
represent
disk accesses
have
actually
a small fraction
are required
[1978] describes
Bentley
et al. [1975]
the query
of
where
b is the
this implementation
in more
implemented
k-d
trees
for
range
on a random access disk system.
Figure 3.1. Disk pages denoted by dashed lines.
3.3 Tape By its nature the
brute
magnetic
force
approach.
possible
to employ
In order
to read
a record
from
records
into central
without same
interference,
this
a magnetic
systems
it is possible any data.
blocks
are
activity
medium, and therefore
sequential
limitation,
though,
or
ideal for
however,
it
described
is
above.
to pass over
all
to read all of those
them from the controller
to issue instructions
Although
skipped
methods
tape it is necessary
It is not necessary,
memory or even transmit
and channel
multiprogramming The abstract
to it.
transmitting
whether
storage
the other range searching
from
the beginning
On most computing the
Even within
to advantage
records
blocks
tape is a sequential
to the channel.
to skip one or several
the real time to read a tape is nearly read,
can be substantially
the
CPU requirement,
reduced.
memory
This is important
in a
environment. projection
method of Section 2.2 calls for storing
the set of records
in
- 19 -
k sorted
lists, each sorted
storing
sorted
some
lists--just
mechanism
by a different
key.
Magnetic
tape is an ideal medium
store each of the k lists sequentially
is needed
for
deciding
list
to search
when
particular
query.
inspecting
each; on tape one can store a sample of each of the k sorted
storing
any list in its entirety.
a particular samples, search then
When all sorted
which
on the tape.
query
one counts how many sample records
and then therefore
skips over
At that point
searches
skips over re;ords
it starts
answering
that
sorted
list which
all records
has fewest
until arriving
all records
and testing
a by
lists before
in Figure 3.2. To answer
it overlaps
in each of the k
intersections.
at the desired
in that list until we finding a record reading
In addition
lists were in main memory this was accomplished
This tape layout is illustrated
for
within
The
sorted
list, and
the desired
to see whether
range.
they satisfy
all k’
ranges. Sample aor ted by all k coord’s
k I I I
0
*
Records sorted first key
l
Records sorted second key
by
Records sorted k-th key
l
by
by
Figure 3.2. Tape layout for projection. The cell method blocks
is implemented
of the tape with
similarly.
the data following,
Here the directory arranged
comprise
one block of data.
directory
and then those cells are read sequentially
blocks.
This tape layout
The cells overlapping
is illustrated
in Figure 3.3.
comprises
so that points
the query
the first
few
within
each cell
are determined
from the
from the tape, skipping
unwanted
-2oCell
‘Records
Directory
Cell
Records
in
in
Cell 2
I
Recorde
in
Figure 3.3. Tape layout for cells. .The
hierarchical
implementation the tree the
nature
on a sequential
are stored
tree.
easy--skip
reading
the number
only
the left subtree
right
subtree, method
allows
for
tape.
a natural
The nodes
(node, left son, right son) traversal
The terminal
a record.
searches
is slightly
of of
nodes are the data blocks.
tree
son.
A search
the
search
starts
consideration.
The
the range trees
a
this is also
The case of visiting
the number of records
in the With
is equal to the number
of
can be applied
to a wide v,ariety
will be obtained.
In particular
of Section
2.5, in that magnetic
of this
tape
can
requirements. of tree searching
on the tape:
on tape.
on tape. a’ node
and then visits
Figure
appears
before
the
right
the
left
all of
all descendants
in the tree by lines through “prunes”
3.4.a is the
Notice that the nodes of
of the left son appear before
in the tree is depicted then
on the
to this node skip that many blocks.
This technique
the process
search
tape)
by the left subtree.
and the same behavior
at the. root
record
If only the right son is to be visited,
of blocks read into main memory
and all descendants
At each
The easiest is when both sons are to be
is returned
in “preorder”
right
first
from the tape.
to search one or both of its sons;
and Figure 3.4.b is its implementation
appear
descendants,
is the
it is necessary
their large storage
3.4 illustrates
directly
more complicated--stack
in the tree search. on tape
which
three cases.
and when control
accommodate
tree
root
of blocks occuppied
can be used with
abstract
the
the tape.
the number
visited
Figure the
with
of this test yields
visited--continue
often
trees
medium such as magnetic
of a preorder
is ma,le as to whether
the outcome
method
range
the tree search can proceed
(beginning
determination
tree
and
with each node is the number, D, of its descendants.
visited
nodes
trees
storage
in the order
With this arrangement
this
k-d
Each node comprises
Associated
node
of
son
son (number
its
of the
son pointers:
(number
2) from
51, and then
must
I
recursively .exteraal
search
both
of that
node’s sons (numbers
nodes e and h are pruned, and nodes f and g are visited.
described
on tape in Figure 3.4; nodes which are visited
and those
which
when records nodes
6 and 7); at that
are skipped
it is determined following underlined
the abstract
that
are underlined
the left son can be pruned
it are skipped,
from
and the search proceeds
by solid lines on the tape are exactly
This same search
are underlined
by dashed lines.
point
is
by solid lines
So node 1 is visted, the search
and
the seven
to node 5. Notice
those visited
the
that the
by the search
in
tree.
a.) Binary tree
~G--J@~l--q@~J--l> - ---^--______--_--___-----b.) Corresponding
tape
Figure 3.4. Tape layout for trees.
- 22 -
4. Further
Wbrk
Our discussion and
there
of range searching
are
discussed
many
so far
have
dynamic
structures,
of brute
force,
are discussed
avenues
queries structure
[1975b]
should
All files
research.
that is, unchanging. and deletions
Many
can be made.
[1978].
the surface
that
we
applications Dynamic
Considerable
work
have
require
Dynamic
are easily obtained.
and Bentley
also remains For example,
methods. always
just scratched
versions k-d trees
remains
to
analysis of all of these structures.
research
almost
static,
further
and cell structures
by Bentley
search
for
in which insertions
be done in the dynamic
these
open
been
projection,
Considerable
has in many r&pects
involve
involve
only
in the development if in a seven
similar
situations
would
prove
useful
in such an investigation.
dimensional
two of the attributes,
be very
Bentley
helpful.
problem
then
Heuristics
only these two attributes.
other
of heuristics
for
aiding
the
ranbe
the, design
for detecting
and Burkhard
of the
these
and
[1976-J might
5. Conclusions t The problem of this
paper
problem for
we mentioned
which
“algorithm range
of range searching models
design
these
In Section
implemented
on a number
‘In 1973
Avenues
Knuth [1973,
seem
to exist”
show
that this situation
substantial
to describe
abstract
applications In Section
the real problems.
viewpoint. also practical.
some of those
and analysis”
searching;
arises in many database
3 we saw how these of different
open for further
storage
abstract
research
a number
1
from
structures
a theoretical
can be efficiently
that the structures
that “no really
of
of data structures
were mentioned
of range searching. systems.
an “abstract”
media, showing
has changed in the interim,
impact on database
and defined
are interesting
p. 5541 was able to write
for the problem
In Section
2 we used the techniques
and analyze
structures
applications.
are
in Section 4. nice data structures
In this paper
we have tried
to
and that these changes can have a
- 23 -
Bibliography Bentley, searching,
Stanford
Bentley,
J. L. [1976].
multidimensional
search
of the ACM 18, 9, September
radius
near
August
trees
neighbor
1975,’ 33 pp.
used for associative
1975, pp. 509-517.
algorithms
Ph.D. Thesis, University
Bentley,
Decomposable
J. L. [1477].
Carnegie-Mellon
to
fixed
for closest-point
of North Carolina,
problems
Chapel
in
Hill, North
101 pp.
Bentley,
”
binary
Divide and conquer
space.
for
Center Report SLAC-186,
“Multidimensional
Communications
Bentley,
of techniques
Linear Accelerator
J. L. [1975b].
searching,”
Carolina,
A survey
J. L. [1975a].
UniversityComputer
J. L. [1978].
appear
in
problems,
extended
abstract,
Science Department.
“Multidimensional
the
searching
Second
binary search trees in database
Computer
Software
and
“Heuristics
for partial
applications,
Applications
Conference
Proceedings. Bentley,
J. L. and W. A. Burkhard
base design,” Bentley, range
Information
Bentley, space,”
submitted
Bentley,
J. L. and
on Communication,
neighbor
209-212.
1976, pp.
“Efficient
worst-case
“Divide
and conquer
data
132-135.
data structures
for
for publication.
Symposium
M. I. Shamos
Conference
Bentley,
match retrieval
in multidimensional
on the Theory of Computing,
ACM, May
220-230.
data
Information
[197&l.
of the Eighth
algorithm,
Bentley,
Letters 4, 5, February
J. L. and M. I. Shamos [1976.].
Proceedings
1976, pp.
Processing
J. L. and H. A. Maurer
searching,”
[1976].
structure,
and applications,”
[1975].
“Analysis
and E. H. Williams, Jr.
Information
Processing
in
multivariate
of the fifteenth
Proceedings
Letters 3, 6, July 1975, pp.
J. L., 0. F. Stanat, searching,”
problem
Control and Computing,
J. L. and 0. F. Stanat Processing
“A
[1977).
pp.
statistics: Allerton
193-201.
of range searches
in quad trees,”
170-173.
[1977].
Letters
“The complexity
6, 6,
December
of near
1977,
pp.
- 24 -
Dobkin, Journ”al
[1976].
D. and R. J. Lipton
of Computing
Finkel,
5, 2, pp.
keys,”
Friedman, nearest
[1974].
“Quad trees--a
4, 1, pp.
Acta Informatica
J. H., F. Baskett,
searching
problems,”
data structure
for retrieval
on
l-9.
and L. J. Shustek
[1975].
“An
algorithm
for
IEEE, Transactions on Computers C-24, 10, October
neighbors,”
SIAM
181-l 86.
R. A. and J. L. Bentley
composite
“Multidimensional
finding
1975, pp.
1000-1006. Friedman, matches
J. H., J. L. Bentley,
in
logarithmic
time,”
and A. A. Finkel [1977]. ACM Transactions
September 1977, pp. 209-226. . Knuth, D. E. [ 1973 1. Sorting and Searching, 3, Addison-Wesley,
Journal
on Mathematical
The Art of Computer
of the ACM 22, 4, October
[ 19751.
“On finding
to multi-key
1, pp.
3, 3,
Programming,
vol.
“Application
of principal
component
Engineering
X-2,
3,
185-133.
Lee, D. T. and C. K. Wong [1978]. searches
Software
the maxima of a set of
IEEE Transactions on Software
searching,”
1976, pp.
September
best
1975, pp. 469-476.
Lee, R. C. T., Y. H. Chin, and S. C. Chang [1975]. analysis
for finding
Reading Massachusetts.
Kung, H. T., F. Luccio, and F. P. Preparata vectors,”
“An algorithm
in multidimensional
binary
“Worst-case
analysis for region and partial
search trees and quad trees,”
region
Acta Informatica
9,
23-29.
Rabin, M, 0. [ 19761. “Probabilistic
Williams,
Yuval, Processing
and complexity:
and recent results, J. F. Traub (ed.), Academic Press, pp.
directions
project,
in Algorithms
algorithms,”
E. H. Jr.
University G. [1975]. Letters
et al. [1976].
PABST Program
Logic Manual.
New
21-39. Unpublished
class
of North Carolina, Chapel Hill, North Carolina. + “Find/ng
near
neighbors
in k-dimensional
3, 4, March 1975, pp. 113-114.
,
space,”
Information