A SURVEY OF ALGORITHMS AND DATA STRUCTURES FOR ...

28 downloads 0 Views 2MB Size Report
surveys a number of algorithms for efficiently answering range queries. ... of this paper is to collect together the known results on range searching and to.
SLAC PUB-21 89 August 1978

A SURVEY OF ALGORITHMS AND DATA STRUCTURES FOR RANGE SEARCHING* Jon Louis Bentley Departments of Computer Science and Mathematics Carnegie-Mellon University Pittsburgh, Pennsylvania 15213 Jerome H. Friedman Computation Research Group Stanford Linear Accelerator Center Stanford, California 94305

ABSTRACT An important

problem

surveys

a number

“loGical

structures”

secondary

of algorithms

memories

“theoretical”

in database

interest.

systems is answering

for efficiently

is described is discussed. Although

together

present

them in a common terminology.

to

partially

supported

the known

ACM Transactions by Department

quickly.

range queries.

implementation

included

some new results

paper

(Submitted

their

The algorithms

of this

-Work

is to collect

and ‘then

answering

queries

results

and

are of both “practical”

and

the primary

on range

on Database of

First a set of primary

are presented,

in

This paper

Energy

searching

Systems)

purpose and to

-i-

Table of Contents 1. Introduction

1

2. Logical Structures

3

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Brute Force Projection Cells k-d Trees Range Trees k-ranges Other Structures Comparison of Methods

17

3. Implementations 3.1 Internal 3.2 Disk 3.3 Tape 4. Further

Work

5. Conclusions

3 3 5 8 11 13 14 14

Memory

17 17 18 22 23

-l-

i. Int reduction Rezarchers

in database

fundamental

areas

system s have recently

of study

in their

security,

reliability,

however,

is that

algorithms

and data structures

we

the

apply

examining

and integrity.

of

algorithms

each

satisfying key

values

records

is called

geometric values The

for each record of

framework

do

query

problems

by

type of query.

asks for

all records

range query asks for all records of retrieving

of range

searching

attributes

can

be

represented

The problem of range searching

with

the appropriate can be cast

as coordinates,

a point in a k-dimensional

ranges

In this paper

A file is a collection

or keys. A query

the record

ih

and the

coordinate

k

space.

as a k-dimensional is then to find all points

We will often cast range searching

arises in many applications.

students whose age is between

is greater

list of all those 102’

the

possible”

in this geometric

as an aid to intuition.

those

average

problem.

The process

as representing

this hyperrectangle.

Range searching know

this searching

ranges.

of “best

to database

a particular

searching. The problem

in this space.

inside

analysis

An orthogonal

much attention,

kinds of queries.

for answering

attributes

One can regard

intersection

lying

several

specified

range

terms.

hyperrectangle

to describe

and

many

are issues such as database

is the study

different

design

characteristics.

each within

which

for answering

and data structures

containing

certain

efficiency,

algorithm

We need some definitions of records,

field; among these

and investigated

One area which has not received

of algorithmic

tools

identified

for which the latitude

analyses

observation

space

Stanford

Linear

370/168

time

is devoted

to

determine

empirical

In this searching.

of data

37’

to this the

we survey

application.)

distributions, various

the respective over

ten hours

content

range

for further

point.

density

structures work

between

it is often

useful

(ranges)

of

results. per

week

the

of IBM

searching

can be to

estimation. useful

and then

and conclusions

we have omitted

to

(At the

of a hyperrectangle,

and data structures

the “logical”

4 and 5. Because this is a survey,

and longitude regions

and to perform

algorithms

3. Directions

and 41’

In statistics

probability

grade

of U.S. cities one might seek a

in different

for example,

empirical

2 we study

in, Section

lying

may wish to

and whose

In data analysis

(or contrast)

Center,

cumulative

In Section

implementations in Sections

determine

paper

is between

and then compare

Accelerator

employed

database

the state of Colorado). on sets

administrator

21 and 24 years

than 3.5. In a geographic

and 109’ (defining separate

A university

for

turn

range

to their

are offered

the more -mathematical

-2-

analyses

of the various

Readers

interested

There

been considerable points

research.

a fixed

[1977]

Williams

problems

Bentley,

neighbors

of a point

finding

the nearest

are closely coordinates

‘related

Bentley

this

and Finkel

problem

[1977]

neighbor

discusses

on which

the problem

Yuval [1975] the special

discuss

the problem

Bentley

[1976]

to range searching;

investigate

the calculation

by any other point.

of how many points

distribution

[1975]

evaluated

applied

to

of finding

all

case

L,

of

the

of finding discusses

and

metric.

the k nearest the problem

of

Domination

problems

another

if all of its

discuss the determination Bentley

and Shamos [1977]

a given point dominates,

at the point.

has

Stanat,

a point is said to dominate

Kung, Luccio, and Preparata

there

and Bentley,

to each of the N points in the file.

a given point is dominated cumulative

[1975a] for

in a file of N points.

are larger.

description.

to the works in which they appear. to range searching

radius of a given point.

of whether empirical

related

a more intuitive

In the future, these methods might be usefully

investigate

Friedman,

are referred

closely

of range searching.

within

in favor of presenting

in the analyses

are several

the problem

structures

which

is the

-3-

2. Logical Structures In this section logical and

structures; “pointers”

problem

without

method

building

There

regard

analyzes

a search

preprocessing search

structure

N points

time

or their

explicitly

mentioning

by a data structure utility

operations

In Section

3 we will

on specific

for storing

storage

three

and searching

We will

the average whenever

usually

media.

the structure. One

1) the cost of

cost functions:

PS(N,k); 2) the storage required,

cost.

the

and deletion.

SS(N,k); and 3) the

cost, QS(N,k). These costs can be analyzed

worst-case

study

the data and algorithms

such as insertion

(say S) by giving

in k-space,

or query

average

to implementation.

(which we call preprocessing),

also be various

in terms of their

of the data at the level of “adjacency”

these logical structures

is specified

the structure

may

methods for range searching

that is, the logical structure

of how one implements

A search for

we discuss the various

in terms

of their

speak of the worst-case

cost,

we employ it.

2.1 Brute Force The simplest sequential

approach

list.

As each query

records

that

handled

immediately

with

satisfy

the query

searching

arrives

is to store

all members

are enumerated.

each of the

pass through

8, possesses

the file.

N points

of the list are scanned

If the queries

do not have

then they can be batched so that many queries

one sequential

structure,

to range

in a

and all to be

can be processed

It is easy to see that

the brute

force

the properties

PB(N,k) = O(Nk), St$N,k) = O(Nk), and Q$N,k) = O(Nk). Brute

force

medium.

searching

has the advantage

It is competitive

of being trivial

with the more sophisticated

the file is small and the number of attributes records

in the file satisfy

to implement

on any storage

methods described

is large;or

below

when a large fraction

when of the

the query (or queries, if they are batched).

2.2 Projection The projection technique the

‘wa s applied

nearest

database

technique

neighbor

problems.

is referred

by Friedman, problem,

Projection

to as inverted

Baskett,

lists by Knuth

and Shustek

[1976] in their

and by Lee, Chin, and Chang [1976] involves

[1973].

keeping, for each attribute,

This

solution

of

to a number

of

a sequence

of the

,,

-4-

records

in the file sorted

projation

of the points

can be obtained range

query

attributes,

can be answered say the i-th.

search)

query.

All records

found.

sixteen lines sorted

sorting

values defining

technique

into

is illustrated

slab is the y-range,

The vertical

satisfy

the query.

which

are inside the vertical

in the i-th

sequence

a the

(using

attribute

a

of the

these two positions

by brute force. represent

by ,x and y coordinates.

onto the x coordinate

(that

which is their intersection

is, the records the horizontal

contains

To answer this query we need only investigate

a set of

The dashed

slab is the x- range of the query,

and the rectangle

which

preprocessing, choose one’of

in Figure 2.1. The points

of the records

x order).

After

search procedure:

as a

the projections

the range on the i-th

of two keys each, represented

are the projection

k times.

the query will be in the list between

This smaller list is then searched

records

algorithm

Look up the two positions

satisfying

this geometrically

The k lists representing

by the following

of the extreme

The projection

One can view

on each coordinate.

by using a standard

binary just

by that attribute.

those

points

the six points

slab, marked by the 45O lines.

I/n x range

? I

Figure 2.1. Illustration One can apply

the projection

of values

of the various

the query

ranges

it

can

pay

to

technique

attributes

of each attribute keep

sorted

. ’

Y

of projection.

with only one sorted

are more or less uniform

list.

If the distribution

over similar

are similar, then one list is sufficient.

sequences

on all k attributes.

ranges

and

If not, then

The positions

of

the

‘5-

corresponding which-the

query

range

difference

Analysis

extremes

in positions

of the projection

in Friedman,

Baskett,

to the problem

is smallest is searched

technique,

and Shustek

between

P, for nearest

[1976].

of range searching.

The list for

are found in each of the k lists.

the two positions.

neighbor

searching

Most of this analysis

directly

is reported carries

over

It is clear that

PP(N,k) = O(kN log N), and Sp(N,k) = O(kN). For searches neighbor

that

find a small number of records

searches)

query

[Average

technique

is usually

similar

to near

one has

Qp(N,k) = O(N’-‘/kX The projection

(and are therefore

is most effective

Case]

when the number of records

satisfying

each

close to zero.

2.3 Cells. There are two ways they can search [for the murder weapon]: from the body outward in a spiral, or divide the room up into squares--that’s the grid meth0d.l Cartographers

as well as detectives

use the grid (or cell) method:

metropolitan

areas are often printed

in the form of books.

shows

entire

remaining

the

one-mile-square

area

regions.

would

look

in the’ first

check

only

on those

immediately. the

file

A square

within

corresponds

and

the

pages

Street

The first page of the book

at-e detailed

maps

To find (for example) all schools in a specified page to find which squares

pages to find the schools. of the map corresponds

the cell are stored to a directory

maps of

overlap

of

(say)

rectangle

the rectangle

one

and then

This approach

can be mechanized

to a cell in k-space,

and the points

as a linked list.

The first

of

page of the map book

which allows one to take a hyperrectangle

and look up the

set of cells. Knuth [1966]

[1973]

has discussed

used a cell technique

atoms within

five angstroms

as “cubing”.

Yuval [1975]

lFrom

the CBS

seriee

Kojsk,

this scheme for the two-dimensional in three-dimensional

of every

“Death

Ie Not a Passing

molecule--he

apply an overlapping

Grade”.

Levinthal

Euclidean space for determining

atom in a protein

and Rabin [1976]

case. referred

cell structure

all

40 this to the

-6closest-pair

problem.

The directory distributed

can be implemented

on

[0, lOJ2

and

we

two-dimensional

array

list of all points

in the cell [i, itl]

If the points

in two ways. have

are (say) uniformly

1 x 1 cells, then

chosen

we

can

use

in DIRECT(i, j) we would keep a pointer

as the directory.

x [j, jtl].

If we then wanted

to find

a

to a

all points

in

[5.2, 6.31 x C1.2, 3.41 then we would only have to examine cells (5,1), (5,2), (5,3), (6,1), (6,2), (6,3).

The multidimensional

array works very well when the points

priori

all to be in some given rectangle.

would

probably

Instead

To process

file. those The

of storing

so cell (i, j) is a pointer

all cells, however, a query

storage

locations

required

for

usually

the linked

much smaller

The

cell

represent

for

stored

together

upper

part

the rectangle

in the occupied

the cell technique

list representing

is illustrated

records”

points

containing

in Figure two

in an implementation.

of the figure,

need be investigated.

in [i,i+l]

x [j,j+l].

points

of% the

into a set of cell id’s, look up

cells for inclusion

is the storage

for

in the rectangle. the directory

plus

in cells; the size of the directory

is

than N.

technique sixteen

to the points

In this method

we store only cells which contain

we “decode”

id’s, and check the points

a

When this is not known to be the case one

use a search method such as hashing for the directory.

we name each celt as before,

are known

and to answer

2.2. The sixteen

keys each.

The query

points

The points

is given

in that

figure

in each cell

by the rectangle

it only those points in the four dashed

are

in the ceils

-7A ’ ‘.. /’ / ’ / ’ ,‘/ ’ ’ ’ / ’ ’ ’ ,’

0 4r

)

l







,

’ ’







/ ’

’ /

parameters

analyzing

a search

look-ups)

and inclusion





tpe cells are extremely tests.

The

cell

hyperrectangle.

shape

and shape

,Bentley,

hyperrectangles coordinate

size

space)

have

the number

given

by

upon

the size

and Williams

[1977]

size and shape so that

is unspecified,

of points

then for a single

in the range.

PC(N,k) = OWL SC(N,k) = O(Nk), and Q&k) = OPk I=) F is the number size

and shape

the range

In

directory

If

query).

many

grid

and shape show only

that their

a nearly

of the

query

if

query

the

location optimum

hyperrectangle.

size

and

For this case

search time is proportional

In this context

(in the

the ,performance

to 2k

of cells is

[Average]

of records as well

will be very

Clearly either extr.eme is bad.

depends

Stanat,

constant

tests.

of cells accessed is 2k and the expected

times

their

a point satisfies

for the cells are the same as that for the query

the number

where

whether

cell accesses (the number

small, on the other hand, then there

and very few inclusion

best

are the size and shape of each ceil.

large, then there will be few cell accesses and many inclusion

If the ceil size is very

cell accesses

of cells,

are two costs to count: tests (testing







of the cell technique there

,

‘0

Figure 2.2. illustration Basic

7-

/











/, l



/



/



0’ ’

. ’ (, ’ /

/

/











,

,’

..”

’ 0

.-

/



found.

as their

In most applications location,

so that

the queries

there

is little

will vary information

in

-aavailable

for making a good choice of cell size and shape.

2.4 k-d Trees This

data

structure

Finked [1977] effective

effect the

introduced

for

application

nearest

property

the

that

adapts k-d

searching.

terminal

tree

tree

case

coordinate

mutually of k-space.

cells which

each with

induced

as a discriminator

with

prescription

is represented

tree

used for sorting

defined small

discriminator

subsets

by any convenient partitioning

prescription

statistid) value

by k coordinate

values.

the subcollection

The

the

The

points,

which

are called buckets.

is represented

by

a single

of that coordinate.

All That A

Any one of these can serve by a particular

node

can range from 1 to k. in

adaptive

k-d

tree

is to

j for which the spread of attribute is maximum for the subcollection

r&cursively

two sons of the node just partitioned.

son.

to the two subcollections.

represented

is chosen. to be the median

is then applied

a

node has two sons;

of

by some value

records

constructing

that coordinate

both

by the -partitioning.

subsets

a point

and

of that subcollection.

Each nonterminal

foi assigning

for partitioning

for

by k-d

key values less than or equal to the partition’value

in the tree; that is, the discriminator The

limits

each node represents

These terminal

is defined

is thus a discriminator

in k-space

the same

severely

to the left son while those with a larger value belong to the’right

point

the

has the

nearly

The cell pattern

in which

searching

and a partition

in a subcollection

coordinate

The

tree

exclusive

of one-dimensional

value

discussed

of k-d trees

and all contain

search

the two su&ollections

form a partition

In the

of the binary

the entire collection.

represent

is very

hyperrectangles

in the space and a partitioning

represent

has

and

of the points in k-space.

is a binary

of the points

[1978]

of empty

grids.

Bentley

that this structure

of irregular cubical

regular

Friedman,

The application

problems.

the problem

is a generalization

nodes

collectively

belong

Bentley

with

to the distribution

son nodes

records

searching.

are approximately

of the tree represents

these

and showed

into a collection

of searching

[1975b].

k-d trees

This overcomes

The k-d

subcollection root

they

of points.

The

neighbor

the k-space

performance

trees

adaptive

by Bentley

of k-d t’rees to database

of dividing

number

was introduced

value

of this

is stopped,

for

the

values (as measured

represented

to the two subcollections The partitioning

choose

by the node.

attribute.

represented creating

This by the

a terminal

-9-

node (or bucket),

when the cardinality

maxiwm,

is a parameter

which

[1977]

found

of buckets,

each containing

the stopping

of discriminator).

Range searching’with recursively

by the j-th

of the query

in the

with the j-th

does not satisfy key (that

son can be pruned

is, the key is between

sons need be searched.

recursively,

making use of a stack.

recursive

algorithm within

Friedman, subtree

a region,

shows

subtree

The

many large

in that

region

the j-th above

(respectively any node it

range overlaps of the range),

by searching

both sons

the speed of this basic

subtrees

“bounds-array”

which

range is totally

If the query

can increase

[ 19771 can be employed

and the points

abstract

tree.

will

technique to detect

be contained described

the inclusion

can be listed

line.

of A (which

without

by of a the

rectangle

within

the rectangle subtree

continues,

rectangle

of A (which

2.3.b

recursively, is illustrated

is illustrated

searching

and Figure

is internal

node

A; it is an Jabelled

A is

line is in the with

root

and the cells in this tree each contain

two

in Figure 2.3.a, and the search for all points The search

to the right of the vertical

is B) can be pruned

by the perpendicular

in 2-space

point to the right is in the subtree

in both figures.

is entirely

2.3. The k-d

point to the left of ,that vertical

is B), and every

The query

in Figure

line in the right part of the figure

That is, every

points.

the query

is illustrated

The root of the tree

and the the vertical

continues

Figure

that

(by the

a node

the low and high bounds

A modification

region.

C. This partitioning

left

key.

in two ways; Figure 2.3.a shows the structure

the discriminating

since

in shape

from the search because

of k-d trees to range searching

the

x-discriminator, left

of

of the tree traversal.

is depicted

2.3.b

query

and Finkel

The application tree

the same number

one compares

This can be accomplished

if it is suspected

the

Bentley, within

overhead

is

visiting

If the query

the query in that particular

nearest

at the root the k-d tree

When

manner.

for

space

“cubical”

Starting

key of the node.

both

entirely

following

well

the key’s value then one need only search the right subtree

the node’s then

approximately

and each approximately

and Finkel

the coordinate

key (which we call a j-discriminator)

of that node; the other

contains

is that

k-d trees is straightforward.

searched

discriminates

(or below)

criteria)

Bentley,

from 8 to 16 work

into a number

choice

left)

ranging

of this procedure

(by

range

values

The result

points

is

that

is less than a prespecified

Friedman,

of the procedure.

searching.

neighbor divided

empirically

of the subcollection

line through

from the search.

starts

at the root,

line defined

and

by A the

This is illustrated

in

the son link from A to B. The search

both sons of C, both sons of F, and only the left son of G. A total

- 10 -

of three

buckets

are searched;

these buckets are dashed in the planar

representation

bv an S in the tree representation.

and are marked

a.) Planar representation A

b.) Tree representation Figure 2.3. Illustration Analysis researchers.

of

k-d

trees

for

The work required

range

of k-d trees.

searching

to construct

has been

considered

a k-d tree and its storage

by

several

requirements

are Pk (N,k) = 0 (N log N), and Sk (N,k) = O(Nk). The search

cost depends

and Wong [1976]

upon the nature of the query.

show that

Qk(N,k) I O(N’-‘/k)

[Worst Case].

In the very

wotst

case Lee

- 11 If the number

of records

sirnil%

to a nearest

[1977]

that Q&b&)

where

that satisfies

neighbor

fraction

[Average

of points

of the file satisfies

[Average

tree

the nature useful

structure

of the queries

if other

types

is

and ‘Finkel

For the case where

a large

and Stanat [1975] show that

Case for large answer]

is most effective

in situations

or a wide variety

of queries

of queries

Bentley

query

Case for small answer]

found in the region.

the query Bentley

Qk(N, k) = O(F). The k-d

is small so that the range

search then one has from Friedman,

= O(log N+F)

F is the number

the query

where

little

is known

are expected.

(in addition to range queries)

They

about

are also

are’anticipated.

2.5 Range Trees In this [1977].

section

we describe

It achieves

the best (worst-case)

so far, but has relatively the

high

storage

will

theoretical

viewpoint.

discussion

by

structure

The simplest preprocessing range

query points

structure searches

by

costs.

structure,

Bentley discussed

For most applications

is very

interesting

from

a

recursively

we will begin

our

and

generalize

that

then

dimensions. for one-dimensional

the N elements

part

we use linear

to be in ascending

these two positions

of the array

be proportional

order

searches to find the positions

After

storage

range searching

as the answer

is a sorted by key.

array.

The

To answer

of the low and high end

to the range time.

the points

to the number of such points.

query. The two

found

Letting

For this binary

in the region

F be the number

found in the region, we have

= OW lg NJ, s~(N,l) = O(N), and QR(N,l) = O(lg N + F). +(N,l)

We will now build a two-dimensional

a

have been found we can list all

and O(N Ig N) preprocessing

will each cost O(lg N), and the cost of listing

will, of course, of points

and storage

but the range tree

at a one-dimensional

in the array. in that

introduced

search time of all the structures

Since the range tree is defined

we do two binary

of the range the

be prohibitive,

structure sorts

tree, a structure

high preprocessing

looking

to higher

the range

range tree, using as a tool the one-dimensional



- 12 -

sorted

arrays

we described

simile

to the “binary

left

every

all points.

The left than

median

by Knuth [1973,

by y-coordinate.

x-coordinate,

and each containing continues

6.21 so we will binary

tree

and (unlike

in

a regular

value the median x-value

greater

Likewise

the N/2

the left

than the median

This partitioning

we have 2i subtrees,

is

value (all nodes in the

of the root has an SA containing x-value

tree

an SA. The root of the range tree contains

and has as discriminating

with

range

Section

value less than the node’s)

sorted by y-coordinate.

points

sorted

the root

partitioning

S/I’s). The

The range tree will be a rooted

node contains

subtree

the N/2

points

from

every

by y-coordinate)

less

those

in our discussions.

tree)

an SA (sorted

represents

described

have a discriminating

search

x-value

we abbreviate

node has a left son, a right son, a discriminating

subtree

binary

(which

search trees”

use his terminology which

above

continues

each representing

for a total of (approximately)

sorted

points

with

son of the

root

and has an SA of

so that i levels

N/2i points

an SA of the points

for

away

contiguous

in the

by y-coordinate.

This

Ig N levels; we handle small point

sets (say less than a dozen ponts) by brute force. The search node

algorithm

in the tree

represents

compare

the x-range

entirely

within

query’s

y-range.

After

value.

The

in the

this

x-range

node’s

we

compare

the below

query’s

the discriminator

to

the

we recursively

if it is above ‘we visit the right; and if the range overlaps

of the planar

tree

Analysis

the time for listing of ‘points

is somewhat

are stored

complicated.

in O(N lg N) time if clever

shows that at most two SA searches

the points

Since there

on each level, the total storage

can performed

(each of cost approximately

number

we

SA for all points

O(N lg N). The preprocessing tree

a node

then we search that structure’s

in the tree and N points

employed.

When visiting

Each

the

then we visit both subtrees.

analysis

levels

recursively.

to the range of the node, and if the node’s range is

If the range is entirely

the left subtree;

discriminator

a range in the x-dimension.

of the query

the query’s

discriminator visit

for a range tree is most easily described

Letting

required

techniques

is are

are done on each level of the

lg N) so the total cost for a search in the region.

are lg N

is O(lg2 N) plus

F stand, as before,

for the total

found in the ‘region we have

PP(N,2) = O(N lg N), SP(N,2) = O(N lg N), and QR(N,2) = O(lg2 N + F). If

we

step

back

for

a moment

we can see how we built

the

structure:

we

- 13 -

constructed

a two-dimensional

struct+ures.

We

can

perform

three-dimensional

structure:

structures

nodes.

in the

k-dimensions, yield

which

a structure

structure

by

building

essentially we

the

construct

This process

a tree

to

containing

can be continued

will be a tree containing

one-dimens.ional

operation

same

a tree

of

a

two-dimensional

to yield

(k-l)-dimensional

yield

a structure

structures.

for

This will

with performances

PP(N,k) = O(N lgk-’ N), SP(N k) = O(N lgk-’ N), and QP(N:k) = O(lgk N + F). The

range

tree

structure

asymptotic

search

prohibitive

in practice.

will

probably

theoretical

time

is very

be limited

(Indeed,

interesting

from

the application

to cases when

of storage

of this structure

viewpoint. used

are some very interesting

problems

an important

method that might yield

relationships

The

is probably

to practical

k = 2 or 3, it does provide

It also gives us an interesting

there

a theoretical

fast, but the amount

Although

benchmark.

practice.

is very

between

fruit

range

in

trees

and the k-d trees of Section 2.4.)

2.6 k-ranges The k-range Bentley

I‘s an efficient

and Maurer

noneverlapping.

[1978].

coordinates;

range

of the

structure

They developed

Both of these structures

by different trees

worst-case

additional

last section.

for range searching

two types of k-ranges:

involve storing

dimensions

overlapping

sets of lists of points

are added recursively,

The overlapping

introduced

k-ranges

by and

sorted

much like the

can be made to have

performance PO(N,k) = SO(N k) = O(N1+‘) Qo(N,k) = O(lg ; + F) ’ for

any f > 0. It is pleasing

above but

equations somewhat

k-ranges, times.

have Their

to note that the consants

a;e just k/C. Overlapping high

preprocessing

very

efficient

performance

PN(N,k) = O(N lg N), SN(N,k) = O(N), and QN(N,k) = CUJ’?.

is

k-ranges

and storage

preprocessing

“hidden”

in the big-ohs

have very efficient costs;

and storage

their costs

dual,

retrieval

of the time

nonoverlapping

but increased

query

- 14 -

for

any

fixed

c 5 0. The details

Maurcr

[1978].

Although

device,

they might prove efficient

of these

these structures

structures

can be found

were developed

in Bentley

primarily

and

as a theoretical

in some implementations.

2.7 Other Structures In this

section

competitive

we briefly

with those discussed

hope that someone those

mention

several

that we feel

by one of them to invent

when

divided

into

multidimensional imposed

tree

has more than some certain subcubes

of

yet

with, multiway

smaller

number

superior

In terms

branching.

on the space and the ease of implementation,

Finkel

and

Bent1e.y [1974]

generalization

of the binary

Stanat

analyzed

[1975]

uniform

planar

point

“search-sort

synchronized

to

sets. k trees”)

system.

cumulative

of

points.

is required

can be used to obtain performance

only

over

a

partitioning

It

is a

Bentley

and

tree.

range searches

trees

when

aside, however,

in

(which used

he in a

the quad tree

the k-d tree.

of a point

(the ECDF tree)

(in k-dimensional

the

of

space)

among

in the- query

and, not a listing of the points, then several

ECDF searches

This structure

number

for finding

points

storage

of

the

the quad

binary

a data structure

a count

that count.

but requires

of Section

has very desirable

and preprocessing

requirements

worst-case

a

query

similar to the range

2.5.

2.8 Comparison In Sections

If

implies

the fact that quad trees

successor,

describe

distribution

of both

node has 2k sons.

This application

by its historical

and Shamos [1977]

hyperrectangle

performance

have advantages

is

scheme

of quad trees for “square” discussed

the cube

by the k-d tree.

called

in which every

Lin [1973]

multiprocessor

empirical

collection

tree

of points,

That is,

this idea seems to be dominated

a structure

the performance

seems to be dominated Bentley

describe

recursively.

This

size.

by the quad tree (see below), which is in turn dominated

trees

techniques

points out that the notion of .cells can be applied

one of the cubes

further

the

and in the

we have discussed.

Knuth [1973]

called

are no longer

We include them for completeness

above.

might be inspired

structures

of Methods

2.1 to 2.6 we have discussed of these six structures

six structures

(seven including

for range searching.

the two .variants

of k-ranges)

The is ’

- 15 -

summarized

All of the functions

.costs, costs

in Table 2.1, which shows for each the preprocessing,

which

are

assumptions

marked

in that table reflect with

are described

For those

an asterisk.

:

P(N,k)

Force

Projection

O(N Ie N)

Cells

ON

k-d trees k-ranges

those

the

query

probabilistic

O(N)

O(N)

.O(N1-‘/k+F)

*

O(N)

O(F)

*

O(N)

O(N, lg N)

ON

O(N

l-l/k O(N

+F) O(lg N + F)

1+c O(N )

)

*:

O(NE + F)

O(N lgk-IN)

ltc k-ranges

QNk)

O(N)

O(N Ig NJ

O(N k k-lN)

Range Trees Overlapping

functions

. S(N,k)

O(N)

Nonoverlappping

costs, except

and .query

in the notes.

Structure

Brute

worst-case

storage,

O(lgk N + F) O(lg i

+ F)

s’ed query times indicate average case analysis. Probabilistic assumptions: 1. Smooth data sets, very small query region. 2. Any data set, cell size equals query size. 3. Smooth data set. Table 2.1. Performance Four of these six structures presented

as providing

there

situations

are

performs

badly,

performance

(brut.e force, projection,

practical for

of data structures

solutions

which

superior

we will mention

and other

various

of the other

problem.

For each

situations

where

situations

and compare

it the

is large, if the file is to be searched

a few times, or if the ‘queries can be batched so that nearly

the file satisfy

have been

of the four methods.

If the file is small and the number of attributes only

cells, and k-d trees)

to the range searching

it is clearly

In this section

for range searching.

at least one, then brute methods

is likely

all of the records

force is the method of choice.

to be more efficient.

Projection

Otherwise

does best when

in one the

- 16 -

query file

range

on only one of the attributes

r&cords.

dominate restrict

query

For this case the low overhead

the others.

and k-d

restricts

grid

advantageous. design,

size F&

however,

and shape queries

common to that

allows

relatively

in those

serve

to

poorly.

situations

If the approximate

of the expected

it to

where

the

size and shape

queries

considerably

by a

is most from

the

is poor.

is characterized

coordinate

of records;

are no empty cells.

of the coordinate

ali of the

and are known in advance, then cells defined

the k-dimensional

division

performs

are appropriate

adapts to the distribution

there

this structure

with sizes and shapes that differ

performance

The k-d tree structure The cell design

constant

nearly

or many of the attributes

or many of the attributes.

are roughly

with

technique

structures

to eliminate

of searching

where several

the projection

tree

several

of the queries fixed

In situations

the range query

The cell

is sufficient

space.

by its robustness of the attribute

cells.

If a wide variety

should

serve

best.

varying

queries.

values of the file records

The ,cells all contain very nearly

in

the same number

In dense regions there are many cells and a fine

space; in sparse

fewer

to wildly

of queries

regions

there

are expected

is a coarser

division

then the k-d tree

with

structure

- 17 -

3.

Implementations In Section

2 we discussed

less abstract

way without

how one implement:

3.1 Internal

the various regard

structures

for range searching

to implementation.

these structures

We now turn

then implementation

structure

is implemented of pointers

As discussed

to records; in Section

well

represented

the k attribute known

search

The k-d tree

2.3, there

is straightforward.

(N x k) array.

are two possible

to

defining

array.

can be implemented

the subcollection

records

satisfying

without

traversing

The brute

force

For projection

one has ,k

coordinate.

ways to associate

If there

a cell can be treated

method such as binary searching

defining

of the

records

with

distributed

in a

area (so that there are few empty cells) then the grid can be

values

[1975b].

memory

If the points are uniformly

as a multidimensional

Bentley

in the internal

each table is sorted on a different

the grid method.

more or less rectangular then

of these structures

as a two dimensional

implementing

efficiently

attention

Memory

computer

cells when

our

on real computers.

If the file is small enough so that it can be contained

tables

in a more or

are many empty

collectively

as a key and a

or hashing can be employed-.

as any other

binary

tree; see Knuth [1973]

It is easy to store for each node a pair of pointers associated

the query

with the node,

This facilitates

(if this is the case for

the descendants

cells

ail records

and

to the records

enumeration belo%

of the

that

node)

slightly

less

of that node.

3.2 Disk Implementing

these

structures

straightforward

than on central

replace

addresses.

memory

the records.

on

random

memory.

access

For the most part

the sorted

lists contain

the corresponding

records.

the disk

one list at a time need reside

technique records

the hash tables themselves

disk page.

contain

are stored

memory.

pointers

The lists for each attribute

Only pages containing

read into central

is only

disk addresses

For brute force one simply performs

With projection

and only

disks

disk pointers

and reside

those cells overlapping

scan of

to the disk address

can themselves

in central

on disk, with all records

a sequential

simply

memory. in central

in a cell stored the query

be stored With

of on

the cell

memory.

The

on the same

rectangle

need be

- 18Tree

structures

lend themselves

discussed

tree

can be grouped is chosen

as some convenient

the

memory.

file

then

number detail,

on the order

searching

Figure 3.2.1 shows how the nodes of the The size of each

unit of disk memory (such as a track satisfying

of lg (N/b)

per page.

and Williams

access disks;

or sector),

in the usual manner only a few pages at a time .need reside

If the records

of records

on ‘random

(as shown by the dotted lines) onto disk pages.

While the tree is searched in central

to implementation

by Knuth [1973, p.4721.

this&

page

nicely

represent

disk accesses

have

actually

a small fraction

are required

[1978] describes

Bentley

et al. [1975]

the query

of

where

b is the

this implementation

in more

implemented

k-d

trees

for

range

on a random access disk system.

Figure 3.1. Disk pages denoted by dashed lines.

3.3 Tape By its nature the

brute

magnetic

force

approach.

possible

to employ

In order

to read

a record

from

records

into central

without same

interference,

this

a magnetic

systems

it is possible any data.

blocks

are

activity

medium, and therefore

sequential

limitation,

though,

or

ideal for

however,

it

described

is

above.

to pass over

all

to read all of those

them from the controller

to issue instructions

Although

skipped

methods

tape it is necessary

It is not necessary,

memory or even transmit

and channel

multiprogramming The abstract

to it.

transmitting

whether

storage

the other range searching

from

the beginning

On most computing the

Even within

to advantage

records

blocks

tape is a sequential

to the channel.

to skip one or several

the real time to read a tape is nearly read,

can be substantially

the

CPU requirement,

reduced.

memory

This is important

in a

environment. projection

method of Section 2.2 calls for storing

the set of records

in

- 19 -

k sorted

lists, each sorted

storing

sorted

some

lists--just

mechanism

by a different

key.

Magnetic

tape is an ideal medium

store each of the k lists sequentially

is needed

for

deciding

list

to search

when

particular

query.

inspecting

each; on tape one can store a sample of each of the k sorted

storing

any list in its entirety.

a particular samples, search then

When all sorted

which

on the tape.

query

one counts how many sample records

and then therefore

skips over

At that point

searches

skips over re;ords

it starts

answering

that

sorted

list which

all records

has fewest

until arriving

all records

and testing

a by

lists before

in Figure 3.2. To answer

it overlaps

in each of the k

intersections.

at the desired

in that list until we finding a record reading

In addition

lists were in main memory this was accomplished

This tape layout is illustrated

for

within

The

sorted

list, and

the desired

to see whether

range.

they satisfy

all k’

ranges. Sample aor ted by all k coord’s

k I I I

0

*

Records sorted first key

l

Records sorted second key

by

Records sorted k-th key

l

by

by

Figure 3.2. Tape layout for projection. The cell method blocks

is implemented

of the tape with

similarly.

the data following,

Here the directory arranged

comprise

one block of data.

directory

and then those cells are read sequentially

blocks.

This tape layout

The cells overlapping

is illustrated

in Figure 3.3.

comprises

so that points

the query

the first

few

within

each cell

are determined

from the

from the tape, skipping

unwanted

-2oCell

‘Records

Directory

Cell

Records

in

in

Cell 2

I

Recorde

in

Figure 3.3. Tape layout for cells. .The

hierarchical

implementation the tree the

nature

on a sequential

are stored

tree.

easy--skip

reading

the number

only

the left subtree

right

subtree, method

allows

for

tape.

a natural

The nodes

(node, left son, right son) traversal

The terminal

a record.

searches

is slightly

of of

nodes are the data blocks.

tree

son.

A search

the

search

starts

consideration.

The

the range trees

a

this is also

The case of visiting

the number of records

in the With

is equal to the number

of

can be applied

to a wide v,ariety

will be obtained.

In particular

of Section

2.5, in that magnetic

of this

tape

can

requirements. of tree searching

on the tape:

on tape.

on tape. a’ node

and then visits

Figure

appears

before

the

right

the

left

all of

all descendants

in the tree by lines through “prunes”

3.4.a is the

Notice that the nodes of

of the left son appear before

in the tree is depicted then

on the

to this node skip that many blocks.

This technique

the process

search

tape)

by the left subtree.

and the same behavior

at the. root

record

If only the right son is to be visited,

of blocks read into main memory

and all descendants

At each

The easiest is when both sons are to be

is returned

in “preorder”

right

first

from the tape.

to search one or both of its sons;

and Figure 3.4.b is its implementation

appear

descendants,

is the

it is necessary

their large storage

3.4 illustrates

directly

more complicated--stack

in the tree search. on tape

which

three cases.

and when control

accommodate

tree

root

of blocks occuppied

can be used with

abstract

the

the tape.

the number

visited

Figure the

with

of this test yields

visited--continue

often

trees

medium such as magnetic

of a preorder

is ma,le as to whether

the outcome

method

range

the tree search can proceed

(beginning

determination

tree

and

with each node is the number, D, of its descendants.

visited

nodes

trees

storage

in the order

With this arrangement

this

k-d

Each node comprises

Associated

node

of

son

son (number

its

of the

son pointers:

(number

2) from

51, and then

must

I

recursively .exteraal

search

both

of that

node’s sons (numbers

nodes e and h are pruned, and nodes f and g are visited.

described

on tape in Figure 3.4; nodes which are visited

and those

which

when records nodes

6 and 7); at that

are skipped

it is determined following underlined

the abstract

that

are underlined

the left son can be pruned

it are skipped,

from

and the search proceeds

by solid lines on the tape are exactly

This same search

are underlined

by dashed lines.

point

is

by solid lines

So node 1 is visted, the search

and

the seven

to node 5. Notice

those visited

the

that the

by the search

in

tree.

a.) Binary tree

~G--J@~l--q@~J--l> - ---^--______--_--___-----b.) Corresponding

tape

Figure 3.4. Tape layout for trees.

- 22 -

4. Further

Wbrk

Our discussion and

there

of range searching

are

discussed

many

so far

have

dynamic

structures,

of brute

force,

are discussed

avenues

queries structure

[1975b]

should

All files

research.

that is, unchanging. and deletions

Many

can be made.

[1978].

the surface

that

we

applications Dynamic

Considerable

work

have

require

Dynamic

are easily obtained.

and Bentley

also remains For example,

methods. always

just scratched

versions k-d trees

remains

to

analysis of all of these structures.

research

almost

static,

further

and cell structures

by Bentley

search

for

in which insertions

be done in the dynamic

these

open

been

projection,

Considerable

has in many r&pects

involve

involve

only

in the development if in a seven

similar

situations

would

prove

useful

in such an investigation.

dimensional

two of the attributes,

be very

Bentley

helpful.

problem

then

Heuristics

only these two attributes.

other

of heuristics

for

aiding

the

ranbe

the, design

for detecting

and Burkhard

of the

these

and

[1976-J might

5. Conclusions t The problem of this

paper

problem for

we mentioned

which

“algorithm range

of range searching models

design

these

In Section

implemented

on a number

‘In 1973

Avenues

Knuth [1973,

seem

to exist”

show

that this situation

substantial

to describe

abstract

applications In Section

the real problems.

viewpoint. also practical.

some of those

and analysis”

searching;

arises in many database

3 we saw how these of different

open for further

storage

abstract

research

a number

1

from

structures

a theoretical

can be efficiently

that the structures

that “no really

of

of data structures

were mentioned

of range searching. systems.

an “abstract”

media, showing

has changed in the interim,

impact on database

and defined

are interesting

p. 5541 was able to write

for the problem

In Section

2 we used the techniques

and analyze

structures

applications.

are

in Section 4. nice data structures

In this paper

we have tried

to

and that these changes can have a

- 23 -

Bibliography Bentley, searching,

Stanford

Bentley,

J. L. [1976].

multidimensional

search

of the ACM 18, 9, September

radius

near

August

trees

neighbor

1975,’ 33 pp.

used for associative

1975, pp. 509-517.

algorithms

Ph.D. Thesis, University

Bentley,

Decomposable

J. L. [1477].

Carnegie-Mellon

to

fixed

for closest-point

of North Carolina,

problems

Chapel

in

Hill, North

101 pp.

Bentley,



binary

Divide and conquer

space.

for

Center Report SLAC-186,

“Multidimensional

Communications

Bentley,

of techniques

Linear Accelerator

J. L. [1975b].

searching,”

Carolina,

A survey

J. L. [1975a].

UniversityComputer

J. L. [1978].

appear

in

problems,

extended

abstract,

Science Department.

“Multidimensional

the

searching

Second

binary search trees in database

Computer

Software

and

“Heuristics

for partial

applications,

Applications

Conference

Proceedings. Bentley,

J. L. and W. A. Burkhard

base design,” Bentley, range

Information

Bentley, space,”

submitted

Bentley,

J. L. and

on Communication,

neighbor

209-212.

1976, pp.

“Efficient

worst-case

“Divide

and conquer

data

132-135.

data structures

for

for publication.

Symposium

M. I. Shamos

Conference

Bentley,

match retrieval

in multidimensional

on the Theory of Computing,

ACM, May

220-230.

data

Information

[197&l.

of the Eighth

algorithm,

Bentley,

Letters 4, 5, February

J. L. and M. I. Shamos [1976.].

Proceedings

1976, pp.

Processing

J. L. and H. A. Maurer

searching,”

[1976].

structure,

and applications,”

[1975].

“Analysis

and E. H. Williams, Jr.

Information

Processing

in

multivariate

of the fifteenth

Proceedings

Letters 3, 6, July 1975, pp.

J. L., 0. F. Stanat, searching,”

problem

Control and Computing,

J. L. and 0. F. Stanat Processing

“A

[1977).

pp.

statistics: Allerton

193-201.

of range searches

in quad trees,”

170-173.

[1977].

Letters

“The complexity

6, 6,

December

of near

1977,

pp.

- 24 -

Dobkin, Journ”al

[1976].

D. and R. J. Lipton

of Computing

Finkel,

5, 2, pp.

keys,”

Friedman, nearest

[1974].

“Quad trees--a

4, 1, pp.

Acta Informatica

J. H., F. Baskett,

searching

problems,”

data structure

for retrieval

on

l-9.

and L. J. Shustek

[1975].

“An

algorithm

for

IEEE, Transactions on Computers C-24, 10, October

neighbors,”

SIAM

181-l 86.

R. A. and J. L. Bentley

composite

“Multidimensional

finding

1975, pp.

1000-1006. Friedman, matches

J. H., J. L. Bentley,

in

logarithmic

time,”

and A. A. Finkel [1977]. ACM Transactions

September 1977, pp. 209-226. . Knuth, D. E. [ 1973 1. Sorting and Searching, 3, Addison-Wesley,

Journal

on Mathematical

The Art of Computer

of the ACM 22, 4, October

[ 19751.

“On finding

to multi-key

1, pp.

3, 3,

Programming,

vol.

“Application

of principal

component

Engineering

X-2,

3,

185-133.

Lee, D. T. and C. K. Wong [1978]. searches

Software

the maxima of a set of

IEEE Transactions on Software

searching,”

1976, pp.

September

best

1975, pp. 469-476.

Lee, R. C. T., Y. H. Chin, and S. C. Chang [1975]. analysis

for finding

Reading Massachusetts.

Kung, H. T., F. Luccio, and F. P. Preparata vectors,”

“An algorithm

in multidimensional

binary

“Worst-case

analysis for region and partial

search trees and quad trees,”

region

Acta Informatica

9,

23-29.

Rabin, M, 0. [ 19761. “Probabilistic

Williams,

Yuval, Processing

and complexity:

and recent results, J. F. Traub (ed.), Academic Press, pp.

directions

project,

in Algorithms

algorithms,”

E. H. Jr.

University G. [1975]. Letters

et al. [1976].

PABST Program

Logic Manual.

New

21-39. Unpublished

class

of North Carolina, Chapel Hill, North Carolina. + “Find/ng

near

neighbors

in k-dimensional

3, 4, March 1975, pp. 113-114.

,

space,”

Information