efficient bulk-loading of gridfiles - NASA Technical Reports Server ...

3 downloads 2634 Views 1MB Size Report
for bulk-loading ... In section 6 we present our two phase bulk-loading ...... collectionof information Send commentsregardinglh=sburdenestimateor any other ...
NASA ICASE

Contractor Report

Report

194974

No. 94-74

S EFFICIENT

BULK-LOADING

OF GRIDFILES

O" cO _ _ ,U

,._ I u'_

Scott T. Leutenegger David M. Nicol

Z

:3

0 _CO N 0

_0

Contract August

NAS 1994

Z w o'l 0.., I,kl _J

1- 19480

U. u. u.o

CX

_N

Institute NASA Hampton,

for Computer Langley VA

Applications

Research

in Science

and Engineering

Center

_OW 0_

23681-0001

_/_

F.._ b.,d b,a

_C3 ! ...JaJ

Operated

by Universities

Space

Research

Association

_0

Efficient Scott Institute

for Computer

Bulk-Loading

of Gridfiles

* David

T. Leutenegger

Applications

in Science and Engineering

Mail Stop 132c, NASA Langley Research Hampton, VA 23681-0001 leut@icase, edu

Dept.

Center

M. Nicol

of Computer

Science

College of William and Mary Williamsburg, VA 23187-8795

Abstract This paper

considers

gridfile

multi-attribute

tioning needed

algorithm to ensure

the problem indexing

of bulk-loading technique.

large

data

We propose

sets for for the

a rectilinear

parti-

that heuristically seeks to minimize the size of the gridfile no bucket overflows. Empirical studies on both synthetic

data sets and on data sets drawn from computational fluid dynamics applications demonstrate that our algorithm is very efficient, and is able to handle large data sets. In addition, sets too large to fit in main it creates

*This while

the authors

Langley and

research

Research

NASA

grant

a gridfile

was supported were

by the National

in residence

Center,

Hampton,

NAG-l-ll32.

without

we present an algorithm for bulk-loading data memory. Utilizing a sort of the entire data set incurring

Aeronautics

at the Institute VA 23681-0001.

any overflows.

and Space

for Computer Nicol's

work

Administration

Applications was

in Science

also supported

under and in part

NASA

contract

Engineering by NSF

NAS1-19480

(ICASE), grant

NASA

CCR-9210372

1

Introduction

We are developing (CFD)

data

a scientific

sets.

Retrieval

is two or three-dimensional partially

qualified,

technique

[5]. The

basic

idea

query

that

to support

thus requires and point

partitioning

all tuples

sharing

the data

page

and fast retrivial

extensive of existing

multiattribute queries.

Current

data

soon.

sets are common

data.

in files created

In this paper is termed relations

Note

The main

contributions

1. A partitioning

lization.

algorithm

that The efforts.

rest

In section

the existing

sets including technique. and plans

highly

In section for future

will fit on a disk page. to the data

page,

recovery,

Both the size of the data

fast organization

of new data,

a large set of tuples

of the form:

)

but are projected data

to be 2-3 orders

sets, large

physically

of magnitude oriented

data

computations.

data

files into a gridfile from

or during

indexing

relational

structure.

databases

reorganization.

This

for reloading

In a relational

database

up to two to four data

orders

into gridfile

of magnitude

blocks.

less CPU

We provide

time

experimental

under-utilized results

which of large

logical

grid-buckets

demonstrate data

to achieve

the utility

sets (significantly

better

disk uti-

of the aggregation larger

than

main

phase. memory)

overflows.

is organized

as follows:

the general

CFD

problem

algorithm,

compare

6 we present work.

is typically

is required

for partitioning

partitioning

skewed

a are

algorithm.

no bucket

5 we experimentally

load entire

for bulk-loading

3 we present

we provide

floatN

of scientific

requires

expermental

of this paper

inducing

partitions

are:

to aggregate

algorithm

guarantees

4 we present section

algorithm

dimension

indexing

thereby Enough

in

set in our work.

which

algorithm

We provide

3. A complete

during

of this paper

for our partitioning

2. An efficient

space.

one to fetch a pointer

with CFD

functionality

to the data

than the only known results

similar

platforms,

is analogous

data

float2,..,

concerned

we show how to quickly

when changing

the relation

that

only tens of megabytes,

to a wide spectrum

bulk loading.

into subranges,

in each

interested

multi-attribute

by CFD simulations.

sets require

Our two dimensional

we are specifically

outputs

subrange

with two disk accesses,

use of the data

sets measure

Although

We are specifically

multi-attribute

Dynamics

All of our data

itself.

( X, ]I, float1,

larger

range

Fluid

exploration.

are a well known

attribute entire

of Computational

and data

indexing.

Gridfiles

each

the same

The data we wish to store is contained sets and anticipated

of subsets

for visualization

on the

can be then be satisfied

and one to fetch

retrieval

is required

is to partition

rectilinear

to ensure

Any point

and

fully qualified,

multi-dimensional chosen

database of subsets

data

next

in more detail

our new algorithm,

the execution sets.

In the

times

of the

We also demonstrate

our two phase

bulk-loading

section

we relate

and provide

our

an example.

and our aggregation two algorithms, the effectiveness

algorithm.

work

to prior In section

algorithm.

on a variety

In

of data

of our aggregation

We end with our conclusions

2

Previous

Bulk-loading

Work

of B + trees

considered.

The

Their

emphasis

main

multiple

single

[6] has been

paper

on this of which

is bulk-loading

sites in a shared

nothing

among

the sites in the

at one site, into the buckets programming,

For physical for logical number

partitioning,

We show grid files.

that

that

to be created

overflows

is better overflow

focusing of blocks

problem

is zero.

The

use a fast heuristic number

algorithm.

and still guarantee

to fit into main

data

sets than

data

which is typical Our

by Nicol[4] differences

freedom

are not considered in a bucket.

times.

by allowing

work,

of CFD data algorithm

to better

total

number

finds

the issue

of

an equally

is capable

understand

the effects

rather

say,

tuples.

we propose for which

the

The

is reduced much

that

enables

algorithm.

via

it

total us to

possibly

a low cost

larger

grid files

if the data

set is too

we consider

of positionally

than,

and fixed number

but

although

Furthermore,

If the assesses

of handling

utilization,

split.

work inadequately

of partitions

of partitions

For

will need

of partitions number

their

new buckets

approach,

number bucket

grid directory

a fixed partitioning

number

for large

may be significant.

of overflow

programming

first be sorted.

skewed

more

extensive

and

clustered

sets. is a modification

of load-balancing

our algorithm

spaced,

[7, 8] to accelerate

multiple

dynamic

algorithm good

equally

to assume

some block overflows

an arbitrary

an increased

must

and the The earlier

that

find the smallest

while achieving the data

a specific

to be considered

of which

capacity,

is an excellent

of an expensive from

only

be used

dynamically

addresses

sampling

be created

and the average

our partitioning

memory

for the purposes between

introduced

Thus,

suggest

the bucket

formulation

is likely

directly

is too inefficient

in Li et al. , i.e. given

and

instead

resulting

the earlier

partitioning

given

(which

the handling

must

than

could given

Our algorithm

on

gridfiles.

We are concerned

it may be sufficient

and

overflows,

overflow

problem

no overflows

large

approach

bucket

of parallel

of our algorithm

and

is based

is skewed.

will be split multiple

which

algorithm

of buckets

data

themselves,

is larger

programming the

quickly,

as it does on the probability

specification

the dynamic

to reformulate

aggregation

a bucket

more

distributed

may introduce

an additional

within

number

For the

much

programming

and the grid directory

of partitions,

dimension

dynamic

the risks of sampling, the average

in the other

of a gridfile

Their solution

one dimension,

[2]. across

of the gridfile

of the portion

overflow.

version

Srivastava

distributed

as that

partitioning

bucket

partitions

the

of overflows

physical

optimally

this is not the case when data

sampling

and

a modified

but

However,

each bucket

a partitioning

this problem

as that

grid files been

and

are

partitioning

the the Li et al. paper).

For uniformly

Li et al. recognize

algorithm.

are lacking

of Li, Rotem,

of the gridfile.

is to minimize

although

bulk-loaded

grid files that

partitioning portion

partitioning

function

i.e.

define logical

that

Li et al. algorithm

finds

the fixed partition.

number

objective

have

is that

Files,

They

a fixed partitioning

of partitions,

Grid

compose

the logical

on this fixed partition

the number selecting

and

we are aware

and physical

that

at a single site, The

of partitions

spaced

their

partitioning

partitioning.

but details

system,

for both

partitioning

with physical

larger

database

but only recently

of Parallel environment.

located dynamic

investigated,

and this earlier

fixed in the present

context,

of the rectilinear

irregular

data-parallel

partitioning computations.

one are that the number and that

there

algorithm

is an upper

of subranges

The

developed two principle

in each dimension

limit on the number

of tuples

3

General

Problem

Before considering two-dimensional that

algorithmic case;

but

of subranges

issues, let us first examine

all the algorithms

each attributed

necessary,

Description

range

we have

is partitioned

not addressed

Let S be a set of tuples point

information,

into the

(al, a2, q[]) where

In our specific

values

chemical

data

representing

composition,

of subranges.

the desired

and generalized

ranges.

The

matrix

which

k,j = II{tit

empirical

is of the

We also assume

This

is not rigorously

relationship

between

number

of tuples

P = the number

of partitions

B = the maximum U/ = the number

j}

the number

II, 1 _