for bulk-loading ... In section 6 we present our two phase bulk-loading ...... collectionof information Send commentsregardinglh=sburdenestimateor any other ...
NASA ICASE
Contractor Report
Report
194974
No. 94-74
S EFFICIENT
BULK-LOADING
OF GRIDFILES
O" cO _ _ ,U
,._ I u'_
Scott T. Leutenegger David M. Nicol
Z
:3
0 _CO N 0
_0
Contract August
NAS 1994
Z w o'l 0.., I,kl _J
1- 19480
U. u. u.o
CX
_N
Institute NASA Hampton,
for Computer Langley VA
Applications
Research
in Science
and Engineering
Center
_OW 0_
23681-0001
_/_
F.._ b.,d b,a
_C3 ! ...JaJ
Operated
by Universities
Space
Research
Association
_0
Efficient Scott Institute
for Computer
Bulk-Loading
of Gridfiles
* David
T. Leutenegger
Applications
in Science and Engineering
Mail Stop 132c, NASA Langley Research Hampton, VA 23681-0001 leut@icase, edu
Dept.
Center
M. Nicol
of Computer
Science
College of William and Mary Williamsburg, VA 23187-8795
Abstract This paper
considers
gridfile
multi-attribute
tioning needed
algorithm to ensure
the problem indexing
of bulk-loading technique.
large
data
We propose
sets for for the
a rectilinear
parti-
that heuristically seeks to minimize the size of the gridfile no bucket overflows. Empirical studies on both synthetic
data sets and on data sets drawn from computational fluid dynamics applications demonstrate that our algorithm is very efficient, and is able to handle large data sets. In addition, sets too large to fit in main it creates
*This while
the authors
Langley and
research
Research
NASA
grant
a gridfile
was supported were
by the National
in residence
Center,
Hampton,
NAG-l-ll32.
without
we present an algorithm for bulk-loading data memory. Utilizing a sort of the entire data set incurring
Aeronautics
at the Institute VA 23681-0001.
any overflows.
and Space
for Computer Nicol's
work
Administration
Applications was
in Science
also supported
under and in part
NASA
contract
Engineering by NSF
NAS1-19480
(ICASE), grant
NASA
CCR-9210372
1
Introduction
We are developing (CFD)
data
a scientific
sets.
Retrieval
is two or three-dimensional partially
qualified,
technique
[5]. The
basic
idea
query
that
to support
thus requires and point
partitioning
all tuples
sharing
the data
page
and fast retrivial
extensive of existing
multiattribute queries.
Current
data
soon.
sets are common
data.
in files created
In this paper is termed relations
Note
The main
contributions
1. A partitioning
lization.
algorithm
that The efforts.
rest
In section
the existing
sets including technique. and plans
highly
In section for future
will fit on a disk page. to the data
page,
recovery,
Both the size of the data
fast organization
of new data,
a large set of tuples
of the form:
)
but are projected data
to be 2-3 orders
sets, large
physically
of magnitude oriented
data
computations.
data
files into a gridfile from
or during
indexing
relational
structure.
databases
reorganization.
This
for reloading
In a relational
database
up to two to four data
orders
into gridfile
of magnitude
blocks.
less CPU
We provide
time
experimental
under-utilized results
which of large
logical
grid-buckets
demonstrate data
to achieve
the utility
sets (significantly
better
disk uti-
of the aggregation larger
than
main
phase. memory)
overflows.
is organized
as follows:
the general
CFD
problem
algorithm,
compare
6 we present work.
is typically
is required
for partitioning
partitioning
skewed
a are
algorithm.
no bucket
5 we experimentally
load entire
for bulk-loading
3 we present
we provide
floatN
of scientific
requires
expermental
of this paper
inducing
partitions
are:
to aggregate
algorithm
guarantees
4 we present section
algorithm
dimension
indexing
thereby Enough
in
set in our work.
which
algorithm
We provide
3. A complete
during
of this paper
for our partitioning
2. An efficient
space.
one to fetch a pointer
with CFD
functionality
to the data
than the only known results
similar
platforms,
is analogous
data
float2,..,
concerned
we show how to quickly
when changing
the relation
that
only tens of megabytes,
to a wide spectrum
bulk loading.
into subranges,
in each
interested
multi-attribute
by CFD simulations.
sets require
Our two dimensional
we are specifically
outputs
subrange
with two disk accesses,
use of the data
sets measure
Although
We are specifically
multi-attribute
Dynamics
All of our data
itself.
( X, ]I, float1,
larger
range
Fluid
exploration.
are a well known
attribute entire
of Computational
and data
indexing.
Gridfiles
each
the same
The data we wish to store is contained sets and anticipated
of subsets
for visualization
on the
can be then be satisfied
and one to fetch
retrieval
is required
is to partition
rectilinear
to ensure
Any point
and
fully qualified,
multi-dimensional chosen
database of subsets
data
next
in more detail
our new algorithm,
the execution sets.
In the
times
of the
We also demonstrate
our two phase
bulk-loading
section
we relate
and provide
our
an example.
and our aggregation two algorithms, the effectiveness
algorithm.
work
to prior In section
algorithm.
on a variety
In
of data
of our aggregation
We end with our conclusions
2
Previous
Bulk-loading
Work
of B + trees
considered.
The
Their
emphasis
main
multiple
single
[6] has been
paper
on this of which
is bulk-loading
sites in a shared
nothing
among
the sites in the
at one site, into the buckets programming,
For physical for logical number
partitioning,
We show grid files.
that
that
to be created
overflows
is better overflow
focusing of blocks
problem
is zero.
The
use a fast heuristic number
algorithm.
and still guarantee
to fit into main
data
sets than
data
which is typical Our
by Nicol[4] differences
freedom
are not considered in a bucket.
times.
by allowing
work,
of CFD data algorithm
to better
total
number
finds
the issue
of
an equally
is capable
understand
the effects
rather
say,
tuples.
we propose for which
the
The
is reduced much
that
enables
algorithm.
via
it
total us to
possibly
a low cost
larger
grid files
if the data
set is too
we consider
of positionally
than,
and fixed number
but
although
Furthermore,
If the assesses
of handling
utilization,
split.
work inadequately
of partitions
of partitions
For
will need
of partitions number
their
new buckets
approach,
number bucket
grid directory
a fixed partitioning
number
for large
may be significant.
of overflow
programming
first be sorted.
skewed
more
extensive
and
clustered
sets. is a modification
of load-balancing
our algorithm
spaced,
[7, 8] to accelerate
multiple
dynamic
algorithm good
equally
to assume
some block overflows
an arbitrary
an increased
must
and the The earlier
that
find the smallest
while achieving the data
a specific
to be considered
of which
capacity,
is an excellent
of an expensive from
only
be used
dynamically
addresses
sampling
be created
and the average
our partitioning
memory
for the purposes between
introduced
Thus,
suggest
the bucket
formulation
is likely
directly
is too inefficient
in Li et al. , i.e. given
and
instead
resulting
the earlier
partitioning
given
(which
the handling
must
than
could given
Our algorithm
on
gridfiles.
We are concerned
it may be sufficient
and
overflows,
overflow
problem
no overflows
large
approach
bucket
of parallel
of our algorithm
and
is based
is skewed.
will be split multiple
which
algorithm
of buckets
data
themselves,
is larger
programming the
quickly,
as it does on the probability
specification
the dynamic
to reformulate
aggregation
a bucket
more
distributed
may introduce
an additional
within
number
For the
much
programming
and the grid directory
of partitions,
dimension
dynamic
the risks of sampling, the average
in the other
of a gridfile
Their solution
one dimension,
[2]. across
of the gridfile
of the portion
overflow.
version
Srivastava
distributed
as that
partitioning
bucket
partitions
the
of overflows
physical
optimally
this is not the case when data
sampling
and
a modified
but
However,
each bucket
a partitioning
this problem
as that
grid files been
and
are
partitioning
the the Li et al. paper).
For uniformly
Li et al. recognize
algorithm.
are lacking
of Li, Rotem,
of the gridfile.
is to minimize
although
bulk-loaded
grid files that
partitioning portion
partitioning
function
i.e.
define logical
that
Li et al. algorithm
finds
the fixed partition.
number
objective
have
is that
Files,
They
a fixed partitioning
of partitions,
Grid
compose
the logical
on this fixed partition
the number selecting
and
we are aware
and physical
that
at a single site, The
of partitions
spaced
their
partitioning
partitioning.
but details
system,
for both
partitioning
with physical
larger
database
but only recently
of Parallel environment.
located dynamic
investigated,
and this earlier
fixed in the present
context,
of the rectilinear
irregular
data-parallel
partitioning computations.
one are that the number and that
there
algorithm
is an upper
of subranges
The
developed two principle
in each dimension
limit on the number
of tuples
3
General
Problem
Before considering two-dimensional that
algorithmic case;
but
of subranges
issues, let us first examine
all the algorithms
each attributed
necessary,
Description
range
we have
is partitioned
not addressed
Let S be a set of tuples point
information,
into the
(al, a2, q[]) where
In our specific
values
chemical
data
representing
composition,
of subranges.
the desired
and generalized
ranges.
The
matrix
which
k,j = II{tit
empirical
is of the
We also assume
This
is not rigorously
relationship
between
number
of tuples
P = the number
of partitions
B = the maximum U/ = the number
j}
the number
II, 1 _