Prototype implementations of system-level extensions to support cooperative indexing were evaluated with a modified version of. Oracle Rdb. Our experimental.
Integrating
Samuel
DeFazio,
IR and RDBMS
Amjad
Daoud,
Bruce University
Lisa
Using Cooperative
Ann
Smith,
Oracle
Corporation
Croft
and Jamie
and Jagannathan
at Amherst
for large
text
databases,
to accommodate The
full
integration
database
of information
management
as both
system
a significant
integration
goal
we mean:
retrieval, operations update,
ii)
and
and
retrieval
of
extensions
operations.
It is
also
retrieval
processes,
To identify
imposed
by
the
a number The
of database
results
of
extensions integration from
our
desired
these
of
and
experiments,
indexing,
and full
that provides
implementations
Oracle Rdb.
that
the
of system-level with
findings
and suggest
level
gained called
and not
technology.
extensions
to support version
aware
improve
information
management goal.
and
system
Many
databases
retrieval
(if
require
the
and recovery
(DBMS)
not
also have
(IR)
most) a major
types
of
features
framework of
the
component guarantees
that are provided
within
is
a highly
applications involving about
in commercial
involving
text
structured
data
concurrency database
transactions.
is scalability, additional
are a number
integration significant
of
IR
obstacles
IR operations,
of technical and
database
include
providing
challenges
with
database
indexing
and query
seem we are
performance
or
information retrieval.
systems
continues
to
that
performance
the initial
for
of our
focus
Tuning
to decrease in cost any performance
for DBMS
capabilities
systems
into
database
scalability
and query evaluation.
information
Of these, the most transaction semantics for
features
to raw performance.
products
and algorithm
support
has major implications
more general
the
systems,
supporting
scaleable
associated
the
be used to improve
However,
related to system-level
design
management
However,
scaleable Thus,
(which
can frequently
(e.g., probabilities) There
deliver
as opposed hardware
of interest.
predicates
systems.
do not
is generally
by [LYNCH88]
indexing either
the
is a problem
database
of text,
and retrieval
Extending
control
collections
indexing
directly
desirable
large
must
metric
a database
in
DBMS
This
requirements,
discuss
virtually
LYNCH88,
the
reported
to and
of these approaches,
dramatically)
1 Introduction
that
difficult
indices
performance
relational
of using
system, is
text
IR systems.
IR system
of any studies
very
HARPER92,
the DBMS-provided satisfy
components
recovery
by
in
maintains
is evaluated
store
In the work
IR functionality
andlor Providing
is used.
provide
work
performance.
of full-text
occurring
database
provided
any commercial
To accommodate
the cooperative
to further
approach,
mechanisms
[HARPER92],
scalability
of
Under
(RDBMS) closely
to
CROFT85,
this
text are
the and
is
[BLAIR88,
the requirements
to more
indices in
control
retrieval.
of a complete
mainly
on the
text
and
the IR system
of a query
stored
alternative
the indexing
system
to achieve both
a modified
alternatives
itself
the case when
the insight
validate
concurrency
MCLEOD83].
of IR
separate
The
satisfy
portion
for IR
the DBMS
consist
the integration
documents
guaranteeing
indexing,
approach,
languages
semantics
enabling
actually
structures
IR
impossible.
when
an approach,
a framework
were evaluated
a
infrastructural
desired
With
layered conducted
with
index the
The
with
database
experiments.
of IR and RDBMS
Our experimental
scheme
requirements
retrieval
both
to and
and
indices.
coordinate
offering
then
separate
database
involves
update,
In this
query
transaction
systems
and a DBMS
completely these
query
storage,
“integrated”
interface.
database
Providing
environment
document
current
a common the
retrieval
we
and
we developed
integration
indexing
Rdb
suggest
performance.
scaleable
database
handle
and extending
ranking.
in a database
IR system
addition,
indexing
integration,
experiments
indexing,
integrated
document
and
scalability
indexing
document
to obtain
cooperative
Prototype
the
are necessary initial
cooperative
for
to
Some
all database
document
for
on Oracle
load
iv)
for
systems
of atomicity,
the implementation
level
IR application
thus
a
By full
concurrent
and
ranking
necessary
performance
into
recognized
storage,
properties iii)
documents,
scaleable
been
semantics,
the ACID
to provide
features
undertaking.
document
durability,
exhibit
representative
for
have
isolation,
has long
transaction
on documents
language
(DBMS)
(IR)
and a challenging
i) support
and update,
consistency,
retrieval
Srinivasan
Callan
of Massachusetts
Abstract
Indexing
document
is
design. retrieval
with respect to query language The incorporation
database
query
of uncertainty
languages
system and is essential
It is difficult,
however,
results
in a
for effective
to change current
database systems. Despite some proposals in this area [FUHR90], most integration efforts impose a Boolean filter on the ranked results of IR queries before incorporating those results
processing
Permission to m:lke digihlhrd copies 01’ all or part CIt’ this matcri:l[ without fee is gr:mtcd prnvidcci llmt [he copies are not m:i(ie or distributed for prolit i)r con]mcrcial Idvan[age, the ACM copyright/ server notice, the title ol’ [he pul>lication anct its d:ttc appwr, :mci notice is given lIM1 ci>py right is by permission o f the Associ:ltion for Computin: Machinery, Inc. (ACM). To copy otherwise, M republish, to post on servers or 10 rcdislril>ulc 10 Iisls, rfquires spccilic pcrmiss]on andlor fee. SIGIR’95 Seattle WA USA[l 1995 ACM 0-89791714-6/95/07.S3.50
in the standard database query processing. An example of this would be. to treat the top N ranked text objects as the retrieved set.
Emerging
[SQLMM], oriented extending
while towards
database
standards
acknowledging the Boolean
SQL to handle
logic view
as SQL/Multimedia of ranking,
of retrieval.
IR query semantics
outside our current focus and will
84
such
the importance
are
Although
is important,
not be discussed further,
it is
The overall
goal of our work
framework IR
that enables
technology.
system-level stolage,
This
paper
extensions
transaction
update,
for
particularly
extensions
different
types of IR systems.
this
to provide
provides
the extraction support
choice
since different
data stored
of index
terms
provided
by
where the IR components database system provides
indexing
systems
approach
in traditional are rarely involving
authoring,
section
we employed
to study
the integration
Oracle Rdb[ORACLERdb].
In Section
load and query
study
described
experiments
in
experimental cooperative
Section
4.
findings
for
indexing
Rdb
2 Experimental
approach
the
integration
of
experimental ●
RDBMS
Characterize 6.1,
we
approach,
which
present
workload These
and
IR
the
generated by indexing were
following
the
transactions.
with
Develop
an Experimental
experimental contains and
retrieval
storage is not
version
extensions
as a mechanism consideration.
Oracle
to improve
transactions,
requirements production
of
Version
SQL-based
Rdb
to
help
was
and
reduce
patterns
from
the News
some of the
It should be noted that the
word usage properties.
can
be
characterized
That is,
with
a Zipf
version
and
served
validate
the
extensions
of the News Collection 686 Mbytes 138,578 4,950 Bytes
Size
788,256
Words
48.526,577
Average
61
Occurrences
per Word
Frequent Words
39,413
Infrequent
748,843
Words
Frequent Word Occurrences
93,6%
Infrequent
6.4%
Word Occurrences
database
RDBMS
Schema design
we
An
products
to build
IR systems.
IR approach
This
text
- varying
maximum)
ID - four byte integer for the News article.
●
maxfreq
selected ●
database
is with
design
and consists
one
string
(64,000
a unique
containing
byte
identifier
the maximum
for the article.
This
value
factor for the probabilistic
modeled.
some other
and
containing
- two byte integer
being
the News articles
Each row represents
character
usage frequency
scheme
of Rdb
This
are used
the content of one News article.
used as the normalizing
essentially
experiments
that
attributes:
length
containing
●
word
our
[CALLAN92]
table contains (meta) data.
article and has the following ●
for
schemes
two tables:
the related descriptive
that
under
employ
of the table-based
DOCUMENTS.
secondary
software,
Properties
Total Word Occurrences
of indexing the
This
Rdb.
developed
the scalability
for text indices. quality
of Oracle
typical
1: Statistical
mapped to an Oracle Rdb table. ●
extracted
1, we present
for this corpus.
supports the probabilistic
programs that load the database and execute For this study, the text index structure was
queries.
of
benchmark
the
the
characterize
and retrieval driven
published
is
Using Oracle Rdb Version to
to
extensions.
distribution.
of the following
experiments
the units
of
facilitate
technology,
the IR Workload.
experiments
application
usage
was
In Table
exhibits
2.2 Database
that
using
experiments
Average Document
El
implementation
extensions
this,
was simply
the proposed
Total Raw Text Total Documents
into
was employed:
conducted
our
representative
system-level
approach
word
Table
that
and discuss the implications.
the
for
Total Unique
Approach
identify
data
properties
The
To help
Given
[DEFAZI093].
statistical
need for real-
we
applications
~OMASIC93].
and
technology
5,
IR
collection
The lessons learned from
Section
our
performance
in relative
of these experiments and evaluate
such,
actual
below are expressed
intent
to “tune”
As
the
the
2.1 Test Data The
3, we discuss our initial
prototype
reflect
were
evaluate
was extended
software.
of
retrieval
work, we plan to measure the performance
for
specifications
with these documents.
indexing
In the
subsequent
Oracle
is significantly
using a representative
on Oracle Rdb.
The
to
Rdb for IR applications.
the workload
During
characteristics
effort
and
adding
be added, modified
of IR
do not
times presented
characterize
workload experiments
application
of Oracle
transaction
[ZIPF49]
applications
the experimental
led to the cooperative
as the
of important
be a significant
The following
system that is layered
consistency
and deletion
will
properties
the
findings
News collection
modification
time updates to the text indices associated
database
for the initial
the situation
and there will
describes
and the
being on periodically
Document
Parts of documents
deleted very rapidly,
this
structure,
not only
In a number
collaborative
too
initial
version
load
indexing,
has been largely overlooked
IR, with the emphasis
more dynamic.
defined
access to the index.
The update problem
mentioned.
be
what is extracted
index
1s important
batches of new documents.
to
the
the modified
database
The resulting to
or
of measure.
approaches
and the type of
not
of the text, but also for maintaining
text is updated.
be
since
Using
the
be noted that minimal Rdb
experimental
implementors.
called cooperative
the related
efficient
would
It should Oracle
that the indexing
of the system define
along with
The cooperative
documents
an approach
scaleable
formulating
text indices,
it is important
database
We describe
from documents
from
index,
experiments.
ran
proposed extensions.
viewpoint
of IR system
again
document
for supporting
IR systems have specific
in a text
narrowly.
A simplistic
support for inverted
is the overwhelming
However,
approach
the
the Extensions.
we
compared
with in
Rdb,
to provide
including along
Evaluate
●
of
identifying
products
interested
a generic
DBMS
integration
on
operations
and retrieval
are
RDBMS
that the RDBMS
concentrates
IR
indexing, We
a general
and effective
that enable RDBMS
semantics
performance.
is to develop
an efficient
Note
normalizing
length of the News article. unqwords - two byte integer
that
containing
unique words in the News article
we could
factor
is Et
have
such as the the number
of
DOCINDEX.
This
database.
Each
composed
token
●
ID
- fixed
containing
index
for
entry
and
index
the is
string
integer
(20 characters
m length)
containing
a unique
identifier
within
the News
length
size)
containing
corresponds
to this
the frequency
character the
token
is a single
This
design
document,
reflects
string
(512
occurrence
say, documents index
within
the associated
as opposed
can be deleted
of 512 bytes.
to the entire
to model
data
and maximum
length
This
or updated
Notice
term frequency
The
character
without
string
with
for one word within
In such cases, the string
length
SORTED
data retrieval
one or more attributes attributes
are termed
one attribute discussed indices
of a table.
we
3 Initial
to
To
on the database
2.3 Driver
RDBMS
that include
technology.
database.
for This
into
a Huffman
the text attribute
table.
This
to
Prior
sorting
load
these
to insertion
improves
insert
experiments
other
designed of interest software entries
SQL
Module
Language
to be representative
part.
from
inserts
This
using
as a list of words.
build
the
IR
database,
100 News articles was selected update
choice,
application,
we
experiments.
The
and
can
distinct
and
from
TextLoad,
have
experiments
transaction
the database.
This
the operation
of an
Clearly,
the selection
is an arbitrary
choice
represent
select documents
queries.
The query
and low use words,
equal
weight.
The
of 6 tokens
where
a compromise
related Before
is compressed After
a document
through
a stop word
text
times
using
index
into
the
entries
are
program,
tokens
include
high,
is assumed
program
of the query
to
located
all
set.
Each
retrieving
retrieval
tokens,
and
the occlist
experiment
attribute
of the
table.
One set of load and retrieval
experiments
operated
with a single-
segment index, called TokenInd, defined on the token attribute of the DOCINDEX table. The other set of experiments had a multisegment index, termed TokenIdFreqInd, defined for the
by increasing
TextRetrieval,
and without
DOCINDEX
One or more
entries
with
answer
more
as a
that were selected
executed the same 10 queries, and started with an empty database buffer. For this set of queries, we measured the elapsed
and
table.
or
the database
the
text
one
and
is processed
each word
TextRetrieval
of
on our
transaction,
from
Each query
and consists
viewed
Each load
into
does
the database vocabulmy.
medium,
be
to model
program.
however,
a set of sample
are
DOCINDEX
table.
This
index
includes
frequency
attributes.
The multi-segment
employed
to study the performance
the
token,
ID,
index configuration
implications
and was
of storing
the
data required for ranking as part of the key for each B-tree entry. For both sets of retrieval experiments, a single-segment index,
is
MaxFreqInd
that selects documents
DOCUMENTS
the database.
m the DOCINDEX
our
between loading each article under a distinct loading all the articles as one transaction.
the set of index
performance
of software
for our experiments
as a sequence of large batch transactions.
per load transaction
the locality of database write operations. Also, the occlist attribute of the DOCINDEX table is compressed using a simple variable length encoding scheme. The
RZ28
containing
index the
across three DEC
and mean access time of
a ranked
scheme.
words are filtered
cache of 128
software
for
“batch”
reads News
to generate the text index entries,
performed
DOCINDEX
adds
workload
100 articles
IR
called
DOCUMENTS
encoding
the resulting
are
action the
the database,
list and stemmed
sorted.
of
[HUFFMAN52]
is tokenized, inserts
insert
row
buffer
produced
table.
one
and
Rdb operated
documents
the text, and builds
a single
included
of memory,
calls to SQL Module
from disk, tokenizes for the DOCINDEX to
a
under
Oracle
database load and retrieval
incremental
that
program
Using
configuration
a database
on
operating
process address space of 256 Mbytes,
for only
two SQL
integrating
articles
attributes
with
programs
entries
procedures,
(e.g. IDFs)
performed
1,024 Mbytes
multiple
multl-segment
were driven
software
were
The hardware
The driver
the
multiple
mode of operation
the experiments
and
One of these programs,
the experimental
insertion
defined
For
single
[ORACLERdb] existing
are
At search time,
Experiments
The retrieval
that we conducted of
the document
720 machine
database was striped
characterize
transaction
on
tables.
Language
representative builds
both
both
Software
The experiments
of
Term
in DEC C Language.
conducted
by using an
that may be defined
single-segment. employ
unit
a single News
Oracle Rdb provides
An index
conducted
processor,
and a maximum
was written
load
Indices
6.1.
about 10 milliseconds.
size
512 bytes
storage may be
terabytes)
indices
multi-segment.
is called
below,
Module
performance,
and HASHED
we
7000, Model
disks, each having 2 Gbytes of capacity
Large Object (BLOB).
(B-tree)
within
to generate ranked output
an I/O size of 8 Kbytes,
conceptually To improve
separate
News collection.
10 Gbytes of disk storage.
Our experimental
is
could be increased
(i.e., multiple
with
Mbytes,
the
list
that
Version
approximately
that
a maximum
however,
of text, additional
Oracle Rdb Binary
the
is,
Environment
experiments
one 275 mhz Alpha
That is to
rebuilding
article.
or even larger
required
dedicated DEC Alpha
News
also that the occurrence
may seem very small,
For other collections
as
That
and stored in the database tables.
2.4 Computing
structures
easily handles the occurrences
64 Kbytes,
frequency
that
database.
update and delete operations.
structure.
stored in a varying
required.
not the entire
byte
data
table, the scope of each index
our desire
support document-level entire
is a document,
IR system in a
indexed
of the collection.
inversion
OpenVMS
that for the DOCINDEX
entry
are
are computed.
of
article. Observe
a probabilistic
documents
article.
varying
maximum
containing
To model
independent
the other statistics
- two byte integer
-
procedures. environment,
pre-computed
for
article.
the token
Language dynamic entities
attributes:
character
byte
the News
occlist
an inverted one
the token itself.
- four
frequency
●
contains
represents
of the following
●
●
table row
A query is presented to this For each word, the corresponding
table are selected using SQL Module
86
was
defined
table.
on
the
maxfreq
attribute
of
the
3.1 Database
Scaling
The load and retrieval the scaling
properties
is usually
on a database
[DEWIT92].
that grows
For example,
For database
as executing
for a transaction
holding
the computing
when the amount
the load transaction
would
time on the same computer
multi-segment nonlinear
of N
then returns
can be explained
However,
exhibits
after the data size exceeds
to a more uniform
growth
by considering
transactions
use a B-tree
index
batch scaleup.
110 Mbytes,
pattern.
the structure
As shown in Figure 2, the B“ tree structure has duplicate
by a factor
to retrieve
that doubling
the amount
approximately
B-tree index
would
to observe
expect
our retrieval
transactions.
3.2 Initial
Experimental
Summary
statistics
in the following
of text increases
This
will
nodes to handle multiple
used by Oracle Rdb
occurrences
of the same
one of
Thus, one
properties
for
Findings database experiments
are given
table.
Statistics
for Each Initial
Case
30,000
Total Raw Text
156 Mbytes
Total Unique
199,433
Words
Figure
2. Logical
Structure
of an Oracle
Rdb B* tree.
Total Word Occurrences
7,639,928
For the multi-segment
Total Load Transachons
300
and leaf
Total Retrieval
120
transactions accessesmany of the related index pages.
Transactions
point, 1 displays
the load transaction
times versus the amount of
source text indexed
under both single and multi-segment
on the DOCINDEX
table.
indices
I
nodes,
case, the B* tree bas many more interior
and no duplicate
the amount of buffer
space is exhausted,
degrades rapidly.
After the buffer
to this situation,
the growth
stabilize.
However,
increased
by nearly
behavior
occurs
Each
nodes.
management
of the
and performance
in load transaction
times
by this point the load transaction an order
because
of
magnitude.
as the B*
number of nodes must be flushed
tree
grows,
from the buffer
case, after the database reaches a certain pool can accommodate
and
than
lw-
Clearly,
size, most of the B* tree growth
B*
30
50
110 N&i
im
1s3
of Scurc:Text
I
1. Initial
Load Transaction
Times.
in Figure
1 the
transaction
times
for
the
single
of new vocabulary
storage database
expected,
3. is
Notice
roughly
the multi-segment
mu] ti-segment index cases are nearly identical for data sizes less than 30 Mbytes. When the amount of text exceeds 110 Mbytes
storage than single-segment multi-segment index entries
in size, elapsed
single-segment
times
for the multi-segment
load transactions
87
the amount for most
[CHAPMAN90].
requirements
shown in Table and
multi-segment
case.
of available
text
buffer size, the
sources tends
to
Thus, we have reason to
larger databases.
the that
the
believe that the sublinear scalability exhibited by the load transaction under single-segment indexing will hold for much
The
Figure
for
nodes.
the interior
after the database reaches a certain
decrease significantly
/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .
01 10
longer
also exceed
However,
introduction 50-
far
occurs in the duplicate
as the database grows, at some point the single-segment
tree size must
space.
this
per transaction.
As such, the database buffer
,~.
times have
an increasing
203-
nodes
adjusts seems to
In effect,
In the single-segment
leaf
load
At some
algorithm
250-
Notice
Rdb
of index
less.
batch scaleup
Total Documents
~ ~ w
behavior
of an Oracle
in levels for the
be substantially
sublinear
for the initial
Table 2: Summary —
3 .. : g
of and
If we assume
the number
two fold, then the growth
corresponding
data,
This IS because the number
B-tree levels equals the log of the number of keys.
Figure
in the
an epoch
token.
expects sublinear
entries
transaction
in size,
only if the processing exactly
scaleup when the amount
B’ tree [GRAY93]~
of F!vo. Whm
behavior
is said to be
resources constant.
system increased
Observe that in both cases,
sublinear
case the load
time increase
of database text doubles
scaleup linearly
exhibit
of source text loaded is less than 100 Mbytes.
the same
in size by a factor
the database size and processing
by a factor of N while
the load transactions
on determining
transactions.
defined
The batch scaleup
whenever
hnear
are focused
for the related
systems, batch scaleup transaction
begin to increase much more rapidly. experiments
case.
database
are
that the total storage requirement
for
the experimental
for
3 times
the
source
] ndex configuration
text
size.
requires
As more
index case. This follows since the are much larger m size than for the
Table 3: Initial
Storage
overall
Requirements
for
(table)
occurs with
and
attribute
is not selected.
the best of these schemes, and is also capable of
mechanism
(index)
0.5 Mbytes
(table)
231 Mbytes
returning
occurrence
101 Mbytes
minimum
possible penalty.
(index)
Being
●
160 Mbytes
(index)
about three times
high. 3 shows the average at various
elapsed
time
required
for retrieval
occlist
we
DOCINDEX
with the
the size of the source text, for a table-based
to the multi-segment
attribute),
physical
Rdb that
transactions
By storing the text index structures
B-tree (analogous
database sizes.
Thus, our goal
for Oracle
data to retrieval
amount of storage required
transactions
index retrieval
86 Mbytes
an indexing
MaxFreqInd
Figure
for
when the occlisf
DOCINDEX TokenIdFreqInd
indexing
is to develop combines
TokenInd
a single-segment
multi-segment
156 Mbytes
Total Source (text) DOCUMENTS
performance
loading,
the
text index
entirely
case including
could
eliminate
table
and
the
reduce
is
within
need
a the
for
storage
a
costs
significantly. Retmewl Tran.sadkm
In the following
indexing. —
MuU-segmnl,
indexing
20 Y. . :
~ w
contains
to enable highly -
-
extensions
Smgle-segm3nt,
15
also reduce
4 Cooperative For simple m
25
103
75
153
video reason
Figure
3. Initial
Retrieval
Transaction
Times.
is
Without
returning
execute
much faster
the occlist
the only case where expected time
sublinear
among
the retrieval scaleup
the
DOCINDEX
attribute,
cases
table
is
token,
that
in the B-tree.
Thus, an index-only
single-segment
index,
after
the
The large difference
the query.
of
That
required
identifier,
exhibit
employ
accessing
to
in
DBMS
the
For
is searched
related
the related
The
●
equivalent
segment cases,
For this experiment,
index,
the DOCINDEX
performance
penalty
which
resides only in the
performance
program,
for the single and multi-
even with
Based
Discussion
a multi-segment
table must be accessed and the related
model
is significant.
on our
of Initial
initial
we make
To
multi-segment
the occlist
transaction attribute
indexing,
the
logically
consists
database
and
indexing builds
the database
data
along
with The
attributes.
the DBMS
The
record.
of a key
data
identifies
the
attributes
contain
the index
records,
values.
and interprets
system
provides
access to the index
controls
semantic
content
of the index,
assumes
that
the content
as a tuple.
as a table.
for
any index
More specifically,
The data elements
entry
a cooperative
can be index is
for such tables define the
we observed
the lowest
times and sublinear
scalability
was not selected.
As such, the best
illustrate
this
concept,
consider
the
DOCINDEX
table
described above. The attributes of this table could be mapped into a cooperative index as follows:
times increases nonlinearly in the multi-segment case, and sublinearly when a single-segment index is employed. retrieval
the
the following
Using a physical table to store the text index does not yield scaleable performance for both load and retrieval transactions. As the database grows, load transaction
With
and
and interpreting
structure of the related index entries,
observations: ●
cooperative
for building
referent
application
represented
Results
experimentation
termed
an application
and the database system handles the physical storage of the related data structures. Notice that the cooperative indexing
modeled
3.3
entry
entries. In effect, the application
we
to effectively
In this scheme:
application-specific
table.
to the TextRetrieval
obtain relatively
is
indexing,
referent is a unique value by which
attribute,
table, is returned
requirement
reside
the B-tree
that when the occlist
this
cooperative
index
while Notice
indexing
and token frequency
For the
application-specific
it is necessary to support The approach we techniques.
application-defined
is sufficient.
images, essential
data objects,
ranks;
records must be fetched from the DOCINDEX
DOCINDEX
have
all
This is
The
For the DBMS
for computing
retrieval
of data,
types
share the responsibility
Each
●
data
complex
index structures.
is, in the multi-
than
strings,
of documents,
forms
requirements.
satisfy
indexing.
larger
and small
indexing
complex
complex
application-specific
transactions
remains
IR systems.
can easily be handled by the DBMS.
formats and indexing
In fact, this is
times
a consequence
to satisfy
document
case.
transaction
properties.
segment case, the data attributes namely,
the retrieval
in the multi-segment
file-based
Indexing
and other
accommodate
the
As we shall see, however,
data types such as integers
clips,
These
by eliminating
for our prototype
not the case for content-based
d Source Text
lvlbjtes
that are designed
transactions.
demand
text index table.
aspects of indexing 10
extensions
the storage
the space overhead for efficient
5-
of cooperative
for cooperative
a set of RDBMS
the total storage required 10
the notion
that we propose
we introduce
scaleable load and retrieval
need for a physical
E . ; &
section,
The implementation
●
token
●
ID is the referent attribute,
●
frequency
is the key attribute,
and occlist
For database updates,
when
builds
the index entries,
retrieval,
88
the IR query
and
are the data attributes.
the IR indexing and modifies processing
software
parses the text,
the cooperative software
obtains
index.
On
the index
entries
for
determines
the given
query
the documents
resulting
set of IDs
matching
documents.
terms,
interprets
the content,
that satisfy the selection
is used to retrieve
and
criteria.
and possibly
The
●
The
rank
Working
the
the cooperative
controls
updating
significantly
indexing
the index
from
model,
structures.
most DBMS
updates
addition,
the related
deletion,
application
Notice
provided
That is, when an index is defined DE\MS
the application
structures
as a side-effect
For cooperative
the
of record
scheme
important
Updates to the index structures occur Thus, the level of data transaction control.
under
provided
analogous Data
by the cooperative
to that associated
indexing
with typical
DBMS
Using
It
indexing
That is, inverted
are stored as rows in a table, and the DBMS
maintains
cooperative
indexing,
the IR index
resides
that is managed by the DBMS.
amount of locking enables
during
the DBMS
an
With
enables the application consistency.
cooperative
index
the associated DBMS
of one update
textual
approach
benefits
database framework.
permits
the user to
the structure
of an index
are stored entirely
for the index.
table is a logical
construct
that
our
implementation
More
In effect,
for defining
entry
of
specifically,
index
where
discussed below,
each index
and
we could
data for each
typical
on a per document
scope
Thus,
the occurrence
database (much like
cooperative
the
is not restricted.
For
example,
using
inverted
files).
entry represents
basis.
The reason for
an INDEX
we could
directly
ONLY
delete
table
all entries
mechanisms. Notice
may
within control
operating
fully
backup
from
database
and recovery,
security,
cooperative
the development These
effort.
indexing
of Oracle
of prototype operations
features
will
software
to
was beyond the
be included
of cooperative
was
Rdb discussed
indexing
in the
for
Oracle
Rdb,
the
To ensure an equivalent
evaluation
and retrieval
described
This
and
within
experiments
time,
however,
we used
method along with an INDEX in Section
a
ONLY 4.0.
comparative
of our extensions,
the modified
ONLY
table definition
The results
performance
the load
in Section 2 were employed. SORTED
table for indexing
access the text.
for the database is described
of these experiments
along with
a
analysis are presented below.
Et
features
replication,
version
implementation
The INDEX
system
of
the extended
to
the cooperative
completely
implementation
Unfortunately,
production
by the DBMS. above,
using
scope of this
that in either
under transaction
mentioned from
above.
That is, modify index
Experiments
prototype
support index delete and modification
consistent
Thus, under this model of integration,
can benefit
such as integrated
5.1 Prototype
and
Summary
so on. 4.1
feature
data structures
to DOCINDEX,
evaluated
In this case, an
content.
is guaranteed
Our
queries may not
be logically
transaction.
is updated
to the properties
applications
This
general.
of each index
5 Prototype
a
In such cases the
and the cooperative
database integrity indexing
database
after
Some applications
to always
indexing
content
case, the index In addition
noting
is very
analogous
level of data
can elect to operate in a manner analogous
the typical context
tables
scheme
may update
in time
and full-text
of the database.
the IR indices
application
point
or modified.
index becomes inconsistent, cover portions
an application
at some
content is added, deleted,
database
worth
and updating.
of data
indexing
to select the desired
For example,
cooperative
with
indexing
having a given ID value. The
Consistency.
require
handle
IR
selecting this level of granularity is to model data structures where the access method can directly support document deletion
concurrency, Data
ONLY
the token occurrences
in one data
levels
with
For such tables, the attributes
In the experiments
This reduces the
increased
The primary
the need to use physical
that represents
the physical
key spans the entire
IR index updates and, therefore,
to provide
is
That is, the referent
associated
Tables.
design a cooperative
list entries
index on one or more of the related table attributes, structure
is
indexing
a table to model an IR index
two data structures.
new access method
and accessing index content with SQL statements.
scheme is
content
Concurrency.
requires
(D
a table
an INDEX
mechanisms. !0
ONLY
define record.
Integrity.
integrity
lists
and eliminate
INDEX
within
Data
a proto~pe
for storing IR index structures. ●
has the foIlowing
This
the
extensions:
are stored in the B-tree.
occurrence
technology,
the
assumes this responsibility. indexing
we developed
after a B+ tree [GRAY93].
and data attributes
prc)perties: ID
these assumptions,
for
indexing.
goals of this access method are to more efficiently
the
indexing,
interface
of Oracle Rdb that has the following
modeled
mechanisms.
on some database attribute,
or modification.
The cooperative
under
programming
support for cooperative
New Sorted Access Method.
software
that this differs
indexing
application
must include
implementation ●
Under
native
DBMS
Experimental
statistics
for
the
Findings database
experiments
using
the
prototype version of Oracle Rdb are the same as those presented above in Table 2. Figure 4 displays the load transaction times
RDBMS
Extensions
Given the generality many possible
for Cooperative
of our cooperative
implementations.
based
cooperative
on typical
of handling
B-tree
indexing. or hashing
content-based
such as documents,
versus the amount
model, there are
We have selected
that M based on the following assumptions: 0 The DBMS must provide type-specific supporting
Indexing
indexing
an approach
Rdb
Notice
that after
prototype access methods for
schemes
prototype
are incapable
of complex
of source text
and the initial
what
objects
we
segment
images, video clips, and so on.
observed
in
89
of text is loaded, the
transactions.
elapsed time increased to
index
the multi-segment
version
stabilize.
60 Mbytes,
initial
The
elapsed
times
This behavior experiments
major
for
difference
much less for the prototypes case.
As
with
the
of
experiment.
times begin to increase rapidly,
load transactions load
for the prototype
single-segment
the source data size exceeds
load transaction
after about 90 Mbytes
That is to say, indexing
selection
Oracle
the
Then, for the
is similar the
to
multi-
is that
the
as compared
multi-segment
load
experiments,
this behavior
occurs when it is no longer possible to
cache the entire index structure this case the situation
in database buffers.
is somewhat
For a B+ tree, growth
adds index entries
transaction
pages are flushed
least recently transaction
used (LRU) completes,
in the leaf nodes, and
In the
following
transaction,
reloaded
from disk.
required
per load transaction.
reference there
behavior
of
extensions
pages
3290
4420
Writes
various
elaused
to
source
time
data
sizes,
rectuired
for
pro~otype implem&station
Figure
5 shows
retrieval
the
transactions
and the initial
average
from
multi-segment
the
case.
must
this cyclic
LRU
be
Relri ewd Tm?saclions
more 1/0s are sequential
[HAWTHORN79].
to the
to address this behavior
1816
Average
have been flushed.
these
Fortunately,
understood
157
For
After one load
Thus, as the B+ tree grows, is well
are known
algorithm
many
under a
tree pages required
access entries that occur early in the alphabet
Prototype
Average Reads
As the
the buffer
algorithm.
of the B+
I/O per Load Transaction
of these nodes.
in sorted order. from
replacement
most
Database
Single-segment
accesses a large number
The load transactions proceeds,
Table 5: Average
in
different.
tends to occur mostly
each load transaction
However,
buffer
—
18
Also,
Mulu.segmm.
16
replacement
[GRAY93].
g
14
z
12
E “F
10
Load Transactions & U6
:EEzil
;: .-3 5 ~ G
. . .. . . . . . . . . . . .
20
10
50. 40-
. . . . . . . . . . . .-
; g
4
------
/
..-
- .-
---
150
100
.. . . ...
-. . . . . . . . . . . ..- . . . ..-.
. ... ..
75 50 MLYles&.SourceText
25
30 -. .- ..”- ”..-”...
Figure
M
5. Retrieval
Transaction
Times..
20
From
10
t 10
70
m
30
Figure
90
110
120
150
of Source Text
Mb#es
the curves
attribute retrieval
I
01
When
case.
4 Load Transaction
Times.
that the tx-ototyue
for most regio& speaking, buffer
this result
performance
shown
algorithm,
improvement
With
we
for load transaction
scale linearly
in Figure
is encouraging.
replacement
improvements
load transactions
of ‘the curve
expect
to
transactions
segment
case.
opportunities
The
‘Generally
see
Also,
5 we find
that
when
Notice
exhibit
attribute
require
is
selected,
the
that in all cases, the prototype
sublinear
storage
constructed
to the
requirements
with
in the following
significant
for
the prototype
the
version
experimental of Oracle
table.
there are other
for the prototype
Table 6: Prototype
load
Database
Storage
Requirements
156 Mbytes
DOCUMENTS
86 Mbytes
than
the initial
5, on average
more database writes
single-segment
case.
the prototype
per load transaction
This
additional
1/0
is
directly
related to the large number of B+ tree pages that must be
written
per transaction,
amount
of text index
causes
far fewer
writing
records
end
a contiguous
of
amount known
[GRAY93,
pages
As with
clustering
the buffer
TOMASIC93]. load
causes
techniques
allocations
transactions
enhanced buffer replacement
typically
area.
tree
that adding
However, more
for
reducing
related
Also,
nodes
the higher
can be reduced algorithm
data
adding
in
MaxFreqInd
(index)
0.5 Mbytes
Cooperative
(index)
200 Mbytes
From Table
there amount B+
size.
same
pages
the
cooperative
to the
the
problem, the
table
is because
discontinuous
replacement for
This
appends
(table)
(estimate)
the same
DOCINDEX
to be modified.
table
storage
to a B+
noting
to the physical
to a physical
discontinuous
prototype
database
of data
be modified.
It is worth entries
database
Rdb are shown
Total Source (text)
generates significantly
retrieval
scaleup properties.
experiments.
in Table
prototype
much less time than for the multi-
transactions, the most important of which is I/O reduction. In Table 5, we show the average I/O per load transaction across the As shown
the occlist
or better
4.
enhancements
times.
the occlist
retrieval
transactions Notice
in Figure
is not selected, the prototype exhibits nearly identical transaction times as compared to the multi-segment
6, notice
that the total storage requirement
indexing
This
storage
scheme
savings,
results from eliminating
to
the DOCINDEX
is roughly
2 times
as compared
the
for our
source
to the initial
the need to redundantly
text
case,
store tokens in
table.
are of
Considering
tree
the results Although
of our prototype collectively, we did not achieve the
read rate for the
encouraged. performance
by employing
necessary
changes to Oracle
prototype
efforts.
an
as discussed above.
for
demonstrated
90
the load
that
transactions,
In terms of retrieval the
new
it is only
Rdb are beyond access
because
the
the scope of our
transactions, method
we are desired
we clearly
exhibits
highly
desirable
scalability
for selecting
properties.
occurrence
the multi-segment
the performance
penalty
data is much more reasonable
Also,
than for
studies that we conducted Oracle Rdb yielded
case.
retrieval with
5.2
Lessons Learned
performance
In the process of building made
significant
recpired
mc,st important ●
The
our Oracle
progress
toeffectively
integrate
workload
generated
gaimng
IRand
the
RDBMS
differs
by traditional
RDBMS
technology
RDBMS
text
impose
technology.
database system. by Oracle forvery
Forexample,
Rdb exhibits
reducing
the B*tree
from
databases.
However,
used
magmtude
more index
required.
We have found
possible
toobtain
the tree grows the arrival
●
essentially
of new grows,
indexing
and retrieval
Enhanced
buffer
LRU
RDBMS
buffer
tables
handled
that
However,
require
multiple
this behavior
Pe~ormance obtain This
highly
control,
will
SQL/MM
standard
by
uncertainty
most
non-key
We
decreasing
systems
storage
data
can
demand
the notion
[SQLMM]
In a index
access
are
costs for magnetic handle
“reasonable”
in order
to obtain
the
export
IR functionality
extensions
that
of database query
of uncertainty.
m woefully
The
lacking
in a RDBMS
we will are
context,
the viability
SQL
extended. of physically
now begin
required
evolving
in this regard.
must be appropriately
IR and RDBMS,
addressing
to adequately
the
support
in query processing.
like m
performance
join).
associated
would
assistance
on
to thank building
Rabah
the
Mediouni
test
of Oracle
systems
Also,
experiments.
who reviewed
with
and
we would
the paper for their helpful
for
conducting
like
to thank
his the
those
comments.
References
Naively,
tree can be used to
[BLAIR88]
for text operations.
Model.
products,
is highly
complex
of indexing it
M necessary
methods
to
[LOMET91
to evolve
].
[CALLAN92] It
improvement over time. As one final observation, we mention
and only for
Blair.
An
Extended
Relational
Retrieval
1988.
J.P. Callan,
INQUERY
Retrieval
Conference
on Database
78-83, Valencia,
W.B.
System,
Croft,
and S.M. Harding.
Proceedings,
and Expert
Third
Systems
Spain, Springer-Verlag,
The
International
Applications,
pp.
1992.
the current
accommodate
we can expect to see continual
D.C.
1PM, Vol. 24, No. 3, pp. 349-371,
The
access methods, concurrency this interaction,
text
and,
[CHAPMAN90]
performance
Characteristics
D.
Chapman
that this work
and
of Legal Document
S&PT, Mead Data Central,
S. DeFazio.
Databases.
Dayton,
Statistical
Technical
Ohio, January,
Report,
1990.
has been [CROFT85]
very challenging. Most of this challenge is directly related to developing RDBMS kernel level support for a significantly
Network
different
Retrieval.
indexing
structures.
Acknowledgments
is
methods such as B-trees and
consequently,
concurrent
IR
derives
the cyclic
has taken years to understand Thus,
to include
RDBMS
a limited
access
on
that
languages
essential.
come with time.
performance
RDBMS
and recovery
RDBMS
claim
problem
index
the rapidly
in secondary
integrating
for operations
not the case for RDBMS
between
hashing.
we
primary
the overhead of a complete
impact
With
The of IR
index structures.
desirable
number
the
on
numerous
index structures.
of compressed
and
heavily
to the source text typically
Given that our work has demonstrated
for
This behavior
scans (e.g.,
that some IR-specific
is simply
interaction
used
to handle
is not typically
improvements
one may claim
scheme
products
similar
makes the integration
The
and the database infrastructure
for index pages that result from
by RDBMS
accessing RDBMS ●
are
load transactions.
impossible. rebuild
concentrated
compressed
however,
rebuild
To properly
after as the
rate.
algorithms
management
access patterns
off
tree operations
must be enhanced
a series of document generally
of B+
has
Thus far, we have not addressed the extension
are
Thus,
to fall
increases at asublinear
management
products
sequential
at the leaf nodes.
the number
conducted
benefits of database functionality.
This is because
begins
were
we expect
This work has yielded
highly
database environment,
increases
that by using a B+ tree, it is
vocabulary
of
To obtain
enhancements
scaleable performance,
database
The
access method
design
dynamic
disks,
we observed
keys per byte of data.
performance,
nearly
unacceptable.
properties
that this isnotthe case for text. Thereason being one of scale. That is, documents generate about two orders of scaleable
of text,
the fact that any modification
forces a complete
on the
indexing
for building
and DBMS
Existing
performance
system
use of such data structures,
to handle
demands
IR
storage requirements.
techniques that
The set of operations
excellent
large relational
from
optimized
different
these studies
collection
version
for database load and
for much larger databases.
Historically,
The
include:
operations.
operations.
considerably
Although
small
prototype
performance
system, we
understanding
considerably
has been highly
tables and the related for
Rdb prototype
toward
lessons learned from this undertaking
text
transactions.
a relatively
with an “untuned”
scaleable
paradigm.
W.B.
Structure
Croft
and T.J. Parenty.
and a Database
Information
Systems,
A Comparison
of a
System used for Information Vol.
10, No. 4, pp. 377-390,
1985. 6 (conclusions Commercial highly
RDBMS
integrated
products
support
available
today
for IR technology.
do not provide
In addition
to this,
indexmg,
on IR technology
and scaleable
that enables the effective performance.
S. DeFazio. In Jim Gray , editor,
Full-text
Document
The Benchmark
Database and Transaction Processing Systems, Morgan Kaufmann, Second Edition, 1993.
many of the existing IR systems do not efficiently support document update operations. We have developed a prototype version of Oracle Rdb that implements a generic framework, called cooperative
[DEFAZI093] Benchmark.
integration
The performance
91
Retrieval
Handbook Chapter
for 8.
[FUHR90] Queries VLDB
N. Fuhr.
A Probabilistic
and Imprecise
Information
90, pp. 696-707,
[GRAY93]
J.
Concepts
Framework
in Databases,
for
Vague
Proceedings
of
1990.
Gray,
A.
Reuter.
and Techniques,
Transaction
Morgan
Kaufmann
Processing:
Publishers,
Inc.,
1993. [HARPER92] Extensible Journal,
D. J. Harper Class
Library
and A. D. M. Walker. for
Vol. 35, 3,256-267,
Information
ECLAIR
Retrieval.
an
Computer
June 1992.
[HAWTHORN79]
P.
Performance
of a Relational
Database
ACM
SIGMOD
Conference,
A Method
for the Construction
System.
Analysis
Proceedings
Hawthorn
and
M.
Stonebraker. Management pp.
1-11,
1979. [HUFFMAN52]
D. Huffman.
Minimum-redundancy pp. 1098-1100, [LOMET91]
Codes.
of
Vol. 40, No. 9,
1952.
D, Lomet.
and Future
IRE Proceedings,
Grow and Post Trees: Role, Techniques,
Potential.
Notes in Computer
Advances
in Spatial
Databases,
Science, No. 525, pp. 183-206,
Lecture
Berlin,
1991.
Extended
User-
Springer-Verlag. [LYNCH88] defined
C.A.
Lynch
and M, Stonebraker.
with
Applications
Indexing
Proceedings
VLDB,
[MCLEOD83] Retrieval
pp. 306-317,
I.A.
McLeod
as a Database
[ORACLERdb]
[SQLMM]
P. Cotton. Packages N1679,
~OMASIC93] Incremental
Crawford.
1S0 Working
Technical November,
[ZIPF49]
G.K.
Zipf.
of
Draft
SOU-004,
Development
SQL
Lists
for
Behavior
ISO/IEC
1994. and K. Sheens, Text
STAN-CS_TN-93-l,
Human
Multimedia
Full-text.
March,
H. Garcia-Molina,
Report 1993.
Addison-Wesley
6.1
1995.
- Part 2
Inverted
Document Technology:
1983.
Version
(SQL/MM)
A. Tomasic, Updates
Retrieval. University,
Least Effort,
Rdb
SQL/MM
Databases.
Information
Oracle Corporation,
Application SC21NVG3
and R.G.
Vol. 2, pp. 43-60,
Oracle Kit.
Textual
1988.
Application.
Research and Development,
Documentation
to
Document Stanford
and the Principle
of
Press, 1949.
92