Integrating IR and RDBMS Using Cooperative Indexing ... - CiteSeerX

2 downloads 0 Views 972KB Size Report
a Huffman. [HUFFMAN52] encoding scheme. After a document ...... Retrieval. Benchmark. In Jim Gray , editor, The Benchmark. Handbook for. Database and.
Integrating

Samuel

DeFazio,

IR and RDBMS

Amjad

Daoud,

Bruce University

Lisa

Using Cooperative

Ann

Smith,

Oracle

Corporation

Croft

and Jamie

and Jagannathan

at Amherst

for large

text

databases,

to accommodate The

full

integration

database

of information

management

as both

system

a significant

integration

goal

we mean:

retrieval, operations update,

ii)

and

and

retrieval

of

extensions

operations.

It is

also

retrieval

processes,

To identify

imposed

by

the

a number The

of database

results

of

extensions integration from

our

desired

these

of

and

experiments,

indexing,

and full

that provides

implementations

Oracle Rdb.

that

the

of system-level with

findings

and suggest

level

gained called

and not

technology.

extensions

to support version

aware

improve

information

management goal.

and

system

Many

databases

retrieval

(if

require

the

and recovery

(DBMS)

not

also have

(IR)

most) a major

types

of

features

framework of

the

component guarantees

that are provided

within

is

a highly

applications involving about

in commercial

involving

text

structured

data

concurrency database

transactions.

is scalability, additional

are a number

integration significant

of

IR

obstacles

IR operations,

of technical and

database

include

providing

challenges

with

database

indexing

and query

seem we are

performance

or

information retrieval.

systems

continues

to

that

performance

the initial

for

of our

focus

Tuning

to decrease in cost any performance

for DBMS

capabilities

systems

into

database

scalability

and query evaluation.

information

Of these, the most transaction semantics for

features

to raw performance.

products

and algorithm

support

has major implications

more general

the

systems,

supporting

scaleable

associated

the

be used to improve

However,

related to system-level

design

management

However,

scaleable Thus,

(which

can frequently

(e.g., probabilities) There

deliver

as opposed hardware

of interest.

predicates

systems.

do not

is generally

by [LYNCH88]

indexing either

the

is a problem

database

of text,

and retrieval

Extending

control

collections

indexing

directly

desirable

large

must

metric

a database

in

DBMS

This

requirements,

discuss

virtually

LYNCH88,

the

reported

to and

of these approaches,

dramatically)

1 Introduction

that

difficult

indices

performance

relational

of using

system, is

text

IR systems.

IR system

of any studies

very

HARPER92,

the DBMS-provided satisfy

components

recovery

by

in

maintains

is evaluated

store

In the work

IR functionality

andlor Providing

is used.

provide

work

performance.

of full-text

occurring

database

provided

any commercial

To accommodate

the cooperative

to further

approach,

mechanisms

[HARPER92],

scalability

of

Under

(RDBMS) closely

to

CROFT85,

this

text are

the and

is

[BLAIR88,

the requirements

to more

indices in

control

retrieval.

of a complete

mainly

on the

text

and

the IR system

of a query

stored

alternative

the indexing

system

to achieve both

a modified

alternatives

itself

the case when

the insight

validate

concurrency

MCLEOD83].

of IR

separate

The

satisfy

portion

for IR

the DBMS

consist

the integration

documents

guaranteeing

indexing,

approach,

languages

semantics

enabling

actually

structures

IR

impossible.

when

an approach,

a framework

were evaluated

a

infrastructural

desired

With

layered conducted

with

index the

The

with

database

experiments.

of IR and RDBMS

Our experimental

scheme

requirements

retrieval

both

to and

and

indices.

coordinate

offering

then

separate

database

involves

update,

In this

query

transaction

systems

and a DBMS

completely these

query

storage,

“integrated”

interface.

database

Providing

environment

document

current

a common the

retrieval

we

and

we developed

integration

indexing

Rdb

suggest

performance.

scaleable

database

handle

and extending

ranking.

in a database

IR system

addition,

indexing

integration,

experiments

indexing,

integrated

document

and

scalability

indexing

document

to obtain

cooperative

Prototype

the

are necessary initial

cooperative

for

to

Some

all database

document

for

on Oracle

load

iv)

for

systems

of atomicity,

the implementation

level

IR application

thus

a

By full

concurrent

and

ranking

necessary

performance

into

recognized

storage,

properties iii)

documents,

scaleable

been

semantics,

the ACID

to provide

features

undertaking.

document

durability,

exhibit

representative

for

have

isolation,

has long

transaction

on documents

language

(DBMS)

(IR)

and a challenging

i) support

and update,

consistency,

retrieval

Srinivasan

Callan

of Massachusetts

Abstract

Indexing

document

is

design. retrieval

with respect to query language The incorporation

database

query

of uncertainty

languages

system and is essential

It is difficult,

however,

results

in a

for effective

to change current

database systems. Despite some proposals in this area [FUHR90], most integration efforts impose a Boolean filter on the ranked results of IR queries before incorporating those results

processing

Permission to m:lke digihlhrd copies 01’ all or part CIt’ this matcri:l[ without fee is gr:mtcd prnvidcci llmt [he copies are not m:i(ie or distributed for prolit i)r con]mcrcial Idvan[age, the ACM copyright/ server notice, the title ol’ [he pul>lication anct its d:ttc appwr, :mci notice is given lIM1 ci>py right is by permission o f the Associ:ltion for Computin: Machinery, Inc. (ACM). To copy otherwise, M republish, to post on servers or 10 rcdislril>ulc 10 Iisls, rfquires spccilic pcrmiss]on andlor fee. SIGIR’95 Seattle WA USA[l 1995 ACM 0-89791714-6/95/07.S3.50

in the standard database query processing. An example of this would be. to treat the top N ranked text objects as the retrieved set.

Emerging

[SQLMM], oriented extending

while towards

database

standards

acknowledging the Boolean

SQL to handle

logic view

as SQL/Multimedia of ranking,

of retrieval.

IR query semantics

outside our current focus and will

84

such

the importance

are

Although

is important,

not be discussed further,

it is

The overall

goal of our work

framework IR

that enables

technology.

system-level stolage,

This

paper

extensions

transaction

update,

for

particularly

extensions

different

types of IR systems.

this

to provide

provides

the extraction support

choice

since different

data stored

of index

terms

provided

by

where the IR components database system provides

indexing

systems

approach

in traditional are rarely involving

authoring,

section

we employed

to study

the integration

Oracle Rdb[ORACLERdb].

In Section

load and query

study

described

experiments

in

experimental cooperative

Section

4.

findings

for

indexing

Rdb

2 Experimental

approach

the

integration

of

experimental ●

RDBMS

Characterize 6.1,

we

approach,

which

present

workload These

and

IR

the

generated by indexing were

following

the

transactions.

with

Develop

an Experimental

experimental contains and

retrieval

storage is not

version

extensions

as a mechanism consideration.

Oracle

to improve

transactions,

requirements production

of

Version

SQL-based

Rdb

to

help

was

and

reduce

patterns

from

the News

some of the

It should be noted that the

word usage properties.

can

be

characterized

That is,

with

a Zipf

version

and

served

validate

the

extensions

of the News Collection 686 Mbytes 138,578 4,950 Bytes

Size

788,256

Words

48.526,577

Average

61

Occurrences

per Word

Frequent Words

39,413

Infrequent

748,843

Words

Frequent Word Occurrences

93,6%

Infrequent

6.4%

Word Occurrences

database

RDBMS

Schema design

we

An

products

to build

IR systems.

IR approach

This

text

- varying

maximum)

ID - four byte integer for the News article.



maxfreq

selected ●

database

is with

design

and consists

one

string

(64,000

a unique

containing

byte

identifier

the maximum

for the article.

This

value

factor for the probabilistic

modeled.

some other

and

containing

- two byte integer

being

the News articles

Each row represents

character

usage frequency

scheme

of Rdb

This

are used

the content of one News article.

used as the normalizing

essentially

experiments

that

attributes:

length

containing



word

our

[CALLAN92]

table contains (meta) data.

article and has the following ●

for

schemes

two tables:

the related descriptive

that

under

employ

of the table-based

DOCUMENTS.

secondary

software,

Properties

Total Word Occurrences

of indexing the

This

Rdb.

developed

the scalability

for text indices. quality

of Oracle

typical

1: Statistical

mapped to an Oracle Rdb table. ●

extracted

1, we present

for this corpus.

supports the probabilistic

programs that load the database and execute For this study, the text index structure was

queries.

of

benchmark

the

the

characterize

and retrieval driven

published

is

Using Oracle Rdb Version to

to

extensions.

distribution.

of the following

experiments

the units

of

facilitate

technology,

the IR Workload.

experiments

application

usage

was

In Table

exhibits

2.2 Database

that

using

experiments

Average Document

El

implementation

extensions

this,

was simply

the proposed

Total Raw Text Total Documents

into

was employed:

conducted

our

representative

system-level

approach

word

Table

that

and discuss the implications.

the

for

Total Unique

Approach

identify

data

properties

The

To help

Given

[DEFAZI093].

statistical

need for real-

we

applications

~OMASIC93].

and

technology

5,

IR

collection

The lessons learned from

Section

our

performance

in relative

of these experiments and evaluate

such,

actual

below are expressed

intent

to “tune”

As

the

the

2.1 Test Data The

3, we discuss our initial

prototype

reflect

were

evaluate

was extended

software.

of

retrieval

work, we plan to measure the performance

for

specifications

with these documents.

indexing

In the

subsequent

Oracle

is significantly

using a representative

on Oracle Rdb.

The

to

Rdb for IR applications.

the workload

During

characteristics

effort

and

adding

be added, modified

of IR

do not

times presented

characterize

workload experiments

application

of Oracle

transaction

[ZIPF49]

applications

the experimental

led to the cooperative

as the

of important

be a significant

The following

system that is layered

consistency

and deletion

will

properties

the

findings

News collection

modification

time updates to the text indices associated

database

for the initial

the situation

and there will

describes

and the

being on periodically

Document

Parts of documents

deleted very rapidly,

this

structure,

not only

In a number

collaborative

too

initial

version

load

indexing,

has been largely overlooked

IR, with the emphasis

more dynamic.

defined

access to the index.

The update problem

mentioned.

be

what is extracted

index

1s important

batches of new documents.

to

the

the modified

database

The resulting to

or

of measure.

approaches

and the type of

not

of the text, but also for maintaining

text is updated.

be

since

Using

the

be noted that minimal Rdb

experimental

implementors.

called cooperative

the related

efficient

would

It should Oracle

that the indexing

of the system define

along with

The cooperative

documents

an approach

scaleable

formulating

text indices,

it is important

database

We describe

from documents

from

index,

experiments.

ran

proposed extensions.

viewpoint

of IR system

again

document

for supporting

IR systems have specific

in a text

narrowly.

A simplistic

support for inverted

is the overwhelming

However,

approach

the

the Extensions.

we

compared

with in

Rdb,

to provide

including along

Evaluate



of

identifying

products

interested

a generic

DBMS

integration

on

operations

and retrieval

are

RDBMS

that the RDBMS

concentrates

IR

indexing, We

a general

and effective

that enable RDBMS

semantics

performance.

is to develop

an efficient

Note

normalizing

length of the News article. unqwords - two byte integer

that

containing

unique words in the News article

we could

factor

is Et

have

such as the the number

of

DOCINDEX.

This

database.

Each

composed

token



ID

- fixed

containing

index

for

entry

and

index

the is

string

integer

(20 characters

m length)

containing

a unique

identifier

within

the News

length

size)

containing

corresponds

to this

the frequency

character the

token

is a single

This

design

document,

reflects

string

(512

occurrence

say, documents index

within

the associated

as opposed

can be deleted

of 512 bytes.

to the entire

to model

data

and maximum

length

This

or updated

Notice

term frequency

The

character

without

string

with

for one word within

In such cases, the string

length

SORTED

data retrieval

one or more attributes attributes

are termed

one attribute discussed indices

of a table.

we

3 Initial

to

To

on the database

2.3 Driver

RDBMS

that include

technology.

database.

for This

into

a Huffman

the text attribute

table.

This

to

Prior

sorting

load

these

to insertion

improves

insert

experiments

other

designed of interest software entries

SQL

Module

Language

to be representative

part.

from

inserts

This

using

as a list of words.

build

the

IR

database,

100 News articles was selected update

choice,

application,

we

experiments.

The

and

can

distinct

and

from

TextLoad,

have

experiments

transaction

the database.

This

the operation

of an

Clearly,

the selection

is an arbitrary

choice

represent

select documents

queries.

The query

and low use words,

equal

weight.

The

of 6 tokens

where

a compromise

related Before

is compressed After

a document

through

a stop word

text

times

using

index

into

the

entries

are

program,

tokens

include

high,

is assumed

program

of the query

to

located

all

set.

Each

retrieving

retrieval

tokens,

and

the occlist

experiment

attribute

of the

table.

One set of load and retrieval

experiments

operated

with a single-

segment index, called TokenInd, defined on the token attribute of the DOCINDEX table. The other set of experiments had a multisegment index, termed TokenIdFreqInd, defined for the

by increasing

TextRetrieval,

and without

DOCINDEX

One or more

entries

with

answer

more

as a

that were selected

executed the same 10 queries, and started with an empty database buffer. For this set of queries, we measured the elapsed

and

table.

or

the database

the

text

one

and

is processed

each word

TextRetrieval

of

on our

transaction,

from

Each query

and consists

viewed

Each load

into

does

the database vocabulmy.

medium,

be

to model

program.

however,

a set of sample

are

DOCINDEX

table.

This

index

includes

frequency

attributes.

The multi-segment

employed

to study the performance

the

token,

ID,

index configuration

implications

and was

of storing

the

data required for ranking as part of the key for each B-tree entry. For both sets of retrieval experiments, a single-segment index,

is

MaxFreqInd

that selects documents

DOCUMENTS

the database.

m the DOCINDEX

our

between loading each article under a distinct loading all the articles as one transaction.

the set of index

performance

of software

for our experiments

as a sequence of large batch transactions.

per load transaction

the locality of database write operations. Also, the occlist attribute of the DOCINDEX table is compressed using a simple variable length encoding scheme. The

RZ28

containing

index the

across three DEC

and mean access time of

a ranked

scheme.

words are filtered

cache of 128

software

for

“batch”

reads News

to generate the text index entries,

performed

DOCINDEX

adds

workload

100 articles

IR

called

DOCUMENTS

encoding

the resulting

are

action the

the database,

list and stemmed

sorted.

of

[HUFFMAN52]

is tokenized, inserts

insert

row

buffer

produced

table.

one

and

Rdb operated

documents

the text, and builds

a single

included

of memory,

calls to SQL Module

from disk, tokenizes for the DOCINDEX to

a

under

Oracle

database load and retrieval

incremental

that

program

Using

configuration

a database

on

operating

process address space of 256 Mbytes,

for only

two SQL

integrating

articles

attributes

with

programs

entries

procedures,

(e.g. IDFs)

performed

1,024 Mbytes

multiple

multl-segment

were driven

software

were

The hardware

The driver

the

multiple

mode of operation

the experiments

and

One of these programs,

the experimental

insertion

defined

For

single

[ORACLERdb] existing

are

At search time,

Experiments

The retrieval

that we conducted of

the document

720 machine

database was striped

characterize

transaction

on

tables.

Language

representative builds

both

both

Software

The experiments

of

Term

in DEC C Language.

conducted

by using an

that may be defined

single-segment. employ

unit

a single News

Oracle Rdb provides

An index

conducted

processor,

and a maximum

was written

load

Indices

6.1.

about 10 milliseconds.

size

512 bytes

storage may be

terabytes)

indices

multi-segment.

is called

below,

Module

performance,

and HASHED

we

7000, Model

disks, each having 2 Gbytes of capacity

Large Object (BLOB).

(B-tree)

within

to generate ranked output

an I/O size of 8 Kbytes,

conceptually To improve

separate

News collection.

10 Gbytes of disk storage.

Our experimental

is

could be increased

(i.e., multiple

with

Mbytes,

the

list

that

Version

approximately

that

a maximum

however,

of text, additional

Oracle Rdb Binary

the

is,

Environment

experiments

one 275 mhz Alpha

That is to

rebuilding

article.

or even larger

required

dedicated DEC Alpha

News

also that the occurrence

may seem very small,

For other collections

as

That

and stored in the database tables.

2.4 Computing

structures

easily handles the occurrences

64 Kbytes,

frequency

that

database.

update and delete operations.

structure.

stored in a varying

required.

not the entire

byte

data

table, the scope of each index

our desire

support document-level entire

is a document,

IR system in a

indexed

of the collection.

inversion

OpenVMS

that for the DOCINDEX

entry

are

are computed.

of

article. Observe

a probabilistic

documents

article.

varying

maximum

containing

To model

independent

the other statistics

- two byte integer

-

procedures. environment,

pre-computed

for

article.

the token

Language dynamic entities

attributes:

character

byte

the News

occlist

an inverted one

the token itself.

- four

frequency



contains

represents

of the following





table row

A query is presented to this For each word, the corresponding

table are selected using SQL Module

86

was

defined

table.

on

the

maxfreq

attribute

of

the

3.1 Database

Scaling

The load and retrieval the scaling

properties

is usually

on a database

[DEWIT92].

that grows

For example,

For database

as executing

for a transaction

holding

the computing

when the amount

the load transaction

would

time on the same computer

multi-segment nonlinear

of N

then returns

can be explained

However,

exhibits

after the data size exceeds

to a more uniform

growth

by considering

transactions

use a B-tree

index

batch scaleup.

110 Mbytes,

pattern.

the structure

As shown in Figure 2, the B“ tree structure has duplicate

by a factor

to retrieve

that doubling

the amount

approximately

B-tree index

would

to observe

expect

our retrieval

transactions.

3.2 Initial

Experimental

Summary

statistics

in the following

of text increases

This

will

nodes to handle multiple

used by Oracle Rdb

occurrences

of the same

one of

Thus, one

properties

for

Findings database experiments

are given

table.

Statistics

for Each Initial

Case

30,000

Total Raw Text

156 Mbytes

Total Unique

199,433

Words

Figure

2. Logical

Structure

of an Oracle

Rdb B* tree.

Total Word Occurrences

7,639,928

For the multi-segment

Total Load Transachons

300

and leaf

Total Retrieval

120

transactions accessesmany of the related index pages.

Transactions

point, 1 displays

the load transaction

times versus the amount of

source text indexed

under both single and multi-segment

on the DOCINDEX

table.

indices

I

nodes,

case, the B* tree bas many more interior

and no duplicate

the amount of buffer

space is exhausted,

degrades rapidly.

After the buffer

to this situation,

the growth

stabilize.

However,

increased

by nearly

behavior

occurs

Each

nodes.

management

of the

and performance

in load transaction

times

by this point the load transaction an order

because

of

magnitude.

as the B*

number of nodes must be flushed

tree

grows,

from the buffer

case, after the database reaches a certain pool can accommodate

and

than

lw-

Clearly,

size, most of the B* tree growth

B*

30

50

110 N&i

im

1s3

of Scurc:Text

I

1. Initial

Load Transaction

Times.

in Figure

1 the

transaction

times

for

the

single

of new vocabulary

storage database

expected,

3. is

Notice

roughly

the multi-segment

mu] ti-segment index cases are nearly identical for data sizes less than 30 Mbytes. When the amount of text exceeds 110 Mbytes

storage than single-segment multi-segment index entries

in size, elapsed

single-segment

times

for the multi-segment

load transactions

87

the amount for most

[CHAPMAN90].

requirements

shown in Table and

multi-segment

case.

of available

text

buffer size, the

sources tends

to

Thus, we have reason to

larger databases.

the that

the

believe that the sublinear scalability exhibited by the load transaction under single-segment indexing will hold for much

The

Figure

for

nodes.

the interior

after the database reaches a certain

decrease significantly

/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

01 10

longer

also exceed

However,

introduction 50-

far

occurs in the duplicate

as the database grows, at some point the single-segment

tree size must

space.

this

per transaction.

As such, the database buffer

,~.

times have

an increasing

203-

nodes

adjusts seems to

In effect,

In the single-segment

leaf

load

At some

algorithm

250-

Notice

Rdb

of index

less.

batch scaleup

Total Documents

~ ~ w

behavior

of an Oracle

in levels for the

be substantially

sublinear

for the initial

Table 2: Summary —

3 .. : g

of and

If we assume

the number

two fold, then the growth

corresponding

data,

This IS because the number

B-tree levels equals the log of the number of keys.

Figure

in the

an epoch

token.

expects sublinear

entries

transaction

in size,

only if the processing exactly

scaleup when the amount

B’ tree [GRAY93]~

of F!vo. Whm

behavior

is said to be

resources constant.

system increased

Observe that in both cases,

sublinear

case the load

time increase

of database text doubles

scaleup linearly

exhibit

of source text loaded is less than 100 Mbytes.

the same

in size by a factor

the database size and processing

by a factor of N while

the load transactions

on determining

transactions.

defined

The batch scaleup

whenever

hnear

are focused

for the related

systems, batch scaleup transaction

begin to increase much more rapidly. experiments

case.

database

are

that the total storage requirement

for

the experimental

for

3 times

the

source

] ndex configuration

text

size.

requires

As more

index case. This follows since the are much larger m size than for the

Table 3: Initial

Storage

overall

Requirements

for

(table)

occurs with

and

attribute

is not selected.

the best of these schemes, and is also capable of

mechanism

(index)

0.5 Mbytes

(table)

231 Mbytes

returning

occurrence

101 Mbytes

minimum

possible penalty.

(index)

Being



160 Mbytes

(index)

about three times

high. 3 shows the average at various

elapsed

time

required

for retrieval

occlist

we

DOCINDEX

with the

the size of the source text, for a table-based

to the multi-segment

attribute),

physical

Rdb that

transactions

By storing the text index structures

B-tree (analogous

database sizes.

Thus, our goal

for Oracle

data to retrieval

amount of storage required

transactions

index retrieval

86 Mbytes

an indexing

MaxFreqInd

Figure

for

when the occlisf

DOCINDEX TokenIdFreqInd

indexing

is to develop combines

TokenInd

a single-segment

multi-segment

156 Mbytes

Total Source (text) DOCUMENTS

performance

loading,

the

text index

entirely

case including

could

eliminate

table

and

the

reduce

is

within

need

a the

for

storage

a

costs

significantly. Retmewl Tran.sadkm

In the following

indexing. —

MuU-segmnl,

indexing

20 Y. . :

~ w

contains

to enable highly -

-

extensions

Smgle-segm3nt,

15

also reduce

4 Cooperative For simple m

25

103

75

153

video reason

Figure

3. Initial

Retrieval

Transaction

Times.

is

Without

returning

execute

much faster

the occlist

the only case where expected time

sublinear

among

the retrieval scaleup

the

DOCINDEX

attribute,

cases

table

is

token,

that

in the B-tree.

Thus, an index-only

single-segment

index,

after

the

The large difference

the query.

of

That

required

identifier,

exhibit

employ

accessing

to

in

DBMS

the

For

is searched

related

the related

The



equivalent

segment cases,

For this experiment,

index,

the DOCINDEX

performance

penalty

which

resides only in the

performance

program,

for the single and multi-

even with

Based

Discussion

a multi-segment

table must be accessed and the related

model

is significant.

on our

of Initial

initial

we make

To

multi-segment

the occlist

transaction attribute

indexing,

the

logically

consists

database

and

indexing builds

the database

data

along

with The

attributes.

the DBMS

The

record.

of a key

data

identifies

the

attributes

contain

the index

records,

values.

and interprets

system

provides

access to the index

controls

semantic

content

of the index,

assumes

that

the content

as a tuple.

as a table.

for

any index

More specifically,

The data elements

entry

a cooperative

can be index is

for such tables define the

we observed

the lowest

times and sublinear

scalability

was not selected.

As such, the best

illustrate

this

concept,

consider

the

DOCINDEX

table

described above. The attributes of this table could be mapped into a cooperative index as follows:

times increases nonlinearly in the multi-segment case, and sublinearly when a single-segment index is employed. retrieval

the

the following

Using a physical table to store the text index does not yield scaleable performance for both load and retrieval transactions. As the database grows, load transaction

With

and

and interpreting

structure of the related index entries,

observations: ●

cooperative

for building

referent

application

represented

Results

experimentation

termed

an application

and the database system handles the physical storage of the related data structures. Notice that the cooperative indexing

modeled

3.3

entry

entries. In effect, the application

we

to effectively

In this scheme:

application-specific

table.

to the TextRetrieval

obtain relatively

is

indexing,

referent is a unique value by which

attribute,

table, is returned

requirement

reside

the B-tree

that when the occlist

this

cooperative

index

while Notice

indexing

and token frequency

For the

application-specific

it is necessary to support The approach we techniques.

application-defined

is sufficient.

images, essential

data objects,

ranks;

records must be fetched from the DOCINDEX

DOCINDEX

have

all

This is

The

For the DBMS

for computing

retrieval

of data,

types

share the responsibility

Each



data

complex

index structures.

is, in the multi-

than

strings,

of documents,

forms

requirements.

satisfy

indexing.

larger

and small

indexing

complex

complex

application-specific

transactions

remains

IR systems.

can easily be handled by the DBMS.

formats and indexing

In fact, this is

times

a consequence

to satisfy

document

case.

transaction

properties.

segment case, the data attributes namely,

the retrieval

in the multi-segment

file-based

Indexing

and other

accommodate

the

As we shall see, however,

data types such as integers

clips,

These

by eliminating

for our prototype

not the case for content-based

d Source Text

lvlbjtes

that are designed

transactions.

demand

text index table.

aspects of indexing 10

extensions

the storage

the space overhead for efficient

5-

of cooperative

for cooperative

a set of RDBMS

the total storage required 10

the notion

that we propose

we introduce

scaleable load and retrieval

need for a physical

E . ; &

section,

The implementation



token



ID is the referent attribute,



frequency

is the key attribute,

and occlist

For database updates,

when

builds

the index entries,

retrieval,

88

the IR query

and

are the data attributes.

the IR indexing and modifies processing

software

parses the text,

the cooperative software

obtains

index.

On

the index

entries

for

determines

the given

query

the documents

resulting

set of IDs

matching

documents.

terms,

interprets

the content,

that satisfy the selection

is used to retrieve

and

criteria.

and possibly

The



The

rank

Working

the

the cooperative

controls

updating

significantly

indexing

the index

from

model,

structures.

most DBMS

updates

addition,

the related

deletion,

application

Notice

provided

That is, when an index is defined DE\MS

the application

structures

as a side-effect

For cooperative

the

of record

scheme

important

Updates to the index structures occur Thus, the level of data transaction control.

under

provided

analogous Data

by the cooperative

to that associated

indexing

with typical

DBMS

Using

It

indexing

That is, inverted

are stored as rows in a table, and the DBMS

maintains

cooperative

indexing,

the IR index

resides

that is managed by the DBMS.

amount of locking enables

during

the DBMS

an

With

enables the application consistency.

cooperative

index

the associated DBMS

of one update

textual

approach

benefits

database framework.

permits

the user to

the structure

of an index

are stored entirely

for the index.

table is a logical

construct

that

our

implementation

More

In effect,

for defining

entry

of

specifically,

index

where

discussed below,

each index

and

we could

data for each

typical

on a per document

scope

Thus,

the occurrence

database (much like

cooperative

the

is not restricted.

For

example,

using

inverted

files).

entry represents

basis.

The reason for

an INDEX

we could

directly

ONLY

delete

table

all entries

mechanisms. Notice

may

within control

operating

fully

backup

from

database

and recovery,

security,

cooperative

the development These

effort.

indexing

of Oracle

of prototype operations

features

will

software

to

was beyond the

be included

of cooperative

was

Rdb discussed

indexing

in the

for

Oracle

Rdb,

the

To ensure an equivalent

evaluation

and retrieval

described

This

and

within

experiments

time,

however,

we used

method along with an INDEX in Section

a

ONLY 4.0.

comparative

of our extensions,

the modified

ONLY

table definition

The results

performance

the load

in Section 2 were employed. SORTED

table for indexing

access the text.

for the database is described

of these experiments

along with

a

analysis are presented below.

Et

features

replication,

version

implementation

The INDEX

system

of

the extended

to

the cooperative

completely

implementation

Unfortunately,

production

by the DBMS. above,

using

scope of this

that in either

under transaction

mentioned from

above.

That is, modify index

Experiments

prototype

support index delete and modification

consistent

Thus, under this model of integration,

can benefit

such as integrated

5.1 Prototype

and

Summary

so on. 4.1

feature

data structures

to DOCINDEX,

evaluated

In this case, an

content.

is guaranteed

Our

queries may not

be logically

transaction.

is updated

to the properties

applications

This

general.

of each index

5 Prototype

a

In such cases the

and the cooperative

database integrity indexing

database

after

Some applications

to always

indexing

content

case, the index In addition

noting

is very

analogous

level of data

can elect to operate in a manner analogous

the typical context

tables

scheme

may update

in time

and full-text

of the database.

the IR indices

application

point

or modified.

index becomes inconsistent, cover portions

an application

at some

content is added, deleted,

database

worth

and updating.

of data

indexing

to select the desired

For example,

cooperative

with

indexing

having a given ID value. The

Consistency.

require

handle

IR

selecting this level of granularity is to model data structures where the access method can directly support document deletion

concurrency, Data

ONLY

the token occurrences

in one data

levels

with

For such tables, the attributes

In the experiments

This reduces the

increased

The primary

the need to use physical

that represents

the physical

key spans the entire

IR index updates and, therefore,

to provide

is

That is, the referent

associated

Tables.

design a cooperative

list entries

index on one or more of the related table attributes, structure

is

indexing

a table to model an IR index

two data structures.

new access method

and accessing index content with SQL statements.

scheme is

content

Concurrency.

requires

(D

a table

an INDEX

mechanisms. !0

ONLY

define record.

Integrity.

integrity

lists

and eliminate

INDEX

within

Data

a proto~pe

for storing IR index structures. ●

has the foIlowing

This

the

extensions:

are stored in the B-tree.

occurrence

technology,

the

assumes this responsibility. indexing

we developed

after a B+ tree [GRAY93].

and data attributes

prc)perties: ID

these assumptions,

for

indexing.

goals of this access method are to more efficiently

the

indexing,

interface

of Oracle Rdb that has the following

modeled

mechanisms.

on some database attribute,

or modification.

The cooperative

under

programming

support for cooperative

New Sorted Access Method.

software

that this differs

indexing

application

must include

implementation ●

Under

native

DBMS

Experimental

statistics

for

the

Findings database

experiments

using

the

prototype version of Oracle Rdb are the same as those presented above in Table 2. Figure 4 displays the load transaction times

RDBMS

Extensions

Given the generality many possible

for Cooperative

of our cooperative

implementations.

based

cooperative

on typical

of handling

B-tree

indexing. or hashing

content-based

such as documents,

versus the amount

model, there are

We have selected

that M based on the following assumptions: 0 The DBMS must provide type-specific supporting

Indexing

indexing

an approach

Rdb

Notice

that after

prototype access methods for

schemes

prototype

are incapable

of complex

of source text

and the initial

what

objects

we

segment

images, video clips, and so on.

observed

in

89

of text is loaded, the

transactions.

elapsed time increased to

index

the multi-segment

version

stabilize.

60 Mbytes,

initial

The

elapsed

times

This behavior experiments

major

for

difference

much less for the prototypes case.

As

with

the

of

experiment.

times begin to increase rapidly,

load transactions load

for the prototype

single-segment

the source data size exceeds

load transaction

after about 90 Mbytes

That is to say, indexing

selection

Oracle

the

Then, for the

is similar the

to

multi-

is that

the

as compared

multi-segment

load

experiments,

this behavior

occurs when it is no longer possible to

cache the entire index structure this case the situation

in database buffers.

is somewhat

For a B+ tree, growth

adds index entries

transaction

pages are flushed

least recently transaction

used (LRU) completes,

in the leaf nodes, and

In the

following

transaction,

reloaded

from disk.

required

per load transaction.

reference there

behavior

of

extensions

pages

3290

4420

Writes

various

elaused

to

source

time

data

sizes,

rectuired

for

pro~otype implem&station

Figure

5 shows

retrieval

the

transactions

and the initial

average

from

multi-segment

the

case.

must

this cyclic

LRU

be

Relri ewd Tm?saclions

more 1/0s are sequential

[HAWTHORN79].

to the

to address this behavior

1816

Average

have been flushed.

these

Fortunately,

understood

157

For

After one load

Thus, as the B+ tree grows, is well

are known

algorithm

many

under a

tree pages required

access entries that occur early in the alphabet

Prototype

Average Reads

As the

the buffer

algorithm.

of the B+

I/O per Load Transaction

of these nodes.

in sorted order. from

replacement

most

Database

Single-segment

accesses a large number

The load transactions proceeds,

Table 5: Average

in

different.

tends to occur mostly

each load transaction

However,

buffer



18

Also,

Mulu.segmm.

16

replacement

[GRAY93].

g

14

z

12

E “F

10

Load Transactions & U6

:EEzil

;: .-3 5 ~ G

. . .. . . . . . . . . . . .

20

10

50. 40-

. . . . . . . . . . . .-

; g

4

------

/

..-

- .-

---

150

100

.. . . ...

-. . . . . . . . . . . ..- . . . ..-.

. ... ..

75 50 MLYles&.SourceText

25

30 -. .- ..”- ”..-”...

Figure

M

5. Retrieval

Transaction

Times..

20

From

10

t 10

70

m

30

Figure

90

110

120

150

of Source Text

Mb#es

the curves

attribute retrieval

I

01

When

case.

4 Load Transaction

Times.

that the tx-ototyue

for most regio& speaking, buffer

this result

performance

shown

algorithm,

improvement

With

we

for load transaction

scale linearly

in Figure

is encouraging.

replacement

improvements

load transactions

of ‘the curve

expect

to

transactions

segment

case.

opportunities

The

‘Generally

see

Also,

5 we find

that

when

Notice

exhibit

attribute

require

is

selected,

the

that in all cases, the prototype

sublinear

storage

constructed

to the

requirements

with

in the following

significant

for

the prototype

the

version

experimental of Oracle

table.

there are other

for the prototype

Table 6: Prototype

load

Database

Storage

Requirements

156 Mbytes

DOCUMENTS

86 Mbytes

than

the initial

5, on average

more database writes

single-segment

case.

the prototype

per load transaction

This

additional

1/0

is

directly

related to the large number of B+ tree pages that must be

written

per transaction,

amount

of text index

causes

far fewer

writing

records

end

a contiguous

of

amount known

[GRAY93,

pages

As with

clustering

the buffer

TOMASIC93]. load

causes

techniques

allocations

transactions

enhanced buffer replacement

typically

area.

tree

that adding

However, more

for

reducing

related

Also,

nodes

the higher

can be reduced algorithm

data

adding

in

MaxFreqInd

(index)

0.5 Mbytes

Cooperative

(index)

200 Mbytes

From Table

there amount B+

size.

same

pages

the

cooperative

to the

the

problem, the

table

is because

discontinuous

replacement for

This

appends

(table)

(estimate)

the same

DOCINDEX

to be modified.

table

storage

to a B+

noting

to the physical

to a physical

discontinuous

prototype

database

of data

be modified.

It is worth entries

database

Rdb are shown

Total Source (text)

generates significantly

retrieval

scaleup properties.

experiments.

in Table

prototype

much less time than for the multi-

transactions, the most important of which is I/O reduction. In Table 5, we show the average I/O per load transaction across the As shown

the occlist

or better

4.

enhancements

times.

the occlist

retrieval

transactions Notice

in Figure

is not selected, the prototype exhibits nearly identical transaction times as compared to the multi-segment

6, notice

that the total storage requirement

indexing

This

storage

scheme

savings,

results from eliminating

to

the DOCINDEX

is roughly

2 times

as compared

the

for our

source

to the initial

the need to redundantly

text

case,

store tokens in

table.

are of

Considering

tree

the results Although

of our prototype collectively, we did not achieve the

read rate for the

encouraged. performance

by employing

necessary

changes to Oracle

prototype

efforts.

an

as discussed above.

for

demonstrated

90

the load

that

transactions,

In terms of retrieval the

new

it is only

Rdb are beyond access

because

the

the scope of our

transactions, method

we are desired

we clearly

exhibits

highly

desirable

scalability

for selecting

properties.

occurrence

the multi-segment

the performance

penalty

data is much more reasonable

Also,

than for

studies that we conducted Oracle Rdb yielded

case.

retrieval with

5.2

Lessons Learned

performance

In the process of building made

significant

recpired

mc,st important ●

The

our Oracle

progress

toeffectively

integrate

workload

generated

gaimng

IRand

the

RDBMS

differs

by traditional

RDBMS

technology

RDBMS

text

impose

technology.

database system. by Oracle forvery

Forexample,

Rdb exhibits

reducing

the B*tree

from

databases.

However,

used

magmtude

more index

required.

We have found

possible

toobtain

the tree grows the arrival



essentially

of new grows,

indexing

and retrieval

Enhanced

buffer

LRU

RDBMS

buffer

tables

handled

that

However,

require

multiple

this behavior

Pe~ormance obtain This

highly

control,

will

SQL/MM

standard

by

uncertainty

most

non-key

We

decreasing

systems

storage

data

can

demand

the notion

[SQLMM]

In a index

access

are

costs for magnetic handle

“reasonable”

in order

to obtain

the

export

IR functionality

extensions

that

of database query

of uncertainty.

m woefully

The

lacking

in a RDBMS

we will are

context,

the viability

SQL

extended. of physically

now begin

required

evolving

in this regard.

must be appropriately

IR and RDBMS,

addressing

to adequately

the

support

in query processing.

like m

performance

join).

associated

would

assistance

on

to thank building

Rabah

the

Mediouni

test

of Oracle

systems

Also,

experiments.

who reviewed

with

and

we would

the paper for their helpful

for

conducting

like

to thank

his the

those

comments.

References

Naively,

tree can be used to

[BLAIR88]

for text operations.

Model.

products,

is highly

complex

of indexing it

M necessary

methods

to

[LOMET91

to evolve

].

[CALLAN92] It

improvement over time. As one final observation, we mention

and only for

Blair.

An

Extended

Relational

Retrieval

1988.

J.P. Callan,

INQUERY

Retrieval

Conference

on Database

78-83, Valencia,

W.B.

System,

Croft,

and S.M. Harding.

Proceedings,

and Expert

Third

Systems

Spain, Springer-Verlag,

The

International

Applications,

pp.

1992.

the current

accommodate

we can expect to see continual

D.C.

1PM, Vol. 24, No. 3, pp. 349-371,

The

access methods, concurrency this interaction,

text

and,

[CHAPMAN90]

performance

Characteristics

D.

Chapman

that this work

and

of Legal Document

S&PT, Mead Data Central,

S. DeFazio.

Databases.

Dayton,

Statistical

Technical

Ohio, January,

Report,

1990.

has been [CROFT85]

very challenging. Most of this challenge is directly related to developing RDBMS kernel level support for a significantly

Network

different

Retrieval.

indexing

structures.

Acknowledgments

is

methods such as B-trees and

consequently,

concurrent

IR

derives

the cyclic

has taken years to understand Thus,

to include

RDBMS

a limited

access

on

that

languages

essential.

come with time.

performance

RDBMS

and recovery

RDBMS

claim

problem

index

the rapidly

in secondary

integrating

for operations

not the case for RDBMS

between

hashing.

we

primary

the overhead of a complete

impact

With

The of IR

index structures.

desirable

number

the

on

numerous

index structures.

of compressed

and

heavily

to the source text typically

Given that our work has demonstrated

for

This behavior

scans (e.g.,

that some IR-specific

is simply

interaction

used

to handle

is not typically

improvements

one may claim

scheme

products

similar

makes the integration

The

and the database infrastructure

for index pages that result from

by RDBMS

accessing RDBMS ●

are

load transactions.

impossible. rebuild

concentrated

compressed

however,

rebuild

To properly

after as the

rate.

algorithms

management

access patterns

off

tree operations

must be enhanced

a series of document generally

of B+

has

Thus far, we have not addressed the extension

are

Thus,

to fall

increases at asublinear

management

products

sequential

at the leaf nodes.

the number

conducted

benefits of database functionality.

This is because

begins

were

we expect

This work has yielded

highly

database environment,

increases

that by using a B+ tree, it is

vocabulary

of

To obtain

enhancements

scaleable performance,

database

The

access method

design

dynamic

disks,

we observed

keys per byte of data.

performance,

nearly

unacceptable.

properties

that this isnotthe case for text. Thereason being one of scale. That is, documents generate about two orders of scaleable

of text,

the fact that any modification

forces a complete

on the

indexing

for building

and DBMS

Existing

performance

system

use of such data structures,

to handle

demands

IR

storage requirements.

techniques that

The set of operations

excellent

large relational

from

optimized

different

these studies

collection

version

for database load and

for much larger databases.

Historically,

The

include:

operations.

operations.

considerably

Although

small

prototype

performance

system, we

understanding

considerably

has been highly

tables and the related for

Rdb prototype

toward

lessons learned from this undertaking

text

transactions.

a relatively

with an “untuned”

scaleable

paradigm.

W.B.

Structure

Croft

and T.J. Parenty.

and a Database

Information

Systems,

A Comparison

of a

System used for Information Vol.

10, No. 4, pp. 377-390,

1985. 6 (conclusions Commercial highly

RDBMS

integrated

products

support

available

today

for IR technology.

do not provide

In addition

to this,

indexmg,

on IR technology

and scaleable

that enables the effective performance.

S. DeFazio. In Jim Gray , editor,

Full-text

Document

The Benchmark

Database and Transaction Processing Systems, Morgan Kaufmann, Second Edition, 1993.

many of the existing IR systems do not efficiently support document update operations. We have developed a prototype version of Oracle Rdb that implements a generic framework, called cooperative

[DEFAZI093] Benchmark.

integration

The performance

91

Retrieval

Handbook Chapter

for 8.

[FUHR90] Queries VLDB

N. Fuhr.

A Probabilistic

and Imprecise

Information

90, pp. 696-707,

[GRAY93]

J.

Concepts

Framework

in Databases,

for

Vague

Proceedings

of

1990.

Gray,

A.

Reuter.

and Techniques,

Transaction

Morgan

Kaufmann

Processing:

Publishers,

Inc.,

1993. [HARPER92] Extensible Journal,

D. J. Harper Class

Library

and A. D. M. Walker. for

Vol. 35, 3,256-267,

Information

ECLAIR

Retrieval.

an

Computer

June 1992.

[HAWTHORN79]

P.

Performance

of a Relational

Database

ACM

SIGMOD

Conference,

A Method

for the Construction

System.

Analysis

Proceedings

Hawthorn

and

M.

Stonebraker. Management pp.

1-11,

1979. [HUFFMAN52]

D. Huffman.

Minimum-redundancy pp. 1098-1100, [LOMET91]

Codes.

of

Vol. 40, No. 9,

1952.

D, Lomet.

and Future

IRE Proceedings,

Grow and Post Trees: Role, Techniques,

Potential.

Notes in Computer

Advances

in Spatial

Databases,

Science, No. 525, pp. 183-206,

Lecture

Berlin,

1991.

Extended

User-

Springer-Verlag. [LYNCH88] defined

C.A.

Lynch

and M, Stonebraker.

with

Applications

Indexing

Proceedings

VLDB,

[MCLEOD83] Retrieval

pp. 306-317,

I.A.

McLeod

as a Database

[ORACLERdb]

[SQLMM]

P. Cotton. Packages N1679,

~OMASIC93] Incremental

Crawford.

1S0 Working

Technical November,

[ZIPF49]

G.K.

Zipf.

of

Draft

SOU-004,

Development

SQL

Lists

for

Behavior

ISO/IEC

1994. and K. Sheens, Text

STAN-CS_TN-93-l,

Human

Multimedia

Full-text.

March,

H. Garcia-Molina,

Report 1993.

Addison-Wesley

6.1

1995.

- Part 2

Inverted

Document Technology:

1983.

Version

(SQL/MM)

A. Tomasic, Updates

Retrieval. University,

Least Effort,

Rdb

SQL/MM

Databases.

Information

Oracle Corporation,

Application SC21NVG3

and R.G.

Vol. 2, pp. 43-60,

Oracle Kit.

Textual

1988.

Application.

Research and Development,

Documentation

to

Document Stanford

and the Principle

of

Press, 1949.

92

Suggest Documents