The Use of Phrases and Structured Queries in Information ... - CiteSeerX

4 downloads 0 Views 1MB Size Report
(Croft and. Dae, 1990). Despite the significant amount of work on phrases, .... (Robert- son, 1977). Typically, a probabilistic model calculates. P(Relevant.
The Use of Phrases W.

and Structured

Bruce

Croft,

Computer

Howard

Queries

R. Turtle;

and Information

University

and David

Science

of Massachusetts,

Abstract and

in information tems.

retrieval,

In previous

as a source This

like

in

this

for

show

that

results

improve

performance,

matically

extracted

nearly

in

queries

using

re-

by

this

way are

language selected

can auto-

query

per-

phrases.

the

came

tion

must

obey

some

among

its

showed

significant

use of phrases

as part

of a text

representation

language

has been

investigated

since

of information

retrieval

research.

indexing

collections

ative

low

baseline

days

example,

included

phrase-based

field

studies

riety

of experiments

tem.

Certainly,

phrases,

(1966).

if used

tained

(1968)

using

there

phrases

should

language

and,

representation.

with

phrases

ition.

These

small

improvements

results

in

have

the

been improve

experimental

however,

been

very

in some

we feel

that

support mixed,

collections

this

ranging

intu-

treated

from

swers

to decreases

* Current

address:

West

F’ubtishing

Company,

St.

Paul,

Language

Stud-

In

t Current ies,

University

Permission granted direct

address: of

to

commercial of the

that

copying

Machinery.

Chicago,

copy

provided

title

Center

that

for

Chicago,

without

fee

the copies

advantage,

publication

all or part

of this

are not made

and its date

copyright

appear,

material

or distributed notice

and notice

of the Aeeociation

otherwise,

and/or specific permission. @ 1991 ACM O-8979 J-448

and

or to republish,

is

results

representations

from

words,

between

index

for

of this

paper

in using

phraees

with

systems,

operators

such

are

terme? not

searchers

as AND

(A),

The

anand

algorithms. the

issues

model.

express

Boolean

to it be

obvious

clarify

a retrieval

using

eimilar

or should

retrieval is to

phrases, retrieval

For example, term,

single as these

on

to the

examined.

implications

phrases)

of work

as an index

goals

(e.g.

1 we call

is given

and 2A

for Computing

queries,

a fee

assume

linguistic

expressions OR

con-

word-level

(V),

section

$J .50

32

retrieval

effectiveness

is measured

in

terms

of

re-

precision. test and

Communications

-J/9 j[O009/0032...

word

for

and the

requires

con-

a probabilis-

provided

of phrases

such

commercial

taining

neither

sufficiently

queetions

structure

Illinois.

the ACM

is by permission To copy

Information

over

been

user-identified

and

amount

be t rested

derived

of the

involved

Minnesota.

been

significant

One

with

single

relationship

as a relationship to

have

in

significant

the

terms

have

rel-

1990).

a phrase

index

ob-

the

has not

should

quality

results

Dae,

that

model

the specificity the

and

fig-

were

Improvements

algorithm

from

phrases.

as CACM2

might

that

with

syntactic

results.

found

relaresults

his improvement

experiments

different

Despite

sys-

feeling

we

collections

of

addi-

syntactic

some

such

Fagan’s

in

Fagan’s

that

co-

some

but

words. using

baselines

In

both

algorithm,

(Croft

a va-

SMART

the

using

significantly

Cran-

described

consequently,

The

do not,

in the

also

has always

correctly,

of the indexing of the text

Salton

indexing

tic

for

word

smaller.

phrasee

early

Cleverdon,

single

and

by

phrase,

in

out

with

defined

in a document.

on the

increases

also be pointed

phrase

is

the proxim-

characterized

constraint

ures obtained

siderably

or

the

be

none

to quite

and/or

of components

but

the

of occurrences words

com-

phrases,

“syntactic” in

phrase

component

phrases,

best

and

as a statistical

tionships

the The

number

most

ueing

of factors

statistical

may

criteria

statistical

Introduction

the

phrase

It should

1

A

on

of the

indexing

“statistical”

occurrences

between syntactic

is one

a number

of its component

A

that

both

varied

constraints

ity

in

used

occurrences

phrases model.

(1987)

of automatic

process.

to build

retrieval

a natural

he and

where

phrases

as manually

that

phrasee formation

effectiveness,

theeis

studies

are used

phrases

that

model,

on phrases,

retrieval

in

used

01003

in othersl, recent

prehensive

sys-

been

retrieval

a probabilistic

from

as well

have

an approach

and

history

in commercial queries

language for

a long

of research

we describe

queries

have

a statistical

majority

in natural

structured

form

Boolean

improvement

paper,

identified Our

the

little

queries

particularly

work,

of phrases

work,

sulted In

Boolean

D. Lewist

MA

effectiveness

phrases

Retrieval

Department

Amherst,

Fagan’s Both

in Information

4.1.

collection lists

consists of

the

relevant

of

the

ACM

of

a

set

documents (CACM)

of

documents, for

collection

each

a query.

is described

set

of The in

prqximity, level for

sentence-level The

proximity. example,

tried),

may

or by

formation query

is used

tify

model

can

phrases

operator

how

incorporating

term

phrases

phrase,

were

used

dl

in the

or other

d2

paper,

(Croft,

1986),

as specifying

identified struct

in a natural a structured

bilistic

model

199 1).

goal,

rl

rz

based

representation

In the following

section,

start

by describing of our

phrases, been

the inference

emphasizing in retrieval

these

models

ables

the similarities

clearly

in

seen.

Boolean

In section building

structured used

presents

the

work.

network

reviews

work

content

of

for

an overview and

phrases

in

of our

that

The

uses

in sta-

results

Finally,

in

approach

describe this

and

section

the

paper.

the importance

to

are

and

other

the

use

are query

information

difference

of

document nodes,

and

need.

).

need,

paper,

future

document

and

collections.

values

true

and

of the

into

of

forms

emphasizes to

calculate

of the of the

such

under

model.

For this

structured

queries

in the inference model

can

of the

net model

be shown

model

docu-

informa-

as a thesaurus

this

are that

of the

features

net model

it

evidence

knowledge

account

advantages

These

that

representations

interpretation

different

is

representations

domain

a natural

diagram.

sources

different

and

the inference

models

Different

the major

have

possible

as representations

between

multiple

can all be taken

of

with

regarded

probabilistic

content,

tion 4

a discussion

of large

d~’s are

qi’s

need.

major

ment

specific

Section

5, we indicate

Network:

nodes,

user’s

Queries

P(I]Document

queries,

and discuss

the

information

to be

as proximity

Inference

of a document,

false.

en-

them

1: Basic

v+’s are concept

on have

each

among

Figure nodes,

1 represents

is the

phrases

of an inference

such

We

research

Instantiating

differences

experimental

results.

directions

through

models.

3, we give

techniques

ways

operators

renet-

which

different

subsection

and

retrieval

previous describe

form

last

need

net model

models.

and

The

queries

tistical

those

the

Croft,

overall

inference

We then the

P

interaction.

we review

experiments.

treated

user

‘-lk

‘Amx

to con-

and

our

information

and

‘m

... ..

ql

Phrases

(Turtle

a complex,

of an analysis

. .. . .

used in a proba-

towards

is to build

language

basis

nets

a step

rs

In

term

are used

is then

on inference

which

approach.

query

which

represents

search natural

language

query,

based

This

a different

d

in a probabilistic

interpreted

we take

d-l

Y’Jx~J

to iden-

dependencies. In this

.. . .

lintext.

queries

dependency

were

as (in-

in a document

Boolean

re-

A

such

Structure

the

used

that

retrieval,

(information

be detected

we have

p=wwh-

information

3wordsofretrieva~.

work,

work,

and

by

a proximity

to describe

potential

that

using

construct,

In previous

concept

be expressed

within

guistic

proximity,

using

are discussed

a be-

low.

Previous

2

2.1

The

The

inference

used

as the

basis

of

phrases,

ments tion it

4.

It

follows

son,

1977).

net

that

it

particular

the

comparisons for

probability

Typically,

decides

model

information

(Turtle

of different

ranking

principle

a document query

is relevant (Fuhr,

a slightly

P(I lDocument information

different

More

need

as a complex

that

trix

specifically,

node

inferin

probagiven

we consider about

on

the

of parents

its

potential

for

the

can

to

compute

be

used

associated

paper.

with

1 shows It

the

consists

of the the

all

Given

the

probability

or

de-

all nodes

a set these

multiof that

and

DAG,

remaining

basic

has

characterizes node

a

a ma-

all possible

a node

that

causes.

roots

q, we draw

the dependence

and

between

probabilities

this

33

set

If or im-

q contains

P(g Ip) for

When

specifies

relationship

Figure

an the

matrix

propositions.

node

specifies

nodes

and edges

p “causes”

by node

The

variables.

representing

belief

a

that

two the

pendence

a par-

The approach

proposition

ple parents, y

represented p to q.

matrix)

of the

between

by a node

is a di-

in which

or constants

relations

from

1989)

(DAG)

variables

proposition

(a link

(Pearl,

graph

represented

edge

values

calculates

is satisfied

the

in

probability

is the

dependence

directed

given

1989).

), which need

document.

is the

represent

sec-

(Robert-

model

which

propositional

in

network

dependency

represent

plies

treat-

model

inference

acyclic

a proposition

199 1) is

experiments

a probabilistic

takes

Croft,

retrieval

,Query),

and

a user’s

and

the

a probabilistic

computes

that

Model

Net

and

document

ence bility

for

Bayesian

rected,

model

IDocument

a user

ticular

net

the

P(Relevant that

Inference

is

A

Work

of prior networks degree

of

nodes.

inference

of a document

network network

used and

in a

For

retrieval,

teraction network. that

a query

with

the

This and,

and

allows

the information

ument

network

user,

us

is

to to

through

to the

compute

need is met

consequently,

built

attached

the

for any

produce

in-

document probability

particular

doc-

a ranked

list

of

can

be

documents.

2.2

Phrases

The

use of phrases

discussed 1. What

2. Are

phrases

(information

query

network

for

the

V Tfiies

A ret?’ieva~)

the

terms

to

determine

if

a phrase

concepts

or are they

relation-

concepts? weighting

use of phrases

are

systems

issues:

or query?

is an appropriate

Should

4.

query

used

separate

between

3. What Structured

is

IR

following

in a document

ships

2:

of the

evidence

exists

Figure

in experimental

in terms

used

for

for

affect

phrases?

which

indexing

single

word

and

docu-

queries

ments?

query

network.

a collection query gle

The and

its

processing.

node

and

or

more

each information tive

query

The

and

and

document been cific

to

(i.e.

the

content represent

a document

signed.

A representation

given

its

The gle leaf tion the

query that

need

query

networks

expressions.

plex

Figure

query Boolean

operators

matrix

form

(Turtle

showed

that

queries

is at least

version

of the vector

this

such

as those

2 shows has

event

that that

be used the

A retrieval)

(information

with

DAG

roots

may

inference as effective space

been

as-

the

need.

formed

with

as the model

set

1991). model

to

both

for

parse

phrase

indexing about

pairs (Van

(1990)

tend

to same

phrases.

Sparck

phrasal

synonymous

used

together

for

other

concept. or nearly

in documents

research

on

term

evidence

parser

techniques, for

together

to idenusing

information than

has been

the

If

may

part

of

hypothesis

words the

mea-

words

being

synonymous

other

of words

associated two

Tait

queries.

example,

For instance,

clustering.

and

on the

Of course, reasons

noun

as phrases

the co-occurrence

mutual

co-occur

grammar. (e.g.

to analyze

such as the expected 1979).

document

Jones

a syntactic

strongly

Rijsbergen,

used the PLNLP of the

identified

are

that

siderable

that

of

and grammars document and

to use semantic

It is possible,

of words

a library Parser-based

constructs

information

use information

sure the

used

cate-

and patterns

a simpler

are then

Dil-

of the

syntactic

against

used

extraction.

semantic

used.

the

( 1987)

linguistic tree

example,

general

link

34

specific

been

is typical

parse

(1988)

phrase

the

measures

1983).

Smeaton

to refine

to identify

Wu,

Fagan

a complete

lin-

template-

noun>),

It is also possible

tify

Boolean”

where

use

Both

are identified

are matched

For example,

in the

canonical

Fox and

have (1983)

as

Suggest Documents