erY ExPansion using Lexical-semantic Relations - CiteSeerX

0 downloads 0 Views 676KB Size Report
set of hand picked synonym sets has yet to be devised, and expanding by the synonym sets that are ..... The runs in which the original terms a was less than or equal to another a tested this assumption. Clearly, the expansion ... is not optimal.
@erY

ExPansion

using

Lexical-semantic

Ellen

M.

ellen@scr.

Siemens

Voorhees siemens.

Corporate

755

College

Princeton,

Relations

com

Research, Road NJ

Inc.

East

08540

Abstract Applications such as office automation, news filtering, help facilities in complex systems, and the like require the ability to retrieve documents from full-text databases where vocabulary problems can be particularly severe. Experiments performed on small collections with single-domain thesauri suggest that expanding query vectors with words that are lexically related to the original query words can ameliorate some of the problems of mismatched vocabularies. This paper examines the utility of lexical query expansion in the large, diverse TREC collection. Concepts are represented by WordNet synonym sets and are expanded by following the typed links included in Word Net. Experimental results show this query expansion technique makes little difference in retrieval effectiveness if the original queries are relatively complete descriptions of the information being sought even when the concepts to be expanded are selected by hand. Less well developed queries can be significantly improved by expansion of hand-chosen concepts. However, an automatic procedure that can approximate the set of hand picked synonym sets has yet to be devised, and expanding by the synonym sets that are automatically generated can degrade retrieval performance.

1

Introduction

Users of retrieval systems that use word matching as a basis for retrieval are faced with the challenge of phrasing their queries in the vocabularies of the documents they wish to retrieve. This difficulty is especially severe in large, full-text databases since such databases cent ain many different expressions of the same concept [1]. Yet the ability to retrieve documents from such databases is crucial in a wide range of applications: retrieving documentation in support of a legal case, facilitating the organization and retrieval of correspondence and forms in an office, filtering news feeds for articles of interest, finding relevant passages within the complete manual set of a complex system for the particular problem at hand, etc. One method of easing the user’s burden when selecting query words is for the retrieval system to automatically expand the query by adding terms that are related to the words supplied by the user. The new terms can either be statistically related to the original query words (that is, the terms tend to cooccur with one another in documents) or chosen from lexical aids such as thesauri, controlled vocabulary schedules, and the like. Using statistical relations to expand query vectors is attractive since the the relations are easily generated from the documents at hand, obviating the need for lexical aids, which are expensive to build and maintain, Unfortunately, such methods have had little success in improving retrieval effectiveness when used apart from relevance data [2, 3]. Indeed, Peat and Willett show there are limitations to the effectiveness one can expect from such systems [4]. (Note, however, that methods that exploit statistical relations but do not expand the query, such as Latent Semantic Indexing [5], have been more successful.) Using lexical aids as a source of related terms has met with some success in small experiments. Salton and Lesk found that expansion by synonyms improved performance but expansion by broader or narrower terms selected from a hierarchical thesaurus was too inconsistent to be generally useful [6]. Wang, Vendendorpe, and Evens found that a variety of lexical-semantic relations improved retrieval performance [7]. However, each of these conclusions was drawn from experiments on very small collections using single-domain thesauri. This paper examines the utility of query expansion by lexical-semantic relations in a large collection that spans several domains. Queries are expanded using the relations encoded in WordNet [8], a large, [9]. general-purpose lexical system built at Princeton University, and are run against the TREC collection To eliminate were

chosen

the confounding by hand.

expected

from

scenario,

the

Thus,

a completely expansion

effects the

of expanding

results

automatic did

not

improve

a poor

reported

here

procedure

that

the

selection

represent uses this

effectiveness

of words, an upper expansion

of queries

that

the terms

bound

strategy. were

that

on the

Even

relatively

were

expanded

performance in this complete

to be best-case at the

62 start.

Less

were

complete

significantly

2

The

queries



improved

queries

by the

Retrieval

consisting

of a single

sentence

the

topic

of interest



Environment

This section provides the background necessary to understand were carried out. The following section describes the experiments summarizes the conclusions the data support.

2.1

describing

expansion.

the context themselves,

in which the experiments and the remaining section

WordNet

The

expansion

procedure

manually-constructed Science Laboratory a synset. Synsets speech.

For

used

lexical at Princeton are organized

nouns

(the

only

hypernymy/hyponymy relation

M-a

hierarchies. as defined

toy,

The

2.2

dominant

Figure

1 shows

this

work

relies

developed

heavily

by

on the

George

information

Miller

and

of WordNet

relation)

and

used in this study), three

different

and

organizes

relationship, a piece

ZS-a relation

for

of WordNet.

The

six senses

of the

the

is part-of

a playground.

TREC

Collection

recorded

his

University [8]. WordNet’s basic object by the lexical relations defined on them,

part

(is-a

is the by the

a child’s

in

system

the

noun

the lexical

synsets

relations

WordNet,

the

a

Cognitive

a set

all the

Also

include

(part-of)

into

contains swing.

at

is a set of strict synonyms, called which differ depending on part of

meronym/holonym

figure

in

colleagues

of approximately

ancestors

shown

antonymy, The

relations. and

is that

ten

descendants

one of the

senses,

The TREC collection is a test collection being produced as a result of the TREC and Tipster workshops [9]. The part of the collection used in this work consists of the approximately 742,000 documents on TREC disks one and two, queries 101-150, and the set of relevance judgments produced after the TREC-2 and Tipster-3 evaluations. The TREC documents consist of English prose obtained from a variety of sources including newspapers, abstracts of technical papers, and the Federal Register. There are some SGML-like tags in the documents to delineate the bibliographic parts of the document (document number, title/headline, author, etc.). Other tags that mark special punctuation in the body of a document were ignored in this work. The documents were indexed completely automatically using the standard SMART indexing routines [10] (i.e., tokenization, stop word removal, and stemming) to produce an inverted index of document vectors.

The text statement markers

(the

description that

is also

words

creator

enclosed

Figure

in

field.

(The

are related

version,

Summary

the

Summary

for

Statement

has

Narrative

field

the

Concepts

to the topic.

Summary

Statement

statement

The

document;

thinks

shorter

The

topic

brackets).

a relevant

topzc statement,

parlance,

Each

2.

angle

constitutes This

request.

Description

or, in TREC

in

of the statement

available.

search

query

as shown

of what

the

the

of a TREC

of need

topic

but

lists

version

always,

by

words

and

identical

describing

sentence to the

phrases

statement

sentence

2 is the

special detailed

of each topic

a single

in Figure

not

flagged

a particularly

usually

is usually

shown

is frequently,

field

natural-language

of fields

provides

A shorter

Statement,

the

is a complex a set

given

the

in the

Description

field.) For

this

containing synsets i.e.,

work,

I added

nouns

germane

that

emphasized

selecting

restrict

the

myself

that

the

the

topic

only

choice

the

for

is-a hierarchy (stimulants,

1The actual structure

be used topic. 2 provides

about

I added

demonstrated were

that

was governed

sets per in Figure

but

the

the efficacy

would

information

experiments

of drugs

synsets

of synsets synsets

to the topic statements: My goal in selecting

concepts

contains

is to investigate

‘pharmaceutical’, Early

important

to adding

6) synonym 122 shown

asks

topic.

that

selected

maximum Topic

field

to the

synset

the experiments Instead,

a new

used.

intoxicants,

bringing

the

synset

that

of the correct

contain

sedatives,

is not quite a hierarchy

some

One

by my

aspect

original

topic

the

an example cancer

however,

assuming

of the

full

topic

synsets

were

poorly

to pharmaceutical, are not

to

The

a child

of the

synset

when

synsets

with

{ drug} has children related

starting

for

to cancer-fighting.

sirkce a few synsets have more than one parent.

many

many

the

fact O,

a topic. never

{ drug}, very

of

(minimum

for

text

not

concepts. and

of 2.7

selected

market.

I did

one purpose

statment

of how

resolution,

word.

since

good

an average

drugs

is sense topic

I added

{pharmaceutical},

etc. ) that

word,

WordNet synsets topic was to pick

problem

original

query.

fighting worked

of the

relations

understanding

to expand

of hand-selected for a particular

of an ambiguous

of lexical-semantic

expansion

In addition

topic. sense

a list synsets

The

mentions

to

the

text.

children

different

in

types

I chose the more

63

l==

act

entity

human.activity

attribute

T

human_action

/

r

O J(X

I

“nrmirnate_obj ec

activity

attribute

change

physical_object

behavior

thing

a

I liveliness

diversion

motion

recreation

property

motion

article

movement

artefact

I

I

1

artifact

change_of_location

movement

8

I

I swing

sound~operty

instrumentality

swinging n,’ ante

rhythm

stroke

“1

music

device

danccroom_music ballroom_music golr_stroKe swing lilt

8~

swing

shot

basebaIl_swing

golf_shot

approach

swing

approach_shot

jive

drive

slice

mechanical_device

jazz

swing

I plaything toy

0

swing

/

playground puts

hook

0

trapeze

1. Relations

Figure

pharmaceutical

specific to topic

topics

contain

important

is a gap in WordNet;

disciplinary

measures

in version

the expanded

query,

swing in Word Net.

The

complete

list

of synsets

added

and {pharrnaceutica~.

concepts

for example,

are not

for the six senses of the noun

over-generalizing

{skin-cancer},

122 is {cancer},

Some synset

to avoid

defined

that

have

no corresponding

toxic waste, genetic 1.3 of WordNet.

synset.

engineering,

More

often,

Occasionally,

and sanctions

the important

the

meaning

concept

missing

economic

was a proper

SDI or Star Wars, for example, is an important concept for topics 101 and 102 but does not occur in WordNet. Nothing was added to the topic texts for concepts that lacked corresponding synsets in these experiments, although making some provision for them would improve retrieval performance. noun

2.3

or highly

technical

The

Expansion

term

that

one wouldn’t

expect

to be in Word

Net.

Procedure

Once the text of the topics is annotated with Selected fields of the original topic statements

synsets, the remainder of the processing is automatic. are indexed using the standard SMART routines. The

64

Domain:



Medical

Topic:



& Biological

RDT&E

of

New

Cancer

Fighting

the

research,

Description:

Document a new

will

report

on

anti-cancer



drug

developed

document

cancer

fighting

approval. type

report to

or

which the

on

market,

laboratory

cancer(s) of



will

drugs

The

of

properties

cancer,

2.

drug,

terms

evaluation

(RDT&E)

of

drug

nmst

phase

the

worldwide

for

designed

to

process

to

responsible

is

be

im

the

drug

counter,

of

government project,

and

the

bringing

new

marketing the

specific

chemicaljmedical

identified.

chemotherapy

derived

Given synonyms away

from

these

sections

set section

asynset, within

from

the

facilitate

synset the

each

that

maybe

followed.

and

may

contain

only

to the

words, a tag

stop

query. words

indicating

appended As

added

maybe pitch pitch

would

not

playground, Stems

model

by

ctypes) where

(called eleven

ctypes:

contained

within

original

query

ctypes.

Similarly,

The

similarity

weighted

the

ofgolj, vector.

Ifthesynset wouldbe

ctype

corresponds

query

query

portion that

thesynsets

lnc weights

suggested of times

are stemmed. to the

only one

for

the link

a given

of that of the

a synset

type

topic

of the

of

text

chain

are

into

their

component

The

word

stems

original

synset

plus

are then

stems

are

kept

separate

synonyms, half

1.

Ifthe

(child)

approach, one,

swing

[12];

using

that two

CYl X

the

device,

links

chip,

then

meaning

query different

extended

of different

A query

one each

of asymmetric

~

vectors, Qi to the other et al.

Figure

rnechanzcal,

relation. and

for

through

occurs

putt,

of subvectors

lexical

selected

=

in

ofhyponym

to length

of swing,

is comprised

(each

Q)

shown

drive,

are limited

D and an extended D and each of the query’s

by Buckley the term

hook,

vector

between

svirzg number

and

chap and plaything plaything,

query.

one for

isrelatedtoasynset

slice,

for any

is the one containing

the

to a different

asynset

. denotes the inner product of two the importance of ctype i relative t~ is the number

shot,

topic

then

to the

and

chains

ctype

the

set

chain

section

within

words

to the

of WordNet

a document

sum of the similarities

ofa

1 are broken

are related

stems

swing,

vector

terms,

is amemberof

between

where

when

is parameterized

parameter

inthesynset

contained

add synsets

procedure The

the

relations

each

aword

listed

gol~.stroke,

link,

added lexical

Each

noun

using

is invoked

—onecan words in

length

Figure

If hyponym

one

[11].

that

expansion

schemes.

remaining

consider

added for

different

original

the

which

stroke,

sin(D,

where reflects

procedure

vector or all

maximum

synonyms

containing

Fox

one for term

and

process,

query

trapeze

the

synset

All

through

be followed

through

introduced

expansion

The

etc. of these

as change.ojJocationin

synset

stems

the

may and

added

The

terms.

is the

the

type,

ateach

removed,

expansion

be added.

type

122.

toaddto aquery the zs-a hierarchy,

in WordNet

type.

such

query

then

of link

begins

relation

topic

be addedto

of query

terms”.

of a variety

single

as o~are

of the

to the

link

ofa

lexical

would any

query

included

A chain

original

traversed,

type

links

such

an example

synset

regardless

Collocations

the

to the

“original

effectiveness

relation

link added

are

statement

there is awidechoice ofwords the synset, or all descendants in

original

for

2. Topic

is reached.

comparing

specifies

toy,

and

world.

leukemia

synonym

and

the

conceptualization

company

the

drug

any

from

Figure

run

in

Concept(s):

1.

to

testing,

development,

anywhere

Narrative:

A relevant

the

Drugs

for

relation appears relations

vector the

vector

space

concept

types

potentially

other

has its own in both

of the

appears

query vector subvectors:

has

relation

types

ctype).

An

respective

in both

ctypes.

Q is computed

as the

D.Qi

i is the ith subvector of Q, and CYi, a real number, ctypes. Terms in documents vectors are weighted that

is, the

in the document

weight

and is then

of a term normalized

is set to 1.0+ by the square

ln(t~) root

65 of the sum of the squares of the weights in the vector (cosine normalization). Query terms are weighted using it~ the log term frequency factor above is multiplied by the term’s inverse document frequency, and the weights in the ct ype representing original query terms are normalized by the cosine factor. Weights in additional ctypes are normalized using the length computed for the original terms’ ctype. This normalization strategy allows the original query term weights to be unaffected by the expansion process and keeps the weights in each ctype comparable with one another.

3

Experiments

3.1

Topic

Full

Statement

The purpose of this investigation is to determine the efficacy of expanding a query by lexical-semantic relations. Given a set of concepts to be expanded, the effectiveness of an expanded run is dependent on the link types followed during the expansion and the relative weight given to each link type (the a’s in the similarity function above), so a variety of different schemes must be tested. Table 1 shows the 11point average precision value and percent difference over the unexpanded run for different combinations evaluated using the full topic statement (except the “Definitions” field) plus synsets. Four expansion strategies were tried: expansion by synonyms only, expansion by synonyms plus all descendants in the zsexpansion by synonyms plus parents and all descendants in the is-a hierarchy, and expansion a hierarchy, by synonyms plus any synset directly related to the given synset (i.e., a chain of length 1 for all link types). The a for the original terms subvector was usually greater than the a for the other subvectors to reflect the assumption that user-supplied terms are generally superior than automatically added ones. The runs in which the original terms a was less than or equal to another a tested this assumption. Clearly, the expansion is ineffective: none of the expansion strategies significantly improves the performance of the unexpanded query. Indeed, the difference in performance between an expanded and unexpanded run for individual queries is very small for most expanded runs. Individual query performance differs more for more aggressive expansion strategies (i.e., expanding using longer chains of links and weighting added terms more heavily) but across the set of queries the aggregate performance is worse for aggressively expanded queries. In an earlier set of experiments, the most effective expanded run was the one that expanded a query synset by any synset directly related to it and had a = .5 for all added subvectors [13]. While this combination is not optimal for these queries, it has the advantage of being a straight-forward choice of expansion parameters. Thus, this expansion strategy, which will be called the standard expansion strategy, is used for the experiments described in the next section. 3.2

Less

Detailed

Query

expansion

vocabularies.

the very

complete

original

derived the

using

queries

of queries: and

derived

derived

same

from

(17.56)

Figures run the

for 3 and

is the

case uses

and

same the

Summary

the the

topic

from each

full

topic

4 contain

unexpanded Statement

the

plus

Statement only version (35% level of effectiveness obtained queries

3.3

(3970

degradation

Automatic

another

the

same

vectors.

The

problems

TREC

derived

expansion

field;

of the

in the

queries

standard

exactly

query

the

query

terms

derived same

from

strategy.

query

shorter

derived

Concepts

Selection

by

due

to

versions

query

set was derived

of

set was

using

to expand

for

from version

the

full

of the

mean

new

queries,

no

only

as did

but

mean

the

1 l-point

average

number

different

Concepts

of

versions

(SmryCon),

of additional

terms

is the

time.

of the

additional

statement.

the

of the

plus

each

versions

with

topic

each

number

is expanded

run

contains

for

Statement

The

the two

expanded

table

queries.

The

terms

added.

Expansion

significantly

unexpanded

does

improves

The not

the

improvement in 1 l-point average precision). Note, however, that by the expanded Summary queries is less than the unexpanded

in the

caused

collection One

set of synsets

terms

Summary

only.

set of synsets

results as the

to original

from

(Summary)

retrieval

queries

different

(Full),

since

of the

topic,

Concepts

used

of additional

Statement

version

version

of the

ratio

Summary

the

versions

the

some

is unhelpful

statement.

lengths

mean

using

plus

to overcome

expansion

by a TREC

expanded

new

designed

that

provided

Statement Both

the full

2 compares terms

technique hypothesis

were

Summary

Statement. from

the

statement

statements

the

derived

Table original

To test

problem

topic

Summary

Statements

is a recall-enhancing

differing the

Topic

base

improve Summary

the overall full topic

precision).

of Synsets

Given that short queries have the potential to be significantly see if the potential can be realized by a completely automatic

improved by expansion, it is necessary to procedure. While it is possible to present

66 ave. Unexpanded Expansion

by synonyms

orig

terms

.1

.3614

1

.3

.3639

+1.5

1

.5

.3634

+1.3 +1.2

a list

is based

of the portance in more

query than

A series effectiveness.

.3617

.3 .3 .5 .5 .5 .8 .8 .8

.1 .3 .1 3 :; .1 .3 .5

.3639 .3635 .3635 .3637 .3622 .3614 .3612 .3603

a

plus

parents

synonyms

and

a

all

+0.9 +1.5 +1.4 +1.4 +1.4 +1.0 +0.8 +0.7 +0.5

descendants

descendants

a

parents

a

1

.1

.1

.1

.3617

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

.3 .3 .3 .5 .5 .5 .5 .5 .5 .8 .8 .8 .8 .8 .8 1

.1 .3 ..‘3 .1 .3 .3 .5 .5 .5 .1 .3 .3 .5 .5 .5 1

.1 .1 .3 .1 .1 .3 .1 .3 .5 .1 .1 .3 .1 .3 .5 1

.3640 .3639 .3647 .3639 .3638 .3646 .3624 .3628 .3627 .3622 .3617 .3614 .3605 .3605 .3609 .3511

+0.9 +1.5 +1.5 +1.7 +1.5 +1.5 +1.7 +1.0 +1.2 +1.1 +1.0 +0.9 +0.8 +0.5 +0.5 +0.6 -2.1

1

.3350

-6.6

2 a

plus

1 any

synonyms

directly

a

related other a

synset

1

.3

.1

.3629

+1.2

1

.3

.3630

+1.2

1

.5

.3 .1

.3624

+1.0

1

.5

.3

.3620

+0.9

1

.5

.5

.3608

+0.6

1

.3

.5

.3604

+0.5

1

1

1

.3491

-2.7

Table

1. Combinations

of candidate

synsets

and a poor a high-level

Using

the

of retrieval

that

same

runs

experiments

and

have

strategies

them

the synsets

reasoning number

is not two

of expansion

and relation

select

the

choice can be worse than not description of the algorithm

by the

N documents to at least The

a

1 1 1 1 1 1 1 1

is approximated

to be related

descendants .1

on the observation [14].

a

.1

terms

is a tedious process, Figure 5 provides rithm

.3629 all descendants

synonyms

by synonyms

orig

+0.8

1

terms

Expansion

with

a

plus

by synonyms

orig

users

.8

terms

Expansion

a

1

by synonyms

orig

~o change

only

synonyms

a

1 Expansion

prec.

.3586

queries

using

as is used

query

the above

tested

for

Sense terms

resolution

before

procedure

different

values

inverse

in which

choosing

[13]. to select

the

correct

synsets.

the correct

sense of zmportant

document

frequency

a query

term

occurs

is approximated

it is included

on the Summary of N: 70,000,

tested.

to expand,

expanding developed

need to represent

of documents

expanded.

original

ones

weights

in the

by requiring expanded

Statements approximately

The

tested 10%

algo-

concepts

weights — a term

synsets

[15],

im-

occurring a new

term

query. the procedure’s of the

collection,

67 0

‘1 n



2—

0.0

0.2

0.4

full unexpanded smr con unexpanded sm & con expanded

0.6

1 ;0

0.8

Recall

Figure

3. Effectiveness

of queries

derived

from

Summary

Statement

and Concept

fields.

0

0.0

0.4

0.2

0.6

1.0

0.8

Recall

Figure

and

35,000,

approximately

expanding

(all

shows

1 l-point

the

materially requirement correct

link

changes

Inspection

4. Effectiveness

of that

5%

types

average the the

collection;

treated

performance appear contained

Mean

different

in two

number

from

the

is not query

terms

ratio

2. Length

for

unexpanded

lists

in a short

1 and

obtained

resulted

Mean

Table

values of the

that

derived

identically):

precision

queries

a term

senses of words

of the

were

of queries

statistics

from

limits

on the

2; and

different

these

runs.

Summary automatic

a good

Summary

Statement.

lengths As

can

Statement

seldom

have

common

Full

SmryCon

to sense relatives.

Summary

29.22

11.02

.36

.77

1.71

versions

be seen,

procedure

52.54

for different

to follow

.3, .5, and none

when

.8. Table of the

3

runs

queries.

selection

approximation

of chains

a values:

of queries.

suggests

that

disambiguation.

Instead,

the words

the The that

68 for

(each query word w) { if (w not already expanded and document frequency of w < N ) { expand all synsets containing w producing

kin list of w

} } ~or (each relative in the set of kin lists) { if (relative occurs in more than 1 list) add relative to query vector

} Figure

5. Procedure

to automatically

select synonym

sets to expand.

appear in more than one list are likely to be fairly general terms with more than one sense themselves. For example, since collocations are split into their components during the expansion process, general nouns such as system tend to appear in multiple lists.

4 The

Conclusion experiments

little

benefit

it is not

doing

as relevance query

here

demonstrate

a user supplies

surprising

no means for

discussed

when

that

longer

a perfect

feedback expansion

queries

job

[16].

a detailed

expansion Since

benefit

of retrieval, The

that query.

success

are idiosyncratic

less than

and

they

of these

to the

by general query

shorter

queries.

can be improved other

methods

particular

lexical-semantic

expansion

query

However, by other

suggests in the

relations

is a recall-enhancing the longer expansion

that

context

the of the

provides technique,

queries

are by

techniques

such

most

useful

relations

particular

document

lexical-semantic

relations

collection. Nonetheless, have

the

users

potential

as a better

formulated

that

to select

is able

frequently

to improve

do not an initial

user-supplied appropriate

supply query,

query. concepts

a detailed though The

this

challenge

query. now

N=70,000;

queries max

chain

max chain a!=.s

Table selected

3. Effectiveness synsets.

of expansiou

prec.

procedure

‘?10change

.1627

-0.5

.1603

-1.9

.1543

-5.6

.1633

-0.1

.1557

-4.7

.1402

-14.2

cr =.3

.1636

+0.1

a=.5

.1635

+0.1

a=.8

.1639

+0.3

chain length=2 cl! =.3

.1645

+0.7

Q!= ..5

.1642

+0.5

~=.

.1617

-1.0

max

an automatic

length=2

0! =.5 ~=.8

N=35,000;

in finding

to be as effective

length=l

a=.8

N=35,000;

is unlikely

.1634

cY =.3 0! =,5 N=70,000;

lies

case,

query

to expand.

ave. Unexpanded

In this

expanded

chain

length=l

max

$

strategies

on Summary

Statement

qneries

when

expanding

automatically

69

References 1. David

C. Blair

Information

and M. E. Maron.

Processing

and

Full-text

Management,

information

retrieval:

26(3):437-447,

1990.

The retrieval 2. A. F. Smeaton and C. J. van Rijsbergen. Journal, 26:239-246, document retrieval system. Computer 3. C. T. Yu, C. Buckley, Information

and G. Salton.

Technology:

Research

A generalized

Further

effects of query

and clarification.

expansion

on a feedback

1983.

term dependency

and Development,

analysis

2:129-154,

model

in information

retrieval.

1983.

4. Helen J. Peat and Peter Willett. The limitations of term co-occurrence data for query expansion in of the Amerzcan Soczety for Information Science, 42(5):378-383, document retrieval systems. Journal 1991. 5. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, of the American Society for man. Indexing by latent semantic analysis. Journal 41(6):391-407, 1990.

and Richard Information

HarshSc2ence,

evaluation of indexing and text processing. In Gerard Salton, 6. G. Salton and M. E. Lesk. Computer Retrieval System: Experiments in Automatic Document Processing, pages 143– editor, The SMART Inc. Englewood Cliffs, New Jersey, 1971. 180. Prentice-Hall, Wang, James Vandendorpe, and Martha Evens. Relational 7. Yih-Chen thesauri in information Journal of the American Soczety for Information Sczence, 36(1):15-27, January 1985. trieval. 8. George Miller. cography,

3(4),

Special

Issue, WordNet:

An on-line

lexical

database.

Journal

of Lexi-

1990.

The first Text REtrieval Conference 9. Donna K. Harman. Processing and Management, November, 1992. Information 10. Chris Buckley. 686, Computer

International

re-

MD,

U.S.A,

4-6

Implementation of the SMART information retrieval system. Technical Science Department, Cornell University, Ithaca, New York, May 1985.

Report

85-

11. Edward

the Boolean and A. Fox. Extending Queries and Mu/tip/e Con cept Types. Microfilms, Ann Arbor, MI. P-norm

Vector

Space

PhD

thesis,

(TREC-1),

Rockville,

29(4):411-414,

Models

Cornell

1993.

of Information

University,

Retrieval

1983.

with

University

12. Chris Buckley, Gerard Salton, and James Allan. Automatic retrieval with locality information using of the First Text REtrieval Conference (TREC-l)J SMART. In D. K. Harman, editor, Proceedings pages 59–72. NIST Special Publication 500-207, March 1993. On expanding query vectors with lexically related 13. Ellen M. Voorhees. of the Second Text REtrieval Conference (TREC-2), editor, Proceedings

words. 1993.

In D. K. Harman, In press.

14. Ellen M. Voorhees and Yuan- Wang Hou. Vector expansion in a large collection. In D. K. Harman, of the First Text REtrteval Conference (TREC-1), pages 343–351. NIST Special editor, Proceedings Publication 500-207, March 1993. 15. Karen

Sparck

Journal

Jones.

A statistical

of Documentation,

interpretation March

28(1):11-21,

of term specificity 1972.

and its application

in retrieval.

16. Chris Buckley, James Allan, and Gerard Salton. Automatic routing and ad-hoc retrieval using of the Second Text REtrzeval Conference SMART: TREC 2. In D. K. Harman, editor, Proceedings {TREC.2),

1993.