Inferring Probability of Relevance Using the Method of ... - CiteSeerX

1 downloads 0 Views 801KB Size Report
Fredric. C. Gey. UC Data Archive and Technical Assistance. University of California,. Berkeley ...... 500-207, March 1993, pp 73-88. 21. Harman, D. Overview.
Inferring Probability of Relevance Using the Method of Logistic Regression Fredric

C. Gey

UC Data Archive and Technical Assistance University of California, Berkeley USA e-mail: [email protected]. edu Abstract

This

research

technique function clues

evaluates

of logistic of document

present

bution ment

collections,

retrieval

test

retrieval

which

measure. tions)

clue

allows with

into

Since

its standardized

and

the

infers

we call

it logistic

are

derived

The

model

compared

directly

In the comparison,

or equally

well

formances

of

the

significant

or could

the logistic

as (in the third

two

models

have

were

occurred

inference

collection) subjected

method

to

with

mean

from

a training

to

the

(tfidf) performs

the tfidf/cosine

tests

collection

to three

see if

statistical

space

than

The

docu-

information model

the cosine

better

the

deviation

to other

vector and

space model.

the as a

the distri-

well-known

particular

significantly to

from

transforming

v = O and standard

weighting

vector

statistical

By

utilizes

of relevance

of relevance

inference.

is applied

uses term-frequency/inverse-document-frequency

the model

by probability

probability

(one

coefficients

retrieval;

documents

the model

power.

results

and document rank

distribution

logistic

loss of predictive

collections,

text which

and queries,

one to apply

little

probabilistic equations

properties.

of documents

statistical

a = 1), the method

for

to obtain

and query

in the texts

of each

a model

regression

(in two

collec-

differences

differences

of

similarity

are

in per-

statistically

by chance.

1. Introduction We

can consider

the central

task

contents

of a collection

meanings

of the texts

contained

associates

at Cornell

University

“m’’ -dimensional this

model,

tance One

term

we have

measures such

absence

of a vocabulary

weights

to the

(lDF)

the

may



where

ratio,

query

These

queries

the query

interpreted

merely

derived

from

frequency’

N

is the

attributes

of terms.

is the count

number

of

For example,

of occurrences

tj used in documents,

For terms

term

IDF,

which

is usually

weights,

used

in the SMART

the

for

tj occurs

which

attribute

document

Buckley

in

weighting

terms, 1988

documents

only

query,

performed By

accounted

rules

mation

retrieval

satisfying

system attempt value

are derived the

which

query

dis-

vector.

vector.

show

term

‘inverse

collection

system

theoretically

the computation as the

a system

we will

return

The

In the or

by adding

the

magnitude

and

one

frequency experiments

whose

query

of the theory documents

primary status

of

frequency

used

frequency’

~t

number

1

to index

for

terms

performance show

that

(DAF

is the

the document.

[4].

the query

or

the query

document

Sparck-Jones product

in the query

in either

Two

).

measurements a simple

and the occurrence

The

well-known

(QAF),

“~~F

of

and,

Salton for

for and

multiple

retrieval

weight,

of the term

in the

[5].

and practical.

of retrieval probability

Karen

in the document

application

need. sound

mean

or has been

and computed Their

comparisons

the formal

Such

factors

of the term

from

information

be computed

For

the presence

for terms

of that

the attribute in the

by

document

collections.

search

text

suggested

are the term

inverse

of these test

retrieval

user’s

first

system,

occurrence

problem. is both

to place can

several the

was

frequency-

in their

a probabilistic

retrieval of

for

worst

retrieval a number

against

in the document’s

logged,

term

identified

schemes

which

the

the

“m”.

generalized

%j documents

the

indicate

can be easily

geometrically,

and

in

and can choose

and query

only

Salton

is of size

from

document

to the of the

as vectors

terms

deviates

vectors

This

by Gerard

and document

between

and document

or the document.

weights,

developed

vector

access

representation

are represented

of nontrivial

document

of the angle

vectors

the query

both

the

intellectual

is the concise

space model and

vocabulary

for

to which

of providing

task

vectors.

‘absolute

(DAF).

the indexing

the cosine

in either

of this

documents

interpretation

model,

be mechanically

or the documefit is the

that

to be that

heart

In the vector

both

degree

or document.

the attribute

[3]

computes space

term

query

Weights (QAF)

geometric

of the appropriate

the document,

[2]

a clear

of the vector

and direction

[1]

assuming

measure

form

The

in the collection.

measure

sparsest

Retrieval

or documents.

space

which

well-know

of Information

of texts

focus

on a sound the

indexing

user’s

paper

is to test

models

of text

theoretical information

processes,

to the logic

to the user in an order of this

Probabilistic

value

satisfying

specification, of probability

ranked

by probability

a probabilistic and document

footing. need,

If retrieval then

and

of the inforsearch retrieval status

the probability

223 ranking

principle

in order

[6]

of their

babilistic

retrieval

tunately

experiments

tistical

[7],

evidence

methods

for

cessful

states

the optimal

which

was

this

to predict

statistical

retrieval

of relevance.

with

This

generalized

relevance

estimation

to

even

in the

“Linked

approach

when

face

be achieved

if documents

to the so-called

the

feedback

relevance,

will

leads

half

of this

model

encountered

collection

insufficient

by

have

of pro-

[8]

Unfor-

of insufficient

as a training

size

to the user

model

Cooper

problems

is used

sample

are returned

Independence”

Dependence

have

the

“Binary

only

sta-

set. Improved

been

partially

suc-

[9].

2. The logistic

inference

If no assumptions

are made

query-document

pair

ated

query

with

each

r

,Vn ) from

(VI,””” (such

as count

number exists

values before

vary

of

inference

the

and further

that for

tdjthere

the

are

a finite

and

by the number then

have

associate exists

attributes

in the query),

from

the

indexed

between

and that about

from

collection

a “Model

model says that we can use a random have

in both

document

the logarithm

set of

of documents

we

which system

the

term

the document

(such

as the

by the term),

O“ system,

a

associ-

i.e.

total

and that

one

which

models.

judgments

is present

retrieval

of the term

document),

collection,

of relevance

that a text

of occurrences in

divided

with

particular

allow

term

term

log~(~

the logodds

of the probabilities

as count

the

statistically

relevance

tk which

for term

(such

in the collection

introducing

binary

the character

tq,and document

term

occurrences

logistic

The which

about

the query

of

model

= ~ (~i ,dj ), but we merely

of documents

attribute

for

that

probability

been

made,

dj

and query

Iqi,dj,tk)

of the odds

sample of query-document-term

and compute qi

for

of the odds

triples of relevance

by the formula:

= co+clvl+

of relevance

the logarithm

“ “ “ “lvn

the i ‘h

query

qi =Ct

1, - - “ ,tq > is the sum

of

all terms:

logO (1? I qi ,dj ) = ~

[logo

(~

I qi ,dj >tk ) _ logo

(~

)1

k=l

where

O(R),

collection

as the prior

known

will

be relevant

are derived

by using

independent

variable

odds of relevance

to query

qi.

The

of logistic

the method as a function

of

is the odds that a document

coefficients

regression

(possibly

of individual

[10] which

continuous)

query

fits

chosen

an equation

independent

at random

and document to predict

variables

from

term

which

the

properties

a dichotomous show

statistical

variation. Once tion

may

log

odds

of relevance

be applied

to directly

is computed obtain

from

a simple

the probability

linear

of relevance

formula,

the inverse

of a document

logistic

transforma-

to a query:

1

P((ZUqj,dj)=

- log(O (R I qi ,dj ))

l+e Once retrieval ples,

the coefficients

system

these

for

of the equation

a particular

coefficients

may

collection be used

for

logodds

from

of relevance

a random

to predict

odds

sample

have

been

derived

within

a document

of query -document-term-relevance

of relevance

for

other

query-document

quadrupairs

(past

or

future).

tions

Regression

models

of multiple

clues

success mate

by Ed Fox

relevance.

approaches ●

The



the error An

[15]

[11]. There

to probability

outcome

variable

distribution alternative

which such

approximate

as term

More are

recently

two

Fuhr

major

document

and co-citation, [12]

[13]

problems

relevance

were

first

used polynomial

with

the

application

by linear

introduced, regression of

combinawith

some

to approxi-

ordinary

regression

approximation:

is dichotomous

approach

variable,

authorship, and Buckley

well-known

of relevance

is binomial

the dependent

frequency,

rather

rather

than

than

continuous,

normal,

is to use Bayesian

hence

as is usually

inference

assumed

networks

by ordinary

as developed

regression.

by Turtle

and others

[14]

[16].

3. Sampling

for logistic

One

of the computational

pute

logistic

regression

problems coefficients.

regression in using

logistic

Even

supposing

regression that

only

is the computational half

the

document

size

necessary

collection

to com-

is used

as a

224 training

set the number

nomical.

Naturally,

query,

naturally

attempt than not

a very

to construct

non-relevant all,

ments,

and

it into

We

have

chosen

Cranfield’s lowing sonable

fit

queries

It does

seem to possess The

queries

Second,

a better

characteristics

deficiencies while

it only

language

test collection

form,

has

query

most,

model.

The

which

com-

non-relevant

Cranfield

experiments

docutriple

collection

beginning

[17],

documents,

if

docu-

of 30.

by Swanson

1,400

documents

query-document-term

to be a sufficient

of an ideal

should

so that

seems

us to

sample

factor

than

factor

numerous

as outlined which

a weight

non-relevant

to fit our

for

each

upon

of non-relevant

fewer

a weighting

which

fit to the characteristics

of an ideal

are in natural

upon retrieval

by

1/ lo(hh

astro-

is, for

of relevant we would

proportion

process

thirtieth

by applying

information

proportion

smaller

becomes

documents

it is incumbent

documents,

are typically

it has 225 queries,

model.

higher

in the regression say, every

of relevant

collection,

of relevant

documents

some

in the

a significantly

computation

have

First,

logistic

sampling

the sample,

in

the number

a significantly

test collection

used

properties.

to our

into

are to be used for computation

documents

documents

the Cranfield

experiments.

total

representation

relevant

regression

collection

desirable

while

Since

to take

Since

encompasses

non-relevant

the logistic

which

adequate

documents for

triples

is necessary.

of the

which

assure

the differential.

a long-standing



To

it is reasonable

and factor

fraction

a sample ones.

adjusting

for

approach

small

of the relevant

pensates ments,

of query-document-term

a sampling

but it offers

number

both

is

with

the

the fol-

to afford

a rea-

the documents

and the

test collection.

be:

statistics

can be collected

as well

as document

statis-

tics, . the documents

have

both

titles

and abstracts

present

(abstracts

are missing

from

some

of the CACM

col-

lection),



The

queries

number

are a random

of queries



Documents



That

relevance

ple of all possible We

will

later

ments

logistic

(QRF),

absolute

frequency

Similarly, of

total

all terms You ment rithm, sums

of term

in the entire will

the influence

note

times

in general, of

Indeed, maximum

(QAF

of

logged

be a sufficiently

large

pairs

in the collection. judgments

If this

is not possible

are a carefully

crafted

sam-

size be known. (52 queries,

3204

documents)

and the CISI

collec-

likelihood

has been query in

i.e., collection

all

variables

terms

important out

the fit for

are logged.

clues

relative

absolute

frequency divided

tj

of term

(WAD

in

in all docu-

is then:

)

frequency

(QRF)

occurrences

of all

frequency in

all

is equal terms

divided

documents

by the total

of than

One reason

to assume Another to one

the

and non-logged

logged

frequency

number

to

in the

by the total (RFAD)

is the

of occurrences

of

length.

distribution

as products

logged for

Relative

C ~log

of term

is document

10 occurrences.

a skewed

) +

Query number

in the collection

relative

of the term

+ c2bg(QZW)

defined.

(DRF)

clues:

frequency

the presence

) + C ~zog (~~~

It is unrealistic

than

behave

is attained

given

by the total

in the formula

information.

smooths

of relevance,

the document.

collection, that

elementary

and relative

previously

divided

frequency

terms

additional

(DRF)

= Co + cllog(QAF)

Itj)

relative all

attributes

logodds

of the term

more

in comparing

for

occurrences

of frequency

is five

must

and,

the relevance

collection

) + c410g (~~~

(QAF)

document

occurrences

number

formula

(DAF

frequency

number

collection

the sample

in the document

= logO(R

+ c ~iog

query.

then

and that

There

population.

query-document

model has the following

frequency

The complete

absolute

all

for six term

inference

Zt,

Query

for

query

significance.

document

collections),

our tit to the CACM

relative

(RFAD).

query

a larger

judgments,

regression

particular

larger

statistical

1460 documents)

4. Logistic Our

from

document

apply

some

some

be made

million relevance

(76 queries,

the query

sample

judgments

be for

from

to achieve

are a random

(as it cannot

tion

sample

in order

same items,

for raw

that reason

which

for

has nice

variables in general clues.

for

using

logarithms

50 occurrences

after

taking

the logs

behavioral an

(at least

in a docu-

is that

properties.

anti-logarithm for

is to dampen

of a term

these

the logaFinally,

is clues)

applied. a higher

225 The sample

fit

for

taken

this

from

extended

clues

the Cranfield

logistic

model

collection.

The

Z,j = logO (R I tj) = -0.2085

is done

following

against

the same

coefficients

(QAF)

+ -0.203610g

remains

is to compute

by counting

the number

the Cranfield

collection

(for

225 queries

Cranfield

log

(0

(R ))

the prior

of query-document this

is

1838

times

pairs

pairs)

odds

for

and then

1400 documents).

of relevance

which

becomes

the prior

probability

=

from

this

this

collection.

of relevance

(IWAD )

This

is estimated

has been

number

fit:

(QZ?F)

) + 0.7503310g

by the total

assigned

(for

of query-document

pairs

Hence:

1838 Przor

for

a judgment

dividing

query-document-term

derived

+ 0.1914310g

+ O.1678910g (DAF ) + 0.5754410g (DZ?F ) – 1.596710g (IDF What

20,246

are then

= .005835

225”1400

of relevance,

and

logprior

=

log

‘rl~

= –5. 138

l–prior and the final

formula

for ranking

is:

log(O(l/

I ~))

-5,138

=

+ ~(Z~,

+ 5.138)

j=l

50 Performance A key

part

of this

bilistic

method,

model,

which

ment

for

a single

and

according

evaluation

of

relevant

query

but

to a statistical

is usually

test to determine

which

all

queries

our

relevance

between

terms

of

the

is our

the difference

regression

against

the

recall

These

in

precision

measures

to

also

is significant

space

and the documeasures.

the retrieval

It is our

intention

proba-

vector

vector

and

point

in the collection. it

logistic

the query

retrieval

are relevant.

however,

whether

of

at a certain

documents over

in

tests methods:

of the angle

described

performance;

major

probability

retrieved

averaged

of

to

to the cosine

of retrieved

measure

of two

according

documents

are usually

as one

significance

performances

documents

is the fraction

precision

differences

is to compare documents

fraction

precision

statistical

ranks

Performance

is the

while

paper

which ranks

vector.

Recall

evaluation:

process

can be computed intent

to use recall

subject

or could

performance

have

occurred

by

chance. The better

reason

than

retrieval, bility

of

that

relevance

have

mance

of two

statistical

randomly

(such

will

chosen

that

query.

inference

on a particular

for

the next

and our method It

profile.

the results

is that

be reversed

at random,

performance

as logistic

show

could

is chosen

different

methods which

significance

the performance

a query

to the

an entirely

test

to test statistical

and yet

we assume

would

cally

one wishes

another

might

versus

obtained

ranks

be that

Thus,

if

we

next

documents set of

vector

the randomly

one method

posed.

according

we

may

be

model

of

to proba-

chosen

to compare

space)

chosen

In our

randomly

are attempting

tfidf/cosine for

query

query

need

queries

the perforto establish

set of queries

are statisti-

significant. Such

between

a test,

are statistically ●

well-known

performance

For

for

significant.

each

precision

query value

the

query,

Our

take over

in

each

literature,

approach,

the ranking all ranks

would

and we determine therefore,

of relevant

(all

levels

be a T-test whether

for testing documents

of recall).

This

which

these for

statistical

by each will

would

differences

significance

method

give

compare

the

in ranking will

be:

and compute

a unique

number

differences

or in precision

an average

for

each

query

for each method. ●

Compute

for

number ●

Apply

and

the difference

the two

methods

of queries a T-test

hence

their

in average

being

precision

compared.

Thus,

between

the two

in applying

numbers

the T-test,

found

with

the sample

size

each will

method

equal

the

in the test collection. to these mean

differences difference

under should

the null

hypothesis

that the methods

be

and

standard

zero

their

deviation

perform not

identically significantly

different. [18]

provides

a summary

of possible

statistical

tests which

might

be used to evaluate

retrieval

experiments.

a

226

6. Logistic

model

performance

for the Cranfield

For the logistic

inference

model

the following

on Cranfield

Logistic

Inference

versus

Cranfield

table

tfidf/cosine

Collection:

collection

Vector

Averages

Space

over

Performance ueries

225

Logistic

Vector

0.7787

0.00

0.8330

0.10

0.8116

0.7440

0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

0.7129

0.6434

0.6021

0.5301

0.5161

0.4380

0.4503

0.3814

0.3698

0.2994

0.2859

0.2267

0.2280

0.1882

0.1640

0.1379

1.00

0.1464

0.1251

11 -pt Avg:

0.4655

0.4084 -12.3

% Change: displays

the recall-precision

for

each

of the 225 queries

significant

at the one tenth

performance for

results.

the two

Hypothesis

225 Thus

we can see that

property

clues,

The

above

queries,

results

for

fitting.

in testing for

apply

for

Naturally,

tests

of probabilistic half

obtained

from

model

have

regression

where

the number

We

feel

the Cranfield

CACM

regression

equation

approach

space

model.

collection

to rank used

approach,

which

[19],

use that

of queries

as a training

entire

based

was used

the next

training

is necessary instead,

set to other

set of set for

documents

to the sample

and then

Our

The

used as the training

has also been

number

retrieval.

difficulty.

are used

of queries

a large

to document

will

collections

step

to predict in order

to

be to apply in our

test

collections.

fitted

for

the

Cranfield

sample

to the

CACM

collection,

we

result: Logistic

versus

Tfidf/cosine

Collection:

Vector

Averages

over

Space

Performance

52 Queries

Logistic Recall 1 l-pt

Vector

Precision

Avg:

seem that the extended for the CACM

collection,

Space

Precision

0.3322

0.3148

% Change:

it may

on the set of six term vector

the sample

coefficients

this

that

a fundamental

the test is applied

on half

and CISI

CACM

space model

.0000

the tfidf/cosine

within

logistic

whereby

models

for

Cranfield

While

P

-8.31

of logistic over

set is included

the resulting

approach

using

the CACM

logistic

inference

of the queries.

regression

inference the

to train

test

results

the relevant when

model

improvement

and egg process,

has been

logistic

the following

probabilistic

significant

the logistic from

a chicken

in particular

8. Logistic we

224

of the other

the coefficients

If

0.0072063

models

a decent

repertoire,

obtain

-0.059885

and terms

In past such

the results train

0.0000

use of the full

precision

are statistically

methods

df

sample

stat

a statistically

in average

that the results

Differences

and Tfifdf/cosine

SE

obtained

we have

t on Precision

mean

regression.

on relevance,

two-tail

sample

making

the difference

we find

level:

inference

of the Cranfield

documents

the logistic

Test:

compute

and logistic)

mean

we obtain

7. A criticism

(cosine

significance

Logistic

null

If we further

methods

of one percent

between N

Space

Precision

Precision

Recall

-5.2

logistic

model

if we apply

performs

5.2 percent

the same hypothesis

better

test:

than

the tfidf/cosine

vector

227 CACM

Hypothesis

between N

52 we find null

again

that

hypothesis

as well

or better

Applying

null

sample

sample

mean

SE

0.0000

-0.020528

0.015866

difference

in fact,

of performance

given

than the extended

9. Cranfield

2-tail

inference

mean

this

that,

test:

Logistic

logistic

the Cranfield

fitted

model.

equation

Cranfield

is not statistically

logistic

regression

Logistic

test

P

stat

value

-1.29

.2016

significant

the tfidf/cosine

to the CISI

Tfidf/cosine

Collection:

collection,

Vector

Averages

over

this

seems

age precision

to indicate

and recall

Space

better

Vector

N

vector

0.2137

space performance,

76 we

find

model

that

there

as against

directly tions

can the

to both

extended

logistic

The another

space

blind

SE

clearly

significant

neither

test on the aver-

collection

test

P

stat

difference the

between

application

collection

collection

value

0.65

75.0

While

and the CISI

for

the statistical

Differences for CISI

df

0.0087791

model.

collection

application

collection

performance of

produce

can we find

.5183

the

on

the

vector

Cranfield-derived

comparable

the difference

results

space

coefficient

on these

between

vector

collec-

space

and

significant.

of

coefficients

leaves

derived

something

of coefficients

from

the

to be desired.

from

statistical

In the next

one collection,

in order

distribution

section,

to adapt

of

we apply

them

one

some

collection clever

to the statistical

to

thought

clues

of the

collections.

10. Standardized The

sample

logistic

model,

T on Precision

sample

statistically

we apply

and Tfidf/cosine

mean

to be statistically

to the transformation other

be no extended

the CACM

to the vector

2-tail

inference

0.0056973

0.0000

when

queries:

Test:

Logistic

null

Space

4.4

for 77 CISI

mean

a

perform

Precision

0.2088

Hypothesis between

reject

might

Performance

Precision

Avg:

measures

we cannot

model

76 Queries

% Change: While

space

we find:

Logistic Recall 11-pt

and hence

vector

CISI

versus

CISI

methods

51.0

set of queries

for

Differences

df

another

model

T on Precision and Tfidf/cosine

direct

application

lection

and one

lowing

assumption.

of relevance the

regression Even

unlikely deviation

adaptation

can take though

the mean the actual

one which If

we obtain standardized

lection

which

an adaptation ●

For

a query-document-term collection which

distribution come

standard

are being from

fi-om

deviation

sample

and yet another used

as sensitive

collection

each

clue

By

probability

as the

one document makes

indicators

to collection.

the same underlying for

from

set of queries

original

col-

the fol-

of probability

identical

statistical

distribution, collection

but they

on which

might used

and standard under] ying zero

derive

from

to collection,

being

for

coefficients

O) and such

of probability

of any

probability

be no variation

statistical

distribution

standard

deviation

one (6

to indicate

=

distribution,

probability

for

the

This

gives

into

it seems

mean

and

us insight

It is well

and transform

is not known)

coefficients

either

collection.

function

a standardized

distribution,

in

of relevance. to another

density

to compute clues

would

fit one collection

deviation

for

distribution

the same underlying

there

indication

which

probability

Q.L =

uses the same

known

highly

standard into

the

that

one

that distribution

a standardized

(even

distribution,

i.e.,

1). then

we can

the standardized

of relevance.

also

distribution

The

underlying

apply

the coefficients

of yet another assumption

clue

the

underlying

probability

distribution

of

that

clue

remains

identical

col-

of such

is the folIowing: each

the

has been done.

of coefficients

has mean

of this

and

collection

of the clue

possible

the clues

do the clues

mean

each clue

from

all

from

document

statistical

not only

fitting

though

that,

that

identical

identical

derived

to another

It assumes the

we mean

have

logistic

of coefficients

set of queries

have

distribution also

variables

from

228 collection to collection changing only in its mean and standard deviation, i.e., only the mean and standard deviation of the particular collection-dependent distribution will change while the underlying standardized probability distribution remains constant over all collections. One might

well

question

this

assumption

on the following

grounds:

Suppose the mean pi and standard deviation Oi for clue i varies meaningtidly from collection to collection, and that these differences have predictive power with respect to relevance. Then the application of standardization of distributions will remove this predictive power.



In the face

of such thinking,

for

purposes

our

case the Cranfield

tribution

of testing,

of any

formulae,

the

the best approach

assumption

collection)

other

we obtain

and

we can directly

collection

to which

the coefficients

is to experimentally

we derive

the

compute

we wish

test the assumption.

standardized

coefficient

for

the adjusted

coefficient

for

to apply

our

logistic

c ~, c ~, and c ~ for the equation

regression

(here

lx~,y~))

=c~+c~(

Given

the equation that

equation lection,

for

above,

we would the second

we have

for

the source

apply

the

collection.

for collection

collection,

same

)+C2(

if we then

coefficients

~

two,

look

distribution

to derive

lx2,yJ)

the coefficients

out the first

= co + C*(

remains

of

and Y ):

the target

we

constant

collection,

can then

from

derive

collection

the

to col-

Y 2–wy2

~

)+C2(

~

) Y2

c 0’, c 1‘, and c z’ for the equation: Ixz,yz))

equation

above

=

co”

+ C1’X2

Ixz,yz))

+

Cz’yz

+

—X2 ;1

we obtain WX2

log(o(zi

x

the same equation,

log(o(z? but if we multiply

for

distribution,

X2 and we wish

In terms

clues,

)

at the equation

x 2–PX2

log(o(R

two

dis-

YI

to a standardized

If the standardized

(in

the standardized

Y1-wy,

~ xl

assuming

collection

technique.

we assume

x l–WX1 log(o(ll

If we accept,

a first

Py2

=co–c~;

–C2Z X2

+

5Y2

X2

Y2

Y2

and hence

C()’=

Py2 IJX2 — – c2— 6 X2 0 Y2

co-cl

c1 —

cl’=

6

X2

and C2’

=

C2 — 0

Thus

to find

standard values

the

deviation, for the new

This retrieval

sample

to obtain from

each

coefficients

CTv for

each

for clue

the target

v,

collection,

compute

the

we

new

need

only

coefficients,

determine

and

apply

the them

mean,

~V,

directly

and

to clue

collection.

methodology

conference

11. Applying In order

logistic

Y2

was

(TREC

first

1) [21].

to CACM means of

applied

by

This

paper

and CISI

and standard

those

Cooper,

collections.

and

Chen

a more

[20]

complete

to the queries

analysis

of the NIST

text

of the method.

collections

deviation

for

If

obtain

we

Gey

provides

all

clues a small

for

the new

sample

(of

collection, 1 in

125)

we must query

obtain

document

a new term

229 triples

and all associated clues values, we can compute

the following

standard deviations

for CACM

and

CISI.

St.

Coil.

log q.af

log d.af

log q.rf

log d.rf

log idf

log rfad

-3.5450

-3.7922

1.9045

-5.5303

CISI

mean

0.3031

0.4093

CISI

std dev

0.5153

0.5795

0.9110

0.6944

0.8333

1.0675

CACM

mean

0.1763

0.3900

-2.7644

-3.4976

2.0968

-5.2434

CACM

std dev

0.3347

0.5855

0.5612

0.7999

1.1378

1.2146

We then use these means and standard deviations lowing

new logistic

or logodds coefficients

together

for the CACM

with the Cranfield

coefficients

to obtain the fol-

and CISI collections.

log d.af

log q.rf

log d.rf

log idf

log rfad

-0,03229

0.1059

0.06910

0.4193

1,4515

0.7734

-1.341

-0.08412

0.1520

0.11993

0.5391

1.0933

0.5507

-0.934

-0.061660

0.1622

0.0663

0.5572

1.5325

0.6680

Coil.

const.

CRAN

-4.125

CACM CISI

log q.af

If we then apply the fitinorder tions, weobttin fi;s; for CACMtie

toobtain rankings forallqueries following recall-p;ecision results:

for both the CACMand

CISIcollec-

Standardized Logistic vs. Tfidf/cosine Vector Space Performance CACM- C llection: Averages over 52 Q :ries

0.00

0.7452 0.6040 0.5338 0.4392 0.3683 0.3128 0.2682 0.1846 0.1388 0.0961 0.0694

0.10 0.20 0.30

0.40 0.50 0.60 0.70 0.80 0.90 1.00

I I

Vector Space Precision

Standardized Logistic Precision

Recall

11-pt Avg: % Change:

1 I

0.6985 0.5683 0.4830 0.3887 0.3210 0.2761 0.2306 0.1770 0.1441 0.1018 0.0735

0.3419

I I

0.3148 -7.9

Note that the average precision has increased approximately another 3 percent over the tfidf cosine vector space method of ranking documents. The recall-precision table makes it quite clear that except for the tail of recall-precision, the standardized extended logistic model performs substantially better than the tfidflcosine vector space model for the CACM collection. Indeed, if we once again go through our process of computing average precision from all points of recall for each of the 52 significant queries for CACM, and then do a statistical test of the difference from standardized logistic method and tfidf/cosine vector space average precision, we find this improvement is now statistically significant at the 2 percent level.

CACM Hypothesis Test: 2-tail T on Precision Differences between Standardized Logistic and Tfidflcosine N 52

null mean 0.0000

sample mean

sample SE

df

-0.030180

0.012335

51.0

test stat -2.45

P value .0179

That is to say, we would only expect in less than two percent of all samples of queries applied CACM collection, for the vector space model to perform better than the standardized logistic model. way, we can reject the null hypothesis that there is no difference in the performance in the CACM tion and accept the clear indication that standardized logistic regression performs better than the space model for the CACM collection.

to the In this collecvector

230 On the other hand, if we apply this correction factor and standardization find that there is a slight worsening in the performance results.

to the CISI collection,

we

Standardized Logistic vs. Tfidflcosine Vector Space Performance CISI Collection: Averages over 76 Queries Vector Space Precision

Standardized Logistic Precision

Recall

0.2137 4.7

0.2042

1l-pt Avg: Yo Change:

When comparing this result with the non-standru-dized extended logistic model for CISI we see that the precision changes occur in the third decimal place. We can conclude that the difference between the two is certainly not statistically significant, nor is that between CISI standardized and tfidf/cosine. Among reasons for this failure to achieve performance improvement in the CISI collection are the different prior probability of relevance for the collections. [22] gives more detail on the sensitivity of the model to the proper estimation of prior probability of relevance.

12. Conclusions In this research we have investigated a new probabilistic text and document search method based upon logistic regression. This logistic inference method estimates probability of relevance for documents with respect to a query which represents the user’s information need. Documents are then ranked in descending order of their estimated probability of relevance to the query. 1. The logistic inference method has been subjected to detailed performance tests comparing it (in terms of recall and precision averages) to the traditional tfidf/cosine vector space method, using the same retrieval and evaluation software (the Cornell SMART system) for both methods. In this way the test results remain free of bias which might be introduced from different software implementations of evaluation inference method outperforms the methods. In terms of recall and precision averages, the logistic tfidf/cosine vector space method on the Cranfield and CACM test collections. The methods seem to perform equally well on the CISI test collection (for an appropriately estimated prior probability). 2. Statistical tests have been applied to ascertain whether these performance differences between the The performance improvement of the logistic inference method two methods are statistically significant. over the tfidf/cosine vector space method for the Cranfield and CACM collection is statistically significant at the five percent level. Performance differences for the CISI collections are not statistically significant, for the most plausible estimate of prior probability. 3. The use of standardized variables (statistical clues standardized to mean w = O and standard deviation CT= 1) seems to enable the training of and fitting for logistic regression coefficients to take place on the queries and documents of one collection and to be applied directly to the queries and documents of other collections.

13. Acknowledgments The work described was part of the author’s dissertation research at the School of Library and Information Studies at the University of California, Berkeley. Many of the ideas contained herein were jointly formulated with my dissertation advisor, Professor William S. Cooper, whose clarity of thinking and relentless persistence made this research both possible and doable. My outside committee member, Professor of Biostatistics Steve Selvin, provided much helpful advice on statistical techniques.

References 1. Salton G et al. The SMART Prentice-Hall, Englewood Cliffs, 2.

Salton Addison

retrieval NJ, 1971

G. Text processing: the Wesley, Reading, MA-Menlo

3. Salton G, McGill

M.

4. Sparck-Jones

A statistical

K.

Introduction

system:

Experiments

in

automatic

transformation, analysis and retrieval Park, CA, 1989

to modern information

interpretation

of

retrieval.

term specificity

of information

McGraw-Hill, and

document

its application

processing.

by computer.

New York,

1983

in retrieval.

Journal

231 of Documentation

1972; 28:11-21

C. Term weighting approaches 5. Salton G Buckley cessing and Management 1988; 24:513-523 6. Robertson,

S. The probability

7. Robertson 27:129-145

S Sparck-Jones

ranking K.

principle

Relevance

in automatic

in IR.

text

retrieval.

Journal of Documentation

weighting

of search terms.

Information

pro-

1977; 33:294-304

Journal

of the ASIS

1976;

IR, In: Proceedings of the Fourteenth Annual 8. Cooper W. Inconsistencies and misnomers in probabilistic International ACM/SIGIR Conference on Research and Development in Information Retrieval, Chicago, 111,Ott 13-16, 1991, pp 57-61 estimation 9. Fuhr N Huther H. ODtimum Probability cessing and Management 198;; 25:493;507 10. Hosmer D Lemeshow

S.

Applied

logistic

from empirical

regression.

John Wiley

11. Fox E. Extending the Boolean and Vector Space Models and Multiple Concept Types. PhD dissertation, Computer

learning 13. Fuhr N Buckley C. A probabilistic on Information Systems 19919:223-248

& Sons, New York,

of Information

Retrieval

Science, Cornell

retrieval functions based on 12. Fuhr N. Optimal polynomial Transactions on Information Systems 1989; 7:183-204 approach for

Information

distributions.

the probability

document

1989

with P-Norm

University,

Queries

1983

ranking

indexing.

Pro-

principle.

ACM

ACM

Transactions

Proceedings of the 1993 SIGIR feedback and inference networks. 14. Haines D Croft B. Relevance International Conference on Information Retrieva 1, Pittsburgh, Pa, June 27-July 1, 1993, pp 2-12 15, Turtle H. Inference networks for document retrieval. COINS Technical Report 90-92, February, 1991

PhD Dissertation,

University

of

Massachusetts,

concept-bases information 16, Fung R Crawford S Appelbaum L Tong R. An architecture for probabilistic retrieval. In: Proceedings of the 13th international conference on research and development in information retrieval. Brussels, Belgium, September 5-7, 1990, pp. 455-467 17 Swanson D.

Information

retrieval

as a trial-and-error

process.

Library

Quarterly

1977; 47:128-148

18. Hull D. Using statistical testing in the evaluation of retrieval experiments. Proceedings of the 1993 SIGIR international conference on information retrieval. Pittsburgh, Pa, June 27-July 1, 1993, pp.329338 19. Yu C Buckley C Lam H Salton G. A generalized term dependence retrieval. Information Technology: Research and Development 1983; 2:129-154

model

in

information

20. Cooper W Gey F Chen A. Information retrieval from the TIPSTER collection: an application of staged logistic regression. In: Proceedings of the Fkst NIST Text Retrieval Conference, National Institute for Standards and Technology, Washington, DC, November 4-6, 1992, NIST Special Publication 500-207, March 1993, pp 73-88 21. Harman, D. Overview of the tional conference on information

first TREC conference. In: Proceedings of the 1993 SIGIR retrieva 1, Pittsburgh, Pa, June 27-July 1, 1993, pp 36-47

22. Gey F. Probabilistic dependence University of California, Berkeley,

and logistic 1993

inference

in

information

retrieval.

interna-

PhD dissertation,

Suggest Documents