Inferring Probability of Relevance Using the Method of Logistic Regression Fredric
C. Gey
UC Data Archive and Technical Assistance University of California, Berkeley USA e-mail:
[email protected]. edu Abstract
This
research
technique function clues
evaluates
of logistic of document
present
bution ment
collections,
retrieval
test
retrieval
which
measure. tions)
clue
allows with
into
Since
its standardized
and
the
infers
we call
it logistic
are
derived
The
model
compared
directly
In the comparison,
or equally
well
formances
of
the
significant
or could
the logistic
as (in the third
two
models
have
were
occurred
inference
collection) subjected
method
to
with
mean
from
a training
to
the
(tfidf) performs
the tfidf/cosine
tests
collection
to three
see if
statistical
space
than
The
docu-
information model
the cosine
better
the
deviation
to other
vector and
space model.
the as a
the distri-
well-known
particular
significantly to
from
transforming
v = O and standard
weighting
vector
statistical
By
utilizes
of relevance
of relevance
inference.
is applied
uses term-frequency/inverse-document-frequency
the model
by probability
probability
(one
coefficients
retrieval;
documents
the model
power.
results
and document rank
distribution
logistic
loss of predictive
collections,
text which
and queries,
one to apply
little
probabilistic equations
properties.
of documents
statistical
a = 1), the method
for
to obtain
and query
in the texts
of each
a model
regression
(in two
collec-
differences
differences
of
similarity
are
in per-
statistically
by chance.
1. Introduction We
can consider
the central
task
contents
of a collection
meanings
of the texts
contained
associates
at Cornell
University
“m’’ -dimensional this
model,
tance One
term
we have
measures such
absence
of a vocabulary
weights
to the
(lDF)
the
may
—
where
ratio,
query
These
queries
the query
interpreted
merely
derived
from
frequency’
N
is the
attributes
of terms.
is the count
number
of
For example,
of occurrences
tj used in documents,
For terms
term
IDF,
which
is usually
weights,
used
in the SMART
the
for
tj occurs
which
attribute
document
Buckley
in
weighting
terms, 1988
documents
only
query,
performed By
accounted
rules
mation
retrieval
satisfying
system attempt value
are derived the
which
query
dis-
vector.
vector.
show
term
‘inverse
collection
system
theoretically
the computation as the
a system
we will
return
The
In the or
by adding
the
magnitude
and
one
frequency experiments
whose
query
of the theory documents
primary status
of
frequency
used
frequency’
~t
number
1
to index
for
terms
performance show
that
(DAF
is the
the document.
[4].
the query
or
the query
document
Sparck-Jones product
in the query
in either
Two
).
measurements a simple
and the occurrence
The
well-known
(QAF),
“~~F
of
and,
Salton for
for and
multiple
retrieval
weight,
of the term
in the
[5].
and practical.
of retrieval probability
Karen
in the document
application
need. sound
mean
or has been
and computed Their
comparisons
the formal
Such
factors
of the term
from
information
be computed
For
the presence
for terms
of that
the attribute in the
by
document
collections.
search
text
suggested
are the term
inverse
of these test
retrieval
user’s
first
system,
occurrence
problem. is both
to place can
several the
was
frequency-
in their
a probabilistic
retrieval of
for
worst
retrieval a number
against
in the document’s
logged,
term
identified
schemes
which
the
the
“m”.
generalized
%j documents
the
indicate
can be easily
geometrically,
and
in
and can choose
and query
only
Salton
is of size
from
document
to the of the
as vectors
terms
deviates
vectors
This
by Gerard
and document
between
and document
or the document.
weights,
developed
vector
access
representation
are represented
of nontrivial
document
of the angle
vectors
the query
both
the
intellectual
is the concise
space model and
vocabulary
for
to which
of providing
task
vectors.
‘absolute
(DAF).
the indexing
the cosine
in either
of this
documents
interpretation
model,
be mechanically
or the documefit is the
that
to be that
heart
In the vector
both
degree
or document.
the attribute
[3]
computes space
term
query
Weights (QAF)
geometric
of the appropriate
the document,
[2]
a clear
of the vector
and direction
[1]
assuming
measure
form
The
in the collection.
measure
sparsest
Retrieval
or documents.
space
which
well-know
of Information
of texts
focus
on a sound the
indexing
user’s
paper
is to test
models
of text
theoretical information
processes,
to the logic
to the user in an order of this
Probabilistic
value
satisfying
specification, of probability
ranked
by probability
a probabilistic and document
footing. need,
If retrieval then
and
of the inforsearch retrieval status
the probability
223 ranking
principle
in order
[6]
of their
babilistic
retrieval
tunately
experiments
tistical
[7],
evidence
methods
for
cessful
states
the optimal
which
was
this
to predict
statistical
retrieval
of relevance.
with
This
generalized
relevance
estimation
to
even
in the
“Linked
approach
when
face
be achieved
if documents
to the so-called
the
feedback
relevance,
will
leads
half
of this
model
encountered
collection
insufficient
by
have
of pro-
[8]
Unfor-
of insufficient
as a training
size
to the user
model
Cooper
problems
is used
sample
are returned
Independence”
Dependence
have
the
“Binary
only
sta-
set. Improved
been
partially
suc-
[9].
2. The logistic
inference
If no assumptions
are made
query-document
pair
ated
query
with
each
r
,Vn ) from
(VI,””” (such
as count
number exists
values before
vary
of
inference
the
and further
that for
tdjthere
the
are
a finite
and
by the number then
have
associate exists
attributes
in the query),
from
the
indexed
between
and that about
from
collection
a “Model
model says that we can use a random have
in both
document
the logarithm
set of
of documents
we
which system
the
term
the document
(such
as the
by the term),
O“ system,
a
associ-
i.e.
total
and that
one
which
models.
judgments
is present
retrieval
of the term
document),
collection,
of relevance
that a text
of occurrences in
divided
with
particular
allow
term
term
log~(~
the logodds
of the probabilities
as count
the
statistically
relevance
tk which
for term
(such
in the collection
introducing
binary
the character
tq,and document
term
occurrences
logistic
The which
about
the query
of
model
= ~ (~i ,dj ), but we merely
of documents
attribute
for
that
probability
been
made,
dj
and query
Iqi,dj,tk)
of the odds
sample of query-document-term
and compute qi
for
of the odds
triples of relevance
by the formula:
= co+clvl+
of relevance
the logarithm
“ “ “ “lvn
the i ‘h
query
qi =Ct
1, - - “ ,tq > is the sum
of
all terms:
logO (1? I qi ,dj ) = ~
[logo
(~
I qi ,dj >tk ) _ logo
(~
)1
k=l
where
O(R),
collection
as the prior
known
will
be relevant
are derived
by using
independent
variable
odds of relevance
to query
qi.
The
of logistic
the method as a function
of
is the odds that a document
coefficients
regression
(possibly
of individual
[10] which
continuous)
query
fits
chosen
an equation
independent
at random
and document to predict
variables
from
term
which
the
properties
a dichotomous show
statistical
variation. Once tion
may
log
odds
of relevance
be applied
to directly
is computed obtain
from
a simple
the probability
linear
of relevance
formula,
the inverse
of a document
logistic
transforma-
to a query:
1
P((ZUqj,dj)=
- log(O (R I qi ,dj ))
l+e Once retrieval ples,
the coefficients
system
these
for
of the equation
a particular
coefficients
may
collection be used
for
logodds
from
of relevance
a random
to predict
odds
sample
have
been
derived
within
a document
of query -document-term-relevance
of relevance
for
other
query-document
quadrupairs
(past
or
future).
tions
Regression
models
of multiple
clues
success mate
by Ed Fox
relevance.
approaches ●
The
●
the error An
[15]
[11]. There
to probability
outcome
variable
distribution alternative
which such
approximate
as term
More are
recently
two
Fuhr
major
document
and co-citation, [12]
[13]
problems
relevance
were
first
used polynomial
with
the
application
by linear
introduced, regression of
combinawith
some
to approxi-
ordinary
regression
approximation:
is dichotomous
approach
variable,
authorship, and Buckley
well-known
of relevance
is binomial
the dependent
frequency,
rather
rather
than
than
continuous,
normal,
is to use Bayesian
hence
as is usually
inference
assumed
networks
by ordinary
as developed
regression.
by Turtle
and others
[14]
[16].
3. Sampling
for logistic
One
of the computational
pute
logistic
regression
problems coefficients.
regression in using
logistic
Even
supposing
regression that
only
is the computational half
the
document
size
necessary
collection
to com-
is used
as a
224 training
set the number
nomical.
Naturally,
query,
naturally
attempt than not
a very
to construct
non-relevant all,
ments,
and
it into
We
have
chosen
Cranfield’s lowing sonable
fit
queries
It does
seem to possess The
queries
Second,
a better
characteristics
deficiencies while
it only
language
test collection
form,
has
query
most,
model.
The
which
com-
non-relevant
Cranfield
experiments
docutriple
collection
beginning
[17],
documents,
if
docu-
of 30.
by Swanson
1,400
documents
query-document-term
to be a sufficient
of an ideal
should
so that
seems
us to
sample
factor
than
factor
numerous
as outlined which
a weight
non-relevant
to fit our
for
each
upon
of non-relevant
fewer
a weighting
which
fit to the characteristics
of an ideal
are in natural
upon retrieval
by
1/ lo(hh
astro-
is, for
of relevant we would
proportion
process
thirtieth
by applying
information
proportion
smaller
becomes
documents
it is incumbent
documents,
are typically
it has 225 queries,
model.
higher
in the regression say, every
of relevant
collection,
of relevant
documents
some
in the
a significantly
computation
have
First,
logistic
sampling
the sample,
in
the number
a significantly
test collection
used
properties.
to our
into
are to be used for computation
documents
documents
the Cranfield
experiments.
total
representation
relevant
regression
collection
desirable
while
Since
to take
Since
encompasses
non-relevant
the logistic
which
adequate
documents for
triples
is necessary.
of the
which
assure
the differential.
a long-standing
●
To
it is reasonable
and factor
fraction
a sample ones.
adjusting
for
approach
small
of the relevant
pensates ments,
of query-document-term
a sampling
but it offers
number
both
is
with
the
the fol-
to afford
a rea-
the documents
and the
test collection.
be:
statistics
can be collected
as well
as document
statis-
tics, . the documents
have
both
titles
and abstracts
present
(abstracts
are missing
from
some
of the CACM
col-
lection),
●
The
queries
number
are a random
of queries
●
Documents
●
That
relevance
ple of all possible We
will
later
ments
logistic
(QRF),
absolute
frequency
Similarly, of
total
all terms You ment rithm, sums
of term
in the entire will
the influence
note
times
in general, of
Indeed, maximum
(QAF
of
logged
be a sufficiently
large
pairs
in the collection. judgments
If this
is not possible
are a carefully
crafted
sam-
size be known. (52 queries,
3204
documents)
and the CISI
collec-
likelihood
has been query in
i.e., collection
all
variables
terms
important out
the fit for
are logged.
clues
relative
absolute
frequency divided
tj
of term
(WAD
in
in all docu-
is then:
)
frequency
(QRF)
occurrences
of all
frequency in
all
is equal terms
divided
documents
by the total
of than
One reason
to assume Another to one
the
and non-logged
logged
frequency
number
to
in the
by the total (RFAD)
is the
of occurrences
of
length.
distribution
as products
logged for
Relative
C ~log
of term
is document
10 occurrences.
a skewed
) +
Query number
in the collection
relative
of the term
+ c2bg(QZW)
defined.
(DRF)
clues:
frequency
the presence
) + C ~zog (~~~
It is unrealistic
than
behave
is attained
given
by the total
in the formula
information.
smooths
of relevance,
the document.
collection, that
elementary
and relative
previously
divided
frequency
terms
additional
(DRF)
= Co + cllog(QAF)
Itj)
relative all
attributes
logodds
of the term
more
in comparing
for
occurrences
of frequency
is five
must
and,
the relevance
collection
) + c410g (~~~
(QAF)
document
occurrences
number
formula
(DAF
frequency
number
collection
the sample
in the document
= logO(R
+ c ~iog
query.
then
and that
There
population.
query-document
model has the following
frequency
The complete
absolute
all
for six term
inference
Zt,
Query
for
query
significance.
document
collections),
our tit to the CACM
relative
(RFAD).
query
a larger
judgments,
regression
particular
larger
statistical
1460 documents)
4. Logistic Our
from
document
apply
some
some
be made
million relevance
(76 queries,
the query
sample
judgments
be for
from
to achieve
are a random
(as it cannot
tion
sample
in order
same items,
for raw
that reason
which
for
has nice
variables in general clues.
for
using
logarithms
50 occurrences
after
taking
the logs
behavioral an
(at least
in a docu-
is that
properties.
anti-logarithm for
is to dampen
of a term
these
the logaFinally,
is clues)
applied. a higher
225 The sample
fit
for
taken
this
from
extended
clues
the Cranfield
logistic
model
collection.
The
Z,j = logO (R I tj) = -0.2085
is done
following
against
the same
coefficients
(QAF)
+ -0.203610g
remains
is to compute
by counting
the number
the Cranfield
collection
(for
225 queries
Cranfield
log
(0
(R ))
the prior
of query-document this
is
1838
times
pairs
pairs)
odds
for
and then
1400 documents).
of relevance
which
becomes
the prior
probability
=
from
this
this
collection.
of relevance
(IWAD )
This
is estimated
has been
number
fit:
(QZ?F)
) + 0.7503310g
by the total
assigned
(for
of query-document
pairs
Hence:
1838 Przor
for
a judgment
dividing
query-document-term
derived
+ 0.1914310g
+ O.1678910g (DAF ) + 0.5754410g (DZ?F ) – 1.596710g (IDF What
20,246
are then
= .005835
225”1400
of relevance,
and
logprior
=
log
‘rl~
= –5. 138
l–prior and the final
formula
for ranking
is:
log(O(l/
I ~))
-5,138
=
+ ~(Z~,
+ 5.138)
j=l
50 Performance A key
part
of this
bilistic
method,
model,
which
ment
for
a single
and
according
evaluation
of
relevant
query
but
to a statistical
is usually
test to determine
which
all
queries
our
relevance
between
terms
of
the
is our
the difference
regression
against
the
recall
These
in
precision
measures
to
also
is significant
space
and the documeasures.
the retrieval
It is our
intention
proba-
vector
vector
and
point
in the collection. it
logistic
the query
retrieval
are relevant.
however,
whether
of
at a certain
documents over
in
tests methods:
of the angle
described
performance;
major
probability
retrieved
averaged
of
to
to the cosine
of retrieved
measure
of two
according
documents
are usually
as one
significance
performances
documents
is the fraction
precision
differences
is to compare documents
fraction
precision
statistical
ranks
Performance
is the
while
paper
which ranks
vector.
Recall
evaluation:
process
can be computed intent
to use recall
subject
or could
performance
have
occurred
by
chance. The better
reason
than
retrieval, bility
of
that
relevance
have
mance
of two
statistical
randomly
(such
will
chosen
that
query.
inference
on a particular
for
the next
and our method It
profile.
the results
is that
be reversed
at random,
performance
as logistic
show
could
is chosen
different
methods which
significance
the performance
a query
to the
an entirely
test
to test statistical
and yet
we assume
would
cally
one wishes
another
might
versus
obtained
ranks
be that
Thus,
if
we
next
documents set of
vector
the randomly
one method
posed.
according
we
may
be
model
of
to proba-
chosen
to compare
space)
chosen
In our
randomly
are attempting
tfidf/cosine for
query
query
need
queries
the perforto establish
set of queries
are statisti-
significant. Such
between
a test,
are statistically ●
well-known
performance
For
for
significant.
each
precision
query value
the
query,
Our
take over
in
each
literature,
approach,
the ranking all ranks
would
and we determine therefore,
of relevant
(all
levels
be a T-test whether
for testing documents
of recall).
This
which
these for
statistical
by each will
would
differences
significance
method
give
compare
the
in ranking will
be:
and compute
a unique
number
differences
or in precision
an average
for
each
query
for each method. ●
Compute
for
number ●
Apply
and
the difference
the two
methods
of queries a T-test
hence
their
in average
being
precision
compared.
Thus,
between
the two
in applying
numbers
the T-test,
found
with
the sample
size
each will
method
equal
the
in the test collection. to these mean
differences difference
under should
the null
hypothesis
that the methods
be
and
standard
zero
their
deviation
perform not
identically significantly
different. [18]
provides
a summary
of possible
statistical
tests which
might
be used to evaluate
retrieval
experiments.
a
226
6. Logistic
model
performance
for the Cranfield
For the logistic
inference
model
the following
on Cranfield
Logistic
Inference
versus
Cranfield
table
tfidf/cosine
Collection:
collection
Vector
Averages
Space
over
Performance ueries
225
Logistic
Vector
0.7787
0.00
0.8330
0.10
0.8116
0.7440
0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90
0.7129
0.6434
0.6021
0.5301
0.5161
0.4380
0.4503
0.3814
0.3698
0.2994
0.2859
0.2267
0.2280
0.1882
0.1640
0.1379
1.00
0.1464
0.1251
11 -pt Avg:
0.4655
0.4084 -12.3
% Change: displays
the recall-precision
for
each
of the 225 queries
significant
at the one tenth
performance for
results.
the two
Hypothesis
225 Thus
we can see that
property
clues,
The
above
queries,
results
for
fitting.
in testing for
apply
for
Naturally,
tests
of probabilistic half
obtained
from
model
have
regression
where
the number
We
feel
the Cranfield
CACM
regression
equation
approach
space
model.
collection
to rank used
approach,
which
[19],
use that
of queries
as a training
entire
based
was used
the next
training
is necessary instead,
set to other
set of set for
documents
to the sample
and then
Our
The
used as the training
has also been
number
retrieval.
difficulty.
are used
of queries
a large
to document
will
collections
step
to predict in order
to
be to apply in our
test
collections.
fitted
for
the
Cranfield
sample
to the
CACM
collection,
we
result: Logistic
versus
Tfidf/cosine
Collection:
Vector
Averages
over
Space
Performance
52 Queries
Logistic Recall 1 l-pt
Vector
Precision
Avg:
seem that the extended for the CACM
collection,
Space
Precision
0.3322
0.3148
% Change:
it may
on the set of six term vector
the sample
coefficients
this
that
a fundamental
the test is applied
on half
and CISI
CACM
space model
.0000
the tfidf/cosine
within
logistic
whereby
models
for
Cranfield
While
P
-8.31
of logistic over
set is included
the resulting
approach
using
the CACM
logistic
inference
of the queries.
regression
inference the
to train
test
results
the relevant when
model
improvement
and egg process,
has been
logistic
the following
probabilistic
significant
the logistic from
a chicken
in particular
8. Logistic we
224
of the other
the coefficients
If
0.0072063
models
a decent
repertoire,
obtain
-0.059885
and terms
In past such
the results train
0.0000
use of the full
precision
are statistically
methods
df
sample
stat
a statistically
in average
that the results
Differences
and Tfifdf/cosine
SE
obtained
we have
t on Precision
mean
regression.
on relevance,
two-tail
sample
making
the difference
we find
level:
inference
of the Cranfield
documents
the logistic
Test:
compute
and logistic)
mean
we obtain
7. A criticism
(cosine
significance
Logistic
null
If we further
methods
of one percent
between N
Space
Precision
Precision
Recall
-5.2
logistic
model
if we apply
performs
5.2 percent
the same hypothesis
better
test:
than
the tfidf/cosine
vector
227 CACM
Hypothesis
between N
52 we find null
again
that
hypothesis
as well
or better
Applying
null
sample
sample
mean
SE
0.0000
-0.020528
0.015866
difference
in fact,
of performance
given
than the extended
9. Cranfield
2-tail
inference
mean
this
that,
test:
Logistic
logistic
the Cranfield
fitted
model.
equation
Cranfield
is not statistically
logistic
regression
Logistic
test
P
stat
value
-1.29
.2016
significant
the tfidf/cosine
to the CISI
Tfidf/cosine
Collection:
collection,
Vector
Averages
over
this
seems
age precision
to indicate
and recall
Space
better
Vector
N
vector
0.2137
space performance,
76 we
find
model
that
there
as against
directly tions
can the
to both
extended
logistic
The another
space
blind
SE
clearly
significant
neither
test on the aver-
collection
test
P
stat
difference the
between
application
collection
collection
value
0.65
75.0
While
and the CISI
for
the statistical
Differences for CISI
df
0.0087791
model.
collection
application
collection
performance of
produce
can we find
.5183
the
on
the
vector
Cranfield-derived
comparable
the difference
results
space
coefficient
on these
between
vector
collec-
space
and
significant.
of
coefficients
leaves
derived
something
of coefficients
from
the
to be desired.
from
statistical
In the next
one collection,
in order
distribution
section,
to adapt
of
we apply
them
one
some
collection clever
to the statistical
to
thought
clues
of the
collections.
10. Standardized The
sample
logistic
model,
T on Precision
sample
statistically
we apply
and Tfidf/cosine
mean
to be statistically
to the transformation other
be no extended
the CACM
to the vector
2-tail
inference
0.0056973
0.0000
when
queries:
Test:
Logistic
null
Space
4.4
for 77 CISI
mean
a
perform
Precision
0.2088
Hypothesis between
reject
might
Performance
Precision
Avg:
measures
we cannot
model
76 Queries
% Change: While
space
we find:
Logistic Recall 11-pt
and hence
vector
CISI
versus
CISI
methods
51.0
set of queries
for
Differences
df
another
model
T on Precision and Tfidf/cosine
direct
application
lection
and one
lowing
assumption.
of relevance the
regression Even
unlikely deviation
adaptation
can take though
the mean the actual
one which If
we obtain standardized
lection
which
an adaptation ●
For
a query-document-term collection which
distribution come
standard
are being from
fi-om
deviation
sample
and yet another used
as sensitive
collection
each
clue
By
probability
as the
one document makes
indicators
to collection.
the same underlying for
from
set of queries
original
col-
the fol-
of probability
identical
statistical
distribution, collection
but they
on which
might used
and standard under] ying zero
derive
from
to collection,
being
for
coefficients
O) and such
of probability
of any
probability
be no variation
statistical
distribution
standard
deviation
one (6
to indicate
=
distribution,
probability
for
the
This
gives
into
it seems
mean
and
us insight
It is well
and transform
is not known)
coefficients
either
collection.
function
a standardized
distribution,
in
of relevance. to another
density
to compute clues
would
fit one collection
deviation
for
distribution
the same underlying
there
indication
which
probability
Q.L =
uses the same
known
highly
standard into
the
that
one
that distribution
a standardized
(even
distribution,
i.e.,
1). then
we can
the standardized
of relevance.
also
distribution
The
underlying
apply
the coefficients
of yet another assumption
clue
the
underlying
probability
distribution
of
that
clue
remains
identical
col-
of such
is the folIowing: each
the
has been done.
of coefficients
has mean
of this
and
collection
of the clue
possible
the clues
do the clues
mean
each clue
from
all
from
document
statistical
not only
fitting
though
that,
that
identical
identical
derived
to another
It assumes the
we mean
have
logistic
of coefficients
set of queries
have
distribution also
variables
from
228 collection to collection changing only in its mean and standard deviation, i.e., only the mean and standard deviation of the particular collection-dependent distribution will change while the underlying standardized probability distribution remains constant over all collections. One might
well
question
this
assumption
on the following
grounds:
Suppose the mean pi and standard deviation Oi for clue i varies meaningtidly from collection to collection, and that these differences have predictive power with respect to relevance. Then the application of standardization of distributions will remove this predictive power.
●
In the face
of such thinking,
for
purposes
our
case the Cranfield
tribution
of testing,
of any
formulae,
the
the best approach
assumption
collection)
other
we obtain
and
we can directly
collection
to which
the coefficients
is to experimentally
we derive
the
compute
we wish
test the assumption.
standardized
coefficient
for
the adjusted
coefficient
for
to apply
our
logistic
c ~, c ~, and c ~ for the equation
regression
(here
lx~,y~))
=c~+c~(
Given
the equation that
equation lection,
for
above,
we would the second
we have
for
the source
apply
the
collection.
for collection
collection,
same
)+C2(
if we then
coefficients
~
two,
look
distribution
to derive
lx2,yJ)
the coefficients
out the first
= co + C*(
remains
of
and Y ):
the target
we
constant
collection,
can then
from
derive
collection
the
to col-
Y 2–wy2
~
)+C2(
~
) Y2
c 0’, c 1‘, and c z’ for the equation: Ixz,yz))
equation
above
=
co”
+ C1’X2
Ixz,yz))
+
Cz’yz
+
—X2 ;1
we obtain WX2
log(o(zi
x
the same equation,
log(o(z? but if we multiply
for
distribution,
X2 and we wish
In terms
clues,
)
at the equation
x 2–PX2
log(o(R
two
dis-
YI
to a standardized
If the standardized
(in
the standardized
Y1-wy,
~ xl
assuming
collection
technique.
we assume
x l–WX1 log(o(ll
If we accept,
a first
Py2
=co–c~;
–C2Z X2
+
5Y2
X2
Y2
Y2
and hence
C()’=
Py2 IJX2 — – c2— 6 X2 0 Y2
co-cl
c1 —
cl’=
6
X2
and C2’
=
C2 — 0
Thus
to find
standard values
the
deviation, for the new
This retrieval
sample
to obtain from
each
coefficients
CTv for
each
for clue
the target
v,
collection,
compute
the
we
new
need
only
coefficients,
determine
and
apply
the them
mean,
~V,
directly
and
to clue
collection.
methodology
conference
11. Applying In order
logistic
Y2
was
(TREC
first
1) [21].
to CACM means of
applied
by
This
paper
and CISI
and standard
those
Cooper,
collections.
and
Chen
a more
[20]
complete
to the queries
analysis
of the NIST
text
of the method.
collections
deviation
for
If
obtain
we
Gey
provides
all
clues a small
for
the new
sample
(of
collection, 1 in
125)
we must query
obtain
document
a new term
229 triples
and all associated clues values, we can compute
the following
standard deviations
for CACM
and
CISI.
St.
Coil.
log q.af
log d.af
log q.rf
log d.rf
log idf
log rfad
-3.5450
-3.7922
1.9045
-5.5303
CISI
mean
0.3031
0.4093
CISI
std dev
0.5153
0.5795
0.9110
0.6944
0.8333
1.0675
CACM
mean
0.1763
0.3900
-2.7644
-3.4976
2.0968
-5.2434
CACM
std dev
0.3347
0.5855
0.5612
0.7999
1.1378
1.2146
We then use these means and standard deviations lowing
new logistic
or logodds coefficients
together
for the CACM
with the Cranfield
coefficients
to obtain the fol-
and CISI collections.
log d.af
log q.rf
log d.rf
log idf
log rfad
-0,03229
0.1059
0.06910
0.4193
1,4515
0.7734
-1.341
-0.08412
0.1520
0.11993
0.5391
1.0933
0.5507
-0.934
-0.061660
0.1622
0.0663
0.5572
1.5325
0.6680
Coil.
const.
CRAN
-4.125
CACM CISI
log q.af
If we then apply the fitinorder tions, weobttin fi;s; for CACMtie
toobtain rankings forallqueries following recall-p;ecision results:
for both the CACMand
CISIcollec-
Standardized Logistic vs. Tfidf/cosine Vector Space Performance CACM- C llection: Averages over 52 Q :ries
0.00
0.7452 0.6040 0.5338 0.4392 0.3683 0.3128 0.2682 0.1846 0.1388 0.0961 0.0694
0.10 0.20 0.30
0.40 0.50 0.60 0.70 0.80 0.90 1.00
I I
Vector Space Precision
Standardized Logistic Precision
Recall
11-pt Avg: % Change:
1 I
0.6985 0.5683 0.4830 0.3887 0.3210 0.2761 0.2306 0.1770 0.1441 0.1018 0.0735
0.3419
I I
0.3148 -7.9
Note that the average precision has increased approximately another 3 percent over the tfidf cosine vector space method of ranking documents. The recall-precision table makes it quite clear that except for the tail of recall-precision, the standardized extended logistic model performs substantially better than the tfidflcosine vector space model for the CACM collection. Indeed, if we once again go through our process of computing average precision from all points of recall for each of the 52 significant queries for CACM, and then do a statistical test of the difference from standardized logistic method and tfidf/cosine vector space average precision, we find this improvement is now statistically significant at the 2 percent level.
CACM Hypothesis Test: 2-tail T on Precision Differences between Standardized Logistic and Tfidflcosine N 52
null mean 0.0000
sample mean
sample SE
df
-0.030180
0.012335
51.0
test stat -2.45
P value .0179
That is to say, we would only expect in less than two percent of all samples of queries applied CACM collection, for the vector space model to perform better than the standardized logistic model. way, we can reject the null hypothesis that there is no difference in the performance in the CACM tion and accept the clear indication that standardized logistic regression performs better than the space model for the CACM collection.
to the In this collecvector
230 On the other hand, if we apply this correction factor and standardization find that there is a slight worsening in the performance results.
to the CISI collection,
we
Standardized Logistic vs. Tfidflcosine Vector Space Performance CISI Collection: Averages over 76 Queries Vector Space Precision
Standardized Logistic Precision
Recall
0.2137 4.7
0.2042
1l-pt Avg: Yo Change:
When comparing this result with the non-standru-dized extended logistic model for CISI we see that the precision changes occur in the third decimal place. We can conclude that the difference between the two is certainly not statistically significant, nor is that between CISI standardized and tfidf/cosine. Among reasons for this failure to achieve performance improvement in the CISI collection are the different prior probability of relevance for the collections. [22] gives more detail on the sensitivity of the model to the proper estimation of prior probability of relevance.
12. Conclusions In this research we have investigated a new probabilistic text and document search method based upon logistic regression. This logistic inference method estimates probability of relevance for documents with respect to a query which represents the user’s information need. Documents are then ranked in descending order of their estimated probability of relevance to the query. 1. The logistic inference method has been subjected to detailed performance tests comparing it (in terms of recall and precision averages) to the traditional tfidf/cosine vector space method, using the same retrieval and evaluation software (the Cornell SMART system) for both methods. In this way the test results remain free of bias which might be introduced from different software implementations of evaluation inference method outperforms the methods. In terms of recall and precision averages, the logistic tfidf/cosine vector space method on the Cranfield and CACM test collections. The methods seem to perform equally well on the CISI test collection (for an appropriately estimated prior probability). 2. Statistical tests have been applied to ascertain whether these performance differences between the The performance improvement of the logistic inference method two methods are statistically significant. over the tfidf/cosine vector space method for the Cranfield and CACM collection is statistically significant at the five percent level. Performance differences for the CISI collections are not statistically significant, for the most plausible estimate of prior probability. 3. The use of standardized variables (statistical clues standardized to mean w = O and standard deviation CT= 1) seems to enable the training of and fitting for logistic regression coefficients to take place on the queries and documents of one collection and to be applied directly to the queries and documents of other collections.
13. Acknowledgments The work described was part of the author’s dissertation research at the School of Library and Information Studies at the University of California, Berkeley. Many of the ideas contained herein were jointly formulated with my dissertation advisor, Professor William S. Cooper, whose clarity of thinking and relentless persistence made this research both possible and doable. My outside committee member, Professor of Biostatistics Steve Selvin, provided much helpful advice on statistical techniques.
References 1. Salton G et al. The SMART Prentice-Hall, Englewood Cliffs, 2.
Salton Addison
retrieval NJ, 1971
G. Text processing: the Wesley, Reading, MA-Menlo
3. Salton G, McGill
M.
4. Sparck-Jones
A statistical
K.
Introduction
system:
Experiments
in
automatic
transformation, analysis and retrieval Park, CA, 1989
to modern information
interpretation
of
retrieval.
term specificity
of information
McGraw-Hill, and
document
its application
processing.
by computer.
New York,
1983
in retrieval.
Journal
231 of Documentation
1972; 28:11-21
C. Term weighting approaches 5. Salton G Buckley cessing and Management 1988; 24:513-523 6. Robertson,
S. The probability
7. Robertson 27:129-145
S Sparck-Jones
ranking K.
principle
Relevance
in automatic
in IR.
text
retrieval.
Journal of Documentation
weighting
of search terms.
Information
pro-
1977; 33:294-304
Journal
of the ASIS
1976;
IR, In: Proceedings of the Fourteenth Annual 8. Cooper W. Inconsistencies and misnomers in probabilistic International ACM/SIGIR Conference on Research and Development in Information Retrieval, Chicago, 111,Ott 13-16, 1991, pp 57-61 estimation 9. Fuhr N Huther H. ODtimum Probability cessing and Management 198;; 25:493;507 10. Hosmer D Lemeshow
S.
Applied
logistic
from empirical
regression.
John Wiley
11. Fox E. Extending the Boolean and Vector Space Models and Multiple Concept Types. PhD dissertation, Computer
learning 13. Fuhr N Buckley C. A probabilistic on Information Systems 19919:223-248
& Sons, New York,
of Information
Retrieval
Science, Cornell
retrieval functions based on 12. Fuhr N. Optimal polynomial Transactions on Information Systems 1989; 7:183-204 approach for
Information
distributions.
the probability
document
1989
with P-Norm
University,
Queries
1983
ranking
indexing.
Pro-
principle.
ACM
ACM
Transactions
Proceedings of the 1993 SIGIR feedback and inference networks. 14. Haines D Croft B. Relevance International Conference on Information Retrieva 1, Pittsburgh, Pa, June 27-July 1, 1993, pp 2-12 15, Turtle H. Inference networks for document retrieval. COINS Technical Report 90-92, February, 1991
PhD Dissertation,
University
of
Massachusetts,
concept-bases information 16, Fung R Crawford S Appelbaum L Tong R. An architecture for probabilistic retrieval. In: Proceedings of the 13th international conference on research and development in information retrieval. Brussels, Belgium, September 5-7, 1990, pp. 455-467 17 Swanson D.
Information
retrieval
as a trial-and-error
process.
Library
Quarterly
1977; 47:128-148
18. Hull D. Using statistical testing in the evaluation of retrieval experiments. Proceedings of the 1993 SIGIR international conference on information retrieval. Pittsburgh, Pa, June 27-July 1, 1993, pp.329338 19. Yu C Buckley C Lam H Salton G. A generalized term dependence retrieval. Information Technology: Research and Development 1983; 2:129-154
model
in
information
20. Cooper W Gey F Chen A. Information retrieval from the TIPSTER collection: an application of staged logistic regression. In: Proceedings of the Fkst NIST Text Retrieval Conference, National Institute for Standards and Technology, Washington, DC, November 4-6, 1992, NIST Special Publication 500-207, March 1993, pp 73-88 21. Harman, D. Overview of the tional conference on information
first TREC conference. In: Proceedings of the 1993 SIGIR retrieva 1, Pittsburgh, Pa, June 27-July 1, 1993, pp 36-47
22. Gey F. Probabilistic dependence University of California, Berkeley,
and logistic 1993
inference
in
information
retrieval.
interna-
PhD dissertation,