(Croft and. Dae, 1990). Despite the significant amount of work on phrases, .... (Robert- son, 1977). Typically, a probabilistic model calculates. P(Relevant.
The Use of Phrases W.
and Structured
Bruce
Croft,
Computer
Howard
Queries
R. Turtle;
and Information
University
and David
Science
of Massachusetts,
Abstract and
in information tems.
retrieval,
In previous
as a source This
like
in
this
for
show
that
results
improve
performance,
matically
extracted
nearly
in
queries
using
re-
by
this
way are
language selected
can auto-
query
per-
phrases.
the
came
tion
must
obey
some
among
its
showed
significant
use of phrases
as part
of a text
representation
language
has been
investigated
since
of information
retrieval
research.
indexing
collections
ative
low
baseline
days
example,
included
phrase-based
field
studies
riety
of experiments
tem.
Certainly,
phrases,
(1966).
if used
tained
(1968)
using
there
phrases
should
language
and,
representation.
with
phrases
ition.
These
small
improvements
results
in
have
the
been improve
experimental
however,
been
very
in some
we feel
that
support mixed,
collections
this
ranging
intu-
treated
from
swers
to decreases
* Current
address:
West
F’ubtishing
Company,
St.
Paul,
Language
Stud-
In
t Current ies,
University
Permission granted direct
address: of
to
commercial of the
that
copying
Machinery.
Chicago,
copy
provided
title
Center
that
for
Chicago,
without
fee
the copies
advantage,
publication
all or part
of this
are not made
and its date
copyright
appear,
material
or distributed notice
and notice
of the Aeeociation
otherwise,
and/or specific permission. @ 1991 ACM O-8979 J-448
and
or to republish,
is
results
representations
from
words,
between
index
for
of this
paper
in using
phraees
with
systems,
operators
such
are
terme? not
searchers
as AND
(A),
The
anand
algorithms. the
issues
model.
express
Boolean
to it be
obvious
clarify
a retrieval
using
eimilar
or should
retrieval is to
phrases, retrieval
For example, term,
single as these
on
to the
examined.
implications
phrases)
of work
as an index
goals
(e.g.
1 we call
is given
and 2A
for Computing
queries,
a fee
assume
linguistic
expressions OR
con-
word-level
(V),
section
$J .50
32
retrieval
effectiveness
is measured
in
terms
of
re-
precision. test and
Communications
-J/9 j[O009/0032...
word
for
and the
requires
con-
a probabilis-
provided
of phrases
such
commercial
taining
neither
sufficiently
queetions
structure
Illinois.
the ACM
is by permission To copy
Information
over
been
user-identified
and
amount
be t rested
derived
of the
involved
Minnesota.
been
significant
One
with
single
relationship
as a relationship to
have
in
significant
the
terms
have
rel-
1990).
a phrase
index
ob-
the
has not
should
quality
results
Dae,
that
model
the specificity the
and
fig-
were
Improvements
algorithm
from
phrases.
as CACM2
might
that
with
syntactic
results.
found
relaresults
his improvement
experiments
different
Despite
sys-
feeling
we
collections
of
addi-
syntactic
some
such
Fagan’s
in
Fagan’s
that
co-
some
but
words. using
baselines
In
both
algorithm,
(Croft
a va-
SMART
the
using
significantly
Cran-
described
consequently,
The
do not,
in the
also
has always
correctly,
of the indexing of the text
Salton
indexing
tic
for
word
smaller.
phrasee
early
Cleverdon,
single
and
by
phrase,
in
out
with
defined
in a document.
on the
increases
also be pointed
phrase
is
the proxim-
characterized
constraint
ures obtained
siderably
or
the
be
none
to quite
and/or
of components
but
the
of occurrences words
com-
phrases,
“syntactic” in
phrase
component
phrases,
best
and
as a statistical
tionships
the The
number
most
ueing
of factors
statistical
may
criteria
statistical
Introduction
the
phrase
It should
1
A
on
of the
indexing
“statistical”
occurrences
between syntactic
is one
a number
of its component
A
that
both
varied
constraints
ity
in
used
occurrences
phrases model.
(1987)
of automatic
process.
to build
retrieval
a natural
he and
where
phrases
as manually
that
phrasee formation
effectiveness,
theeis
studies
are used
phrases
that
model,
on phrases,
retrieval
in
used
01003
in othersl, recent
prehensive
sys-
been
retrieval
a probabilistic
from
as well
have
an approach
and
history
in commercial queries
language for
a long
of research
we describe
queries
have
a statistical
majority
in natural
structured
form
Boolean
improvement
paper,
identified Our
the
little
queries
particularly
work,
of phrases
work,
sulted In
Boolean
D. Lewist
MA
effectiveness
phrases
Retrieval
Department
Amherst,
Fagan’s Both
in Information
4.1.
collection lists
consists of
the
relevant
of
the
ACM
of
a
set
documents (CACM)
of
documents, for
collection
each
a query.
is described
set
of The in
prqximity, level for
sentence-level The
proximity. example,
tried),
may
or by
formation query
is used
tify
model
can
phrases
operator
how
incorporating
term
phrases
phrase,
were
used
dl
in the
or other
d2
paper,
(Croft,
1986),
as specifying
identified struct
in a natural a structured
bilistic
model
199 1).
goal,
rl
rz
based
representation
In the following
section,
start
by describing of our
phrases, been
the inference
emphasizing in retrieval
these
models
ables
the similarities
clearly
in
seen.
Boolean
In section building
structured used
presents
the
work.
network
reviews
work
content
of
for
an overview and
phrases
in
of our
that
The
uses
in sta-
results
Finally,
in
approach
describe this
and
section
the
paper.
the importance
to
are
and
other
the
use
are query
information
difference
of
document nodes,
and
need.
).
need,
paper,
future
document
and
collections.
values
true
and
of the
into
of
forms
emphasizes to
calculate
of the of the
such
under
model.
For this
structured
queries
in the inference model
can
of the
net model
be shown
model
docu-
informa-
as a thesaurus
this
are that
of the
features
net model
it
evidence
knowledge
account
advantages
These
that
representations
interpretation
different
is
representations
domain
a natural
diagram.
sources
different
and
the inference
models
Different
the major
have
possible
as representations
between
multiple
can all be taken
of
with
regarded
probabilistic
content,
tion 4
a discussion
of large
d~’s are
qi’s
need.
major
ment
specific
Section
5, we indicate
Network:
nodes,
user’s
Queries
P(I]Document
queries,
and discuss
the
information
to be
as proximity
Inference
of a document,
false.
en-
them
1: Basic
v+’s are concept
on have
each
among
Figure nodes,
1 represents
is the
phrases
of an inference
such
We
research
Instantiating
differences
experimental
results.
directions
through
models.
3, we give
techniques
ways
operators
renet-
which
different
subsection
and
retrieval
previous describe
form
last
need
net model
models.
and
The
queries
tistical
those
the
Croft,
overall
inference
We then the
P
interaction.
we review
experiments.
treated
user
‘-lk
‘Amx
to con-
and
our
information
and
‘m
... ..
ql
Phrases
(Turtle
a complex,
of an analysis
. .. . .
used in a proba-
towards
is to build
language
basis
nets
a step
rs
In
term
are used
is then
on inference
which
approach.
query
which
represents
search natural
language
query,
based
This
a different
d
in a probabilistic
interpreted
we take
d-l
Y’Jx~J
to iden-
dependencies. In this
.. . .
lintext.
queries
dependency
were
as (in-
in a document
Boolean
re-
A
such
Structure
the
used
that
retrieval,
(information
be detected
we have
p=wwh-
information
3wordsofretrieva~.
work,
work,
and
by
a proximity
to describe
potential
that
using
construct,
In previous
concept
be expressed
within
guistic
proximity,
using
are discussed
a be-
low.
Previous
2
2.1
The
The
inference
used
as the
basis
of
phrases,
ments tion it
4.
It
follows
son,
1977).
net
that
it
particular
the
comparisons for
probability
Typically,
decides
model
information
(Turtle
of different
ranking
principle
a document query
is relevant (Fuhr,
a slightly
P(I lDocument information
different
More
need
as a complex
that
trix
specifically,
node
inferin
probagiven
we consider about
on
the
of parents
its
potential
for
the
can
to
compute
be
used
associated
paper.
with
1 shows It
the
consists
of the the
all
Given
the
probability
or
de-
all nodes
a set these
multiof that
and
DAG,
remaining
basic
has
characterizes node
a
a ma-
all possible
a node
that
causes.
roots
q, we draw
the dependence
and
between
probabilities
this
33
set
If or im-
q contains
P(g Ip) for
When
specifies
relationship
Figure
an the
matrix
propositions.
node
specifies
nodes
and edges
p “causes”
by node
The
variables.
representing
belief
a
that
two the
pendence
a par-
The approach
proposition
ple parents, y
represented p to q.
matrix)
of the
between
by a node
is a di-
in which
or constants
relations
from
1989)
(DAG)
variables
proposition
(a link
(Pearl,
graph
represented
edge
values
calculates
is satisfied
the
in
probability
is the
dependence
directed
given
1989).
), which need
document.
is the
represent
sec-
(Robert-
model
which
propositional
in
network
dependency
represent
plies
treat-
model
inference
acyclic
a proposition
199 1) is
experiments
a probabilistic
takes
Croft,
retrieval
,Query),
and
a user’s
and
the
a probabilistic
computes
that
Model
Net
and
document
ence bility
for
Bayesian
rected,
model
IDocument
a user
ticular
net
the
P(Relevant that
Inference
is
A
Work
of prior networks degree
of
nodes.
inference
of a document
network network
used and
in a
For
retrieval,
teraction network. that
a query
with
the
This and,
and
allows
the information
ument
network
user,
us
is
to to
through
to the
compute
need is met
consequently,
built
attached
the
for any
produce
in-
document probability
particular
doc-
a ranked
list
of
can
be
documents.
2.2
Phrases
The
use of phrases
discussed 1. What
2. Are
phrases
(information
query
network
for
the
V Tfiies
A ret?’ieva~)
the
terms
to
determine
if
a phrase
concepts
or are they
relation-
concepts? weighting
use of phrases
are
systems
issues:
or query?
is an appropriate
Should
4.
query
used
separate
between
3. What Structured
is
IR
following
in a document
ships
2:
of the
evidence
exists
Figure
in experimental
in terms
used
for
for
affect
phrases?
which
indexing
single
word
and
docu-
queries
ments?
query
network.
a collection query gle
The and
its
processing.
node
and
or
more
each information tive
query
The
and
and
document been cific
to
(i.e.
the
content represent
a document
signed.
A representation
given
its
The gle leaf tion the
query that
need
query
networks
expressions.
plex
Figure
query Boolean
operators
matrix
form
(Turtle
showed
that
queries
is at least
version
of the vector
this
such
as those
2 shows has
event
that that
be used the
A retrieval)
(information
with
DAG
roots
may
inference as effective space
been
as-
the
need.
formed
with
as the model
set
1991). model
to
both
for
parse
phrase
indexing about
pairs (Van
(1990)
tend
to same
phrases.
Sparck
phrasal
synonymous
used
together
for
other
concept. or nearly
in documents
research
on
term
evidence
parser
techniques, for
together
to idenusing
information than
has been
the
If
may
part
of
hypothesis
words the
mea-
words
being
synonymous
other
of words
associated two
Tait
queries.
example,
For instance,
clustering.
and
on the
Of course, reasons
noun
as phrases
the co-occurrence
mutual
co-occur
grammar. (e.g.
to analyze
such as the expected 1979).
document
Jones
a syntactic
strongly
Rijsbergen,
used the PLNLP of the
identified
are
that
siderable
that
of
and grammars document and
to use semantic
It is possible,
of words
a library Parser-based
constructs
information
use information
sure the
used
cate-
and patterns
a simpler
are then
Dil-
of the
syntactic
against
used
extraction.
semantic
used.
the
( 1987)
linguistic tree
example,
general
link
34
specific
been
is typical
parse
(1988)
phrase
the
measures
1983).
Smeaton
to refine
to identify
Wu,
Fagan
a complete
lin-
template-
noun>),
It is also possible
tify
Boolean”
where
use
Both
are identified
are matched
For example,
in the
canonical
Fox and
have (1983)
as