Information Storage and Retrieval (ISR) is a discipline involved with the organization, structuring, retrieval and display of bibliographic information. ISR systems ...
1984 TOC
Search V E C T O R SPACE MODEL OF INFORMATION RETRIEVAL - A REEVALUATION* S.K.M. Wong Department Regina,
of Computer
Sask.
Science,
Canada,
University
of Regina,
University
of Regina,
S4S 0A2
Vijay V. R a g h a v a n + Department Regina,
Abstract.
used
conflict
in
with
issues
and
information
we,
S4S 0A2
in e s s e n c e ,
current
naturally,
differently.
investigation
1.
paper
the
vector
based
of the v e c t o r
lead to how
systems
things might
lead
to a c l e a r e r
understanding
in
using
the
are
is f e l t
vector
The
that
space
in
have been
it
problems
the
space model.
importantly,
this
of
the
model
in
retrieval.
INTRODUCTION Storage
and Retrieval
involved
w i t h the o r g a n i z a t i o n ,
display
of
designed user
query,
computerized database It
bibliographic
with
information
references desired
submits
a request,
keywords,
This
Natural
the
research Sciences
This
author
currently Franklinstr.
which
request
journals,
with 28/29,
are
in r e s p o n s e
to a
contain
application
compared
supported
on
leave
Institut Sekr.
retrieval
in p a r t
terms.
fur
FR 5-8,
with
where
a
the
Univ.
a user
in t e r m s
of the
the
Council of
Informatik, i000 Berlin
each
When
document
by a g r a n t
Research
from
for
to represent
is also s p e c i f i e d is
the
etc.
or i n d e x
and E n g i n e e r i n g is
systems
is in a library e n v i r o n m e n t
of k e y w o r d s
was
and
would
A typical
in i n f o r m a t i o n
by m e a n s
which
a discipline
retrieval
ISR
of p r o v i d i n g ,
by the user.
consists of books, is c o m m o n
is
structuring,
to d o c u m e n t s
ISR s y s t e m
(ISR)
information.
the o b j e c t i v e
document
+
p o i n t out t h a t
More
will
Information
*
Science,
Canada,
the p r e m i s e s
considerations, done
Sask.
In this
methods
of Computer
from
the
of Canada.
Regina TU
and
is
Berlin,
i0, FRG.
Wong & Raghaven: The vector space model
representations
to d e t e r m i n e
retrieved.
In
essence,
references
that
are
of
168
w h i c h of the d o c u m e n t s
then, value,
the or
system
should be
must
relevant,
to
retrieve the
user's
request. A document
m a y or m a y
depending
on m a n y
is about,
how
on
numerous
previous Since
variables
user
does
relevance depends
suggested
that
documents
(what
it
etc.) as w e l l
as
reason what
way
query
for
search,
he wants,
on m a n y
etc.).
factors,
it
cannot p r e c i s e l y select only
documents.
in the order
know
in a c o m p l e x
a retrieval
the d o c u m e n t
(the
the user
that an ISR s y s t e m
relevant
to a user
is it c l e a r ,
characteristics
knowledge,
all
concerning
is it o r g a n i z e d ,
is recognized and
not be r e l e v a n t
It
has,
system
therefore,
should
of their potential
attempt
relevance
been to
rank
to a user
query. One
approach
to p r o v i d e
which
has b e e n w i d e l y
such a ranking
vectors
( S a l t o n 1971;
1983).
The
documents
used
of the v e c t o r s .
consists
of
importance When
of
a query
vector method
in the
of
determining
and the d o c u m e n t s
2.
to
the
could
vocabulary is
document
n-
the
concerned. the q u e r y
b a s e d on a c h o s e n
between and
vectors.
a document
of the c o r r e s p o n d i n g
be ranked
an
represents
formulates
the query
of
to the various
document
element
& McGill
in the d e c r e a s i n g
For may
be
vectors order
of
measure.
MOTIVATION It
is
completely as
i th
as
contents
indexing
each
similarity
product
Salton the
the d o c u m e n t s
between
as the s c a l a r
if the
the s y s t e m
against
similarity
defined
keyword
is p r e s e n t e d ,
and m a t c h e s
example,
this
i th
the
and q u e r i e s
to c o r r e s p o n d
keywords,
which
1979;
describe
Thus,
n distinct
vector
to
are a s s u m e d
elements
element
documents
van Rijsbergen
keywords
or queries
models
u s e d o v e r the y e a r s
basis
clear
informal. vectors,
orthogonality whether obeyed
that
the
notions
presented
F o r m a l notions from linear linear
independence
are c a r e f u l l y
avoided.
or Even
above
are
algebra such
dependence, the q u e s t i o n
and of
w e h a v e a v e c t o r s p a c e , t h a t is, a r e the a x i o m s to be by
the
elements
of
a vector
space
appropriate
for
Wong & Raghaven: The vector space model information that not
retrieval,
in the e a r l y make
vector
any
considered Thus,
the
notion
to
data
simply
a notational
other
defined quite
similar
been
of
in
vector
Salton
to
the
corresponding
to
documents
vectors in
various in
that
such
paper
vector
space.
Moreover,
vector
processing
model
is
choice,
instead
the
term
above,
all
of
1983).
things
are
not
a
used,
objects other
and
vector the
we
is
that
space
dandy
come
retrieval a brief
index
terms
space,
and
as the
subsequent
depend
on
the
as v e c t o r s papers,
in a
the
term
by c o n s c i o u s
model
(Salton
model and
or a
have
work
presume
"vector"
and
here
But
as
as r a n d o m
we
really
in s e v e r a l
terms,
such
In this
a space.
retrieval
fine
objects
of v i e w i n g
do
Given
retrieval
point
of
or
an o p e r a t i o n
information
dimensions
even
usually
reference in
or
to be a l o g i c a l
al. (1975).
possibility
information
al.
main
spaces
et
of
et
The
merely
was
notions
earliest
is m a d e
Salton
involving
is
product
in set t h e o r e t i c
the
idea
modelling
modelled
above
model
information
a
size.
scalar
logical
not intended
to
developments
The
to
Instead,
is s i m p l y
was
across
as
function
made
a physical
the
a vector In f a c t ,
mention
Similarly,
seems
languages,
considered of
it
was
of c e r t a i n
sort
function.
tool.
was
some
terms
formal
the
array
example,
density
of
literature
in p r o g r a m m i n g
structure.
For
have
and
the concept
spaces.
similarity
as in s t a t i s t i c a l
variables
to v e c t o r
vector,
aspect.
on the d a t a
and p r o c e s s e s
of
fact,
effort
what,
structure;
different.
well
or
In
a conscious
a one-dimensional
refers
some
considered.
connection
a tuple
as
not
literature
direct
meant
is
169
1980;
as o u t l i n e d
one
felt
quite
content. However, curiosity one
take
(1983),
specific space.
van
Van
vectors
Rijsbergen
(p.41)
representatives
dimensional
Euclidean
environment
dimensions
correspond
model
in
a vector to
the
retrieval
and
that
vectors observes
Salton
Koll
(1979)
space,
in
different
Salton
which index
and make
vector considers
embedded that
our
context
a multidimensional
infers
Koll
aroused
seriously.
(1979),
as b i n a r y
space.
have
information
space
Rijsbergen of
developments
in the
the v e c t o r
mention
document
system
recent
to ask w h e t h e r
should
McGill
certain
in
an n-
in the
SMART
the terms
various in
the
Wong & Raghaven: The vector space model vocabulary
and w h e r e
assumed. further first
Salton and
is felt
of
is
independently
and
exist
and
each
issue,
clear
what
which
of
the
easier
of
vector
was
two
accepted
these
retrieval. the
two
used
consistent
vector
spaces,
it the
as
as
a
being
where
not
term
assigned warrant
as
we
was
under
earlier
or
not c o n s i s t e n t this
traditional vector
paper
things
importantly,
it
are
might
in
have
of
an
and
to
fairly
If this
be
had
been
that
the
are
is,
and
can
either
of
what
Unfortunately,
retrieval
taken
all a l o n g
processes
approximations
as
should
construct
demonstrate
be
that
such
not
that
of
well
though,
be
assert
the
array.
orthogonality
and
model.
in
appears
interpretations
essence,
The
is felt
to
adopt
to notions
spaces.
the vector in
model.
understanding
able
is
notion
or
it
it
accept
the
as a logical
objects
space
we,
and
of vector
with
can
mean,
made
we may
information
approaches space
been
to
and
It w o u l d
reasonable
in
since
traditional
intended
and
we
a tuple
flirtings
be
the vector work
wish
of
space,
retrieval
should
with
we
it is n o t
Rather
sense
hand,
practices
consistent
mean.
along
have
casual
that
all
field.
in the context
traditional
a clear
are
that
the
that
On the o t h e r
case,
how
terms
statements
vectors
answer
n-dimensional
information
understood
the
in our
the notion of vectors
to
term
reality
terms These
of
of
references
seriously.
the
the
statements
easier
most
disregarded
In
are only
capture
index
the
On the one hand,
in
the
practices
the
part,
each
ideas
First,
adequately
is not so m u c h
answers:
only
with
any of
that
other.
notions
represents
correct
index
at the o u t s e t ,
precisely
information
the
to
involved
is
scrutiny.
The
that
not
assuming
contrary
these
situation.
treating
deemed
of
true
does
orthogonal,
discuss
assumptions
to the
Secondly,
relationships
careful
the
view
coordinate
orthogonal,
be
that
the vector
scope.
separate
This
out
are m u t u a l l y
(p.129-130)
approximations
that
notion
terms
and M c G i l l
point
order
the
170
is
we
find
for
the
most
the
ways
space model. point
conflict
with
considerations,
out the
premises
naturally,
been
done
that
this
investigation
the
issues
differently.
and p r o b l e m s
the
will
of lead
More lead
in using
to the
Wong & Raghaven: The vector space model
vector
space model
In
addition
modelling
of
significant
and
the
of
More
that
by b e i n g
term vectors
or concepts
number
3.
used
in
current
the
both
is
also
for a m o d e l
WEIRD
terms
the
their
work
system
and
by
Koll are
location)
Similarly,
of d o c u m e n t s
that
documents
(or " m e a n "
that they contain.
space
about
objects,
the g r o u n d w o r k
to i n v e s t i g a t e
of distinct
gains
of
terms
or concepts.
It
the p r o b l e m of d i m e n s i o n a l i t y of f e w e r
dimensions
than
the
index terms.
THE V E C T O R SPACE M O D E L
that as
the
a combination
a (vector)
one
retrieval
as a c o m b i n a t i o n
is also possible and i d e n t i f y
insight
specifically,
be v i e w e d
retrieval.
processes,
represented
may
new
in t h a t it lays
reminiscent
(1979).
to
information
relationships
is
in i n f o r m a t i o n
171
The
basic
premise
the
various
elements
of
a
queries,
the v e c t o r
space.
we
have
ability
to
obtain
a
new
multiply
Note
or
with
together
any of
element
the v e c t o r s
two
obey
Let
consider
first
used to represent
term,
there
exists
of g e n e r a l i t y ,
unit length.
Now,
expressed
a vector
~i
it is a s s u m e d
suppose in
documents.
terms
real
to to
number. algebraic x, ~).
a vector.
of
representation
of
Let tl, t2, ... t n be Corresponding
in the
space.
t h a t ~i's
that each d o c u m e n t of ~i's.
the
ability
any v e c t o r s
denotes
issue
a
in
implies
the s y s t e m
of b a s i c
for
in t e r m s of the i n d e x t e r m s .
the terms
vector
the
terms,
space
the
by
a number
(e.g. x + ~ = y + x,
of
is
modelled
properties:
and
system
with underscore
ti,
linear
system
the
are
all v e c t o r s
a vector
elements
that a letter us
objects
model
Specifically
of
the
the
of
space
and so on are
existence
system
axioms
documents
loss
space.
concepts, The
the vector
retrieval
vector
element
any
Furthermore, rules
a
add
adopting
information
documents,
that
of
Let
to each Without
are v e c t o r s Dr,
l~r~m,
the d o c u m e n t
of
is a
vector
~r be
~r = Since
(alr,
a2r,
it is sufficient
"-" anr)" to restrict our
scope of d i s c u s s i o n
to
Wong & Raghaven: The vector space model
the
subspace
thought
spanned
to
subspace,
be
the
and
by
the
generating
in p a r t i c u l a r
combinations
of
equivalently,
term
the
vectors,
set.
the ~i's
Every
all d o c u m e n t
term
expressed
172
vector
vectors,
vectors.
Thus,
can in
are _Dr
be
this linear
can
be,
as
n Dr =
The c o e f f i c i e n t s of Dr
along We
vector
(i)
~ airti i=l
air , for
introduce
spaces,
one
Z2,
"'" [k
al,
a2,
... ak, not
are l i n e a r l y all
several
implies this (ii)
space
(iii)
any
such
concepts
in
A set of v e c t o r s
if we
find
some
scalars
that
theorems
set
basis
in l i n e a r
the g e n e r a t i n g of
linearly
at m o s t is
a
independent
at m o s t
important
dependence.
dependent
being
contains
a
linearly
most
the c o m p o n e n t s
algebra
(Goult
that
... tn} that
because
has
are
akY k = 0 .
known
it can be seen
{tl, t2,
the m o s t
zero,
al_Y1 + a2Y 2 + ....
(i)
of
t h a t of l i n e a r
ZI,
1978),
and l