Vector Space Model of Information Retrieval - A

17 downloads 0 Views 794KB Size Report
Information Storage and Retrieval (ISR) is a discipline involved with the organization, structuring, retrieval and display of bibliographic information. ISR systems ...
1984 TOC

Search V E C T O R SPACE MODEL OF INFORMATION RETRIEVAL - A REEVALUATION* S.K.M. Wong Department Regina,

of Computer

Sask.

Science,

Canada,

University

of Regina,

University

of Regina,

S4S 0A2

Vijay V. R a g h a v a n + Department Regina,

Abstract.

used

conflict

in

with

issues

and

information

we,

S4S 0A2

in e s s e n c e ,

current

naturally,

differently.

investigation

1.

paper

the

vector

based

of the v e c t o r

lead to how

systems

things might

lead

to a c l e a r e r

understanding

in

using

the

are

is f e l t

vector

The

that

space

in

have been

it

problems

the

space model.

importantly,

this

of

the

model

in

retrieval.

INTRODUCTION Storage

and Retrieval

involved

w i t h the o r g a n i z a t i o n ,

display

of

designed user

query,

computerized database It

bibliographic

with

information

references desired

submits

a request,

keywords,

This

Natural

the

research Sciences

This

author

currently Franklinstr.

which

request

journals,

with 28/29,

are

in r e s p o n s e

to a

contain

application

compared

supported

on

leave

Institut Sekr.

retrieval

in p a r t

terms.

fur

FR 5-8,

with

where

a

the

Univ.

a user

in t e r m s

of the

the

Council of

Informatik, i000 Berlin

each

When

document

by a g r a n t

Research

from

for

to represent

is also s p e c i f i e d is

the

etc.

or i n d e x

and E n g i n e e r i n g is

systems

is in a library e n v i r o n m e n t

of k e y w o r d s

was

and

would

A typical

in i n f o r m a t i o n

by m e a n s

which

a discipline

retrieval

ISR

of p r o v i d i n g ,

by the user.

consists of books, is c o m m o n

is

structuring,

to d o c u m e n t s

ISR s y s t e m

(ISR)

information.

the o b j e c t i v e

document

+

p o i n t out t h a t

More

will

Information

*

Science,

Canada,

the p r e m i s e s

considerations, done

Sask.

In this

methods

of Computer

from

the

of Canada.

Regina TU

and

is

Berlin,

i0, FRG.

Wong & Raghaven: The vector space model

representations

to d e t e r m i n e

retrieved.

In

essence,

references

that

are

of

168

w h i c h of the d o c u m e n t s

then, value,

the or

system

should be

must

relevant,

to

retrieve the

user's

request. A document

m a y or m a y

depending

on m a n y

is about,

how

on

numerous

previous Since

variables

user

does

relevance depends

suggested

that

documents

(what

it

etc.) as w e l l

as

reason what

way

query

for

search,

he wants,

on m a n y

etc.).

factors,

it

cannot p r e c i s e l y select only

documents.

in the order

know

in a c o m p l e x

a retrieval

the d o c u m e n t

(the

the user

that an ISR s y s t e m

relevant

to a user

is it c l e a r ,

characteristics

knowledge,

all

concerning

is it o r g a n i z e d ,

is recognized and

not be r e l e v a n t

It

has,

system

therefore,

should

of their potential

attempt

relevance

been to

rank

to a user

query. One

approach

to p r o v i d e

which

has b e e n w i d e l y

such a ranking

vectors

( S a l t o n 1971;

1983).

The

documents

used

of the v e c t o r s .

consists

of

importance When

of

a query

vector method

in the

of

determining

and the d o c u m e n t s

2.

to

the

could

vocabulary is

document

n-

the

concerned. the q u e r y

b a s e d on a c h o s e n

between and

vectors.

a document

of the c o r r e s p o n d i n g

be ranked

an

represents

formulates

the query

of

to the various

document

element

& McGill

in the d e c r e a s i n g

For may

be

vectors order

of

measure.

MOTIVATION It

is

completely as

i th

as

contents

indexing

each

similarity

product

Salton the

the d o c u m e n t s

between

as the s c a l a r

if the

the s y s t e m

against

similarity

defined

keyword

is p r e s e n t e d ,

and m a t c h e s

example,

this

i th

the

and q u e r i e s

to c o r r e s p o n d

keywords,

which

1979;

describe

Thus,

n distinct

vector

to

are a s s u m e d

elements

element

documents

van Rijsbergen

keywords

or queries

models

u s e d o v e r the y e a r s

basis

clear

informal. vectors,

orthogonality whether obeyed

that

the

notions

presented

F o r m a l notions from linear linear

independence

are c a r e f u l l y

avoided.

or Even

above

are

algebra such

dependence, the q u e s t i o n

and of

w e h a v e a v e c t o r s p a c e , t h a t is, a r e the a x i o m s to be by

the

elements

of

a vector

space

appropriate

for

Wong & Raghaven: The vector space model information that not

retrieval,

in the e a r l y make

vector

any

considered Thus,

the

notion

to

data

simply

a notational

other

defined quite

similar

been

of

in

vector

Salton

to

the

corresponding

to

documents

vectors in

various in

that

such

paper

vector

space.

Moreover,

vector

processing

model

is

choice,

instead

the

term

above,

all

of

1983).

things

are

not

a

used,

objects other

and

vector the

we

is

that

space

dandy

come

retrieval a brief

index

terms

space,

and

as the

subsequent

depend

on

the

as v e c t o r s papers,

in a

the

term

by c o n s c i o u s

model

(Salton

model and

or a

have

work

presume

"vector"

and

here

But

as

as r a n d o m

we

really

in s e v e r a l

terms,

such

In this

a space.

retrieval

fine

objects

of v i e w i n g

do

Given

retrieval

point

of

or

an o p e r a t i o n

information

dimensions

even

usually

reference in

or

to be a l o g i c a l

al. (1975).

possibility

information

al.

main

spaces

et

of

et

The

merely

was

notions

earliest

is m a d e

Salton

involving

is

product

in set t h e o r e t i c

the

idea

modelling

modelled

above

model

information

a

size.

scalar

logical

not intended

to

developments

The

to

Instead,

is s i m p l y

was

across

as

function

made

a physical

the

a vector In f a c t ,

mention

Similarly,

seems

languages,

considered of

it

was

of c e r t a i n

sort

function.

tool.

was

some

terms

formal

the

array

example,

density

of

literature

in p r o g r a m m i n g

structure.

For

have

and

the concept

spaces.

similarity

as in s t a t i s t i c a l

variables

to v e c t o r

vector,

aspect.

on the d a t a

and p r o c e s s e s

of

fact,

effort

what,

structure;

different.

well

or

In

a conscious

a one-dimensional

refers

some

considered.

connection

a tuple

as

not

literature

direct

meant

is

169

1980;

as o u t l i n e d

one

felt

quite

content. However, curiosity one

take

(1983),

specific space.

van

Van

vectors

Rijsbergen

(p.41)

representatives

dimensional

Euclidean

environment

dimensions

correspond

model

in

a vector to

the

retrieval

and

that

vectors observes

Salton

Koll

(1979)

space,

in

different

Salton

which index

and make

vector considers

embedded that

our

context

a multidimensional

infers

Koll

aroused

seriously.

(1979),

as b i n a r y

space.

have

information

space

Rijsbergen of

developments

in the

the v e c t o r

mention

document

system

recent

to ask w h e t h e r

should

McGill

certain

in

an n-

in the

SMART

the terms

various in

the

Wong & Raghaven: The vector space model vocabulary

and w h e r e

assumed. further first

Salton and

is felt

of

is

independently

and

exist

and

each

issue,

clear

what

which

of

the

easier

of

vector

was

two

accepted

these

retrieval. the

two

used

consistent

vector

spaces,

it the

as

as

a

being

where

not

term

assigned warrant

as

we

was

under

earlier

or

not c o n s i s t e n t this

traditional vector

paper

things

importantly,

it

are

might

in

have

of

an

and

to

fairly

If this

be

had

been

that

the

are

is,

and

can

either

of

what

Unfortunately,

retrieval

taken

all a l o n g

processes

approximations

as

should

construct

demonstrate

be

that

such

not

that

of

well

though,

be

assert

the

array.

orthogonality

and

model.

in

appears

interpretations

essence,

The

is felt

to

adopt

to notions

spaces.

the vector in

model.

understanding

able

is

notion

or

it

it

accept

the

as a logical

objects

space

we,

and

of vector

with

can

mean,

made

we may

information

approaches space

been

to

and

It w o u l d

reasonable

in

since

traditional

intended

and

we

a tuple

flirtings

be

the vector work

wish

of

space,

retrieval

should

with

we

it is n o t

Rather

sense

hand,

practices

consistent

mean.

along

have

casual

that

all

field.

in the context

traditional

a clear

are

that

the

that

On the o t h e r

case,

how

terms

statements

vectors

answer

n-dimensional

information

understood

the

in our

the notion of vectors

to

term

reality

terms These

of

of

references

seriously.

the

the

statements

easier

most

disregarded

In

are only

capture

index

the

On the one hand,

in

the

practices

the

part,

each

ideas

First,

adequately

is not so m u c h

answers:

only

with

any of

that

other.

notions

represents

correct

index

at the o u t s e t ,

precisely

information

the

to

involved

is

scrutiny.

The

that

not

assuming

contrary

these

situation.

treating

deemed

of

true

does

orthogonal,

discuss

assumptions

to the

Secondly,

relationships

careful

the

view

coordinate

orthogonal,

be

that

the vector

scope.

separate

This

out

are m u t u a l l y

(p.129-130)

approximations

that

notion

terms

and M c G i l l

point

order

the

170

is

we

find

for

the

most

the

ways

space model. point

conflict

with

considerations,

out the

premises

naturally,

been

done

that

this

investigation

the

issues

differently.

and p r o b l e m s

the

will

of lead

More lead

in using

to the

Wong & Raghaven: The vector space model

vector

space model

In

addition

modelling

of

significant

and

the

of

More

that

by b e i n g

term vectors

or concepts

number

3.

used

in

current

the

both

is

also

for a m o d e l

WEIRD

terms

the

their

work

system

and

by

Koll are

location)

Similarly,

of d o c u m e n t s

that

documents

(or " m e a n "

that they contain.

space

about

objects,

the g r o u n d w o r k

to i n v e s t i g a t e

of distinct

gains

of

terms

or concepts.

It

the p r o b l e m of d i m e n s i o n a l i t y of f e w e r

dimensions

than

the

index terms.

THE V E C T O R SPACE M O D E L

that as

the

a combination

a (vector)

one

retrieval

as a c o m b i n a t i o n

is also possible and i d e n t i f y

insight

specifically,

be v i e w e d

retrieval.

processes,

represented

may

new

in t h a t it lays

reminiscent

(1979).

to

information

relationships

is

in i n f o r m a t i o n

171

The

basic

premise

the

various

elements

of

a

queries,

the v e c t o r

space.

we

have

ability

to

obtain

a

new

multiply

Note

or

with

together

any of

element

the v e c t o r s

two

obey

Let

consider

first

used to represent

term,

there

exists

of g e n e r a l i t y ,

unit length.

Now,

expressed

a vector

~i

it is a s s u m e d

suppose in

documents.

terms

real

to to

number. algebraic x, ~).

a vector.

of

representation

of

Let tl, t2, ... t n be Corresponding

in the

space.

t h a t ~i's

that each d o c u m e n t of ~i's.

the

ability

any v e c t o r s

denotes

issue

a

in

implies

the s y s t e m

of b a s i c

for

in t e r m s of the i n d e x t e r m s .

the terms

vector

the

terms,

space

the

by

a number

(e.g. x + ~ = y + x,

of

is

modelled

properties:

and

system

with underscore

ti,

linear

system

the

are

all v e c t o r s

a vector

elements

that a letter us

objects

model

Specifically

of

the

the

of

space

and so on are

existence

system

axioms

documents

loss

space.

concepts, The

the vector

retrieval

vector

element

any

Furthermore, rules

a

add

adopting

information

documents,

that

of

Let

to each Without

are v e c t o r s Dr,

l~r~m,

the d o c u m e n t

of

is a

vector

~r be

~r = Since

(alr,

a2r,

it is sufficient

"-" anr)" to restrict our

scope of d i s c u s s i o n

to

Wong & Raghaven: The vector space model

the

subspace

thought

spanned

to

subspace,

be

the

and

by

the

generating

in p a r t i c u l a r

combinations

of

equivalently,

term

the

vectors,

set.

the ~i's

Every

all d o c u m e n t

term

expressed

172

vector

vectors,

vectors.

Thus,

can in

are _Dr

be

this linear

can

be,

as

n Dr =

The c o e f f i c i e n t s of Dr

along We

vector

(i)

~ airti i=l

air , for

introduce

spaces,

one

Z2,

"'" [k

al,

a2,

... ak, not

are l i n e a r l y all

several

implies this (ii)

space

(iii)

any

such

concepts

in

A set of v e c t o r s

if we

find

some

scalars

that

theorems

set

basis

in l i n e a r

the g e n e r a t i n g of

linearly

at m o s t is

a

independent

at m o s t

important

dependence.

dependent

being

contains

a

linearly

most

the c o m p o n e n t s

algebra

(Goult

that

... tn} that

because

has

are

akY k = 0 .

known

it can be seen

{tl, t2,

the m o s t

zero,

al_Y1 + a2Y 2 + ....

(i)

of

t h a t of l i n e a r

ZI,

1978),

and l

Suggest Documents