on extending the vector space model for boolean query ... - CiteSeerX

36 downloads 0 Views 1MB Size Report
ton [S71, Sg3] index terms are basic vectors in a vecmr space. Rach document or query is represented as a linear combiition of these basic term vecmrs.
ON EXTENDING THE VECTOR SPACE MODEL FOR BOOLEAN QUERY PROCESSING + S. K. M. Wont w. zinrko V. V. Raghwn and P. C. N. Wang Department of Computer ScienceUniversity of Regina Regina, Canada s4S OA2

Abstract. An infamation retrieval model, named the Generaliied Vectm Spice Model (GVSM). is extended m handle situations where queries are specitied as (extended) Boolean expressions.It is shown tbat this unified model, unlike currently available alternatives, has the advantage of inwrpating tetm cortelations inm the retrieval process. ‘Ilte query languageextension is attractive in the sensethat most of the aIgebraic properties of tbe strict Boolean language are still preserved. Although the experimental results for extended Boolean retrieval are not always better than the vector processingmethod, the developments here am signiecant in facilitating commercially available retrieval systems to benefit from the vector based methods.The proposedscheme is compared m the pnorm model advancedby Salmn snd coworkers. An important conclusion is that it is desirable m investigate further extensionsthat can offer the benefitsof both proposals.

the information retrieval system does, if a document is judged by the user m be of intetesf it is relevant: otherwise, it is non-relevant.Since many facmrs may inguence tba judgement concerning relevance in a complex way, it is easy m see that designing an IR system within this frame of tefenznceis very challenging. In designing the retrieval strategies, the IR researcherst&e the view that the systemsshould adopt methods that facilitate the ranking of documentsin the order of their estimateduseties m a user query. It is common, in IR. m suppose that each document is indexed by a set of “content” idends?& that are variously known as keywords, index terms, subject indicatrxs or concepts. This would requite the application of some aummatic or manual indexing technique m the full text or some surrogate (e.g. abstract) of the documents in order m identify the index terms m be used in their representation.In addition to the selection of index terms m representdocuments,it is also common m associateweights that reflect the importanceofeacbtennasan indicator of the content of the documents m which it is assigned.The user request may, for exantplc. be in the form of a natural language statementor a Boolean expressionand in either case the query may be represented within the IR system, as a set of (index tam, weight) pairs. The retrieval operation often consistsof matchiig sets of index teamsassignedm the stored documentswith the keywords representing the user query. The matching is followed by the retrieval of those documnts whose content identifiers exhibit a sufficiently high degree of similarity m the keywords of the user query. The systemcan obtain feedback from the user as m the approlukteness of the initial set of retrieved document3 m coostruct improved query formulation. Ibis hind of operation, known as relevance feedback,is particularly helpful in obtaining more effective retrievsl output In the past, several mathematical models for dccument letrieval system have been developed lC82, S83, S83a, l76, WO84. lltese models are used m formally represent the basic characteristics,futtctional components, and the retrieval processesof document retrieval systems. Two basic categcaiesof models that have been employed in infamation retrieval are the vector processingmodels and the Boolean retrieval models. In the conventional vecmt spacemodel (VSM), pmposedby Salton [S71, Sg3] index terms are basic vectors in a vecmr space. Rach document or query is represented as a linear combiition of these basic term vecmrs. The teuievai operation consists of computing tbe cosine similarity functim between a given query vector and the set of document vecmrs and then ranking dccuntents accordingly. In this apIxoach, the interpretation that the occurrencefrequency of a term in a document mpresenu the component of the document vecta along the comsspondingbasic term vectors is made. The advantagesof this model are that it is simple and yet pow=ful. The vecta operationscm be perfomed efficiently enough m hat+ die very large colktims. Furthemmre, it bas been shown that the retrieval effectiveness is significantly higher compared m that of the Bwleim mrie~al models. However, this vecmr model has been incurporated into very few commercial systems.

1. INTRODUCTION hfomution Retrieval (IR) is a discipline involved with the organization, strwturhg, analysis,storage,searchingand dissemination of informaUon. IR systems sre designed m make available a given stored ~lkctim nf inform&m items with the objective of providing, in respome to a user query. teferencesthat would contain the information desii by the us-. In other words, the system is intended m identify which documentsthe user should read in order m satisfy his (her) information requimments. In this environment, we have a collection of documents (e.g. books, journal, articles, technical reports, etc.). In order m identify which docwnents the USQshould read with respect m his information requbements, sotna mthod for the representationof what the documents are about (i.e. knowledge representation of documents) is needed. A documentmay or may not be relevant m a user query depending on many variables concerning the document(e.g. its scope,how it is w&en) as well as numerous user characteristics (e.g. why the searchis initiated, user’s previous knowledge). In any case,whatever

a11 or fee copy without material is granted prothe copyright notice Of of the 19BL-CICM Conference and Development in Informrtitle of the Retrieval” the tion and publication and its datr appear.

Permission part of vided that “Organization on Research

@

1986

to this the

Organization Conference Development Retr ieva

of

on in

tha 1986-ACM and Research information

175

In the strict Boolean retrieval systems PU81, P841 the user query normally consistsof index terms that are connectedby Boolean opesators AND, OR and NOT. The advantageof using Boolean co~ectives is to provide a hem structure to formulate the user query. The major problem in such a systemis that them is no provision for associating weights of importance to the terms which are assignedeither to the documentscr to the queries. In other words. the representationis binary, indicating either the presence or the absenceof the various index terms. The output obtained in responseto a query is not ranked in any order of presumed impatance to the user. In most cases,the AND connectives tend to be too restrictive IBU811.Most commercially available retrieval systemsessentially conform to this modeL

known disjunctive normal fam [GE821. Let{~},darotethesetofmin~inB,.Inardatochulctetixe a vector space in which these correspond to the basic vectors, WC&fine a set of 2”dimznsional vectors {&}. These vectom constibasis of the vector space in R’ as follows: tute an atho~mrd irt, = (l,O,O, . . . ,O) 4 = (O,l,O, . . . , 0) iiz, = (O,O,l. . . . , 0) (2.U

One of the challengesfor researchersin informati6n retrieval has been to achieve greatcs acceptanceof the yector processingmodels in commercial systems.The main difficulty in this connection is due to the inability of the vector p-sing systems to handle Boolean queries. In recent years some progress has been made in expressing Boolean queries as vectcas [S83a, S83b]. If attractive ways to achieve this am advanced,it would than be possible to modify existing systems to use vector processing technques without a great ded of cost and effat. Another problem in the’conventicmalvector space mtxlel is that it assumesthat term vectors are orthogonal. It is generally agreed that terms are correlated and it is necessary to generalize the model to incorporate term correlations. A vector processing model termed the GVSM cWO84a.WO85] was proposedin responseto this need. In the GVSM. the queries are assumedto be presentedas a list of temrs and corresponding weights. Thus, no provision is made for processing Boolean queries. However, the premisesof the model naturally lead to a scheme for handling Boolean queries. In this paper we present the details of this scheme. This result will help achieve the aim of integrating vector processing capabilities into existing systems which useBoolean retrkval models. The papes is organized in the following manner. In Section 2 we review the main characteristicsof the GVSM. Ia Section 3, its connection to the strict Bookan retrieval model is explained and the charactezixation of strict Boolean in vector model environment is presented.This vector model for strict Boolean is then.generaIiied to handle weighted queries and documents.In Section 4. the various ideas and models presentedin the earlier sectionsare summarized.Then, in Section 5, the proposed scheme is compared with the pnorm model [S83a, S83b]. This section also presents experimental results which show that the proposed scheme is effective. The final ‘&ction offers someconcluding remarks and areasfor further research. 2. REVIEW

ii&

= (O,O,O,. . . , 1)

Given these, it is easily seen that the vector representation of any Boolean expression is given by the vector sum of the basic vectors which correspond to the minterms in the disjunctive normal form of the expression. The aSSertion th& for any tW0 Vectors4. tij* the scalar PKdUCt Hi f aj is f&e corresponds to the fact that the conjunction of atomic expressims 4 and m, is fds.?. In general, if two vectom me not orthogonal, then the co~nding Boolean expressions have at least one minterm in common. 2.2. Vector Representation of Terms Assuming No Weights The ideas developed in Section 2.1 can be applied to an information xetkval envinmment and each index term can be given an explicit vector representation.Let the indexing vocabulary consist of -@oeraw 4, 4, . . . , f,. Any literal can appeq in a Bookan expressioneither as 7; or r, depending M whether it needs to be cornplemented or not. In particular, carjunctive expressions where every literal appears in either uncompkmented or complemented form are the atomic expressions. Let Cm*), denote the set of all atomic exptessians. Then, since each ti is itself an element of the Boolean algebra genemkd, fi can be expressedin its disjunctive norm& form: ti=TORmi,...ORtq,,

0.3

where the “i;s are minterms in which ri is uncompkmented. Let the set of minterms in eqn. (2.2) be denoted by {m)‘. We cae now define the basic vector analogousto eqn. (2.1) and the term t, can be written in the vector notation as -.

OF THE GVSM

2.1. Vector Representation of Elements of a Boolean Algebra x, be n llteds usedto generatethe free Boolean OX,,%,..., algebra, denoted B,. Any Boolean expression composed of these literals (using operators AND, OR or NOT) is an element of the algebra.

q=

cirt,. “,

l

(2.3)

t.4

Alternatively,

What we desire is to identify a vector space such that every Boolean expression in B, correspondsto a vector in the vector space. In a vector space it is necessaryto specify a set of vectors that form a basis. Clearly, if a basis is known then any vector in the spacecan be expressedas a linear combination of the basic vectors. Since the intent is to obtain a way of expressing every possible Boolean expression, it is appropriate to have the set of basic vectors correspondto a set of fundamental expressionswhich can be combined to generate any elemnt of the algebra. We, therefore, employ the notion of an atomic expression. An atomic expression,or a mintenn. in the n literals x1, y, . . . , xr is a conjunction of the li@rals where each Xi appearsexactly OIMX and is either in complementedor uncomplementedform. Clearly, there are 2* minterms in all. It is well known that the conjunction of any two mintcrms is always xero (f&e ) and that any Boolean expression be uniquely expressedas a disjuncilltbElitWlSX,,Xy...,*.Can tion of tinterms. The representationobtained in this way is the well

~ = E

Cf,

~

(2.4)

i=I

where 1

ifmae

Cm}’

0

othe?wise .

Cih= .

I

That is, the term vectors anza linear oombinationof the &‘s. the basic vectors, and the vector summation ccmzspondsto the OR’s of eqn. (22). Furthen~m, the scalar product between any two basic vectors is zerb corresponding to the fact that the ANDing of two minterms is

false.

176

23. The Generalized Vector Space Model (GVSM) ln this section we will review the essential featums of the GVSM [wO84, WO85]. This model is the result of incorporatingthe idea developed in Section 2.2 into the framework of the conventionai vector space model. One of the main stepsin this processinvolves the generalization of the krm vector representationin such a way that the expansion coefticients in eqn. (2.4) are not binary. The determination of these coefficients is. however, closely tied in with tbe question of what is meant by two tams being not orthogonal(or. correlated).‘Ihii is becaus once the coefficients are speciEed. the scalar product between any two non-binary vectors 7f and c is defined. Since scalar product baing xem implies orthogonality, a mm-zero value must represent a measum of non-ortbogonality. In order to motivate the premisesof the GVSM and to introduce somebasic concepts.we tirst outline the main ideas of the conventionalvector spacemodel. 23.1. The conventional vector space model Tbc basic premise in the vector space model is that the documnts and the queries are represented by a set of vectors, say, {&la=l,&... , p}. and 9: respectively. in a vector spacespanned by the set of normalized term vectors, { ti \ i = 1,2, . . . , n ). That UP & = k

i-1

a&

Suggest Documents