Document retrieval using a fuzzy knowledge -based ...

3 downloads 0 Views 2MB Size Report
Columbia, South Carolina 29208 ..... to be symmetric, so that S(c; , 9) = S(9 , c;), V c; , ci E C. I, a fuzzy binary ... ive in that I(c; , c;) = 1 V c; E C, and I is fuzzy transitive in the ...... Shafer theory of evidence combination 17,19 that overcomes a.
Document retrieval retrieval using using aa fuzzy fuzzy knowledge knowledge-based -based system Viswanath Subramanian* Gautam Biswas Biswas James C. C. Bezdek Bezdek James University of South Carolina Department of Computer Science Columbia, Columbia, South South Carolina Carolina 29208

Abstract. This paper presents presents the design design and and development development of of aa prototype prototype This paper document retrieval retrieval system system using using aa knowledge knowledge-based approach. Both -based systems approach. Both the domain-specific domain -specific knowledge knowledge base base and the inferencing inferencingschemes schemesare arebased based on a fuzzy fuzzy set framework. A query query in in natural natural language language represents represents aa a set theoretic framework. retrieve a relevant subset of documents from a document base. base. Such request to retrieve Such a query, can include include both both fuzzy fuzzy terms and and fuzzy relational operators, operators, is a query, which can converted into an an unambiguous intermediate intermediate form by aa natural natural language language interinterface. relationships between between face. Concepts Concepts that that describe describe domain domain topics topics and and the relationships concepts, such concepts, such as asthe the synonym synonymrelation relation and andthe the implication implication relation relation between aa concepts, have been been captured in a knowlgeneral concept and more specific concepts, edge base. the reasoning reasoning edge base. The The knowledge knowledge base base enables enables the the system to emulate the process followed process followed by an an expert, expert, such such as as aalibrarian, librarian, in in understanding understanding and reformulating queries. The processes the mulating user queries. The retrieval retrieval mechanism processes the query query in in two steps. First produces a a pruned pruned list list of of documents documents pertinent pertinent to to the thequery. query. steps. First it produces Second, uses an evidence evidence combination combination scheme scheme to compute compute aa degree degree of of Second, itit uses support between the the query and individual individual documents documentsproduced produced in in step step one. The front-end front -end component component of the the system system then then presents aa set set of ofdocument document citations citations to to the user in ranked order as as an information request. request. an answer to the information Subject terms: knowledge-based terms: artificial artificial intelligence; knowledge -basedinformation information retrieval; retrieval; document document relations; linguistic variables; inexact reasoning; reasoning; evidence evidence combination; combination; retrieval; fuzzy fuzzy relations; query processing. Optical Engineering Engineering 25(3), 445 445-455 -455 (March (March 1986). 1986).

CONTENTS 1. Introduction 1. 2. Basic Basic concepts in information retrieval 2. 3. Knowledge Knowledge-based 3. -based systems systems approach approach 4. System model and architecture 4. System 4.1. System System model model 4.1. 4.1.1. 4.1.1. Concept set C 4.1.2. Document base base D D 4.1.2. 4.1.3. Query set Q 4.1.3. Query Retrieval function DOS(gc, DOS(qc , d) 4.1.4. Retrieval 4.1.4. 4. 1 .4. 1 . Retrieval Retrieval function function R1 R 4.1.4.1. 4.1.4.2. Retrieval function function R2 R 4.1.4.2. Retrieval Matching function function t/r i// 4.1.5. 4.1.5. Matching 4.2. System architecture 4.2. System Inferencing mechanism mechanism 5. 5. Inferencing 5.1. Preprocessor 5.1. Preprocessor 5.2. Parenthesizing Parenthesizing the query 5.2. 5.3. Retrieval procedure 5.3. Retrieval 6. Summary and further further research research 6. Summary 7. Acknowledgments 7. Acknowledgments 8. References 8. References

INTRODUCTION 1. INTRODUCTION advances in science science and Rapid advances and technology technology in in the the past three decades making it decades have have created created an "information explosion," making difficult for libraries and other information information centers to provide up-to-date users with up -to -date references referencesand and bibliographic material * Present address: address: Wichita Wichita State University, University, Dept. Dept. of of Computer Computer Science, Science, *Present Wichita, KS KS 67208. 67208. Invited Paper AI AM 13 received 12, 1985; 1985; revised received Invited Paper -113 received Aug. Aug. 12, revised manuscript received Oct. 1985; accepted Oct. 14, 14, 1985; 1985; received by Managing Oct. 7, 1985; accepted for publication Oct. Editor Dec. Dec. 9, 9, 1985. 1985. © 1986 Engineers. 1986 Society Society of of Photo-Optical Photo -Optical Instrumentation Instrumentation Engineers.

on topics topics of of their their interest. interest.'1 Currently, there is great interest in on-line on -lineinformation information systems, systems, where whereaauser userinteractively interactively queries queries about information informationitems itemsvia viaaauser user-friendly a database about -friendly interface that accepts accepts constrained face constrained natural language input. This paper discusses discusses the problem problem of ofon on-line retrieval of of This -line retrieval bibliographic material citations in in a bibliographic material stored stored in in the the form form of citations document database, response to user queries. queries. A A major document database, in in response to user problem with most current document retrieval systems is is that that or an an experienced experienced they require an expert (usually a librarian or user) to aid the the user user in in formulating formulating and and reformulating reformulating his his user) The emphasis emphasis queries in order to produce the desired retrieval. The research is is on developing developing aa prototypical system system that of this research user-friendly provides a user -friendly interface to a wide variety of users. A knowledge-based incorporate knowledge -basedsystems systemsapproach approach isis used used to incorporate framework of of the the retrieval retrieval some of the expert's tasks into the framework system. The need for such such systems systems is becoming becoming acute since, since, system. The need even in specialized and narrow subject subject domains domains even in the case of specialized such as fuzzy fuzzylogic, logic,inexact inexactreasoning, reasoning, or or cellular cellulararchitectures, architectures, number of of research research articles articles isis increasing increasing very very rapidly, rapidly, the number causing aa tremendous information overload. overload. A A system system of this type should primarily aid aid research research scientists scientists seeking seeking articles articles domain of of interest. relating to specific topics topics within aa broader broader domain The emphasis of this paper is on the retrieval mechanism of for relating relating user user the system, i.e., on developing developing better methods for queries specific domain queries in in natural language (but limited to a specific or topic, such as as expert expert systems or database database systems) to documents available in in the the database. database. An attempt is made to incorporate an an overall overall fuzzy fuzzy framework and and adopt adoptaaknowledge knowledge-systems approach based systems approach in developing developing aa prototype prototype document retrieval system. set of of retrieval system. This This prototype prototype has has been been tested tested on a set knowledge representation in artifidocuments in the area of knowledge cial intelligence. OPTICAL / March 1986 / Vol. 3/445 OPTICALENGINEERING ENGINEERING / March 1986 / Vol.2525No. No. 3 / 445

Downloaded From: http://opticalengineering.spiedigitallibrary.org/ on 07/10/2015 Terms of Use: http://spiedl.org/terms

BEZDEK BISWAS, BEZDEK SUBRAMANIAN, BISWAS, SUBRAMANIAN,

queries

retrieval mechanism retrieval mechanism

documents

model. system model. retrieval system Information retrieval Fig. 1. Fig. 1. Information

RETRIEVAL CONCEPTS IN INFORMATION RETRIEVAL BASIC CONCEPTS 2. BASIC retrieval information retrieval an information 1, an Conceptually, as Conceptually, as shown shown in in Fig. 1, (IR) system consists of a set of information items (e.g., documents), a set set of of queries, queries, and and aa retrieval retrieval mechanism mechanism for for matchmatchsystems, a retrieval systems, document retrieval In document ing queries queries and and documents. documents.'1 In retrieve a represents an order to retrieve language represents query query in in natural language available document documents from the available relevant relevant subset subset of documents concepts that may contain queries may language queries base. base. Natural language contain concepts is interested user is the user describe the topic or subject matter that the in, with with additional additional terms terms such suchas as"about,""nearly," "about,""nearly," or "important"2' 3 that tant"2,3 that qualify qualify these these concepts, concepts, and and operators that comas aa represented as concept isis represented concepts. AA concept individual concepts. bine or link individual single single word word or or group group of words words that has a definite meaning in the simple be the could be operators could the domain of interest. Concept operators Boolean connectives connectives such such as as AND, AND, OR, OR, and and NOT, and terms which are transsuch such as as "related "related to" or "in the context of," which later in a later discussed in BASED_ON, discussed formed formed into into an an operator BASED_ON, process of section. processing can can be be regarded regarded as as the process section. Query processing closely areclosely that are database that information items finding information finding items in in the database subjective notion Because of the subjective query. Because related related to to the query. notion of request information request retrieval, an information relevance relevance in in information information retrieval, satisfied exactly. may not be satisfied bibliographic database are make up the bibliographic Documents that make Documents as such as profile, such or profile, description or by a short description usually characterized by set of an abstract, extractedfrom fromaa text, text, or or a set keywords extracted abstract, keywords an vec­ scheme is the vecdescriptors.4 A widely widely used used representation scheme where each each document document isis represented represented by by aa set set of model,5 where tor model,5 concept-weight conceptweight pairs. Some simplistic schemes derive concept such as frequency of occurmeasures, such weights from statistical measures, rence rence and and term term discrimination discrimination values, values,'1 or from probabilistic used to models.6.7 However,other otherapproaches approachescan can also also be be used 7 However, models.6is assumed the study, itit is vectors. In our study, create create the document vectors. with familiar with who are familiar experts who by experts created by vectors are created document document vectors interweight is interconcept weight The concept concepts concepts used used in in the domain. The importance or importance relevance or of relevance measure of subjective measure preted preted as aa subjective objective probaan objective as an than as based on expert opinion, rather than based is stored vectors is bilistic measure. bilistic measure. The The set set of document vectors stored in a base. database called the document base. retrieval function The retrieval function can can be be computed computed in in aa number of The process of matching a the process ways. This This computation computation represents represents the ways. document first document The first characterization. The document characterization. query query to a document retrieval systems were limited limited in scope scope because because they they operated on the concept of exact matches of single single keywords. keywords. Capabilischemes, Boolean schemes, by Boolean enhanced by were enhanced systems were ties of such such systems ties comas comrepresented as where complex concepts in a domain are represented binations of operators AND, OR, and of simpler concepts using operators which imply NOT, which imply union, intersection, and negation, respecof flexibility lack of tively. tively. These retrieval systems suffer from a lack description are document description because all the the terms terms that form a document because all the document; grades of importo the termed equally important to terms Similarly, terms tance relevance cannot be incorporated. Similarly, tance or relevance com-termed corn are termed description are document description the document in the not occurring occurring in Weighted so. Weighted actually so. be actually which may not be pletely irrelevant, which pletely irrelevant, weights, 1 simschemes that use index term weights,' Boolean retrieval schemes vectors,5 and document vectors,5 between query and document ilarity measures between

query term weights weights that that express express the the importance of a term in query these overcome these query 8 have the given have been been introduced introduced to overcome given query8 the problems. set theoretic theoretic approaches approaches have also also been been introduced Fuzzy set several models as a generalization of Boolean retrieval, and several as 12 requests. 5 ' 8 "'2 fuzzy requests.5,8 with fuzzy deal with proposed that deal have been been proposed have with fuzzy prediprocessing with query processing fuzzy query aboutfuzzy talks about Tahani Tahani'313 talks 10 and Buelllo language.Buell querylanguage. artificial query cates embedded embedded in an artificial cates Boolean generalizing Boolean in generalizing Buell and and Kraft8 have shown shown that in Kraft 8 have Buell include relevance relevance weights weightsand and thresholds thresholds there are queries to include consistency enforcement. enforcement. The The distinction distinction between problems of consistency Buell by Buell discussed by been discussed has been weights and and thresholds has relevance weights Kraft.8 and Kraft.8 fuzzy set theoretic view, the true advantage of the fuzzy In our view, can incorporate incorporate a linguistic linguistic framework framework that that approach is that it can expresses relations relations among among the the concepts concepts in in the the domain and expresses scheme for for defining defining terms terms that that do not have precise provides a scheme present a we present section we quantitative formulations. formulations. In In aa later section quantitative synonymous and representing synonymous conceptual framework framework for for representing conceptual ambiguous concepts. concepts. This This should should circumvent circumvent aa major major shortshortambiguous should be coming of of many many current current systems. systems. In In addition, addition, itit should be coming on-line user-friendly more user provide aa more possible to provide -friendly on -line interface interface to a variety of users. users.Systems, Systems,or oreven evenprototypes, prototypes, that that handle handle wide variety information of information aspects of fuzzy characterizations characterizations in in all aspects fuzzy dedocument derequests, document or requests, queries or of queries retrieval retrieval- description description of been have not been mechanisms -have retrieval mechanisms scriptions, and the retrieval scriptions, and developed so far. KNOWLEDGE-BASED 3. KNOWLEDGE -BASED SYSTEMS SYSTEMS APPROACH require world require real world Numerous tasks in in the real solving tasks problem solving Numerous problem specialized knowledge for for efficient efficient solution. solution. For these tasks, specialized knowledge direct algorithmic solutions are not computationally feasible, and heuristic and on heuristic rely on to rely have to and, therefore, experts have therefore, human experts knowledge for judgmental knowledge for problem problem solving. solving. Computer Computer projudgmental have systems, have knowlege-based orknowlege systems or expertsystems grams, -based systems, called expert grams, called reasoning. been developed developed to to emulate emulate the the process process of of human reasoning. been systems these systems differentiate these Significant characteristics characteristics that that differentiate Significant on aa large store reliance on theirreliance are their from conventional programsare conventional programs which isis base), which domain- specific knowledge knowledge (the knowledge base), of domain-specific mechareasoning mechasimple reasoning general and simple manipulated by very general manipulated symemphasis on syminference engine); nisms (called (called the the inference engine); their their emphasis nisms to ability to their ability bolic rather than number manipulation; and their bolic explain their reasoning processes in aa natural and easily undersuccessfully been successfully have been systems have Expert systems manner. Expert standable standable manner. fields as as medical medical diagnosis, diagnosis, computer computer configuapplied in such fields interpretafailure diagnosis, chemical data interpretaration, equipment equipment failure system 14 The processing.14 speech processing. and speech tion, and Thedocument document retrieval system system. expert system. presented this paper is structured as an expert presented in this front-end (1) aafront Our system -end components: (1) main components: two main system has two disis diswhich is interface, which language interface, natural language processor with a natural processor with mechanism retrieval mechanism (2) the retrieval paper, 15 and cussed in another paper,15 cussed and (2) interqueries are interlanguage queries discussed in this this paper. paper. Natural language discussed in form intermediate form unambiguous intermediate preted and converted to an an unambiguous concepts, combine concepts, that combine concepts, operators that that consists consists of concepts, between Relationships between them. Relationships qualify them. that qualify terms that fuzzy terms and fuzzy implication synonym relation, and the implication such as concepts, such as the synonym relation between a general concept and more specific concepts base. knowledge base. in aaknowledge represented in arerepresented it, are that are to it, linked to are linked intermediate concepts that that appear in the intermediate between concepts Operators between OR, connectives such as AND, OR, Boolean connectives are Boolean form of query are of a query also has also BASED_ON, has additional operator, and NOT. operator, BASED_ON, NOT. An additional been identified. identified. The The mathematical mathematical definitions of these these operaoperabeen Sec. 4. in Sec. tors appear in

No. 33 25 No. Vol. 25 1986 //Vol. ENGINEERING // March 1986 / OPTICALENGINEERING 446 / OPTICAL

Downloaded From: http://opticalengineering.spiedigitallibrary.org/ on 07/10/2015 Terms of Use: http://spiedl.org/terms

DOCUMENT RETRIEVAL RETRIEVAL USING USING A A FUZZY FUZZY KNOWLEDGE KNOWLEDGE-BASED SYSTEM -BASED SYSTEM

form, itit is is After the query is converted into its intermediate intermediate form, two-step passed to a two -step retrieval retrieval mechanism. mechanism. The first step produces aa pruned pruned list list of of documents documents pertinent pertinent to the query. The second step uses uses an an evidence evidencecombination combination scheme to compute a relevance relevance measure measure between between the the query query and and individual documents produced in step one. The list of documents documents with with their corresponding relevance the corresponding relevance measures measuresisis then then returned returned to the front end end for for further further processing, processing, before before aa set set of of document document citations is is presented to the user in ranked order order as as the answer citations request. to his information request. A knowledge-based A knowledge -basedapproach approachprovides provides great great power power to to a document retrieval retrieval system system by explicitly explicitly using using synonymous synonymous implication relationships relationships between between concepts. concepts. This This facilifaciliand implication tates the understanding understanding of user queries queries and retrieval retrieval of of tates of user appropriate documents of an an appropriate documentsinin aa manner manner similar similar to to that that of For example, example, if the librarian is told, for an expert (librarian). For Al-related AIrelated query, query, to to retrieve retrieve papers papers that discuss "production "production systems," he would also retrieve retrieve papers papers that thatdiscuss discuss "rule "rule-systems," he would based systems" since since the the two two concepts concepts are are synonymous synonymous in in that that domain. Similarly, ifif the query requested requested papers papers on domain. Similarly, the user query on "natural language language semantics," semantics," the would unhesitatunhesitatthe expert would ingly consider papers on on "conceptual "conceptual dependency" dependency" and and ingly consider papers "semantic "semantic grammars," grammars," knowing knowingthat that they they are are some some of of the different schemes schemes used used to to describe describe "natural "natural language semantics." process mimics usage of a tics." This This process mimics the the organization organization and and usage person's enables him to identify identify and person's vocabulary vocabulary in in that that it enables relate given relate similar similar concepts concepts in in matching matching aa document document to a given query. Also, the the entire entire process process by by which which aa user userdecides query. Also, decides whether document satisfies satisfies his needs needs is based based on whether or or not aa document is read. pooling together clues incrementally as the document is Such reasoning reasoning is is reflected reflected in in the theinferencing inferencing mechanism mechanism Such described in Sec. 5. 4. SYSTEM MODEL AND 4. SYSTEM AND ARCHITECTURE ARCHITECTURE model of of our ourknowledge knowledge-based approach The mathematical model -based approach to document retrieval retrieval based fuzzy sets sets for to document based on on the theory of fuzzy for characterization isis presented presented in in this this section. section. Fuzzy Fuzzy document characterization relations knowledge base relations between between the the concepts concepts in in the the knowledge base are are evidence combination scheme scheme for the inference defined. The evidence mechanism mechanism is is also also explained. System model model 4.1. System A A documented retrieval system system can can be be defined defined as as aa quadruple quadruple where C, D, D, Q, Q, and and RR are aredefined defined as asfollows: follows: (C,D,Q,R), where is the set of concepts (named or described by index terms, C is keywords, or descriptors, used commonly in the domain domain being being considered). q GE CC isis represented considered). A concept c1 represented by by one one or more keywords, language processing," keywords, such such as as "logic" "logic" or or "natural language that represent represent meaningful meaningful entities in the domain of of interest. interest. the collection collection ofofdocument documentdescriptions, descriptions, collectively collectively D isis the referred referred to as the document base. Q is is aa set set of of queries queries in in natural natural language. language. Conceptually, Q can contain three components Q Qc , Q , and Qn. Qn . Qc deals with with the Q deals concept part of the query, Qy Qy indicates indicates the the publication publication period the user is the user is interested interested in, in, and and Qn Qn pertains pertains to to the the number number of of documents to to be be retrieved. Correspondingly, Correspondingly,aaquery queryqq EGQQ is is documents triple qq ==(qc (qc ,, qy qy ,,qn). a triple qn).AAmeaningful meaningful query query must have the the qc qc but qy qy and qn qn are optional. component, but R is the retrieval function:

R:QXD R:QXDH>[0,1] [0, 1] ,,

(1)

which in the which assigns assignstotoeach eachpair pair(q,d) (q,d) aa number number R(q,d) in the interval D. This This number number is is a interval [0,1], [0, 1],where whereq qGE QQ and and dd G E D. measure of relevance of the document document d to the query q and has been described described in earlier discussions discussions as as the the degree degree of of support support with respect respect to query (DOS) for document d with query q. q. Therefore, Therefore, R(q, d) ==DOS(q,d) DOS(q, d)produces producesaadegree degree of of support support for docuR(q,d) ment with respect respect to entire query query q using using a ranking ranking ment dd with to the entire function, 3 H>[0,1], follows: function, F: F: [0,1] [0,1]3 x[0,1], asasfollows: DOS(q,d) DOS(q,d) = FF[DOS(qc [DOS(gc,d),DOS(gy,d),DOS(qn,d)] ,d),DOS(qy ,d),DOS(qn ,d)] ,,

(2)

where DOS(qc ,d) =-DOS DOS(q,d)| Qc> DOS (g,d) (q,d)IQcxD r XD

DOS(gy,d) DOS(qy ,d) = DOS(q,d)IQyxD DOS(q,d)| QyXD ',

(3)

(3)

DOS(gn,d) DOS(qn ,d) = DOS(q,d)IQ,XD DOS(q,d)| QnXD -

paper discusses the the function function DOS DOS(qc This paper (qc , d). The method for computing DOS DOS(qy DOS(qn computing (gy,, d) and DOS (qn ,, d) is based on fuzzy set theoretic functions and and isis described described in in another anotherpaper paper15 15 that discusses the natural language language interface. interface. A discusses the details details of of the natural have also also been been defined defined in in Ref. Ref. 15 15 for number of approaches have combining individual degrees their combining the the individual degrees of of support support (F), (F), and their properties have been studied. They will not be repeated here. For aa given information requestor orquery queryqqEGQ, Q,we we call call the set For information request A == {d IR(q,d) >0)l, A {d|R(q,d)>0)},

where whereAACC D D,

(4)

an answer of the the document document retrieval system system to to the the information information request. The elements of defined in in The elements of the the quadruple quadruple (C,D,Q,R) are defined detail in the following sections.

4.1.1. Concept Concept set C 4.1.1. C is defined to be be aa finite finite set setofofnnconcepts concepts{c1, {c,,c2 c2 ,, c3 c3,,...,cn ... , en)} that collectively collectively represent meaningful entities in the domain under consideration. consideration. It is assumed assumed that the set set of of concepts is created by by experts who who are well versed versed in in the topics comprising the binary fuzzy relations between the domain. domain. Two Two important important binary pairs of concepts, concepts, the synonym relation S and the implication relation were introduced introduced earlier. earlier. A general general discussion discussion of relation I, were in Ref. Ref. 16. 16. The The mathematical mathematical definidefinifuzzy relations appears in of SS and and II are are given given below. below. tions of S, called the synonym relation on on S, a fuzzy fuzzy binary binary relation called set C, is is defined as the concept set S:CXC->[0,1] S:CXCx[0,1] ..

(5)

Alternatively, SS can canbe be defined definedas asaafuzzy fuzzysubset subsetininC2. C2 . S(c; S(Cj,, c:) Alternatively, 9) is a measure of the strength of the synonymity between Cj c; and and C:, 9, where Cj,Cj Thenotation notationc;q SS cCjindicates indicates that c; q and where c;, EGC.C.The and C:c are synonymoustotoaadegree degreegiven givenby byS(c; SCq,, ci). Cj). Trivially, Trivially, c; q S ci? synonymous c;, with S(Cj,, q) S(c; c;)== 1.0 C;therefore, therefore, S is reflexive. S is also defined 1.0 VVcjc;GEC; to be symmetric, so that S(c; symmetric, so S(ci?, Cj) = S(9 S(Cj,Cj), V c; Cj,Cj 9) = , c;), V , ci G E C. I, a fuzzy fuzzy binary binary relation, relation, called the implication Implication relation on the set C, is defined as

I:CXC^[0,1] I:CXC -[0,1] ..

(6)

I(Cj,Cj) strength of ofthe the one one-way I(c; , 9) isis aa measure measure of the strength -way implication, c;, tion, from from qc; to to C:. ci. Furthermore, Furthermore, c,c;IIC:, c),denoted denotedasasCjc;=> = ci, Cj "implies"Cj, wheneverI(c; I(Cj,, 9) Cj) >>0.1 is reflexreflexcan be read as c; "implies "9, whenever 0. I is ive in in that that I(c; ive I(Cj,, Cj) c;) == 11 V V cjc;GE C, C, and and I is fuzzy transitive in in the the OPTICAL / March 1986 / Vol. OPTICALENGINEERING ENGINEERING / March 1986 / Vol.2525No. No. 3 / 447 3/447

Downloaded From: http://opticalengineering.spiedigitallibrary.org/ on 07/10/2015 Terms of Use: http://spiedl.org/terms

SUBRAMANIAN, BISWAS, BISWAS, BEZDEK SUBRAMANIAN, BEZDEK

sense: following sense: IfC: If ci

KNOWLEDGE KNOWLEDGE REPRESENTATION REPRESENTATION

Cci and ci -> ck, ck , then then ck and and I(ci,ck) I(Cj,ck) == I(ci,ci)1(ci,ck) . ci -> ck

(7)

STRUCTURE

expert system system applications, applications,such suchas asMYCIN MYCIN17; expert however, other 17 ; however, definitions are possible. possible. Bezdek Bezdek and and Harris Harris18 18 have listed a definitions have listed number studied their properties. properties. The number of of them them and and studied The notation Cj Ic: C: is aa more concept than ci Ici indicates indicates that that ci more general concept is thanc1, Cj, and ci Cj is C:. The notation Cj Ic: more specific than ci. notation ci Ici is is often often represented as Cj => ci, Cj, to ci to the extent I(ci,ci). I(Cj,Cj). The set C, C, together with the implication relation I,I, can be represented by weighted digraph G, defined defined as the represented by aa weighted digraph G, the pair pair (C,I), where where (a) (a) the the vertices vertices of of G G are are the the elements (C,I), elements ci Cj E C, C, togetherby by directed directedarcs; arcs;(b) (b)aa directed directedarc arcfrom linked together fromciq to to ci C: (Cj,C: C) corresponds corresponds to to the the implication implication ci (ci ,c E C) Cj => > ci; (cj Cj; and (c; since I is is transitive, transitive, ci Cj =$> Cj and ci since Cj => > ci ck together imply > ck imply Cj => ck , but an arc corresponding to ci ci > ck, Cj =^> is not shown > ck is explicitly. The graph G, formed by considering the entire set of concepts concepts C, C, forms forms a hierarchical hierarchical structure in our domain called the concept concept hierarchy. hierarchy. A portion of a concept concept hierarchy in Fig. Fig. 2. 2. appears in

NL

UNCERTAINTY

LOGIC

REPRESENT.

This definition This definition has has been been traditionally traditionally used used in in aa number of

PROCEDURAL REPRESENT. REPRESENT.

DECLARATIVE

REPRESENT.

NONMONOTONIC

/1

PRODUCTION SYSTEMS

NETWORK

MONOTONIC

PROPOSITIONAL

I

I

M

TARUL6

FRAME 9CNDT SCRIPT FRAME

CONCEPTUAL cONCSPrue[. DEPENDENCY DNPSNDNNCY

Fig. 2. A portion portion of of aa concept concept hierarchy. hierarchy. Fig. 2. A

4.1.2. Document base D 4.1.2.

the document retrieval system. system. Each Q is is a triple: triple: Each query query q E Q

D is a finite finite set of document document descriptions descriptions{d1, {d,, d2, d2 ,..., }. Each ... , dm }. Each document d E ED D isis represented represented by by aa fuzzy fuzzy set in the universe defined by the concept set C, characterized by a membership function

qq =- (gc, (qc ,qy,qn) gy, qn) ,

d:C

[0, 11

,

(8)

where µd(ci) Md(ci) ls is tne the degree degree of membership of the concept concept ci q in the description of document d and is also called the weight weight of of Cj Thus, each each document is represented represented by ci in in d. d. Thus, document dd is by aa set set of of concept-weight concept -weightpairs, pairs,called calledaa document document vector, vector, of the form {[c 1? iJLd (c { )],...,[q, µd(9)] Md(ci)J}-}. As {[c1, µd(c1)],...,[c1, As an an example, example, the following following characterizations could represent two documents: d, {(productionsystem, system,0.7), 0.7),(semantic (semanticnet, net,1.0), 1.0),(logic, (logic, 0.2)} d1 =={(production 0.2)} ,, (9) (9) d2 {(monotoniclogic, logic,0.8), 0.8),(belief (belief revision, revision, 0.4)) 0.4)} .. d2 == {(monotonic

Though not explicitly shown, shown, all all concepts concepts that that do do not not appear appear in have a degree of membership in the the document document description description have membership = 00 for that document. Therefore, Therefore, all all m document vectors can be viewed in IRn. IRn . viewed as as points in The divided into The document document space space DD isis divided into k hard hard (crisp) (crisp) subsets, DH DH = = {Dc {Dc .,, Dc2 DC2 ,..., domain expert. , ... , DCR D), },bybyaadomain expert. Each Each document subset subset Dci Dc. corresponds to to aa node node ci Cj in the concept document hierarchy, as the the set set of ofthese these concepts concepts hierarchy, and and we we define define H H as {c,, c2 ,...,ck this enables {c1 , c2 enables aa more more , ... , }. || H (I = kk logic,

0.9 I(belief revision, revision, logic) logic) == 0.9 logic(O.Sl) Iogic(0.81)

0(0.19)

logic(0.113)

logic(0.092)

logic(0.021)

NLP(0.771)

$(0.625) 43(0.625)

NLP(0.146)

©(0.116) 0(0.116)

logic(0.094)

0(0.022)

m(NLP) 0.389 m(NLP) = 0.389 m(logic) 0.522 m(logic) = 0.522 0.059 .. m(0) == 0.059

revision" distinctly distinctlysupports supportshypothesis hypothesis( "logic {"logic"}, "Belief revision" " ), and and this causes a strong increase in its measure of of belief from 0.113 to 0.552. At this stage, stage, the degrees degrees of support for both query query conconcepts, cepts, "logic" "logic" and and "NLP," "NLP," are available, and a final degree degree of of support DOSE DOSC for the document document dd with with respect respect to toqE qc is corncomputed according to the definitions definitions in in Sec. Sec. 4: 4: R2 (qc ,d) ==[(0.552) [(0.552)* (0.389)]'/2 = 0.463 0.463 .. R2(gE,d) *(0.389)]

6. SUMMARY AND 6. SUMMARY AND FURTHER RESEARCH This paper has presented a knowledge-based knowledge -basedsystems systemsapproach approach to the design design and implementation implementation of a document document retrieval retrieval system. system. The The key key to to the improved performance of of such such aa syssystem is the processing processing of natural language language queries tem is and the queries and accumulation of of semantic knowledge about domain domain concepts. accumulation Some Some of the the important properties of the system and areas areas in in system and which which further further research is being pursued are described below. below. Our fuzzy fuzzy document retrieval system system overcomes some of overcomes some drawbacks suffered suffered by by Boolean Boolean retrieval retrieval systems, systems, mainly the drawbacks mainly that of of the inability of the latter to handle partial that partial matches. We We have been able to enhance the system's capabilities by defining a new additionto new operator operator called calledBASED_ON, BASEDON, ininaddition tousing using the the conventional conventional Boolean Boolean operators operators AND, AND, OR, OR, and NOT. NOT. The The definition justified by the results results of our definition of of BASED_ON BASED_ON isis justified preliminary survey, real queries queries (about 75) preliminary survey, conducted conducted on on real 75) posed to the the DIALOG DIALOG retrieval retrieval system, posed system, available available at the the Thomas Cooper Library at the the University University of South Carolina. The revealed that a number of user queries queries in natural The study revealed language containing connectives connectives such language such as as "in "in the the context of" and "related "related to" best best translate translate to to the theBASED_ON BASED_ON operator. definitions of Operator definitions of AND, AND, OR, OR, NOT, NOT,and and BASED_ BASED_ ON as given given in Sec. 44 can changed or replaced replaced by ON in Sec. can be changed by other definitions retrieval quality. For definitions in in order to adjust retrieval For example, example, Eq. (16), (16), which which computes geometric mean of individual Eq. computes the geometric degrees degrees of of support, support, could be be changed changed so so that that itit computes computes the the arithmetic mean. In fact, an important important extension extension of of this study would be to test test the the various various definitions definitions that have been used used for for aa number number of operators operators and compare compare their their performance performance in terms of the documents retrieved.

The The evidence evidence combination combination scheme scheme used used causes causes the the set of hypotheses monotonically, which hypotheses to to build build up up monotonically, which isis one one of the to aa poor poor response response time factors that contribute to time for for complex complex queries. Barnett 21 and and Shortliffe22 Shortliffe 22 have have develdevelqueries. Barnett21 and Gordon and oped schemes that can be applied oped more efficient efficient computation computation schemes under under restricted restricted conditions. conditions. However, However, there there isis aa conceptual conceptual problem in applying the DempsterDempster-Shafer Shafer evidence evidence combinaset of tion scheme, since it requires that the set of hypotheses in in the the frame of discernment discernment be exclusive exclusive and frame and exhaustive. exhaustive. As As the scope of the domain expands, expands, it is scope is hard hard to to keep keep concepts concepts mutually evidence combination mutually exclusive, exclusive, and and therefore therefore the the evidence combination scheme will will be be no nolonger longerapplicable applicable in in the the present present framework. A needs to A more more general general framework framework for evidence evidence combination combination needs be developed. The idea idea of evidence evidence contributing contributing negatively negativelyto to the the total total support for a document has not been taken up in our our impleimplementation. The system needs mentation. The framework frameworkfor for our our system needs to to be be expanded idea. In particular, this can be expanded to incorporate this idea. connected connected to an alternative alternative definition definition for for the the NOT NOT operator, operator, different from the one given in Sec. Sec. 4. 4. Alternative definitions of fuzzy fuzzy transitivity, transitivity, with with respect respect to to the implication relation (I) given in in Sec. Sec. 4, 4,may maybe beused usedto tostudy study the the change change in in retrieval retrieval performance. this nature nature was was perperperformance. AA theoretical theoretical study study of of this formed by Bezdek and Harris,18 Harris, 18 and and itit would would be be interesting interesting to test these results results in a real environment. The semantic semantic relationships [synonym [synonym (S) (S) and implication (I) relations] existing between concepts in the knowledge base represent rich source source of information information and have represent aa rich have been been exexploited to a great extent. The synonym synonym relation broadens the vocabulary concepts that the the system system can vocabulary of concepts can recognize. recognize. The The implication relation structures structures the the knowledge knowledge base base by by imposimposing concepts. Other ing aa hierarchical hierarchical organization organization on on the set of concepts. relations may be discovered as the system is is subjected subjected to to more more natural language language queries. The front-end The front -end interface interface can can make make itit possible possible for users to define define their their own own interpretation interpretation of linguistic terms terms by by adjusting adjusting default fuzzy functions encoded in the system. system. Thus, the interactive the system system can help a user active nature nature of of the can help user improve improve the the response membership response to to a query by adjusting parameters to membership functions or to the retrieval function. In In fact, itit is possible to build user profiles in our system so so that that it responds differently to the same same request to the request initiated initiated by by vastly vastly different different user user characteristics.

7. ACKNOWLEDGMENTS 7. ACKNOWLEDGMENTS The work of one of the the authors authors (G. B.) was was supported supported by NCR Grant No. No. 13030 13030 J104. J104. The The work work of of J. J. C. (Columbia) Grant C. B. B. was was supported by by NSF NSF Grant GrantNo. No.IST IST-8407860. -8407860.

8. REFERENCES REFERENCES 1.I. G. Salton and M. M. J.J. McGill, McGill, Introduction Introduction to to Modern Modern Information G. Salton Information Retrieval McGraw McGraw-Hill, New York York (1983). (1983). Retrieval, -Hill, New 2. L. A. A. Zadeh, "A fuzzy set set theoretic theoretic interpretation interpretation of oflinguistic linguistichedges, hedges,"" J. Cybernetics 2, 2, 44 (1972). (1972). 3. L. A. A. Zadeh, "The concept of a linguistic variable and its application to approximate reasoning,parts partsI Iand andI,"I,"Inf. Inf.Sci. Sci.8,8,159 159(I), (I),301 301(11) (II) (1975). (1975). approximate reasoning, 4. M. M. Bartschi, Bartschi, "An "An overview overview of 4. of information information retrieval retrieval subjects," subjects," IEEE IEEE Computer 18, 67 67 (May Computer 18, (May 1985). 1985). 5. A. Bookstein, Bookstein, "A "A comparison comparison of of two two systems systems of 5. A. reof weighted weighted Boolean Boolean retrieval," J. Am. Am. Soc. Soc. Inf. Inf. Sci. Sci. 32, 32, 275 275 (1981). (1981). 6. W. B. B. Croft, Croft, "Document "Document representation representation in probabilistic probabilistic models 6. W. of models of information retrieval," J. Am. Am. Soc. Soc. Inf. Inf. Sci. Sci. 32, 32, 451 451 (1981). (1981). 7. S. S. P. P. Harter, "A probabilistic approach to to automatic automatic keyword keyword indexing, indexing, on the the distribution distribution of of specialty specialty words on a technical Part I:I: on technical literature, II: an algorithm for probabilistic inferencing," inferencing," J. Am. Soc. Inf. Sci. Part II: Sci. 26, 197(1), 26, 197 (1), 280 280 (II) (1975).

454 / OPTICAL / OPTICALENGINEERING ENGINEERING // March Vol. 25 No. No. 33 March 1986 1986 // Vol.

Downloaded From: http://opticalengineering.spiedigitallibrary.org/ on 07/10/2015 Terms of Use: http://spiedl.org/terms

KNOWLEDGE -BASED SYSTEM DOCUMENT RETRIEVAL USING A A FUZZY FUZZY KNOWLEDGE-BASED

8. D. D.A. A.Buell Buell and and D. D. H. H.Kraft, Kraft,"A "Amodel modelfor foraaweighted weighted retrieval retrieval system," system," J. Am. Am. Soc. Soc. Inf. Inf. Sci. Sci.32, 32;21 211I (1981). (1981). 9. A. Bookstein, Bookstein, "Fuzzy "Fuzzy requests: requests: an approach approach to to weighted weighted Boolean Boolean 9. A. (1980). searches," J. Am. Soc. Soc. Inf. Inf. Sci. Sci. 31, 31, 240 240(1980). 10. D. 10. D. A. A. Buell, Buell, "A "A general general model model of query query processing processing in information retrieval systems," Inf. Proc. Manage. Manage. 5,5, 249 249 (1981). (1981). 11. T. 11. T. Radecki, Radecki, "Mathematical "Mathematicalmodel model of of information information retrieval retrieval system system based on the concept of fuzzy thesaurus," Inf. Proc. Proc. Manage. Manage. 12, 12, 313 313 (1976). (1976). 12. V. Tahani, "A fuzzy systems," Inf. Inf. Proc. Proc. 12. fuzzy model of document retrieval retrieval systems," 177 (1976). Manage. 12, 12,177(1976). 13. 13. V. fuzzy query queryprocessing processing -aa step V. Tahani, "A conceptual framework for fuzzy intelligent systems," Inf. Proc. Manage. 289 towards very intelligent database systems,"" Inf. ' " ' ' Manage. 13, 13,289 (1977). F. Hayes -Roth, D. D. Waterman, and D. B. 14. F. Hayes-Roth, B. Lenat, Lenat, Building Building Expert Expert SysSys­ tems, -Wesley, Reading, Mass. (1983). tems, Addison Addison-Wesley, (1983). 15. G. G. Biswas, J. C. 15. Biswas, J. C. Bezdek, Bezdek, M. M. M. M. Marques, Marques, and and V.V.Subramanian, Subramanian, "Knowledge- assisteddocument documentretrieval: retrieval:the the natural natural language "Knowledge-assisted language interinterface," J. Am. Soc. Inf. Inf. Sci., Sci., in in review. review. 16. L. A. Zadeh, "Commonsense "Commonsense knowledge knowledge representation representation based based on on fuzzy fuzzy logic," IEEE Computer, Computer, 61 61 (Oct. (Oct. 1983). 1983). G. Buchanan and E. -based Expert 17. B. B. G. E. H. H. Shortliffe, Shortliffe, Rule Rule-based Expert Systems: Systems: The The MYCIN MYCIN Experiment Experiment of ofthe theStanford StanfordHPP, HPP,pp.pp.272 272-292, -292, Addison Addison-Wes-Wesley, Reading, Mass. ley, Mass. (1984). (1984). 18. J. C. 18. C. Bezdek Bezdek and and J. D. D. Harris, Harris, "Fuzzy "Fuzzy partitions partitions and and relations; relations; an (1978). , 1 1 1(1978). axiomatic basis for for clustering," clustering," Fuzzy Fuzzy Sets Sets and andSystems Systems11,111 19. G. Shafer, 19. Shafer, A A Mathematical Mathematical Theory Theory of ofEvidence, Evidence, Princeton Princeton University University Press, Princeton, N.J. N.J. (1976). (1976). 20. V. approach to document 20. V. Subramanian, Subramanian, "A "A knowledge knowledge based systems systems approach retrieval," M.S. Thesis, Univ. of South Carolina Carolina (1985). (1985). 21. J. 21. J. A. A. Barnett, Barnett, "Computational "Computational methods methods for for aa mathematical mathematical theory of evidence," in in Proc. Proc. Seventh Int. Joint evidence," Seventh Int. Joint Conf. Conf. on onArtificial ArtificialIntelligence Intelligence (Vancouver, B. B. C.), C.), 868 868 (1981). (1981). 22. Gordon and and E. E. H. H. Shortliffe, Shortliffe, "Evidential "Evidential reasoning reasoning in a hierarchy," hierarchy," 22. J.J. Gordon Artificial Intelligence Intelligence 26, 26, 323 323 (1985). (1985). s

Viswanath is an Viswanath Subramanian Subramanian is an instructor instructor in in the Computer Science Science Department at at Wichita Wichita State University. His research interests include include -based sysartificial intelligence, intelligence, knowledge knowledge-based systems, and information retrieval. retrieval. Currently, Currently, he he is working working in the area is area of scientific scientific discovery. discovery. Mr. Subramanian Mr. Subramanian received received the the B.Tech B.Tech degree in in electrical electrical engineering engineering from the Indian Institute of Technology, New Delhi, Delhi, and and the Institute Technology, New M.S. degree degree in in computer computer science science from from the the UniUniM.S. versity of South Carolina, Columbia.

Biswas is is an an assistant assistant professor professor of Gautam Biswas computer science science at at the University of South computer South research interests are in Carolina. His primary research the fields of artificial artificial intelligence, intelligence, expert expert syssystems, computer vision, vision, and and pattern pattern recognition. recognition. In 1977 -78 Dr. In 1977-78 Dr. Biswas Biswas was aa teaching teaching assistant Department assistant in in the Computer Science Department at the University University of of Rhode Island. From 1978 to graduate research research assistant assistant in 1982 he was aa graduate the Pattern Recognition Recognition and and Image Image Processing Processing Laboratory of the Department Laboratory of Department of of Computer Computer Science at Michigan State University. He He spent the summer of 1984 at AT &TBell BellLabs LabsininHolmdel Holmdelworking working on on an an expert expert systems systems project project with with AT&T the CMS -1 Systems Systems Engineering group. CMS-1 Dr. Biswas received his B.Tech B.Tech degree degree in inelectrical electrical engineering engineering at Dr. at the the Institute of Bombay, India, India, in in 1977 1977 and and his his M.S. M.S. and and Indian Institute of Technology, Bombay, Ph.D. degrees in in computer computer science sciencefrom from Michigan Michigan State State University University in in Ph.D. degrees 1979 and He is currently aa member 1979 and 1982, respectively. respectively. He member of of the the IEEE IEEE AAAI, the Pattern Recognition Society, and and the the Computer Society, ACM, AAAI, Sigma Xi Research Society.

James C. Bezdek the B.S. Bezdek received received the B.S. in civil civil engineering from from the University engineering University of of Nevada Nevada (Reno) in in 1969 1969 and the Ph.D. Ph.D. in applied math(Reno) ematics from Cornell University in 1973. 1973. He He is currently professor currently professor and and chairman chairman of of the the ComComUniversity of puter Science Department at the University Carolina. His His interests include research South Carolina. vision, inforin pattern recognition, computer vision, mation retrieval, natural natural language language processing, and numerical optimization. and optimization. Dr. Dr. Bezdek Bezdek is cur(the North North Amerirently President of NAFIPS NAFIPS (the Fuzzy Information Information Processing -Elect of (the can Fuzzy Processing Society), Society), President President-Elect of IFSA IFSA(the International Fuzzy Fuzzy Systems Association), and aa member member of of the theIEEE IEEE Computer Society, Society, Classification Society, Pattern Recognition Society, Computer and Association Association for Computing Machinery. and

OPTICAL ENGINEERING / March 1986 / Vol.2525No. No. OPTICAL ENGINEERING / March 1986 / Vol. 33 / / 455 Downloaded From: http://opticalengineering.spiedigitallibrary.org/ on 07/10/2015 Terms of Use: http://spiedl.org/terms

Suggest Documents