Rough Set Based Decision Model in Information ... - Semantic Scholar

1 downloads 0 Views 188KB Size Report
Geelong VIC 3217, Australia, fyuef eng; chengqi; [email protected]. ABSTRACT ... port the short-term needs of a diverse population of users. The IF ...
Rough Set Based Decision Model in Information Retrieval and Filtering Yuefeng Li, Chengqi Zhang and Jason R. Swan

School of Computing and Mathematics, Deakin University Geelong VIC 3217, Australia, f [email protected] yuef eng; chengqi; j rswan

ABSTRACT In this paper, a model for information retrieval and ltering applications is proposed. The model uses rough set based decision approaches to deal with the relevance between users' queries and documents. In this model, the users' queries are described into two levels: the interesting categories, and the relevant terms about the interesting categories. By using the rough set based decision theory, the dynamic document stream is divided into three states - the positive region, the boundary regions, and the negative regions, instead of the two states in the traditional research which are relevant documents and irrelevant documents.

Keywords: information retrieval, information ltering, rough set.

1 Introduction

Two fundamental issues in the research arena of gathering information from the Internet are, information retrieval (IR) and information ltering (IF). Usually, an IR system is considered to have the function of \leading the user to those documents that will best enable him/her to satisfy his/her need for information [21]. IF is a name used to describe a variety of processes involving the delivery of information to people who need it [1]. Most IR systems usually support the short-term needs of a diverse population of users. The IF systems, however, are commonly personalized to support long-term information needs of a particular user. The main distinction between IR and IF is that IR systems use \query", but IF systems use \pro le" [1]. IR placed little attention on the users' role speci cally, identi cation of users' interests, representation of interests, and application of such representations in interactions. The users just can describe their queries as vectors of terms or vectors of classes. At rst, IF systems use AI-based techniques such as rules [13] [6] to generate the pro les. Currently, most IF systems

use machine-learning techniques to generate user interest pro les [17]. These techniques tend to obtain the correct weight for each element in the query vectors. The goals of IR and IF systems are both the estimation of the \relevance" between the user's queries and documents. In this article, a model for information retrieval and ltering applications is proposed. In this model, users' queries are described as an approximate concepts by using Pawlak rough set theory. Based on this understanding, the agent belief about a user's query can be described into two levels: the interesting categories, and the relevant terms about the interesting categories. We use rough set based decision approaches to deal with the relevance between users' queries and documents. By using this kind of approach, the dynamic document stream can be divided into three parts: the positive region (relevant documents), the boundary region (possible relevant documents), and the negative region (irrelevant documents), instead of the two parts in the traditional research which are relevant documents and irrelevant documents. The paper is organized as follows. In Section 2, we rst discuss the format of storing information on Web sites, then we present an approach to represent users' queries. Section 3 discusses the document representation. In Section 4, we rst talk about the rough set based decision theory, then we propose the method of document classi cation. Section 5 reviews some the related work in the IR and IF elds, and Section 6 is the summary.

2 The User Query Model

In this section, we rst analyse the general format of information on Web sites, then we introduce the method of representing users' queries. In order to manage vast information, people often classify the information into several categories. For example, jobs can be classi ed into di erent occupations (or industries) such as \computing", \electrical

and electronic trade" and so on. Based on this natural way of structuring categories, most Web sites use this hierarchy like the example in Figure 1 to store their own information. The Center for Intelligent Information Retrieval

... Information Retrieval

...

Multi-Media I. R.

...

Database Systems

...

Figure 1: The structure of storing information on Web

sites

In Figure 1, the rst level is the home page address (e.g., http://ciir.cs.umass.edu/cdi-bin/w3msql/publication database/publications-edit.html) of this Web site, the second level includes all of the categories, and the third level is the documents (all research papers). It is easy for you to gather all the documents associated with \information retrieval" at this center, because you can simply locate this Web site, select \Information Retrieval" category, and retrieve all of the documents in this category. However, if you want to gather some research papers in which the knowledge-based uncertainty processing approaches have been used in intelligent information retrieval eld, the above hierarchy approach (Figure 1) will not be an easy task for the user, who must spend a vast amount of time scanning papers. For this kind of uncertain query, the next logical question arises, namely, how can a system on behalf of you to gather these papers? In order to act like you, the system rst should provide a method to describe your query, then based on your query, it could decide those documents that will best enable you to satisfy your need. To describe users' queries, we have to guess what the user wants, i.e., which documents on D (the document space) are relevant to the user's query (in this paper, we use XR  D to represent the set of relevant documents). This is a dicult problem, in fact, the set of relevant documents is an approximate concept, and any information systems could not get it exactly unless the systems can read the documents like the user.

From this view point, we could not describe the set of relevant documents precisely by using some terms or keywords, instead, we can characterize the set of relevant documents by an approximate concept, such as a pair of lower and upper approximations (Pawlak rough set model [19] [20]): the lower approximation includes all positive relevant documents, and the upper approximation includes all possible relevant documents (see Section 4.1). Based on the above statements, the users can rstly describe interesting categories: fP1 (XR ; C1 ); P2 (XR ; C2 ); : : : ; Pm (XR ; Cm )g; (1) where, Pi (XR ; Ci ) (1  i  m) are the probabilities that a category Ci (in the second level in Figure 1) belongs to XR . Then the user can provide term sets:

0 tCu ; tCu ; : : : tCu ;mu BB tCu ; tCu ; : : : tCu ;mu .. .. .. @ .. .

1 1

1 2

1

1

2 1

2 2

2

2

.

.

.

tCum ;1 tCum ;2 : : : tCum ;mum

1 CC A

(2)

to di erent interesting categories, where, tCui ;1 , tCui ;2 , : : :, tCui ;mui are the terms for category Ci (1  i  m). For example, in the case discussed above where we are looking for some research papers in which the knowledge-based uncertainty processing approaches have been used in intelligent information retrieval eld. If you are interested in the eld \Information Retrieval" (interest weight is 0.9) and the eld \Database Systems" (interest weight is 0.6), your uncertain query can be described as the follows: The interesting categories (Information Retrieval, 0.9) (Database Systems, 0.6) The term sets fretrieval; ltering; intelligentg f query; uncertaintyg For the category \information retrieval", you want the papers associated with the \retrieval", \ ltering" and \intelligent", and you also want the papers associated with \query" and \uncertainty" for category \database systems".

3 Document representation

The e ective representation of documents is a very dicult problem, because of the size of the document stream and the computational demands associated with parsing voluminous texts [17]. At any time, new topics may be introduced in the document stream. At present, most IR or IF systems assume a poor representation of documents, based on the use

of index terms automatically extracted from the text of documents. In this paper, we apply the popular tf  idf (term frequency times inverse document frequency) technique to establish the particular degree of importance for each concept in a document. To apply this technique, a table is generated o -line containing total frequencies of all terms in the thesaurus, using a suciently representive collection of documents as a base. In the on-line document stream, another table is generated containing the frequencies of all unique terms found in newly arrived documents. Based on the values in the two tables, the following equation is used to derive appropriate weights for terms in each document: wik = tik  log(N=nk ); where tik is the number of occurrences of term tk in document Vi ; log(N=nk ) is the inverse document frequency of term tk in the document base; N is the total number of documents in the document base; and nk is the number of documents in the base that contain the given term tk .

4 Document Classi cation

From the discussion in Section 2, the aim is for us to get the pairs of lower and upper approximations. In this section, we rst talk about the rough set based decision model, then give the algorithm to classify the incoming document stream.

4.1 Rough-set based decision model

In IR applications, the document space D can be divided into some equivalent classes based on certain document representation methods (e.g., the tf  idf method). Here, the classes are di erent from the categories introduced in Section 2, because categories are usually obtained by human catalogers, and the classes can be generated by using some automatic keyword extraction algorithms. By using classes, the lower and upper approximations can be described using: [ apr(XR ) = CLi ; CLi XR

apr(XR ) =

[

CLi \XR 6=;

CLi :

Where, CLi  D is a class. From the above description, the documents space can be divided into three disjoint regions, the positive region POS (XR ), the boundary region BND(XR ), and the negative region NEG(XR ): POS (XR ) = apr(XR ); BND(XR ) = apr(XR ) ? apr(XR ); NEG(XR ) = D ? apr (XR ):

Figure 2 illustrates the set of relevant documents and the positive, boundary and negative regions. The document space D

XR

Positive Region

Boundary Region

Negative Region

Figure 2: The set of relevant documents and its POS, BND and NEG regions

In terms of decision-theoretic language, we have a set of actions Ad = fa1 ; a2 ; a3 g, representing the three actions, deciding document d 2 POS (XR ), deciding document d 2 BND(XR ), and deciding document d 2 NEG(XR ), respectively, where, we treat d as a class in which every document has the same features. Let (ai j d 2 XR ) denote the loss incurred for taking action ai when d in fact belongs to XR , and let (ai j d 2 :XR ) denote the loss incurred for taking action ai when d in fact belongs to :XR . Based on the above descriptions, the expected loss

L(ai j d) associated with taking the individual actions can be expressed as: L(a1 j d) = 11 P (XR ; d) + 12 P (:XR ; d); L(a2 j d) = 21 P (XR ; d) + 22 P (:XR ; d); L(a3 j d) = 31 P (XR ; d) + 32 P (:XR ; d); where P (XR ; d) is the probability that d belongs to XR , and P (:XR ; d) is the probability that d does not belong to XR ; i1 = (ai j d 2 XR ), i2 = (ai j d 2 :XR ), and i = 1; 2; 3. The above Bayesian decision procedure leads to the following minimum-risk decision rules:

(P) Deciding d 2 POS (XR ) if L(a1 j d)  L(a2 j d) and L(a1 j d)  L(a3 j d); (B) Deciding d 2 BND(XR ) if L(a2 j d)  L(a1 j d) and L(a2 j d)  L(a3 j d); (N) Deciding d 2 NEGR (X ) if L(a3 j d)  L(a1 j d) and L(a3 j d)  L(a2 j d); Since P (XR ; d) + P (:XR ; d) = 1, the above decision rules can be simpli ed so that only the probabilities P (XR ; d) are involved. These kinds of decision rules can be found in [28] or [29]. In [28], they also considered some special kinds of loss functions.

4.2 The Process of Retrieval

To answer the user, we have to decide what's the relationship between a document and the set of relevant documents XR . For this purpose, the agent rst could estimate the relevance between categories and the set of relevant documents XR , then match the documents to certain categories, and estimate the relevance between the documents and the associated terms, last decide what's the relationship between a document and the set of relevant documents XR . Figure 3 indicates the process of classifying the incoming documents. In the rst column, the incoming documents are listed. Based on the information structure (in Figure 1), each document is categorized in the second column. The relevance between the documents and the associated terms is entered in the next column of the table, which can be obtained by using one of probabilistic IR models, such as \joint probability" [25], \conditional probability" [27], and \logical imaging" [3]. Next the relevant impact of each document is assessed by using the above decision rules.

document d1 d2 d3 d4 .. .

Cid Probability Relevant Cid1 0:90 POS Cid2 0:50 BND Cid3 0:70 POS Cid4 0:45 NEG .. .

.. .

.. .

Figure 3: The process of classifying documents In order to use the rough set based decision rules, we must estimate the value of P (XR ; d) and loss L(ai j d). The probability that d belongs to XR , P (XR ; d), can be estimated by Pid (XR ; Cid )  P (R j q; d): Where, Pid (XR ; Cid ) is the probability of category Cid belonging to XR (the user's interest), and P (R j

q; d) is the probability of relevance given a document d and a query q = (tCid ;1 ; tCid ;2 ; : : : ; tCid ;mid ): Consider a special kind of loss function with 11 = 0, 0 < 21 < 1, 31 = 1 and 12 = 1, 0 < 22 < 1, 32 = 0. The loss of classifying document d belonging to XR into the positive region POS (XR ) is zero; the loss of classifying document d belonging to XR into the negative region NEG(XR ) is 1, the biggest loss incurred; and the loss of classifying document d belonging to XR into the boundary region BND(XR ) is strictly less than 1 and strictly greater than zero. We obtain the reverse order of losses by classifying d that does not belong to XR . For these type of loss functions, the minimum-risk decision rules (P), (B) and (N) can be written as: (P) Deciding d 2 POS (XR ) if 22 P (XR ; d)  21 and P (XR ; d)  (1?1?22)+ 21 ; (B) Deciding d 2 BND(XR ) if 22 P (XR ; d)  (1?1?22)+ 21 and  22 P (XR ; d)  (1?21 )+22 ; (N) Deciding d 2 NEGR (X ) if P (XR ; d)  21 and P (XR ; d)  (1?2122)+22 : If we use 1 to represent the biggest loss, then we can consider 21 + 22  1, which implies 22 1 1 ? 22 (1 ? 21 ) + 22  2  (1 ? 22 ) + 21 : Under these assumptions, (P), (B) and (N) can be simpli ed into: (P) Deciding d 2 POS (XR ) 22 if P (XR ; d)  (1?1?22)+ 21 ; (B) Deciding d 2 BND(XR ) 22 if (1?2122)+22  P (XR ; d)  (1?1?22)+ 21 ; (N) Deciding d 2 NEGR (X ) if P (XR ; d)  (1?2122)+22 :

5 Related work

One of the best known IR models is the vector space model (VSM) [22]. In the VSM, a query is represented by means of a vector whose elements represent the presence/absence of certain features in the document representation, for example, the presence or absence of index terms. Considering the binary case, the queries can be represented as 0 or 1 vectors such as the Figure 1. The vector space model is able to rank documents by adopting an inexact matching strategy, but this model needs to choose similarity measures and to interpret term weights.

t1 Q1 1 Q2 1 Q3 1

t2 t3 : : : 0 1 ::: 1 0 ::: 0 1 ::: .. .. .. .. . . . . ::: Qu 1 0 0 : : :

tn 1 1 0 .. . 1

Figure 4: Queries on term vector spaces So far, the leading normative interpretation of relevance on term spaces have been probabilistic IR models [23]. The decision rule the probabilistic IR models use is in fact well known as Bayes' decision rule. It is P (R j q; d) > P (:R j q; d) ! d is relevant, otherwise d is non-relevant. By using the probabilistic IR models, the dynamic document stream can only be divided into two states: relevant and irrelevant. In the SIFTER model (a ltering system [17]), a user pro le learning module based on class vector spaces has been presented. This module provides a method to represent users' queries on class vector spaces. In this module, users' preference for the different classes can be determined by the learning algorithm based on the relevance feedback.

it will not necessarily imply that all the documents in this class are relevant. At this stage it seems our decision model is more reasonable.

6 Summary

The contributions of this paper include: (a) a new approach for intelligent information retrieval applications is proposed to deal with the relevance between users' queries and the documents. This approach uses rough-set based decision theory rather than the traditional probability approaches, (b) Pawlak rough set theory is used to understand users' queries, and (c) a new method to describe users' queries is presented, which can describe the queries into two levels: the interesting categories, and the relevant terms about the interesting categories.

References [1]

[2]

More precisely, denoting the space of documents as

D, the document space is partitioned into t equivalent classes, fCL ; CL ; : : : ; CLt g, over which user 1

2

relevance is estimated. Let di denote the underlying (unknown) expected user preferences (relevance) for the class CLi . The learning algorithm estimates the vector d = fd1 ; d2 ; : : : ; dt g by using an approximate vector d^ = fd^1 ; d^2 ; : : : ; d^t g. To classify the document, the SIFTER model uses the centroid-similarity measure. Firstly, the learning stage is used to choose a centroid Zi = fz1 ; z2 ; : : : ; zn g to represent the class (cluster) Ci . Then it has to decide that an incoming document vector Vi = fwi1 ; wi2 ; : : : ; win g belongs to which class by using a similarity measure, e.g., the cosine similarity measure [22]: Pn j =1 zj wij P ( j=1 zj2 )( nj=1 wij2 )

q Pn

The SIFTER model implies that the classes are independent, that is \if the user is interested in a class, then all the documents in the class are relevant". If we believe that the classes are obtained by using some automatic keyword extraction algorithms, then the SIFTER model is same as the vector space models, because users can only describe their interests in term (keyword) spaces. On the other hand, if the classes are obtained by human catalogers (like categories),

[3] [4]

[5]

[6]

[7]

N. J. Belkin and W. B. Croft, Information ltering and information retrieval: Two sides of the same coin, Commun. ACM, 1992, 35(12): 29-38. G. Biswas, J.C. Bezdek, M. Marques, and V. Subramanian, Knowledge assisted document retrieval, Journal of the American Society for Information Science, 1987, 38: 83-110. F. Crestani and C. J. Van Rijsbergen, A study of probability kinematics in information retrieval, ACM Transactions on Information Systems, 1998, 16(3): 225-255. J. J. Daniels and E. L. Risslan, A casebased approach to intelligence information retrieval, in: Proceedings of the Eighteenth Annual ACM-SIGIR Conference on Research and Development in Information Retrieval, 1995, 238-245. P. Edwards, D. Bayer, C. L. Green, and T. R. Payne, Experience with learning agents which manage Internet-based information, in: Proceedings of AAAI Stanford Spring Symposium on Machine Learning in Information Access, AAAI press, 1996. G. Fischer and C. Stevens, Information access in complex, poorly structured information spaces. in: Proceedings of ACM Special Interest Group on Human Computer Interaction Annual Conference, ACM, New York, 1991, 63-70. M. Friedman and D. S. Weld, Eciently executing information-gathering plans, in: Proceedings of IJCAI, 1997, 785-791.

[8]

[9]

[10] [11] [12] [13] [14] [15] [16] [17]

[18] [19] [20] [21]

N. Fuhr and U. Pfeifer, Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptions, ACM Transactions on Information Systems, 1994, 12(1): 95-115. R. A. Hummel and M. S. Landy, A statistical viewpoint on the theory of evidence, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1988, 10: 235247. R. Kruse, E. Schwecke and J. Heinsoln, Uncertainty and vagueness in knowledge based systems (Numerical Methods), Springer-Verlag, New York, 1991. Y. Li and D. Liu, The combination of interval structures and the disjunctive mapping, Chinese J. of Computers, 1997, 20(2): 151-157. Y. Li and C. Zhang, A method for combining interval structures, in: Proceedings of 7th International Conference on Intelligence Systems, Paris France, 1998, 9-13. T. W. Malone, K. R. Grant, F. A. Turbak, S.A., Brobst, and M. D. Cohen, Intelligent information sharing systems, Commun. ACM, 1987, 30: 390-402. M. E. Maron and J. L. Kuhns, On relevance, probabilistic indexing and information retrieval, J. of the American Society for Information Science, 1960, 7: 216-244. K. J. Mock, Hybrid hill-climbing and knowledge-based methods for intelligent news ltering, in: Proceedings of AAAI, 1996, 48-53. A. Mo at and J. Zobel, Self-indexing inverted les for fast text retrieval, ACM Transactions on Information Systems, 1996, 14(4): 349-379. J. Mostafa, W. Lam and M. Palakal, A multilevel approach to intelligent information ltering: model, system, and evaluation, ACM Transactions on Information Systems, 1997, 15(4): 368-399. D. Oard, Information ltering resources, University of Maryland, College Park, Md., Available as http://www.ee.umd.edu/medlab/ lter. Z. Pawlak, Rough sets, International J. of Computer and Information and Sciences, 1982, 11: 341-356. Z. Pawlak, Rough classi cation, International J. of Man-Machine Studies, 1984, 20: 469-483. S. E. Robertson, The methodology of information retrieval experiment. Information

[22] [23]

[24] [25] [26]

[27] [28] [29]

Retrieval Experiment. In K. Sparck Jones, Ed. Chapt. 1, Butterworths, 1981, 9-31. G. Salton, Automatic text processing: the transformation, analysis, and retrieval of information by computer, Addison-Wesley, Reading, Mass, 1989. S. Schocken and R. A. Hummel, On the use of the Dempster Shafer model in information indexing and retrieval applications, Int. J. Man-Machine Studies, 1993, 39: 843-879. H. Turtle and W. B. Croft, Evaluation of an inference network-based retrieval model, ACM Transactions on Information Systems, 1995, 9: 187-222. C. J. van Rijsbergen, Information retrieval, Butterworths, London, 1979. S. K. M. Wong, L.S. Wang and Y.Y. Yao, Interval structure: a framework for representing uncertain information, in: Proceedings of the Eighth Conference on Uncertainty in Arti cial Intelligence, California, 1992, 336-343. S. K. M. Wong and Y. Y. Yao, On modeling information retrieval with probabilistic inference, ACM Transactions on Information Systems, 1995, 13(1): 38-68. Y. Y. Yao and S. K. M. Wong, A decision theoretic framework for approxamating concepts, International Journal of Man-machine Studies, 1992, 37: 793-809. Y. Y. Yao, S. K. M. Wong and T.Y. Lin, A review of rough set models, in: Rough sets and data mining, edited by T.Y. Lin and N. Cercone, Kluwer Academic Publishers, Boston, 1997, 47-75.