Building topic hierarchy based on fuzzy relations - Semantic Scholar

2 downloads 26961 Views 115KB Size Report
bSchool of Computer Science and Engineering, Seoul National University, San 56-1 .... be weighted by the degree of significance of the terms ti and tj in their ...
Neurocomputing 51 (2003) 481 – 486 www.elsevier.com/locate/neucom

Letters

Building topic hierarchy based on fuzzy relations Han-joon Kima;∗ , Sang-goo Leeb a Faculty

of Electrical and Computer Engineering, University of Seoul, 90 Cheonnong-dong Dongdaemun-ku, Seoul 130-743, South Korea b School of Computer Science and Engineering, Seoul National University, San 56-1 Shillim-dong Kwanak-ku, Seoul 151-742, South Korea

Abstract In this paper, we present a novel method for automatically building hierarchical topic structures of large text databases without any complicated linguistic analysis. Hierarchical relationship among categories from textual data can be discovered on the basis of term co-occurrence, which is described by fuzzy relations. Despite its simplicity, results of experiments on well-known document collections such as Yahoo directory data demonstrate the high quality of the resulting hierarchies. c 2002 Elsevier Science B.V. All rights reserved.  Keywords: Information organization; Topic hierarchy; Term subsumption; Fuzzy relations

1. Introduction Hierarchically organizing data according to topic has been accepted as a very successful method for organizing or browsing a large volume of textual documents in information systems. Such a topic hierarchy (or taxonomy) is used as a way of systematically organizing a large document collection: incoming documents (e.g., extracted from the Web) are indexed on topic hierarchies, which are presented as browsable directories to users with information needs. Currently, topic hierarchies are manually constructed and maintained by human editors in most information systems including LookSmart [4], the Open Directory Project [7], and Yahoo [9]. Unfortunately, manual taxonomy construction remains a time-consuming and cumbersome task. Furthermore, human editor’s decision on taxonomy construction is not only highly subjective but ∗

Corresponding author. Fax: +822-887-1858. E-mail address: [email protected] (H. Kim).

c 2002 Elsevier Science B.V. All rights reserved. 0925-2312/03/$ - see front matter  doi:10.1016/S0925-2312(02)00726-9

482

H. Kim, S. Lee / Neurocomputing 51 (2003) 481 – 486

also its subjectivity is inconsistent over time [2,1]. This drawback requires information systems to provide more intelligent organization capabilities with topic hierarchies. To overcome the above problems, we propose a novel approach to automatically discovering taxonomic relationships among categories from textual data. Taxonomic relationships can be determined simply on the basis of the co-occurrence of terms in a text without depending on any complicated linguistic theories or machine learning methods. In this paper, the taxonomic relationships between categories are described through fuzzy relations.

2. Building a hierarchical topic structure A topic hierarchy is assumed to be the same as that used by Yahoo [9]; that is, the documents indexed according to a topic hierarchy are kept in internal categories as well as in leaf categories, in the sense that documents at a lower category have increasing speciGcity. In addition, every child has more than one parent, and therefore the hierarchical structure is like a directed acyclic graph (DAG). For taxonomy construction, we note that a category is represented by topical terms reIecting its concept. This suggests that the relations between categories can be determined by describing the relations between their terms. In this regard, we Gnd that it is diJcult to dichotomize the relations between categories into groups representing the presence or absence of association, because term associations are generally represented not as crisp relations, but as probabilistic equations [8]. Thus, we introduce a solution to this problem by means of fuzzy relations. Degrees of association between two categories can be represented by membership grade in a fuzzy (binary) relation. Based on the above discussion, the generality and speciGcity of categories is expressed by aggregating the relations among their terms. Thus, we must Grst deGne term relations for selected topical terms within each category. 2.1. Term subsumption relations First of all, to Gnd topical terms for each category, we use the chi-square (2 ) statistic measure, which is commonly used to select an appropriate number of best features in the text categorization literature [10]. A two-way contingency table for term t and category c is constructed and the 2 statistic is calculated as 2 (c; t) =

n × (p × s − q × r)2 ; (p + r) × (p + q) × (q + s) × (r + s)

(1)

where p means the frequency of documents in which t and c co-occur, q and r the frequency when either t or c occurs, s the frequency when neither c nor t occurs, and n is the total number of documents. If c and t are independent of each other, 2 (c; t) has a value of zero. As term t is more informative for reIecting the concept of category c, the value of 2 (c; t) increases.

H. Kim, S. Lee / Neurocomputing 51 (2003) 481 – 486

483

For each category c, we build a set of top-ranked most salient terms Vc ={t|2 (c; t) ¿ for c ∈ C and t ∈ V }, to favor terms with a higher 2 value, where V denotes a set of all the terms, C denotes a set of all the categories in the system, and means a threshold value that determines the number of topical terms in each category. With such topical terms, term relations can be determined by the so-called ‘Document Frequency (DF) hypothesis’: “The more documents a term occurs in, the more general the term is assumed to be”, which can be used for generating a hierarchy of the terms themselves from text [8]. This hypothesis means that the generality and speciGcity of terms is determined by the number of documents that contain the terms. We call the relations between terms the ‘term subsumption relation’. It is deGned as follows, based on experimental results in [8]: For two topical terms ti , tj , if Pr(ti |tj ) ¿ 0:8 1 and Pr(ti |tj ) ¿ Pr(tj |ti ), then ti is said to subsume tj . Here, Pr(ti |tj ) is the probability that ti occurs in the document set in which tj occurs. 2.2. Category subsumption relations Now, we deGne the fuzzy relation between two categories, which represents the relational concept “ci subsumes cj ” (which means that ci is the parent of cj in a topic hierarchy). We call this fuzzy relation the ‘category subsumption relation’ (CSR). Let C be a set of categories within a topic hierarchy. As mentioned before, CSR is described as an aggregate of term subsumption relations. For ci ; cj ∈ C, the fuzzy relation CSR can be characterized by the following membership function:  ci (ti ) × cj (tj ) × Pr(ti |tj ) ti ∈Vci ;tj ∈Vcj CSR (ci ; cj ) =

Pr(ti |tj )¿Pr(tj |ti )



ti ∈Vci ;tj ∈Vcj

ci (ti ) × cj (tj )

;

(2)

where c (t) denotes the degree to which the term t represents the concept corresponding to the category c, which can be estimated by calculating the 2 statistic of term t in category c since the 2 value represents the degree of term importance. Pr(ti |tj ) should be weighted by the degree of signiGcance of the terms ti and tj in their categories, and thus the membership function CSR for categories is calculated as the weighted average of the values of Pr(ti |tj ) for terms. The membership value CSR is represented by a real number in the closed interval [0; 1], and indicates the strength of the relationship present between two categories. As seen in Eq. (2), the fuzzy relation CSR(ci ; cj ) does not accommodate all the combination of topical terms of ci and topical terms of cj . Some pairs of terms may subsume each other since the term subsumption relations depend on the DF hypothesis. Such term pairs can spoil estimated category subsumption relations. Thus, the fuzzy relation CSR(ci ; cj ) considers only the combinations of topical terms ti ∈ Vci and tj ∈ Vcj that satisfy Pr(ti |tj ) ¿ Pr(tj |ti ). As for building a topic hierarchy, the problem of how to determine the subsumption is related to partial ordering. As we utilize fuzzy relations to represent the subsumption 1

The cut-oP value 0.8 was experimentally chosen in [8].

484

H. Kim, S. Lee / Neurocomputing 51 (2003) 481 – 486

relation among categories, a hierarchy is (re-)constructed so that its categories show a ‘fuzzy partial ordering’. A fuzzy (binary) relation R on a set X is called a fuzzy partial ordering iP it is reIexive, antisymmetric, and (max–min) transitive. Finally, to generate a ‘crisp’ hierarchy, we apply the following property of fuzzy relations: “Let R be a fuzzy partial order relation on a set X , then  level-set R is a (crisp) partial order relation on X ” [6], where  level-set R is the crisp relation that contains the elements with membership grades in the fuzzy relation R that are greater than the speciGed value of . Based on this property, we can construct a (crisp) hierarchy by setting the value of . That is, users obtain a desired hierarchy among diPerent candidates while adjusting the threshold value. For slowly changing document collections, the value of  can be experimentally determined. A procedure for building a topic hierarchy is the following: Step (a): Calculate the CSR matrix with entries representing the degree of membership in a fuzzy relation CSR for a given set of categories (see Eq. (2)). Step (b): Generate the -cut matrix of the CSR matrix (denoted by CSR  ) by determining an appropriate value of . Step (c): Create a hierarchy of the partitioned categories from the CSR  matrix representing partial ordering. 3. Evaluation Our experiments involve the construction of three kinds of taxonomies: sub-hierarchies of the Open Directory Project (ODP) and Yahoo directory, and sub-hierarchies containing documents from the Reuters-21578 news collection. 2 We then attempted to evaluate how well the generated hierarchies emulated some of the properties of the manually constructed hierarchies, even though their quality is not easy to measure by some objectively derived value. To this end, we use precision and recall, which are commonly used in information retrieval. When we denote the set of discovered relations by Hˆ ⊂ C × C (where C is a set of categories) and H ⊂ C × C is a set of true taxonomic relations, the two measures are deGned as follows: precision =

|Hˆ ∩ H | ; |Hˆ |

recall =

|Hˆ ∩ H | ; |H |

F1 =

2 × precision × recall : precision + recall

(3)

Finally, we compute their combined measure, called the F1-measure, which gives equal weight to both recall and precision. F1-measure varies from zero to one, and is proportionally related to the ePectiveness of the constructed taxonomy. Fig. 1 shows the changes in the quality of the automatically generated hierarchies from varying the number of selected topical terms when the threshold value  is set to 0.6 – 0.8: the threshold values of Yahoo, ODP, and Reuters-21578 taxonomies are 0.6, 0.7, and 0.8, respectively. From this Ggure, we can see that the proposed method 2 Even though the Reuters collection originally does not have an explicit pre-determined topic hierarchy with the distribution of the data, we used hierarchies presented in [3] for hierarchical text classiGcation.

H. Kim, S. Lee / Neurocomputing 51 (2003) 481 – 486 ODP

Reuters-21578

485

Yahoo

1.0 0.9 0.8

α=0.7

F1-measure

0.7

α=0.6

0.6 0.5

α=0.8

0.4 0.3 0.2 0.1 0.0

80

130

180

200

300

340

540

740

Number of selected topical terms

Fig. 1. Changes in the quality of discovered taxonomies from varying the number of selected topical terms.

can recover the original hierarchical structure of manually constructed hierarchies with reasonably high quality, although it is not perfect. For the given taxonomies, the degree of recovering the pre-determined hierarchical structure approaches 90% in F1-measure. Note that a manually constructed hierarchy may not necessarily have higher quality than its corresponding automatically constructed one. Also, we have observed that an appropriate number of topical terms should be used for ePective taxonomy construction. For example, in case of ODP taxonomy, the method shows the best performance when the total number of topical terms is about 340. Too few topical terms cannot represent the semantics of categories well enough to recognize subsumption relations. In contrast, if too many terms are selected, non-salient terms with relatively small 2 value can weaken term subsumption relations estimated among other salient terms.

4. Summary Construction of topic hierarchies is of great importance in information systems that should organize huge numbers of online text documents. We have proposed a simple yet ePective method for building topic hierarchies, which characterizes subsumption relations among categories through fuzzy relations. We argue that the proposed method can overcome subjectivity of manual construction of topic hierarchies and allow hierarchies to be maintained in an eJcient and consistent manner. This method can be used for generating a lightweight ontology for the Semantic Web (which is the next generation of the current World Wide Web) as well as for building hierarchical structures over large text databases. Furthermore, we plan to investigate techniques for building topic hierarchies by employing external knowledge such as WordNet [5], which speciGes various kinds of relationships between words.

486

H. Kim, S. Lee / Neurocomputing 51 (2003) 481 – 486

References [1] R. Aggrawal, R. Bayardo, R. Srikant, Athena: mining-based interactive management of text databases, in: Proceedings of the Seventh International Conference on Extending Database Technology (EDBT 2000), Konstanz, Germany, March, 2000, pp. 365 –379. [2] C.C. Aggarwal, S.C. Gates, P.S. Yu, On the merits of building categorization systems by supervised clustering, in: Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining (KDD’99), San Diego, USA, August, 1999, pp. 352–356. [3] D. Koller, M. Sahami, Hierarchically classifying documents using very few words, in: Proceedings of the 14th International Conference on Machine Learning (ICML’97), Nashville, USA, July, 1997, pp. 170 –178. [4] LookSmart, http://www.looksmart.com/. [5] U. Miller, Thesaurus construction: problems and their roots, Inform. Process. Manage. 33 (4) (1997) 481–493. [6] S. Miyamoto, Fuzzy Sets in Information Retrieval and Cluster Analysis, Kluwer Academic Publishers, Dordrecht, 1990. [7] Open Directory Project, http://dmoz.org/. [8] M. Sanderson, B. Croft, Deriving concept hierarchies from text, in: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR’99), Berkeley, USA, August, 1999, pp. 206 –213. [9] Yahoo, http://www.yahoo.com/. [10] Y. Yang Y, J. Pedersen, A comparative study on feature selection in text categorization, in: Proceedings of the 14th International Conference of Machine Learning (ICML’97), Nashville, USA, July, 1997, pp. 412– 420. Han-joon Kim received the B.S. and M.S. degrees in Computer Science and Statistics from Seoul National University, Seoul, Korea in 1996 and 1998 and the Ph.D. degree in Computer Science and Engineering from Seoul National University, Seoul, Korea in 2002, respectively. He is currently an instructor at the Faculty of Electrical and Computer Engineering, University of Seoul, Korea. His current research interests include databases, intelligent information systems, business intelligence and data/text mining technologies. Sang-goo Lee received the B.S. degree in Computer Science from Seoul National University, Seoul, Korea in 1985 and the M.S. and Ph.D. degrees in Computer Science from Northwestern University, Evanston, IL, USA, in 1987 and 1990, respectively. He taught at the University of Minnesota from 1989 to 1990, and worked as a research engineer at Electronic Data Systems, Troy, MI, USA, from 1990 through 1992. He is currently an associate professor at the School of Computer Science and Engineering, Seoul National University, Korea. His current research interests include databases, e-business solutions, information retrieval, and digital libraries. He is a member of ACM, IEEE Computer Society and W3C Advisory Committee.

Suggest Documents