Document not found! Please try again

Using Ontology to Map Categories in Blog - IEEE Xplore

5 downloads 10687 Views 704KB Size Report
extraction from the blog, building the personal ontology, and comparing semantic similarities between user-defined categories. Our novel semantic similarity ...
Using Ontology to Map Categories in Blog Pei-Ling Hsu Department of Information System Application National Tsing Hua University Hsinchu, Taiwan, 300 [email protected]

Abstract This paper proposes a framework to automatically map user-defined categories in blog. The proposed framework is composed of a series of procedures including information extraction from the blog, building the personal ontology, and comparing semantic similarities between user-defined categories. Our novel semantic similarity techniques can determine how similar two sets of information concepts are, based on a given ontology. The experimental results demonstrate that our framework and our proposed semantic similarity techniques are effective.

1. Introduction Blog is the short term for Weblog, which is a personal online diary. A blog provides a user with an easy way to publish content on the Internet. Individual users maintain and organize their own online diaries that present the users’ experiences or the interests of individuals. To efficiently organize articles, numerous blog authors classify their diaries by creating categories based on their own definitions. By accessing these user-defined categories, readers can read the articles related to the selected category. As the number of articles in a blog is huge and people only have limited interests, readers usually only intend to access the categories of their own interests. However, the challenge is that different people might have different perceptions. The user categories with the same keywords might represent diverse meanings. Therefore, how one organizes (or classifies) articles based on user-defined categories from various users becomes an interesting but essential problem in blogs. Apparently, this issue is a classification problem with some new favors. In the traditional classification techniques, determining the similarities between articles and a set of well-defined, ∗ Author

to whom correspondence should be addressed.

Proceedings of the International Workshop on Integrating AI and Data Mining (AIDM'06) 0-7695-2730-2/06 $20.00 © 2006

Po-Ching Liu and Yi-Shin Chen∗ Department of Computer Science National Tsing Hua University Hsinchu, Taiwan, 300 {g946395@oz, yishin@cs}.nthu.edu.tw

fixed categories is the major concern. Hence, how to extract information from the articles and the categories becomes the focus of research. After applying the classification techniques onto WWW, more characteristics, such as the layout of the web pages, and the structure of multimedia content are considered. In addition, some approaches combine classification and ontology to improve the system performance by employing background knowledge. These approaches work well in classification with well defined classifying rules. However, classification with uncertain rules, such as user-defined categories in blogs, is an unexplored research topic. Ontology, which represents the shared concept of objects, could be used to resolve the issue of uncertain rules. Ontology comparison arises because more and more ontologies are developed in different domains and represented in different frameworks. To save time and cost, people try to recycle these existent ontologies to fit different demands. Among all ontology-related research issues, comparing ontologies [14, 8, 15] and ontology mapping [6, 4] are probably the most studied ones. In the former, various approaches, such as the formal structures of ontologies and the semantic similarity measurement of the lexical concepts [8, 14], have been studied. Likewise, various techniques have been proposed for the latter problem. For example, Quick Ontology Mapping is proposed to trade off between effectiveness and efficiency of the mapping generation algorithms [4]. Diego Calvanese et al. [1] used Description Logics as an ideal candidate to formalize ontologies. Schmidtke et al. [12] and Kalfoglou et al. [6] discussed the state of art in ontology mapping and ontology integration. Although many techniques have been proposed, none of them addressed the issue of comparing two sets of nodes at one time. This paper provides a framework to automatically map user-defined categories in blogs. The proposed framework is composed of series procedures including information extraction from blogs, building the personal ontology, and comparing semantic similarities between user-defined cat-

egories. We proposed two semantic similarity techniques which can determine how similar two sets of information concepts are, based on a given ontology. Our experimental results demonstrated that our framework and our proposed semantic similarity techniques are effective. The remainder of the paper is organized as follows. Section 2 describes the preliminary concepts used in this paper, including information retrieval, ontology, and semantic similarity. Section 3 introduces the framework step by step. Section 4 focuses on semantic similarity and the two approaches we developed. Our experiments and results are presented in Section 5. The final section concludes the paper and discusses future work.

2.2

Ontology Comparison and Semantic Similarity

An ontology [10], which is a formal, explicit specification of a shared conceptualization, can make a shared and common understanding reach across people and application systems. The represented concepts in the ontology are described with objects and their corresponding relations. The objects with the same identical properties could be grouped into a class. Classes are organized into a taxonomy or hierarchy. Figure 1 shows an example of an ontology describing the objects related to an apple.

2. Preliminaries 2.1

Information Retrieval

The Information Retrieval (IR) system searches the information in documents and extracts meaningful content from them. It works successfully to extract discriminative content by combining term frequency (TF), with the inverse document frequency (IDF). Traditional TF and IDF ignore the structures of documents because these are commonly difficult to acquire. In these days, more and more documents are extracted from the WWW environment, where the structures of documents are easily available through HTML tags. Many approaches have proposed to combine traditional TF and IDF with the structures or the additional data of HTML documents. These approaches include using HTML tags [16, 9], plain text [3], Anchor Tags [7]of hyperlinks, and the metadata of hyperlinks. The corresponding experiments showed these approaches work better than the traditional methods. Hence, in this paper, we combine TF and IDF with the structure of HTML documents to extract meaningful keywords from the user’s classifications. The formula of TF × IDF which is used in this paper is as follows: ni (1) tf = P k nk  tf IDF = tf × log 

|D|   |dj ⊃ tj |

(2)

ni is the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms. |D| is the total number of titles and articles of the classification, and |dj ⊃ tj | is the number of titles and articles where term tj appears as nj is not zero.

Proceedings of the International Workshop on Integrating AI and Data Mining (AIDM'06) 0-7695-2730-2/06 $20.00 © 2006

Figure 1. Representation in Ontology Ontologies can be represented in many frameworks, including a simple concept hierarchy, a semantic net, a frame system, a logical model, a graph, or a thesaurus. Hence, as the number of ontologies constructed based on different frameworks and employed in different application domains increases, ontology-comparison-related research issue become more and more important. Among all related issues, semantic similarity is the key issue. It can be applied in IR to find the similarity between query and documents, or to solve the lexical ambiguity problem. Currently, most studies of semantic similarity only use the objects of ontology and focus on the structure of hierarchy within one single ontology (i.e., WordNet). Two major approaches, the information content-based approach and the conceptual distance approach1 , are introduced. Our proposed techniques are inspired by these two approaches.

3. Framework In many blogs, authors usually organize their articles into several user-defined categories (such as the blog in Figure 2). User-defined categories which are constructed by different users usually can be named differently even with the same underlying concept. Although the concepts of these categories can be easily captured and compared by humans, it is difficult for computers to recognize them. Hence, we 1 These

two approaches will be described in detail in Section 4

Figure 2. Example of a blog

Figure 4. Framework of This System Step 2: Filter Keywords

designed a new system, which can find the underlying concepts of these categories and automatically rank their corresponding semantic similarity values. When a user appends another user’s blog into his/her friend list, the system first reevaluates the similarities between the user’s and his friends’ categories. Next, the system shows the top-k categories with the highest similarity values. As shown in Figure 3, the more the stars, the greater the similarity between the two user-defined categories.

The system then filters out the keywords in Luc not having the same concepts in the share ontology. The matched keywords, which have the same concepts in the shared ontology, become the concept nodes. Step 3: Build Personal Ontologies The method of building personal ontology is as follows: Finding the shortest path entails following the edges in shared ontology and passing through all concept nodes. On the way to connect each concept node distributed in the shared ontology, this shortest path will involve context nodes. Context nodes can represent the contextual relation between concept nodes and connect concept nodes to make personal ontology more consistent. Therefore, an entire personal ontology includes concept nodes, context nodes, and edges/relations. After building a personal ontology, an entire personal otology is viewed as a unit to represent the concept of Uc with a normal way.

Figure 3. Motivation The framework of this new system is shown in Figure 4. This system employs one shared ontology as the standard to compare semantic similarity between user-defined categories. The method used here is introduced step by step.

Step 1: Extract Keywords from a Category In order to extract representing keywords from a userdefined category Uc , our framework employs TF × IDF to rank Top-k keywords to be in a keyword list Luc , where Luc = {w | w is a keyword extracted from category Uc }.

Proceedings of the International Workshop on Integrating AI and Data Mining (AIDM'06) 0-7695-2730-2/06 $20.00 © 2006

Figure 5. Building a Personal Ontology

There are two reasons to build a personal ontology in this system. First, personal ontologies can make the structure of concept nodes more concrete. Second, by grouping concept nodes, personal ontologies can reduce the area of similarity comparison in shared ontology and save the cost. Step 1 to step 3 is executed iteratively for each userdefined category till each user-defined category builds a keyword list and a personal ontology (as shown in Figure 5). After building a personal ontology, we will compare similarities between two personal ontologies.

4. Semantic Similarity Measurement 4.1

Similarity of Two Nodes

In previous related research, two approaches, an Information Content (IC) based approach [2] and Conceptual Distance approach [5], of measuring semantic similarity between two nodes in ontology have been investigated [11]. Both approaches use the hierarchical structure of ontology as the main property to compute similarities. In the first approach, the more information the two nodes share, the more similar they are. Several approaches are developed to find out how much data are shared by the two nodes. The most popular approach among them, corpus statistics, is finding IC of the least common subsume (LCS) node of two nodes in the ontology. The LCS means the lowest node that can be a parent node of two nodes and it contains the maximal share information of two nodes. For example, LCS of n10 and n11 in Figure 6 is n9 rather than n6 or n1 . LCS of n7 and n3 is n1 .

Figure 6. Portion of Ontology

The value of the IC of the LCS node is obtained by estimating the probability of occurrence of this node in a large text corpus [5, 11]. IC(C) = logP (C)

(3)

, where P(C) is the probability of encountering an instance of node. The frequency freq(C) of a concept node C is computed by counting the occurrence of the set of possible concepts that LCS contains and the occurrence of a set

Proceedings of the International Workshop on Integrating AI and Data Mining (AIDM'06) 0-7695-2730-2/06 $20.00 © 2006

of sub-concept of LCS. The concept node probability, P(C), is computed as [5, 11]: P (C) =

f req(C) N

(4)

,where N is the total number concept node in ontology. The conceptual distance approach estimates the distance between two nodes in ontology. The edges connecting the two nodes will be weighted by the structural characteristics of a hierarchical ontology. These characteristics include: local density, depth of a node in the hierarchy, link type [13], and the strength of an edge link. These different features will give different weights on edges in ontology. To summarize, the weights of edges connecting two nodes is the distance between two nodes. The shorter the distance between two nodes is, the more similar they are.

4.2

Similarity of Two Sets of Nodes

In our system, we compare similarities between two sets of nodes in a personal ontology, rather than those between two nodes. We develop two new approaches by adapting some features of IC-based and conceptual distance approaches. Approach 1: Area measurement Area measurement attempts to find how much shared information is between two personal ontologies. Personal ontology consists of concept nodes, context nodes, and edges. However, a personal ontology is viewed as indivisible to represent underlying concepts of the user-defined category in this approach. We assume there are two personal ontologies α and β come that from user Uα and Uβ . In order to estimate how much shared information between α and β, we find the overlap area of α and β (see first graph in Figure 7). The bigger the overlapping area α and β have, the more information they share. The formula of area measurement is |(α ∩ β)| (5) sim(α, β) = |(α ∪ β)| The numerator of this formula is the number of nodes in overlap between α and β to estimate the area of shared information. The denominator is the number of nodes of the union of α and β because if each of α and β contains a huge group of nodes, it can overlap other personal ontology more easily and the area of overlap will be bigger than normal. Approach 2: Weighted by Hierarchical Structure This approach estimates similarities between two personal ontologies by weighting the overlap area according to the hierarchical structure of ontology. In this approach, we make some assumptions.

N1 and N2 . S(n10 , n11 ) is more similar than S(n2 , n6 ). Based on these circumstances, the semantic similarity of two personal ontologies with the lower level overlap should be closer than two personal ontologies with higher level overlap (see second graph in Figure 7), O2 is lower than O1 therefore γ is more similar to α than β to α. Figure 7. Overlap

First, we focus on personal ontology α to find which personal ontology is more similar to α. The scale of similarity is related to α rather than the other ontology we want to compare with α. Second, concept nodes that are transformed from keywords of a user-defined category can represent underlying user-defined categories more precisely than context nodes. Third, users who organize user-defined categories are not professionals at taxonomy. They organize their categories by instinct regardless of accurate relation type between articles. Therefore, we ignore link type and the strength of an edge link of structure characteristics in ontology. When two nodes are close in ontology, their semantics are similar.

According to the assumptions and conditions above, the formula is developed as follows: Pn i=1 logk [l(xi ) + 1] sim(α, β) = Pa Pb j=1 logk [l(yj ) + 1] + j=1 logk [l(zj ) + 1] (6)

Figure 9. Diagram of Equation 6

Figure 8. Concept Nodes in Overlap

l of Equation 6 represents level, xi represents concept node in overlap O1 of personal ontology α and personal ontology Pn β. Figure 9 illustrates the concept of Equation 6. i=1 xi is the total concept nodes in O1 . Each concept node labels the level it locates in the shared ontology, l(xi ). The weights are taken from the logarithm of the levels where the concept nodes are located in the shared ontology. Therefore, the numerator is given: n X

We propose this approach trying to accomplish the following conditions.

Pa

2. The lower level pairs of nodes are more similar than the higher level pairs of nodes. For example, assume nodes n2 , n6 , n10 , and n11 in Figure 6 represent animal, plant, rose, and lily, respectively. Moreover, S(N1 , N2 ) represents the similarity of a pair of nodes

Proceedings of the International Workshop on Integrating AI and Data Mining (AIDM'06) 0-7695-2730-2/06 $20.00 © 2006

(7)

yj is total concept nodes in α.

Pb

zj is total conPb text nodes in α. The total nodes in α = j=1 yj + j=1 zj . Each yj and zj is given the weight log according to the level of each node located in the shared ontology. Therefore, the denominator is given: j=1

1. If a large number of concept nodes appear in the overlap, the two personal ontologies are similar. In Figure 8, the personal ontology γ is more similar to α than personal ontology β comparing to α, although the area measurement of overlap O1 is equal to overlap O2 .

logk l(xi )

i=1

a X j=1

logk l(yj ) +

b X

Pj=1 a

logk l(zj )

(8)

j=1

Finally, each level of node plus 1 to avoid log1 = 0 if the node is level one in the ontology.

5. Performance Evaluation 5.1

Data Set

User Data The user data are obtained from Yam Blog (http://blog.yam.com/), which collects many Chinese personal blogs. The reasons for selecting Yam Blog among all others: 1. It supports user-defined categories and also provides public common categories that can be used as our standard solutions later. 2. The homepage of Yam Blog contains 28 kinds of public common categories F . 3. Each user can post his/her article a in any one of these public categories if he/she feels that the article a is relative to this public category fr .

Figure 10. Clustering in 28 Public Categories

The terms in ontology include one Chinese character to four Chinese characters and we only use two Chinese characters. Finally, there are 30911 nodes in the shared ontology. We show a segment of share ontology in Figure 11. The nodes with a crossed red line represent the deleted words.

4. In the meantime, user can also post the same article a in one of his/her user-defined categories Uc . U is all user-defined categories of one user and U = {U1 , U2 , ..., Uc }. We extracted 17844 user-defined categories and 94558 articles of users from Yam Blog. After extracting user data, we use TF × IDF to extract Top-10 keywords from articles of users and form a keywords list Luc of Uc . Two types of keyword lists are used in experiments to evaluate how our system is performing. (1) Only uses titles in articles as corpus to extract keywords. (2) Uses titles and content in articles as corpus to extract keywords. We assume that users recognize public category fr is equal to user-defined category Uc since they post article a in fr and Uc at the same time. According to this assumption, we divide 17844 user-defined categories into 28 public categories (see Figure 10). These public categories are given as F = {f1 , f2 , ..., fr } where fr is a public category. fr = {U1 , U2 , ..., Uf r }, where Uf r is a user-defined category when fr and Uc contain the same article a and Uf r ∈ U . Shared Ontology Because the collected user data are written in Chinese, we apply the Chinese Synonym Ontology, which is the only Chinese ontology we can obtain. This ontology originally is used in Chinese classical literature; hence some nodes in this ontology are seldom used in current diction in blogs. The ontology initially contains 63213 nodes and connects these nodes only with “similar” relations. We removed stop words in the shared ontology to adapt it to our domain.

Proceedings of the International Workshop on Integrating AI and Data Mining (AIDM'06) 0-7695-2730-2/06 $20.00 © 2006

Figure 11. A Segment of the Shared Ontology

Processing Data We execute the following steps to collect processing data in an experiment: First, we take a user-defined category Uf r from fr . Second, we randomly select ] s user-defined categories from public categories fr to get a set of correct answers. Third, we randomly choose ] ran user-defined categories from public categories F except fr to get a set of incorrect answers. The combination of correct answers and incorrect answers becomes a set of processing data. The cardinality of the processing data set is ] t which is sum of ] s and ] ran. The variation of processing data is in table 1, and there are totally 20 types of combinations for processing data.

weighting the concept nodes. The more the concept nodes between two user categories shared, the more similar they are. Meanwhile, the more specific nodes in ontology they share, the more similar they are. Table 1. The Parameters of Processing Data

5.2

Implementation

Building Personal Ontology We put the keywords in Luc to map and find the location of these nodes in the shared ontology. Each user-defined category Uf r in public category fr will build a personal ontology which includes concept nodes from Luc , context nodes, and edges. To execute, iteratively transform Uf r in fr till each user category in fr is used to build a personal ontology. Comparing Semantic Similarity We employ the processing data to locate the similar categories, which should be in the same public category fr with Uf r , of user-defined category Uf r . The similar values between the compared user-defined categories based on the personal ontologies employs the designs of two approaches described in Section 4.2. Approach 1 adopts Equation (5) and Approach 2 adapts Equation (6).

5.3

Table 2. Average Result of Experiment In Figures 12 and 13, “Title (1)” represents approach 1 using the keywords from Title only. Likewise, “Title + Content (2)” represents the approach 2 with the keywords from Title and Content. Figure 12 shows the average of improvement in approaches. Figure 12 shows that the improvement in approach 2 is effective as the improvement rates are higher than 0.7. It clearly shows that approach 2 is much better than approach 1. Moreover, Title + Content is more stable than Title when the number of random answers increases. However, to extract keywords from Title + Content there is much more preparatory work involved because the corpus of Title + Content is huge. In other words, employing keywords type Title only is effective enough.

Experimental Results

After comparing semantic similarity, we list the Top10 user-defined categories, which are most similar to Uf r . Some public categories contain too few user-defined categories to extract a combination of processing data, hence we filter these data. Table 2 lists a number of combinations of processing data (all data), average precision, average how much data higher than expected value among all data and the average percentage of how much data is higher than the expected value. Table 2 shows that both approach 1 and approach 2 are effective. All the average precisions are higher than 0.5 and at least 78% data are higher than expected value. According to the result in Table 2, the results employing keywords extracted from both titles and content (denoted as “Title + content” in Table 2) are better than those extracted only from titles (denoted as “Title” in Table 2). This observation matches our expectations, since the keywords extracted from both title and content contains much more information. However, the interesting thing is the improvement of “Title + Content” over “Title” dose not amount to much. Compared to approach 1, the results of approach 2 are a great improvement. We believe the improvement contributes to the hierarchical arrangement of ontology and

Proceedings of the International Workshop on Integrating AI and Data Mining (AIDM'06) 0-7695-2730-2/06 $20.00 © 2006

Figure 12. Average Improvement When ]s =10

Figure 13 demonstrates the average improvements in different public categories. Although the improvement rates in different public categories vary, the improvement rates are positive among all public categories. Once again, Figure 13 validates that approach 2 is much better than approach 1.

6. Conclusion and Future Work We have proposed a new system for measuring semantic similarity between user-defined categories in blogs automatically. There are two findings from this study in this paper. First, these systems make the environment of a

Figure 13. Average Improvement Across Different Public Categories

blog more user-friendly by recommending the list of userinterested categories from friends. Second, the proposed system contains two approaches which can measure semantic similarity between two sets of nodes in ontology. The first approach is to find out how much information is shared between two sets of nodes to measure their similarities. The second approach considers the hierarchical structure of ontology to measure this similarity. The results show that the average precision of both approaches is higher than 60%. The second approach is better than the first approach. Two types of keywords lists have been used in these approaches in the experiment. Only one is extracted from the title of articles in a user-defined category. The other one is extracted from the title and content of articles in a user-defined category. The results of these two types of keywords do not differ. Considering the time cost, we suggest the use of title to extract keywords list rather than title and content. Future work here includes combining these two approaches, using another shared ontology to evaluate this system again.

References [1] D. Calvanese, G. D. Giacomo, and M. Lenzerini. Ontology of integration and integration of ontologies. In Description Logics, 2001. [2] V. Cross. Fuzzy semantic distance measures between ontological concepts. Fuzzy Information Processing NAFIPS ’04 IEEE, 2004. [3] M. Cutler, Y. Shih, and W. Meng. Using the structure of html documents to improve retrieval. In USENIX Symposium on Internet Technologies and Systems, California, 1997. [4] M. Ehrig and S. Staab. Efficiency of ontology mapping approaches. In International Workshop on Semantic Intelligent Middleware for the Web and the Grid at ECAI 04, 2004.

Proceedings of the International Workshop on Integrating AI and Data Mining (AIDM'06) 0-7695-2730-2/06 $20.00 © 2006

[5] J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. CoRR, 1997. [6] Y. Kalfoglou and M. Schorlemmer. Ontology mapping: The state of the art. The Knowledge Engineering Review, (1):1– 31, 2003. [7] Y. Liu, C. Wang, M. Zhang, and S. Ma. Finding “abstract fields” of web pages and query specific retrieval–THUIR at trec 2004 web track. In Proceedings of The Thirteenth Text Retrieval Conference (TREC 2004), 2004. [8] A. Maedche and S. Staab. Measuring similarity between ontologies. In Proceedings of the European Conference on Knowledge Acquisition and Management (EKAW), 2002. [9] S. Mukherjee, G. Yang, and I. V. Ramakrishnan. Automatic annotation of content-rich html documents: Structural and semantic analysis. In Proceedings of International Semantic Web Conference, pages 533–549, 2003. [10] K. O’Hara. Knowledge representation with ontologies: The present and future. Christopher Brewster, (1):72–81, January/February 2004. [11] R. Richardson and A. F. Smeaton. Using WordNet in a knowledge-based approach to information retrieval. Technical Report CA-0395, Dublin, Ireland, 1995. [12] H. R. Schmidtke, P. H. Sofia, A. Gomez-Perez, and J. P. Martins. Some issues on ontology integration. In Proceedings of the Workshop on Ontologies and Problem Solving Methods during IJCAI-99. [13] M. Sussna. Word sense disambiguation for free-text indexing using a massive semantic network. In Proceedings of the Second International Conference on Information and Knowledge Management (CIKM 93), pages 67–74, 1993. [14] J. Z. Wang and F. Ali. An efficient ontology comparison tool for semantic web applications. In The 2005 IEEE / WIC / ACM International Conference on Web Intelligence (WI’05), pages 372–378, 2005. [15] J. Z. Wang, F. Ali, and R. Appaneravanda. A web service for efficient ontology comparison. In Proceedings of the IEEE International Conference on Web Services (ICWS05), pages 843–844, 2005. [16] M. Zhang, R. Song, and S. Ma. DF or IDF? on the use of html primary feature fields for web IR. In International World Wide Web Conference(Posters), 2003.

Suggest Documents