A Knowledge-based Model Using Ontologies for ... - Semantic Scholar

26 downloads 5617 Views 1MB Size Report
IOS Press. A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering. Xiaohui Taoa,∗, Yuefeng Lia and Ning Zhongb a School ...
1

Web Intelligence and Agent Systems: An International Journal 8:3 (2010) 1–X IOS Press

A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering Xiaohui Tao a,∗ , Yuefeng Li a and Ning Zhong b a

School of Information Technology, Queensland University of Technology, Australia E-mail: {x.tao, y2.li}@qut.edu.au b Department of Systems and Information Engineering, Maebashi Institute of Technology, Japan E-mail: [email protected]

Abstract. Nowadays, how to gather useful and meaningful information from the Web has become challenging to all users because of the explosion in the amount of Web information. However, the mainstream of Web information gathering techniques has many drawbacks, as they are mostly keyword-based. It is argued that the performance of Web information gathering systems can be significantly improved if user background knowledge is discovered and a knowledge-based methodology is used. In this paper, a knowledge-based model is proposed for Web information gathering. The model uses a world knowledge base and user local instance repositories for user profile acquisition and the capture of user information needs. The knowledge-based model was successfully evaluated by comparing a manually implemented user concept model. The proposed knowledge-based model contributes to better designs of knowledge-based and personalized Web information gathering systems. Keywords: Knowledge-based Information Gathering, Ontology, World Knowledge Base, User Background Knowledge, Local Instance Repository, User Information Needs

– descendant is a function returning the subjects that are more specific than s and link to s directly or indirectly in the LCSH thesaurus, and Definition 1 Let S be a set of subjects, an element s ∈ S is formalized as a 5-tuple s := hlabel, type, neighbor, ancestor,descendant(s) = {sk |sk 6= s, sk ∈ S} and descendant(s) ⊆ S}.  descendanti, where 1. Introduction

– label is the subject heading of s in the LCSH authorities, and label(s) = {t1 , t2 , . . . , tn }; – type is the topic type of s in the LCSH authorities, and type(s) ∈ {topical, geographic, corporate}; – neighbor is a function returning the subjects that have direct links to s in the LCSH thesaurus, and neighbor(s) = {si |si 6= s, si ∈ S} and neighbor(s) ⊆ S}; – ancestor is a function returning the subjects that have a higher level of abstraction than s and link to s directly or indirectly in the LCSH thesaurus, and ancestor(s) = {sj |sj 6= s, sj ∈ S} and ancestor(s) ⊆ S}; * Corresponding

author. E-mail: [email protected].

In recent decades, the amount of Web information has exploded rapidly. How to gather useful information from the Web has become a challenging issue to all Web users. Many information retrieval (IR) systems have been developed in an attempt to solve this problem, resulting in great achievements. However, there is still no complete solution to the challenge [11]. The current Web information gathering systems cannot completely satisfy Web search users, because they are mostly based on keyword-matching mechanisms and suffer from the problems of information mismatching and overloading [42]. Information mismatching means valuable information is being missed in information gathering. This usually occurs when one search topic has different syntactic representa-

c 2010 – IOS Press and the authors. All rights reserved 1570-1263/10/$17.00

2

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

tions. For example, ‘data mining’ and ‘knowledge discovery’ refer to the same topic of discovering knowledge from raw data. However, by using keywordmatching mechanisms, documents containing ‘knowledge discovery’ may be missed if using the query ‘data mining’ in the search. The other problem, information overloading, usually occurs when one query has different semantic meanings. A common example is the query ‘apple’, which may mean apples (fruit), or iMac (computer). By using the query ‘apple’ to describe the information need ‘apple (fruit)’, the search results may be mixed with useless information about ‘iMac (computer)’ [41,42]. From these examples, a hypothesis arises that if user information needs can be captured and interpreted, more useful and meaningful information can be gathered for users. Capturing user information needs via a given query is difficult. In most Web information gathering cases, users provide only short phrases in their queries to express information needs [61]. Also, Web users formulate queries differently because of different personal perspectives, expertise, and terminological habits and vocabularies. These differences cause difficulties in capturing user information needs. Thus, to capture user information needs effectively, understanding user background knowledge is necessary. For this purpose, user profiles are widely used in personalized Web information gathering systems [34]. These systems apply user background knowledge to information gathering. This mechanism was suggested by Yao [81] as knowledge retrieval. In this paper, we introduce a knowledge-based personalized information gathering model, aiming at improving the performance of information gathering systems by utilizing user background knowledge. This knowledge-based model learns personalized ontologies for user profiles and applies user profiles to information gathering. Given a query, the user’s background knowledge is discovered from a world knowledge base and the user’s local instance repository. Based on these, a personalized ontology is constructed that simulates the user’s concept model and captures the user information need. The semantic relations of is-a, part-of, and related-to are specified for the concepts in the constructed ontological user profile. The acquired user profile is then used by Web information gathering systems to gather useful and meaningful information for the user. The knowledge-based model was evaluated by being compared with a model that manually specified user background knowledge, and the evaluation result was promising and encouraging.

The proposed knowledge-based model contributes to better understanding of user information needs and user profile acquisition, as well as better design for personalized Web information gathering systems. The paper is organized as follows. Section 2 presents related work. Section 3 describes the framework of the knowledge-based information gathering model. The implementation of the knowledge-based model is introduced in Section 4. The evaluation-related issues are described in Section 5, and the experimental results are discussed in Section 6. Finally, Section 7 makes conclusions.

2. Related Work 2.1. Knowledge-based Information Gathering Knowledge-based information gathering is based on the semantic concepts extracted from documents and queries. The similarity of documents to queries is determined by the matching level of their semantic concepts. Thus, concept representation and knowledge discovery are two typical issues and will be discussed in this section. Semantic concepts have various representations. In some models, concepts are represented by controlled lexicons defined in terminological ontologies, thesauruses, or dictionaries. A typical example is the synsets in WordNet, a terminological ontology [15]. The models using WordNet for semantic concept representation include [6,17,22] and [33]. The lexiconbased representation defines the semantic concepts in terms and lexicons that are easily understood by users and easily utilized by computational systems. However, though the lexicon-based concept representation was reported to improve information gathering performance in some works [28,33,47], it was also reported as degrading performance in some other works [72,70]. Another concept representation in Web information gathering systems is pattern-based representation, including [42,38,77,14,57]. In such representation, concepts can be discriminated from others only when the length of patterns representing concepts are adequately long. However, if the length is too long, the patterns extracted from Web documents would be of low frequency. As a result, they cannot substantially support the concept-based information gathering systems [77]. Many Web systems rely upon subject-based representation of semantic concepts for information gath-

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

ering. Semantic concepts are represented by subjects that are defined in knowledge bases or taxonomies, including domain ontologies, digital library systems, and online categorization systems. Typical information gathering systems utilizing domain ontologies for concept representation include those developed by Lim et al. [44], by Navigli [51], and by Velardi et al. [71]. Also used for subject-based concept representation are the library systems, like Dewey Decimal Classification used by [31,75], Library of Congress Classification and Library of Congress Subject Headings by [16]. The online categorizations are also widely used by many information gathering systems for concept representation, including the Yahoo! categorization used by [18] and Open Directory Project1 used by [8,52]. However, the semantic relations associated with the concepts in these existing systems are specified as only super-class and sub-class. They have inadequate details and poor specificity level. Thus, the specification of semantic relations for subject-based concept representation demands further development. Techniques used by Web information gathering systems to discover knowledge from text include text classification and Web mining. Text classification is the process of classifying an incoming stream of documents into categories by using the classifiers learned from training samples [46]. The performance of text classification relies upon the accuracy of these classifiers [37,80]. Existing techniques for learning classifiers include Rocchio [56], Naive Bayes (NB) [54], Dempster-Shafer [58], Support Vector Machines (SVMs) [27], and the probabilistic approaches [50]. Treating the classifiers as semantic concepts, the process of learning classifiers is then a process of extracting semantic concepts to represent the categories. Text classification techniques are widely used in concept-based Web information gathering systems, like [7,18,48]. However, by using text classification techniques, the Web information gathering performance largely relies on the accuracy of predefined categories [20]. Also, the ‘cold start’ problem occurs when there is an insufficient number of training samples available to learn classifiers. Web mining discovers knowledge from the content of Web documents, and attempts to understand the semantic meaning of Web data [34,42,45,62]. Li and Zhong [42] represented semantic concepts by maximal patterns, sequential patterns, and closed sequential 1 http://www.dmoz.org

3

patterns, and extracted semantic concepts from Web documents. Association rule mining was also used by many systems for knowledge discovery from web documents, including [39,79,78]. Text clustering techniques were used by [21,76,85,35] to discover user interest for personalized Web information gathering. Some works, such as Dou et al. [14], used hybrid Web content mining techniques for concept extraction. However, as pointed out by Li and Zhong [40,41], these existing Web content mining techniques have some limitations. One of these limitations is the incapability of specific semantic relation (e.g. is-a and part-of ) specification for concepts. Therefore, the current concept extraction techniques need to be improved for better specific semantic relation specification, especially given the fact that the current Web is becoming the semantic Web [3]. 2.2. Ontology Learning Ontologies are an important technology in the semantic Web and Web information gathering systems. They provide a common understanding of topics for communication between systems and users, and enable Web-based knowledge processing, sharing, and reuse between applications [10,83]. Ontologies have been widely used by many groups to specify user background knowledge. Li and Zhong [41] used ontologies to describe the user conceptual level model: the so-called ‘intelligent’ part of the world knowledge model possessed by human beings. They [42] also used pattern recognition and association rule mining techniques to discover knowledge from Web content and learned ontologies for user profiles. Tran et al. [66] introduced an approach to translate keyword queries to the Description Logics conjunctive queries and to specify user background knowledge in ontologies. Gauch et al. [18] learned personalized ontologies for individual users in order to specify their preferences and interest. Cho and Richards [9] proposed to construct ontologies from user visited Web pages to improve Web document retrieval performance. Ontologies were used in these works to specify user background knowledge for personalized Web information gathering. Ontology learning is the process of constructing ontologies. Zhong and Hayazaki [84] introduced a twophase ontology learning approach: conceptual relationship analysis and ontology prototype generation. Alternatively, Maedche [49] proposed an ontology learning framework with four phases: concept import

4

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

and reuse, concept extract, concept prune, and concept refine. The framework extends typical ontology engineering environments by using semi-automatic ontology learning tools with human intervention, and constructs ontologies adopting the paradigm of balanced cooperative modelling. Typical ontologies learned by using manual mechanisms are WordNet [15] and its extensive models, such as Sensus [32] and HowNet [82]. The manual ontology learning mechanism is effective in terms of knowledge specification but expensive in terms of finance and computation. Automated ontology learning is then completed using the hierarchical collections of documents or thesauruses. One example is the so-called reference ontology used by [19]. This ontology was constructed based on the subject hierarchies and their associated Web pages in Yahoo!, Lycos, and Open Directory Project. King et al. [31] proposed the IntelliOnto, an ontology describing world knowledge by using a three-level taxonomy of subjects constructed on the basis of Dewey Decimal Classification. These learning methods increase the efficiency of ontology learning. However, the effectiveness of ontology learning is limited by the quality of the knowledge bases used in these methods. Many works tried to learn ontologies automatically without using knowledge bases. Web content mining techniques were used by Jiang and Tan [25] to discover knowledge from domain-specific text documents for ontology learning. Abulaish and Dey [1] proposed a framework to extract concepts from Web documents and construct ontologies with fuzzy descriptors. Jin et al. [26] attempted to integrate data mining and information retrieval techniques to further enhance ontology learning techniques. Doan et al. [13] proposed a model called GLUE and used machine learning techniques to extract similar concepts from different taxonomies. Dou et al. [14] proposed a framework to learn domain ontologies using pattern decomposition, clustering and classification, and association rule mining techniques. An ontology learning tool called OntoLearn was developed by Navigli et al. [51] in an attempt to discover semantic relations among the concepts from Web documents. These works have explored a new route to specify knowledge efficiently. The semantic association between concepts in ontologies can be discovered by computing the conceptual similarity (or distance) between them in the space of ontologies [24]. The node-based conceptual similarity methods measure the extent of information shared in common by the measured concept

Fig. 1. Knowledge-based Web Information Gathering Framework

classes, including [53]. The edge-based methods measure the distance (e.g. edge length) between the measured concept classes [30]. Edges here refer to the links connecting any two nodes in the ontology structure. The more edges travelled from one concept node to another indicate the less similarity of the two concepts. For the node-based approaches, Jiang and Conrath [24] pointed out that they have a limitation of ignoring the structure information of ontologies. For the edge-based (conceptual distance) methods, although the structure information is considered, none of the existing methods takes into account the influence produced by different semantic relations, such as is-a, part-of, and related-to, to the best of the authors’ knowledge. 3. Framework The framework of knowledge-based information gathering consists of four models: a user concept model, a user querying model, a computer model, and an ontology model, as illustrated in Fig. 1. Given a user, the user concept model is the user’s background knowledge system. The querying model is the user’s formulation of an information need in Web information gathering. The computer model is designed to capture the information need expressed in the querying model. Finally, the ontology model is the ontological user profile produced by the computer model. The ontology model is the explicit representation of the implicit user concept model regarding the information need. The following sections describe these models in detail. 3.1. Concept Model Web information gathering tasks are triggered by user information needs. From observations, when

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

5

users were in need of some information and conducted an information gathering task, they usually fell into one of the following cases:

b – G is a set of gaps {g1 , g2 , . . . , gi } existing on B, in which each gap g is one or more concepts that the user does not possess.

1. they knew nothing about that information; 2. they had tried but failed to infer that information from what they already knew; 3. they might know something about that information but were not sure, so they needed confirmation.

Note that the :≈ is used in Definition 2 instead of :=, as this definition is given under Assumption 1, which is based on observations and cannot currently be proven in laboratories.

In the first case, an assumption can be made that users hold a repository that stores the user background knowledge. Given this assumption, users can check in the repository if they possess some information (or knowledge) or not. From the second case, another assumption arises: that the concepts stored in the knowledge repository are linked to each other. Only with this assumption available can users perform inference tasks from what is known to what is unknown. From the last case, another assumption can be made: that users hold an implicit confidence rate for the concepts stored in their knowledge repository, although the confidence rate cannot be specified clearly. With this assumption, users know what information (or knowledge) they are certain of and what they are uncertain of. Based on these assumptions, though the mechanism of a human user’s thought processes in Web information gathering has not yet been clearly understood, the following assumption can be made: Assumption 1 In Web information gathering, Users have a knowledge repository, in which: – the stored concepts are embedded in a taxonomic structure; – the stored concepts are associated with implicit confidence rates. Information gathering can then be treated as a process of filling the user knowledge repository with information and knowledge. On the basis of Assumption 1, we formalize a user’s knowledge repository as a user concept model: Definition 2 A user concept model is a 3-tuple U :≈ b Gi, where hK, B, – K is a non-empty set of pairs {hk, wk i}, where k is a concept possessed by the user and wk is the user’s confidence in k; – Bb is a taxonomic structure containing concepts and their relationships;

3.2. Querying Model An information gathering task takes place when a user intends to find the relevant concepts to fill the gaps g on the Bb in U. These concepts then become the user’s information needs. When attempting to find these concepts, users usually express their information needs using short phrases in their own languages. The phrases consist of a set of terms, and are formulated in a data structure form[66]. In information gathering, these user-formulated data structures for information needs are called queries. Thus, the following assumption can arise: Assumption 2 A query is a user information need expressed in the user’s own language. Based on Assumption 2, a user query can be formalized as a querying model in the knowledge-based information gathering: Definition 3 A user querying model Q is a set of terms {t|t ∈ LU }, in which each element is a primitive unit in the user’s language LU . Capturing user information needs means to discover the concepts referred to by the gaps g ∈ G in user concept models U. Because users do not possess these concepts, to describe the information needs users have to use the possessed concepts that are associated with the gaps on the Bb of U. Thus, information need capture can be understood as an inverse process of exploiting the unknown concepts of g ∈ G from the user expression Q. However, tracing from a Q back to the concepts of g ∈ G is difficult. Queries are often small sets of terms and contain only limited information [23]. Users have different backgrounds, perspectives, and terminological habits and vocabulary. These facts make the user information need capturing process a challenging task.

6

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

3.3. Computer Model and Ontology Model A hypothesis thus arises that if a user’s background knowledge can be discovered and her (his) concept model can be represented, the concepts referred to by the gaps in the Bb can be exposed, and thus, user information needs can be captured effectively. In this paper, we argue that user concept models can be represented by ontologies, and implement a computer model to learn ontologies for user concept model representation. Ontologies are the adequate representation of user concept models, as ontologies by their nature are the formal specification of knowledge. Individual users come from different backgrounds, and thus, their background knowledge needs to be specified. In an attempt to fill a gap g in U, a user generates a querying model Q for information gathering. The computer model in the knowledge-based framework, denoted by C and displayed in Fig. 1, learns an ontology based on the user background knowledge discovered according to Q. As a result, the structure of the ontology represents the taxonomy structure Bb in U, and the concepts in the ontology represent the user background knowledge K in U. Discovering the concepts linking with the gs from the personalized ontology can help to define the concepts referred to by the g’s – in other words, the querying model Q of the information need. The personalized ontology constructed against Q is called the ontology model in this knowledge-based framework and is denoted by O(Q). The computer model is the core in the framework of knowledge-based Web information gathering. In the next section, the implementation of the computer model C and how a user ontology model O(Q) is produced will be presented in detail.

4. Implementation of the Computer Model Given a querying model Q from a user, the computer model C uses a world knowledge base to extract relevant and non-relevant concepts. The interesting concepts, referring to the user background knowledge, are discovered from the user’s local instance repository. The ontology model O(Q) is then constructed on the basis of these interesting concepts. 4.1. World Knowledge Base The world knowledge base is a knowledge taxonomy describing and specifying the background knowl-

edge possessed by humans. For the sake of explanation, we first assume the existence of a world knowledge base WKB, and define the primitive knowledge unit in the WKB as subjects. These subjects are formally defined: Definition 4 Let S be the set of subjects, a subject s ∈ S is formalized as a 2-tuple s := hlabel, σi, where – label is the label of s, and is denoted by label(s); – σ(s) is a signature mapping defining the cross references of s that directly link to s, and σ(s) ⊆ S. Subjects in the world knowledge base are linked to each other by the semantic relations of is-a, part-of, and related-to. Formally, is-a relations describe the situation that the semantic extent referred to by a hyponym is within that of its hypernym: for example, a ‘car’ is an ‘automobile’, and the ‘car’ and ‘automobile’ are on different levels of abstraction (or specificity). Isa relations are transitive and asymmetric. Transitivity means if subject A is a subject B and B is a subject C, then A is also a C. Asymmetry means if A is a B, B cannot be an A: for example, the statement of ‘an automobile is a car’ is false because not all automobiles are cars, like motorcycles. Alternatively, part-of relations define the relationship between a holonym subject denoting the whole and a meronym subject (a part of, or a member of, the whole): for example, a ‘wheel’ is a part of a ‘car’. Partof relations also hold the transitivity and asymmetry properties. If A is a part of B and B is a part of C, A is also a part of C. If A is a part of B and A6=B, B is not a part of A. Related-to relations are for two topics related in some manner other than by hierarchy, such as ‘ships’ and ‘boats’. The semantic meanings of the two topics may overlap to some extent. Related-to relations are symmetric: if A is related to B, B is also related to A. Related-to relations are not transitive: if A is related to B and B is related to C, A may not be necessarily related to C, if none of the semantic extents referred to by A and C overlap. The semantic relations in the WKB are formally defined: Definition 5 Let R be the set of relations, a relation r ∈ R is a 2-tuple r := hedge, typei, where – an edge connects two subjects that hold a type of relation; – a type of relation is an element of {is-a, partof, related-to}.

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

Fig. 3. Taxonomy Construction for Ontology Learning

Finally, we formally define the world knowledge base WKB: Definition 6 Let WKB be a world knowledge base, which is a taxonomy constructed as a directed acyclic graph. WKB consists of a set of subjects linked by their semantic relations, and can be formalized as a 2-tuple WKB := hS, Ri, where – S is a set of subjects S := {s1 , s2 , · · · , sm }; – R is a set of semantic relations R := {r1 , r2 , · · · , rn } linking the subjects in S. Figure 2 illustrates a portion of the WKB dealing with the subject ‘Business intelligence’. In the experiments conducted in this present work, the WKB was implemented on the basis of the Library of Congress Subject Headings system (refer to Section 5 for a detailed description). 4.2. Taxonomy Construction User background knowledge is represented by the positive and negative subjects identified according to the querying model Q. The Q is a set of terms Q := {t1 , t2 , . . . , tn }, as defined in Definition 3. By using these terms, an automatic syntax-matching mechanism can be used to extract the related subjects from the world knowledge base, along with their associated semantic relationships. The mechanism is presented in Fig. 3. We use sim(s, Q) to specify the relevance level of a subject s ∈ S to Q: sim(s, Q) = |label(s) ∩ Q|

(1)

A subject with sim(s, Q) > 0 is a positive subject relevant to Q. Because S is a large volume of subjects, we cannot treat all subjects in the complement set of posi-

7

tive subjects as negative. Instead, the negative subjects are extracted from the neighborhood of extracted positive subjects. As shown in the algorithm presented in Fig. 3, s 7→ s0 and dis(s, s0 ) are used for negative subject extraction. The s 7→ s0 denotes a path existing between a positive subject s to another subject s0 in the WKB. The dis(s, s0 ) is the conceptual distance between s and s0 , measured by the number of subjects crossing over on the path s 7→ s0 [30]. As argued by Khan et al. [30], a long conceptual distance means little similarity between two concepts. Thus, the subjects with greater dis(s, s0 ) are more different in semantics and carry less value for semantic analysis. Based on this argument, only the subjects with dis(s, s0 ) ≤ 3 are extracted for negative subjects. To be discreet in user background knowledge specification, these subjects are categorized into the negative set temporarily, and their sim(s, Q) are set to zero. These specified subjects will be refined in the next chapter by using an ontology mining method that is based on their semantic relationships and the user’s local document collection. We then have two sets S + and S − extracted from the WKB according to the Q: S + = {s|sim(s, Q) > 0, s ∈ S}; S − = {s|sim(s, Q) = 0, s ∈ S}

(2)

Their semantic relations are also specified as: R = {r|r ∈ R, rν ⊆ S × S}

(3)

where S = S + + S − The subjects in S + and S + specify the implicit knowledge k in K in the user’s concept model U related with the information need g ∈ G, which is expressed by the user as a querying model Q. S + specifies the knowledge relevant to g, and S − specifies the subjects paradoxical or non-relevant to g. The relations in R link the subjects in S, and thus provide the backbone of Bb for U. As a result, we have the taxonomy constructed for the user’s ontology model O(Q). However, this taxonomic structure so far is only at the terminological level because of the use of the syntaxmatching mechanism. In the following sections, the taxonomy will be refined and personalized for the user by using the discovered user background knowledge. 4.3. User Local Instance Repositories The taxonomy constructed for the user’s O(Q) needs to be populated with instances and personal-

8

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

Fig. 2. A Portion of the World Knowledge Base Dealing with the Subject ‘Business intelligence’.

ized with user background knowledge. These tasks are completed by using the user’s local instance repository LIR. A user’s LIR is a collection of information items that are recently visited by the user, such as user stored documents, browsed Web pages, and composed and received email messages [48]. These documents have content-related descriptors referring to the concepts specified in an external knowledge base [12]; for example, the meta-data tags in XML, RDF, OWL, DAML, and XHTML documents citing the concepts specified in knowledge bases. These kinds of documents, with semantic meta-data, are becoming popular on the Web today, and are argued to be the mainstream of semantic Web documents [2,68]. In this paper, such personal information collection is called a user’s local instance repository, and each element in an LIR is called an instance. On the basis of the specified content-related descriptors in LIRs, a user’s background knowledge can be discovered from his (her) LIR and used to populate the constructed ontology taxonomy. The content-related descriptors in instances become the ties connecting an LIR to the WKB when treating these content-related descriptors as subjects. Denoting an instance as i, let I = {i1 , i2 , · · · , ip } be an LIR, and S ⊆ S be a set of subjects cited by the instances in I. The relationship between S and I can be described by the following mappings:

where η −1 (s) is a reverse mapping of η(i). These mappings aim to explore the semantic matrix existing between the subjects and the instances. The belief of an instance to its referring-to subjects changes, depending on the number of the subjects. An instance may cite multiple subjects and these subjects are usually indexed by their importance to the instance. Thus, the belief of an instance to a subject can be defined by bel(i, s) =

priority(s, i) n(i)

(5)

where n(i) is the number of subjects on the citing 1 list of instance i, priority(s, i) = index(s,i) where index(s, i) is the index (starting from one) of s on the citing list of i. The bel(i, s) increases when less subjects occur in the citing list and the higher index of s on the list. 4.4. User Background Knowledge Discovery In this section, how to discover user background knowledge from LIRs is discussed. A user’s LIR contains user background knowledge in the semantic matrix of subjects and instances. Based on the mappings in Eq. (4), we first extract the coverset of each subject for its related-to subjects. A subject’s coverset, referring to the extent of instances in the LIR citing the subject, is defined by: coverset(s) = {i|i ∈ η −1 (s)}

(6)

S

η : I → 2 , η(i) = {s ∈ S|s is cited by i}; η −1 : S → 2I , η −1 (s) = {i ∈ I|s ∈ η(i)}; (4)

Assume subject s1 ∈ S + and s2 ∈ S − . s1 and s2 have something in common if coverset(s1 ) ∩

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

9

coverset(s2 ) 6= ∅, because they are cited by some common instances. Thus, we may say that s2 is relevant to s1 , and may further deduce that s2 is also more or less interesting to the user because s1 is posb itive. On the basis of these, let S(s) = {s0 |s0 ∈ + 0 S , coverset(s )∩coverset(s) 6= ∅} be the synonyms of s ∈ S − in S + . We discover these interesting subjects from S − and use them to personalize the S + and S − by: b 6= ∅}; S + = S + ∪ {s|s ∈ S − , S(s) b 6= ∅}. S − = S − − {s|s ∈ S − , S(s)

(7)

For the purpose of clarity in this paper, we call the positive subjects extracted by using the syntaxmatching mechanism as the initial positive subjects b and the subjects {s|s ∈ S − , S(s) 6= ∅} as the discovered interesting subjects. Recall Assumption 1 and Definition 2, a knowledge unit k possessed by a user in the concept model U is associated with a confidence rate wk . In this section, we use a support value sup(s, Q) to scale a subject’s wk regarding Q. Subjects with large sim values have great sup(s, Q). The sim(s, Q) is used to select the initial positive subjects. Thus, it also has impact to the support value sup(s, Q). For an initial positive subject, we have its sim value measured by Eq. (1). Because these interesting subjects are discovered based on their synonyms in initial S + , these synonyms also have authority to determine a discovered interesting subject’s sim value. Thus, we measure the sim of a discovered interesting subject by: P sim(s, Q) =

b s0 ∈S(s)

conf (s0 → s) × sim(s0 ) b |S(s)|

(8)

where s0 is an initial positive subject and sim(s0 ) is determined by Eq. (1). conf (s0 → s) is the confidence of s received from its synonyms and calculated by: conf (s0 → s) =

|coverset(s0 ) ∩ coverset(s)| (9) |coverset(s0 )|

As a result, a discovered interesting subject has a high sim value if overlapping a large number of initial positive subjects, and/or the overlapped subjects having large initial sim values. The locality of a subject in the backbone taxonomy of the O(Q) also influences the subject’s sup(s, Q). The subjects located toward lower bound levels of the

Fig. 4. Analyzing Semantic Relations for Specificity

backbone are more specific and thus more strongly focus on their referring-to knowledge. Thus, these subjects should have higher sup(s, Q) than the subjects located toward upper bound (more abstractive) levels. For the study of a subject’s locality, we use an ontology mining method, Specificity, which is introduced in [63,64] for specifying a subject’s semantic focus on its referring-to concepts. The algorithm displayed in Fig. 4 presents how the specificity values are scaled for the subjects in an ontology. The functions, hyponym(s) and meronym(s), are to return the set of direct hyponyms or meronyms of s satisfying hyponym(s) ⊆ σ(s) or meronym(s) ⊆ σ(s). θ is a coefficient that defines the reducing rate of specificity for focus lost in each step up from lower bound toward upper bound levels (θ = 0.9 in our experiments). A subject’s specificity is determined by its more specific child subjects, as described in the algorithm in Fig. 4. As WKB is a directed acyclic graph (see Definition 6), the algorithm has a complexity of only O(n), where n = |S|. The instances in the LIR also count in the sup(s, Q) of their citing subjects. Considering the bel(i, Q) calculated by Eq. (5), the support value sup(i, Q) held by an instance i to Q is calculated by: sup(i, Q) =

X

bel(i, s) × sim(s, Q)

(10)

s∈η(i)

Because the negative subjects have sim(s, Q) = 0, only the positive subjects cited by i count in sup(i, Q). Finally, sup(s, Q) is determined, based on the sim(s, Q) from Eq. (1) for the initial positive subjects (or Eq. (8) for the discovered interesting subjects), the

10

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

spe(s) from the algorithm in Fig. 4, and the related sup(i, Q) from Eq. (10): sup(s, Q) = spe(s)×sim(s, Q)×

X

sup(i, Q)

– taxS : taxS ⊆ S × S is a function defining the taxonomic structure of ontology containing two directed relations of is-a and part-of; – rel is a function defining non-taxonomic relation of synonyms, e.g. overlapping.

i∈η −1 (s)

(11) For the subjects in the final S − , we set their sup(s, Q) = 0. The support value sup(s, Q) specifies the wk of hk, wk i ∈ K in U regarding an information need, where s is for a k and Q is generated from a g ∈ G.

An ontology model simulates a user’s concept model U. The knowledge K in a U is specified by S, in which the S + is relevant and S − is non-relevant to a Q representing an information need and generated from a g ∈ G. The wk for k ∈ K is specified by sup(s, Q) for the subjects in S. The Bb in U is restructured by R, taxS , and rel in O(Q). The user’s concept model U can then be simulated.

4.5. Formalization of the Computer and Ontology Models 5. Experimental Setup The computer model aims to simulate a user’s concept model. The computer model discovers a user’s background knowledge regarding an information need. Here we formalize the computer model: Definition 7 A computer model C is a 3-tuple C := hWKB, LIR, Fi, where – WKB is a world knowledge base that a user’s background knowledge is extracted from; – LIR is a user’s local instance repository, in which the elements cite the knowledge in WKB; – F is a set of functions, inferences and algorithms that build an ontology for a user using WKB and LIR. To simulate a user’s concept model U, the computer model C uses the WKB to construct the taxonomy of an ontology, and uses the user’s LIR to populate and personalize the ontology, according to a querying model Q generated from a gap g ∈ G. Equation (1) to (11) and the algorithm in Fig. 4 are used for user background knowledge discovery. They form the F in C, as described in Definition 7. An ontology model is the output of the computer model, aiming to simulate a user’s implicit concept model dealing with an information need. An ontology model can be formalized: Definition 8 An ontology model associated with a Q is a 4-tuple O(Q) := hS, R, taxS , reli, where – S is a set of subjects (S ⊆ S) consisting of a positive subset S + relevant and a negative subset S − non-relevant to Q; – R is a set of relations and R ⊆ R;

In information gathering research, a common batchstyle experiment is to select a collection of documents (a testing set) and a set of topics associated with relevance judgements (a training set), and then to compare the performance of experimental models [59]. Our experiments followed this style, and used the standard testbed and topics that were originally designed in the TREC-11 Filtering track2 for evaluating information retrieval methods using relevance feedback. In the evaluation, the implementation of the knowledgebased information gathering model (called the Ontology model) was compared with a manual model representing the user concept model (called the Manual model) in the experiments. If the performance of the Ontology model achieved the same or close to that of the Manual model, we could prove the hypothesis in the evaluation: the knowledge-based information gathering model is capable of simulating the user concept model for information gathering. The experiment was designed as illustrated in Fig. 5. Against an incoming topic, the Manual model generated a training set manually, whereas the Ontology model learned a user personalized ontology for user profile, and generated a training set using the profile. In both models, a training set consisted of a set of positive samples D+ and a set of negative samples D− . Each sample was a document d holding a support value support(d) to the given topic. The different training sets were used by the common information gathering system to gather information from the testing set. The performance of the information gathering system was 2 Text

REtrieval Conference, http://trec.nist.gov/.

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

11

The IGS first used the training set to evaluate weights for a set of selected terms T . After text pre-processing of stopword removal and word stemming, a positive document d became a pattern that consisted of a set of term frequency pairs d = {(t1 , f1 ), (t2 , f2 ), . . . , (tk , fk )}, where fi was ti ’s term frequency in d. The semantic space referred to by d was represented by its normal form β(d), which satisfied β(d) = {(t1 , w1 ), (t2 , w2 ), . . . , (tk , wk )}

(12)

where wi (i = 1, . . . , k) were the weight distribution of terms and wi = Pkfi f . j=1

Fig. 5. Experiment Design

then affected by the input training sets. Based upon this, we compared the performances of the experimental models to evaluate the proposed knowledge-based model. The Reuters Corpus Volume 1 (RCV1) [36] used in the TREC-11 was also used as the testbed in our experiments. The RCV1 was a large XML document set (806,791 documents) with great topic coverage. A set of 50 search topics were also provided by the TREC11. These topics were designed by linguists manually, and associated with relevance documents judged by the designers [55]. These 50 topics were used in our experiments to ensure the high stability of evaluation, as suggested by [5]. In the experiments, we used only the title of topics for the querying models because in the real world, users usually use only short phrases to express their information needs [23].

j

A probability function on T could be derived based on the normal forms of positive documents and their supports for all t ∈ T : X

prβ (t) =

support(d) × w

The testing documents could be indexed by weight(d), which was calculated using the probability function prβ : weight(d) =

X

prβ (t) × τ (t, d)

(14)

t∈T

where τ (t, d) = 1 if t ∈ d; otherwise τ (t, d) = 0. In an attempt to clarify the semantic ambiguity from D− , a set of negative documents ND was selected firstly from D− , which satisfied ND = {d0 ∈ D− |weight(d0 ) ≥ mind∈D+ weight(d)} (15)

5.1. Information Gathering System An information gathering system (IGS) was implemented for common use with all experimental models. The IGS was an implementation of a model developed by [42] that used user profiles for Web information gathering. The input support values associated with the documents in user profiles affected the IGS’s sensitivity of performance. The [42] model was chosen because: not only was it verified better than the Rocchio and Dempster-Shafer models, but it was also extensible in using support values of training documents for Web information gathering.

(13)

d∈D + ,(t,w)∈β(d)

The supports or normal forms of positive documents d were also updated in the following situations: (i) if ∃d0 ∈ ND, and d ⊆ d0 , the support was adjusted by support(d) = µ1 × support(d), where µ = 8 in our experiments; otherwise, (ii) if d∩d0 6= ∅, instead of updating support(d), its normal form β(d) was adjusted for all (t, w) ∈ β(d) and t ∈ d0 by w = w µ , and for the 0 rest (t, w) ∈ β(d) and t ∈ / d by: w =w+w×

µ − 1 s_offering × µ base

(16)

12

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

where X

s_offering =

w

(17)

(t,w)∈β(d),t∈d0

and base =

X

w

(18)

(t,w)∈β(d),t∈d / 0

The probability function Eq. (13) and then the weight function Eq. (14) could be updated based on the changes of the supports and normal forms. In summary, the input to the Web information gathering system was the user profiles consisting of a set of training documents D = {d|d ∈ D+ ∪ D− } in which each document was associated with a support value support(d) to the given topic. The experimental models, the Ontology model and the Manual model, would match this requirement in their user profiles. 5.2. Ontology Model This model was the implementation of the proposed model. As illustrated in Fig. 5 and required by the IGS, the input to this model was a topic and the output was a training set consisting of positive documents (D+ ) and negative documents (D− ). Each document was associated with a support(d) value indicating its support level to the topic. In the experiments, we assumed that each topic came from an individual user, and attempted to evaluate the proposed model in an environment covering a great range of topics. However, it was unrealistic to expect a participant to hold such a large range of topics in her/his personal interest. Thus, for the 50 experimental topics, we assumed each topic came from an individual user. 5.2.1. World Knowledge Base The WKB described in Section 4.1 was implemented based on the Library of Congress Subject Headings (LCSH) system. The LCSH authority records distributed by the Library of Congress were stored in a single file of 130MB in MARC (MAchine-Readable Cataloging) 21 format, which was sequential raw data compiled in a machine-readable form. After data preprocessing using regular expression techniques, the MARC 21 authority records were translated to humanreadable text and organized in an SQL database of a size of approximately 750MB. Theoretically, the LCSH authority records consisted of subjects for per-

sonal names, corporate names, meeting names, uniform titles, bibliographic titles, topical terms, and geographic names. In order to make the Ontology model run efficiently, only the topical, corporate, and geographic subjects were kept in the world knowledge base, as they covered most topics in daily life. Eventually, the constructed WKB contained 491,250 subjects covering a wide range of topics. The semantic relations in the world knowledge base were transformed from the references specified in the LCSH. The Broader Term(BT)/Narrower Term(NT), Used-forUF, and Related-to(RT) references (represented by ‘450 |w | a’, ‘450’ and ‘550’ in the MARC 21 authority records, respectively) cross-referencing the subjects were also extracted to define the semantic relations of is-a, part-of, and related-to in the WKB respectively. The BT and NT references were for two subjects describing the same topic but in different levels of abstraction (or specificity) [43]. These references defined the is-a relations in the world knowledge base. The UF references were usually used in two situations: to help describe an action, for example, ‘a turner is used for cooking’; or to help describe an object, for example, ‘a wheel is used for a car’. It was assumed in the experiments that in these cases, they were the part-of relations. When object A was used for an action, A actually became a part of that action, like ‘using a turner in cooking’; when A was used for object B, A became a part of B, like‘a wheel is a part of a car’. Hence, the UF references in the LCSH defined the part-of relations in the world knowledge base. The RT references were for two subjects related in some manner other than by hierarchy. They defined the related-to relations in the world knowledge base. The subjects in the implemented world knowledge base were linked by these three types of semantic relations. 5.2.2. User Local Instance Repositories In the implementation, a user’s local instance repository was collected through searching the subject catalogue of the Queensland University of Technology (QUT) library by using the given topic. The QUT library catalogue stored a large volume of information, summarizing over four hundred thousand information items. The catalogue was distributed by the QUT library as a 138MB text file containing information for 448,590 items3 , and used in the experiments as the corpus for user LIRs extraction. All of this informa3 This figure was for the collection in the QUT library prior to 2007.

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

tion can be accessed through QUT library’s Web site (http://www.library.qut.edu.au/) and is available to the public. Before use in the experiments, the catalogue information was also pre-processed by using text processing techniques such as stopword removal, word stemming, and term grouping. Librarians and authors had assigned title, table of content, summary, and a list of subjects to each information item in the catalogue. In order to simplify the experiments, only the abstracted information (title, table of content, summary) was used to represent an instance in the LIRs. Each information item cited a list of subjects defined in the LCSH system for the semantic content. Therefore, treating each information item in the catalogue as an instance, each instance cited a set of subjects in the constructed world knowledge base. On average, there were about 2.06 subjects cited by each instance. For each one of the 50 experimental topics and thus each one of the 50 corresponding users, the user’s LIR was extracted from this catalogue data set. As a result, there were about 1111 instances existing in one LIR on average. 5.2.3. User Profile Acquisition Once the WKB and an LIR were ready, an ontology was learned as described in Section 4.1, and personalized as in Section 4.3. The user confidence rates on the subjects were specified as in Section 4.4. A document di in the training set was then generated by an instance i, and its support value was determined by: X

support(di ) =

sup(s, Q)

(19)

s∈η(i),s∈S

where s ∈ S in O(Q) was as defined in Definition 8. As sup(s, Q) = 0 for s ∈ S − (according to Eq. (11)), the documents with support(d) = 0 were categorized into D− , whereas those with support(d) > 0 were categorized into D+ . 5.3. Manual Model The Manual model represented the user concepts model. As previously discussed, the RCV1 data set used in TREC-11 Filtering Track aimed to evaluate the methods of persistent user profiles for separating relevant and non-relevant documents in an incoming stream: the TREC linguists in NIST4 separated the 4 The National Institute http://www.nist.gov/.

of

Standards

and

Technology,

13

RCV1 set into training sets and testing sets for the topics designed by the TREC linguists [55]. These training sets were used as the user profiles generated by the Manual model in the experiments, as they were manually acquired by the TREC linguists who created the topics, and thus reflected users’ best interests in these topics. Thus, the concepts contained in the content of Manual user profiles perfectly represented the user interests in the experimental topics. In the topic design phase, each TREC linguist came to NIST with a set of candidate topics based on his or her own interest. For each candidate topic, the TREC linguist estimated the approximate number of relevant documents by searching the RCV1 data set using the NIST’s statistic-based search system. The NIST TREC team selected the final set of topics from among these candidate topics based on the estimated number of relevant documents and balancing the load across the TREC linguists. The training sets associated with the topics were acquired through two phases: the retrieval phase and the fusion phase, aiming at providing more accurate relevance judgements for the training documents. In the retrieval phase, extensive searches using multiple retrieval and classification systems were conduced at NIST for each topic. This process included two to seven rounds. After each round, relevant information was used as feedback to improve the search queries used for the next round. The process continued until no more relevant documents were found or five rounds had passed (some topics had more than five rounds due to glitches in the feedback system) [55]. Based on the relevant documents found in the retrieval phase, the TREC author of each topic was given five document sets to judge for the topic in the fusion phase. Each document set consisted of about 100 documents, chosen from the relevant documents found in the retrieval phase. The author read each one of them and marked the document as positive or negative for relevance or non-relevance to the topic. The combined set of judged and marked documents were then used as the training samples for that topic [73]. Thus, the Manual training sets perfectly reflected the users’ interests in these experimental topics, as the topics were created by the same authors who performed the relevance assessments for these topics as well. The Manual training documents associated with the topics were used as the user profiles in the Manual model in this paper’s experiments. Against an incoming topic, each document in the training set was associated with ‘positive’ or ‘negative’ for relevance or non-

14

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

relevance to the topic. The documents marked ‘positive’ were categorized into the positive set in the user profiles and support(d) = |D1+ | ; otherwise, into the negative sets and support(d) = 0.

6. Results and Discussions 6.1. Performance Measures The performance of the experimental models was measured using three methods: the precision averages at eleven standard recall levels (11SPR), the mean average precision (MAP), and the F1 -Measure. They were all on the basis of precision and recall, the modern information retrieval evaluation methods [5]. Precision and recall are two standard quantitative measures of the performance achieved by information retrieval models [74]. Precision indicates the capacity of a system to retrieve only the relevant information items, whereas recall indicates the capacity of a system to retrieve all the relevant information items. They are calculated by [29,30,60]: Precision =

Recall =

# of relevant docs retrieved Total # of docs retrieved

# of relevant docs retrieved Total # of relevant docs

(20)

(21)

where # stands for the number of referring-to documents. The 11SPR is an appropriate method for information gathering, and was used in TREC evaluations as the performance measuring standard [55]. An 11SPR value is computed by summing the interpolated precisions at the specified recall cutoff and then dividing by the number of topics: PN

i=1

precisionλ N

(22)

where λ = {0.0, 0.1, 0.2, . . . , 1.0}, N is the number of topics, and λ are the cutoff points where the precisions are interpolated. At each λ point, an average precision value over N topics is calculated. These average precisions then plot a curve describing the recall-precision performance. The MAP over all relevant documents is a stable measure and a discriminating choice in information

gathering evaluations. The average precision value is a single value measure that reflects the experimental model’s performance over all relevant documents. For each topic, rather than being an average of the precision at standard recall levels, the MAP measure is the mean of the precision values obtained after each relevant document is retrieved. The MAP value for a set of experimental topics is then the mean of the average precision values of each of the individual topics in the experiments. Different from the 11SPR measure, the MAP reflects the performance in a non-interpolated recall-precision curve [67]. As reported by Buckley and Voorhees [5], the MAP measure is a stable information gathering measuring method, and is recommended for general-purpose information gathering evaluations. The Fβ -Measure, also widely used in information retrieval and Web information gathering [69,36], is calculated by: Fβ =

(β 2 + 1) × P recision × Recall β 2 × P recision + Recall

(23)

where β is a parameter balancing precision and recall, depending on the precision or recall preferred in the system. When β = 1, recall and precision are evenly weighted, the Fβ -Measure corresponds to the harmonic mean and becomes the commonly used F1 Measure [36]: F1 =

2 × P recision × Recall P recision + Recall

(24)

Because precision and recall are equally important in Web information gathering in this paper, the F1 Measure was used in the experiments for effectiveness measuring. Furthermore, the macroaverage and microaverage F1 -Measures were used for detailed investigation on the effectiveness across the experimental topics, where the macro-F1 -Measure (shorten as macro-FM) measured the un-weighted mean of effectiveness and micro-F1 -Measure (shortened as microFM) measured the effectiveness computed from the sum of results. The macro-FM averaged the precision and recall and then calculated the F1 value for each experimental topic. The micro-FM calculated the F1 value for each returned result and then averaged the F1 values. The greater F1 values indicated the better effectiveness.

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

Fig. 6. Experimental 11SPR Results Table 1 Experimental Results on Average and Statistical Results

macro-FM micro-FM MAP

Manual

Ontology

p-value

0.388 0.356 0.290

0.386 0.355 0.284

0.862 0.896 0.484

6.2. Experimental Results The experiments were designed to evaluate the proposed knowledge-based information gathering model by comparing the Ontology model to the Manual model. The experimental hypothesis was that the Ontology model could achieve similar performance as that of the Manual model. The experimental 11SPR results are illustrated in Fig. 6. At recall point 0.3, the Manual model slightly outperformed the Ontology model, but at 0.5 and 0.6, the Ontology model achieved subtly better results than the Manual model. At all other points, their 11SPR results were the same. For the MAP results shown on Table 1, the Ontology model achieved 0.284, which was only 0.006 below that of the Manual model (2% downgrade). For the average macro-FM and micro-FM also shown on Table 1, the Manual model only outperformed the Ontology model by 0.002 (0.5%) in macroFM and 0.001 (0.2%) in micro-FM. The two models achieved almost the same performance. The evaluation result was promising. A statistical test, Student’s Paired T-Test, was also performed on the experimental results, in an attempt

15

to analyze the evaluation’s reliability. The Student’s Paired T-Test is a common statistical method used to compare two sets of results for significance [80,4]. A typical null hypothesis in the Student’s Paired T-Test is that no practical difference exists between two compared models. When two tests produce highly different significance levels (substantially low p-value value, usually set as 0.1), the null hypothesis is held and there is little or no practical difference between two compared models. Although the Student’s Paired T-Test has an assumption of using the normal distribution in its null hypothesis, it is argued by Smucker et al. [59] that the Student’s Paired TTest largely agrees with the bootstrap and randomization tests in information retrieval evaluations. They are likely to draw the same conclusions regarding the statistical significance of their results. Thus, in this paper, the Student’s Paired T-Test was chosen for the statistical test. The null hypothesis in our statistical test was that no difference existed in the compared models. The TTest results are also presented on Table 1. The p-values show that there is no evidence of significant difference between the two experimental models, as the produced p-values are substantially high (p-value=0.484(MAP), 0.862(macro-FM) and 0.896(micro-FM), far greater than 0.1). Based on these results, we can conclude that in terms of statistics, the Ontology model achieved the similar performance as that of the Manual model, and the evaluation result was reliable. 6.3. Discussions The experiments were designed to evaluate the proposed knowledge-based information gathering model. This evaluation was conducted by comparing the user profiles acquired by the Ontology model, in which user background knowledge was discovered computationally, to that acquired by the Manual model, in which the concepts were specified and proven by users manually. According to the experimental results presented in Section 6.2, the Ontology model achieved the same performance as that of the Manual model in the experiments. The experimental results indicated that the MAP, macro-FM, and micro-FM results largely agreed with each other. Table 2 presents the comparisons between

16

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering Table 2 Performance Comparison in Topic Numbers

Ontology vs. Manual Won Lost Tied MAP macro-FM micro-FM

14 14 14

19 19 18

17 17 18

the Ontology and Manual models, based on the number of topics that the Ontology model won, lost, and tied in the experiments. For each pair of comparisons, whether the Ontology model was better than, worse than, or equal to the Manual model was compared with a predefined fuzziness value that was introduced by Buckley and Voorhees [5]. This comparison was based on the percentage change calculated by: %Change =

rOntology − rM anual × 100% (25) rM anual

where r denoted an experimental result of either MAP, macro-FM, or micro-FM. If the percentage change made by two scores was smaller than the fuzziness value, the two scores were recognized as equivalent. In contrast, if the percentage change was greater than the fuzziness value, the out-performance achieved by one over the other was recognized as substantial. In this discussion, the fuzziness value was set as 5%, as suggested by [5] for information retrieval experiments. As shown on Table 2, the Ontology model was better than the Manual model in 14 topics, worse in 19 topics, and equal to it in 17 topics in terms of MAP performance. Based on Table 2, one can see that the numbers of topics in which the Ontology model was better than, worse than, and equal to the Manual model were very similar on the MAP, macro-FM, and micro-FM results. These results largely agreed with each other for the similar performance achieved by the Ontology and Manual model. The user profiles produced by the Ontology model had better user background knowledge coverage than that produced by the Manual model, however, the Manual user profiles had better specification. In the investigation of the experimental results, it was found that the proportional difference of the training set sizes had influence on the performance of experimental models. This is reported by the figures in Table 3, which presents the size comparison of the Ontology and Manual user profiles, in terms of the MAP performance. The proportional difference was calculated by the average number of documents in the Ontology user

Table 3 Comparison of the size of Ontology and Manual User Profiles For topics that Ontology won (14) Ontology lost (19) For all topics (50)

Avg. # of docs in user profiles Ontology Manual 7848 6423 7610

44 60 54

Proportional difference (Ontology/Manual) 178 107 141

profiles divided by that of the Manual user profiles. Because the MAP, macro-FM, and micro-FM results largely agreed with each other, the related discussions used only the MAP performance for explanation, for the sake of simplicity. In the topics where the Ontology model outperformed the Manual model, the numbers of training documents contained in the Ontology user profiles and Manual user profiles had a large proportional difference. This was confirmed by the information in Table 3, which presents the size comparison of training sets representing the Ontology and Manual user profiles in MAP performance. As shown on the table, for the 14 topics in which the Ontology outperformed the Manual model, the average size of Ontology training sets was 7848, which was 178 times 44, the average size of Manual training sets. Whereas for the 19 topics by which the Ontology lost, the average size of Ontology profiles was 6423, only 107 times the Manual average size of 60. For the overall 50 topics, the average size of Ontology profiles was 7610 and the Manual profile was 54. The proportional difference was 141 times, again in the middle, considering only the winning and losing topics. Based on these comparisons, we can conclude that the number of training documents in the Ontology user profiles influenced the performance of the Ontology model. Such influence was caused by the extracted and specified user background knowledge. In the Ontology model, the user background knowledge was extracted from the world knowledge base implemented from the LCSH. The world knowledge base had an excellent coverage of topics, containing 439,329 topical subjects, 46,136 geographic subjects, and 5785 corporate subjects. Using the world knowledge base, the Ontology model had less chance of missing relevant subjects when extracting user background knowledge. The Ontology model also used data mining techniques for non-interesting concept filtering and interesting knowledge discovery. The Ontology model, as a result, had a large amount of training documents in user profiles (on average 7610 for overall topics and

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

7848 for the topics that performed well). This ensured that more complete background knowledge was extracted and specified for users. In the Manual model, the user background knowledge was specified manually by users. The training documents for each topic were obtained in two steps. The TREC linguists brought up a topic, and first searched the RCV1 data set using the NIST’s search system to retrieve a set of potentially relevant documents. The author of the topic then read the retrieved documents and judged them as positive or negative for relevance or non-relevance to the topic. This procedure ensured that the training documents were judged accurately, however, sacrificed the coverage of specified user background knowledge. Firstly, the number of documents retrieved from the RCV1 and provided to the Manual linguists to read were limited (54 on average). Secondly, only the binary value of ‘positive’ and ‘negative’ could be chosen when the linguists read a document. This restricted the judgements on binary values. In case of only a part of the content being relevant, some user background knowledge would be missed if the document was judged ‘negative’; some noisy concepts would be obtained if the document was judged ‘positive’. Consequently, the Manual model had limited user background knowledge coverage and unsatisfactory knowledge presentation, which weakened the performance of the Manual model. However, the user background knowledge contained in the Manual user profiles was proven by the users manually, because of the acquiring procedure. This was why the Manual model still performed well in comparison with the Ontology model, especially in the lower recall cutoff points, as plotted in Fig. 6. When the recall cutoff level increased, the performance of the Ontology model dropped slowly, compared with that of the Manual model. The Manual model also outperformed the Ontology model in some measuring methods that preferred precision to recall, such as average MAP shown in Table 1. Another downside to the Manual user profiles was that the user background knowledge contained in the Manual user profiles was well formatted for human beings to understand, but not for computational models. As previously discussed, the Manual user profiles were acquired by the TREC linguists reading and judging each training document manually against the topics. The linguists, being the authors who created the topics, perfectly understood their information needs and what they were looking for in the training set. However, these linguists, as ordinary Web users, still could not

17

formally specify their background knowledge while acquiring the user profiles. The concepts contained in the Manual user profiles were implicit and difficult for computational models to interpret. The Ontology model, on the other hand, had the extracted user background knowledge formalized. The interesting concepts were explicitly extracted from the world knowledge base and discovered from the user LIRs. In the experiments, on average there were 2315 subjects automatically extracted by the Ontology model for each user profile. These subjects were constructed in ontology form, and linked by the semantic relations of is-a, part-of, and related-to. Benefiting from the mathematic formalization, the ontology mining method (Specificity) could be possibly performed and more interesting concepts could be discovered. Thus, the user background knowledge contained in the Ontology user profiles was formally specified and ideal for use by computational models. This partially contributed to the superior performance achieved by the Ontology model, compared with that of the Manual model.

7. Conclusions and Future Work In this paper, a knowledge-based model is proposed, aimed at discovering user background knowledge for personalized Web information gathering. The framework of knowledge-based information gathering consists of four models: user concept model, user querying model, computer model, and ontology model. Given a topic, the computer model uses a world knowledge base to learn an ontology for user concept model simulation. The ontology is then personalized by using the user’s local instance repository. Aiming at describing user background knowledge more clearly, the semantic relations of is-a, part-of, and related-to are specified in the ontology model. The knowledge-based model was successfully evaluated in comparison with a manually implemented user concept model. The proposed knowledge-based model is a novel contribution to better understanding Web personalization using ontologies and user profiles, and to better designs of personalized Web information gathering systems. Given the above conclusions, a few avenues of research have arisen and extend from this paper. Implementing a real-world Web information gathering system using the proposed knowledge-based model will give us a more comprehensive, thorough platform to test our proposal for further enhancement. The user

18

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

profile acquisition is also extendable from routing user profiles in this paper to adaptive user profiles. These are new challenges and the course that will be pursued in our future work.

Acknowledgements This paper presents the extensive work of, but significantly beyond, an earlier paper [65] published in the proceedings of WI’08. We would like to thank the Library of Congress and QUT Library for the use of the LCSH and library catalogs. We also like to thank the anonymous reviewers of this paper for their valuable comments. Our thanks extend as well to M. Carey-Smith for proofreading the English in this paper. The work presented in this paper was partially supported by Grant DP0988007 from the Australian Research Council.

References [1] Muhammad Abulaish and Lipika Dey. Information extraction and imprecise query answering from web documents. Web Intelligence and Agent Systems, 4(4):407–429, January 2006. [2] G. Antoniou and F. van Harmelen. A Semantic Web Primer. The MIT Press, 2004. [3] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic Web. Scientific American, 5:29–37, 2001. [4] G. E. P. Box, J. S. Hunter, and W. G. Hunter. Statistics For Experimenters. John Wiley & Sons, 2005. [5] C. Buckley and E. M. Voorhees. Evaluating evaluation measure stability. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 33–40, 2000. [6] A. Budanitsky and G. Hirst. Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1):13–47, 2006. [7] J. Chaffee and S. Gauch. Personal ontologies for Web navigation. In Proceedings of the ninth international conference on Information and knowledge management, pages 227–234, 2000. [8] P. A. Chirita, W. Nejdl, R. Paiu, and C. Kohlschütter. Using ODP metadata to personalize search. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 178– 185. ACM Press, 2005. [9] W. C. Cho and D. Richards. Ontology construction and concept reuse with formal concept analysis for improved web document retrieval. Web Intelligence and Agent Systems, 5(1):109– 126, January 2007. [10] K.-S. Choi, C.-H. Lee, and P.-K. Rhee. Document ontology based personalized filtering system (poster session). In MULTIMEDIA ’00: Proceedings of the eighth ACM international conference on Multimedia, pages 362–364, New York, NY, USA, 2000. ACM Press. [11] R. M. Colomb. Information Spaces: The Architecture of Cyberspace. Springer, 2002.

[12] K. Curran, C. Murphy, and S. Annesley. Web intelligence in information retrieval. In Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence, pages 409 – 412, 2003. [13] A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, and A. Halevy. Learning to match ontologies on the semantic web. The International Journal on Very Large Data Bases, 12(4):303–319, 2003. [14] D. Dou, G. Frishkoff, J. Rong, R. Frank, A. Malony, and D. Tucker. Development of neuroelectromagnetic ontologies(NEMO): a framework for mining brainwave ontologies. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 270– 279, 2007. [15] C. Fellbaum, editor. WordNet: An Electronic Lexical Database. ISBN: 0-262-06197-X. MIT Press, Cambridge, MA, 1998. [16] E. Frank and G. W. Paynter. Predicting library of congress classifications from library of congress subject headings. Journal of the American Society for Information Science and Technology, 55(3):214–227, 2004. [17] A. Gangemi, N. Guarino, and A. Oltramari. Conceptual analysis of lexical taxonomies: the case of wordnet top-level. In FOIS ’01: Proceedings of the international conference on Formal Ontology in Information Systems, pages 285–296, New York, NY, USA, 2001. ACM Press. [18] S. Gauch, J. Chaffee, and A. Pretschner. Ontology-based personalized search and browsing. Web Intelligence and Agent Systems, 1(3-4):219–234, 2003. [19] S. Gauch, J. M. Madrid, and S. Induri. Keyconcept: A conceptual search engine. Technical report, EECS Department, University of Kansas, 2004. [20] E. J. Glover, K. Tsioutsiouliklis, S. Lawrence, D. M. Pennock, and G. W. Flake. Using Web structure for classifying and describing Web pages. In WWW ’02: Proceedings of the 11th international conference on World Wide Web, pages 562–569, New York, NY, USA, 2002. ACM Press. [21] D. Godoy and A. Amandi. Modeling user interests by conceptual clustering. Information Systems, 31(4):247–265, 2006. [22] C. Hung, S. Wermter, and P. Smith. Hybrid neural document clustering using guided self-organization and wordnet. Intelligent Systems, IEEE [see also IEEE Intelligent Systems and Their Applications], 19(2):68–77, 2004. [23] B. J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information retrieval: a study of user queries on the web. SIGIR Forum, 32(1):5–17, 1998. [24] J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the 10th International Conference Research on Computational Linguistics (ROCLING X), 1997, Taiwan, pages 19–33, Taiwan, 1997. [25] X. Jiang and A-H. Tan. Mining ontological knowledge from domain-specific text documents. In Proceedings of the Fifth IEEE International Conference on Data Mining, pages 665– 668, 2005. [26] W. Jin, R. K. Srihari, H. H. Ho, and X. Wu. Improving knowledge discovery in document collections through combining text retrieval and link analysis techniques. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, pages 193–202, 2007. [27] T. Joachims. Transductive inference for text classification using support vector machines. In Ivan Bratko and Saso Dzeroski, editors, Proceedings of ICML-99, 16th International

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

[28]

[29]

[30]

[31]

[32]

[33]

[34] [35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

Conference on Machine Learning, pages 200–209, Bled, SL, 1999. Morgan Kaufmann Publishers, San Francisco, US. I. Kaur and A. J. Hornof. A comparison of LSA, WordNet and PMI-IR for predicting user click behavior. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 51–60, New York, USA, 2005. ACM Press. L. Khan and F. Luo. Ontology construction for information selection. In Proceedings of 14th IEEE International Conference on Tools with Artificial Intelligence, 2002. (ICTAI 2002)., pages 122–127, 2002. L. Khan, D. McLeod, and E. Hovy. Retrieval effectiveness of an ontology-based model for information selection. The International Journal on Very Large Data Bases, 13(1):71–85, 2004. J. D. King, Y. Li, X. Tao, and R. Nayak. Mining World Knowledge for Analysis of Search Engine Content. Web Intelligence and Agent Systems, 5(3):233–253, 2007. K. Knight and S. K. Luk. Building a large-scale knowledge base for machine translation. In AAAI ’94: Proceedings of the twelfth national conference on Artificial intelligence (vol. 1), pages 773–778, Menlo Park, CA, USA, 1994. American Association for Artificial Intelligence. H. Kornilakis, M. Grigoriadou, K. A. Papanikolaou, and E. Gouli. Using WordNet to support interactive concept map construction. In Proceedings. IEEE International Conference on Advanced Learning Technologies, 2004., pages 600–604, 2004. R. Kosala and H. Blockeel. Web mining research: A survey. ACM SIGKDD Explorations Newsletter, 2(1):1–15, 2000. K. S. Lee, W. B. Croft, and J. Allan. A cluster-based resampling method for pseudo-relevance feedback. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 235– 242, 2008. D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361–397, 2004. X. Li and B. Liu. Learning to classify texts using positive and unlabeled data. In Proceedings of 8th International Joint Conference on Artificial Intelligence, pages 587–594, 2003. Y. Li, S.-T. Wu, and X. Tao. Effective pattern taxonomy mining in text documents. In CIKM ’08: Proceeding of the 17th ACM conference on Information and knowledge management, pages 1509–1510, New York, NY, USA, 2008. ACM. Y. Li, W. Yang, and Y. Xu. Multi-tier granule mining for representations of multidimensional association rules. In Proceedings of the Sixth IEEE International Conference on Data Mining, pages 953–958, 2006. Y. Li and N. Zhong. Capturing evolving patterns for ontology-based web mining. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, pages 256–263, Washington, DC, USA, 2004. IEEE Computer Society. Y. Li and N. Zhong. Web Mining Model and its Applications for Information Gathering. Knowledge-Based Systems, 17:207–217, 2004. Y. Li and N. Zhong. Mining Ontology for Automatically Acquiring Web User Information Needs. IEEE Transactions on Knowledge and Data Engineering, 18(4):554–568, 2006. Library of Congress. MARC 21 concise format for bibliographic data, 1999 Edition, Update No. 1 (October 2001)

[44]

[45] [46]

[47]

[48]

[49] [50]

[51]

[52]

[53]

[54]

[55] [56]

[57]

[58]

[59]

19

through Update No. 8 (October 2007). Washington, D.C. : Library of Congress, 2007. S.-Y. Lim, M.-H. Song, K.-J. Son, and S.-J. Lee. Domain ontology construction based on semantic relation information of terminology. In 30th Annual Conference of the IEEE Industrial Electronics Society, volume 3, pages 2213–2217 Vol. 3, 2004. B. Liu. Web content mining. In Tutorial given at WWW-2005 and WISE-2005, 2005. B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the Third IEEE International Conference on Data Mining, ICDM2003, pages 179–186, 2003. S. Liu, F. Liu, C. Yu, and W. Meng. An effective approach to document retrieval via utilizing WordNet and recognizing phrases. In SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 266–272, New York, NY, USA, 2004. ACM Press. Z. Ma, G. Pant, and O. R. L. Sheng. Interest-based personalized search. ACM Transactions on Information Systems (TOIS), 25(1):5, 2007. A. Maedche and S. Staab. Ontology learning for the Semantic Web. Intelligent Systems, IEEE, 16(2):72–79, 2001. M. E. Maron. Probabilistic approaches to the document retrieval problem. In Proceedings of the 5th annual ACM conference on Research and development in information retrieval, pages 98–107. Springer-Verlag New York, Inc., West Berlin, Germany, 1982. R. Navigli, P. Velardi, and A. Gangemi. Ontology learning and its application to automated terminology translation. Intelligent Systems, IEEE, 18:22–31, 2003. G. Qiu, K. Liu, J. Bu, C. Chen, and Z. Kang. Quantify query ambiguity using odp metadata. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 697– 698, New York, NY, USA, 2007. ACM Press. P. Resnik. Semantic similarity in a taxonomy: an informationbased measure and its application to problems of ambiguity and natural language. Journal of Artificial Intelligence Research, 11:95–130, 1999. I. Rish. An empirical study of the naïve Bayes classifier. In IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 2001. S. E. Robertson and I. Soboroff. The TREC 2002 filtering track report. In Text REtrieval Conference, 2002. J. Rocchio. The smart retrieval system experiments in automatic document processing, chapter Relevance feedback in information retrieval. Englewood Cliffs, NJ, 1971. M. Ruiz-Casado, E. Alfonseca, and P. Castells. Automatising the learning of lexical patterns: An application to the enrichment of WordNet by extracting semantic relationships from Wikipedia. Data & Knowledge Engineering, 61(3):484–499, June 2007. S. Schocken and R. A. Hummel. On the use of the Dempster Shafer model in information indexing and retrieval applications. Internatinal Journal Man-Machine Studies, 39:843– 879, 1993. M. D. Smucker, J. Allan, and B. Carterette. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages

20

X. Tao et al. / A Knowledge-based Model Using Ontologies for Personalized Web Information Gathering

623–632, 2007. [60] J. Song, W. Zhang, W. Xiao, G. Li, and Z. Xu. OntologyBased Information Retrieval Model for the Semantic Web. In Proceedings of the 2005 IEEE International Conference on eTechnology, e-Commerce and e-Service, EEE ’05., pages 152– 155, 2005. [61] A. Spink, D. Wolfram, M. B. J. Jansen, and T. Saracevic. Searching the Web: The public and their queries. Journal of the American Society for Information Science and Technology, 52(3):226–234, 2001. [62] J. Srivastava, P. Desikan, and V. Kumar. Web mining: Accomplishments and future directions. In Proc. US National Science Foundation Workshop on Next-Generation Data Mining (NGDM), Nationall Science Foundation, 2002., 2002. [63] X. Tao, Y. Li, N. Zhong, and R. Nayak. Ontology mining for personalzied web information gathering. In Proc. of WI ’07, pages 351–358, 2007. [64] X. Tao, Y. Li, N. Zhong, and R. Nayak. A knowledge retrieval model using ontology mining and user profiling. Integrated Computer-Aided Engineering, 15(4):313–329, 2008. [65] X. Tao, Y. Li, N. Zhong, and R. Nayak. An ontology-based framework for knowledge retrieval. In Proceedings of WI’08, 2008. [66] T. Tran, P. Cimiano, S. Rudolph, and R. Studer. Ontologybased interpretation of keywords for semantic search. In Proceedins of the 6th International Conference on Semantic Web, pages 523–536, 2007. [67] TREC. Common evaluation measures. In The Eleventh Text REtrieval Conference (TREC 2002), 2002. [68] F. van Harmelen. The semantic web: what, why, how, and when. Distributed Systems Online, IEEE, 5(3):–, 2004. [69] C. J. van Rijsbergen. Information Retrieval. Butterworths, 1979. [70] G. Varelas, E. Voutsakis, P. Raftopoulou, E. G.M. Petrakis, and E. E. Milios. Semantic similarity methods in WordNet and their application to information retrieval on the Web. In WIDM ’05: Proceedings of the 7th annual ACM international workshop on Web information and data management, pages 10–16, New York, NY, USA, 2005. ACM Press. [71] P. Velardi, P. Fabriani, and M. Missikoff. Using text processing techniques to automatically enrich a domain ontology. In FOIS ’01: Proceedings of the international conference on Formal Ontology in Information Systems, pages 270–284, New York, NY, USA, 2001. ACM Press. [72] E. M. Voorhees. Using WordNet to disambiguate word senses for text retrieval. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pages 171–180. ACM Press, Pittsburgh, Pennsylvania, United States, 1993.

[73] E. M. Voorhees. Overview of TREC 2002. In The Text REtrieval Conference (TREC), 2002. Retrieved From: http://trec.nist.gov/pubs/trec11/papers/OVERVIEW.11.pdf and http://trec.nist.gov/pubs/trec11/appendices/MEASURES.pdf. [74] S. Vrettos and A. Stafylopatis. A Fuzzy Rule-Based Agent for Web Retrieval-Filtering. In WI ’01: Proceedings of the First Asia-Pacific Conference on Web Intelligence: Research and Development, pages 448–453, London, UK, 2001. SpringerVerlag. [75] J. Wang and M. C. Lee. Reconstructing DDC for interactive classification. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 137–146, New York, NY, USA, 2007. ACM. [76] C.-P. Wei, R. H. L. Chiang, and C.-C. Wu. Accommodating individual preferences in the categorization of documents: A personalized clustering approach. Journal of Management Information Systems, 23(2):p173 – 201, 2006. [77] S.-T. Wu. Knowledge Discovery Using Pattern Taxonomy Model in Text Mining. PhD thesis, Faculty of Information Technology, Queensland University of Technology, 2007. [78] Y. Xu and Y. Li. Generating concise association rules. In CIKM ’07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 781– 790, New York, NY, USA, 2007. ACM. [79] W. Yang, Y. Li, J. Wu, and Y. Xu. Granule mining oriented data warehousing model for representations of multidimensional association rules. International Journal of Intelligent Information and Database Systems, 2(1):125–145, 2008. [80] Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 42–49. ACM Press, 1999. [81] Y. Y. Yao, Y. Zeng, N. Zhong, and X. Huang. Knowedge retrieval (KR). In Proc. of WI ’07, pages 729–735, 2007. [82] Z. Yu, Z. Zheng, S. Gao, and J. Guo. Personalized information recommendation in digital library domain based on ontology. In IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005., volume 2, pages 1249–1252, 2005. [83] N. Zhong. Toward Web Intelligence. In Proceedings of 1st International Atlantic Web Intelligence Conference, pages 1– 14, 2003. [84] N. Zhong and N. Hayazaki. Roles of ontologies for web intelligence. In Proceedings of Foundations of Intelligent Systems : 13th International Symposium, ISMIS 2002,, volume 2366, page 55, Lyon, France, June 27-29 2002. [85] C. Zhou, D. Frankowski, P. Ludford, S. Shekhar, and L. Terveen. Discovering personally meaningful places: An interactive clustering approach. ACM Transactions on Information Systems (TOIS), 25(3):12, 2007.