Ontology-Based Automatic Classification for the Web Pages: Design, Implementation and Evaluation Rudy Prabowo, Mike Jackson, Peter Burden School of Computing and IT University of Wolverhampton 35-49 Lichfied Street, Wolverhampton WV1 1EL, United Kingdom {Rudy.Prabowo,mj,jphb}@wlv.ac.uk
Heinz-Dieter Knoell University of Applied Sciences Fachhochschule Nordostniedersachsen Volgershall 1 21339 Lueneburg, Germany
[email protected]
Abstract
thing, and can be differentiated from another by defining its properties [17]. The term "terminology" (in the plural form: "terminological resources") is defined as a structured, organised set of concepts in the particular areas of specialist knowledge [10]. The term "gestalt instance" is defined as a term which is used to represent a whole rather than its parts. In addition, the term "class representative" is used to denote a class in a classification scheme hierarchy [2].
In recent years, we have witnessed the continual growth in the use of ontologies in order to provide a mechanism to enable machine reasoning. This paper describes an automatic classifier, which focuses on the use of ontologies for classifying Web pages with respect to the Dewey Decimal Classification (DDC) and Library of Congress Classification (LCC) schemes. Firstly, we explain how these ontologies can be built in a modular fashion, and mapped into DDC and LCC. Secondly, we propose the formal definition of a DDC-LCC and an ontology-classification-scheme mapping. Thirdly, we explain the way the classifier uses these ontologies to assist classification. Finally, an experiment in which the accuracy of the classifier was evaluated is presented. The experiment shows that our approach results an improved classification in terms of accuracy. This improvement, however, comes at a cost in a low coverage ratio due to the incompleteness of the ontologies used.
1. Introduction In order to organise Web pages and to assist Web users to retrieve only information relevant to their query, manual classification is carried out by some search engines, e.g. Yahoo! [26]. Due to the increasing number of Web pages, it is impossible to manually classify the entire Web without some form of automated aid [14]. For this reason, automatic document classification has become an important research area. This paper describes a strategy used to enhance the accuracy of an automatic classifier, called the Automatic Classification Engine (ACE) [2]. The enhancement focuses on the use of ontologies. Some related work which focus on the use of ontologies are discussed in section 2. We, then, propose a method for building a set of ontologies with respect to DDC [20] and LCC [11], and propose formal definitions for DDCLCC and an ontology-classification-scheme mapping in section 3. The way the modified ACE works and makes use of these ontologies are described in section 4. Section 5 presents an experiment which was conducted to evaluate the classification accuracy. Finally, conclusions are drawn and future work is explained in section 6. This paper uses terms which are defined as follows: the term "ontology" is defined as a single entity which holds conceptual instances of a domain, and differentiates itself from another. In other words, conceptual instances represent the existence of their associated ontology. A conceptual instance can be a concept, a terminology or a gestalt instance. The term "concept" is defined as something that represents the idea of a
2. Related Works Our work is related to two different research areas, i.e. ontology-based applications and automatic document classification.
2.1 Ontology-Based Applications This section describes two related projects which are based on the use of ontologies to extract and access information within the Web. The first one is a research project which has developed an information extraction system, called WEB->KB, in order to construct knowledge bases from the World Wide Web (WWW) [12]. The second project is one which sets out to provide and access information at a Web portal, called SEAL (SEmantic portAL) [1]. There is a symbiotic mutualism between WEB>KB and SEAL. WEB->KB provides a means to construct knowledge bases which can be used by SEAL to satisfy a user query. On the other hand, SEAL provides a means to bridge the gap between the knowledge bases built by WEB->KB and a Web user, and to provide information access to the Web user.
2.2 Automatic Document Classification This section describes work which has been carried out by researchers in the area of automatic document classification, and the key difference between our work and other work. In their experiment, [15] proposed an approach for implementing automatic classification. The aim of their research was to automatically classify research project descriptions into a manually pre-defined set of subject headings. They applied three techniques sequentially, i.e. natural language processing, multinomial discriminant analysis [13], and an expert system technique, in order to achieve a high degree of classification accuracy. To carry out an automatic classification of 173,255 Wall Street Journal documents, [6] conducted an experiment, which used natural language techniques to carry out morphological, syntactical and semantic analysis of texts. They analysed and
Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02) 0-7695-1766-8/02 $17.00 © 2002 IEEE
compared the texts found within the document samples against a machine-readable dictionary, called Longman’s Dictionary of Contemporary English.
tion language, called OIL [8]. There is a difference in the way these two classification organised by both, subject and discipline. A thorough
DDC
LCC
Shared Classes (SC)
Ontology
300 Social Sciences
H 1-99 Social Sciences in general H-HB 1-3840 Economic theory. Demography
sc-social-sciences
social sciences
330 Economics
sc-economics economics sc-economic-systems sc-economic-theory sc-demography 330 Economics H-HC 10-1085 Economic sc-economic-history economics history and conditions sc-economic-conditions 331 Labor Economics H-HD 4801-8943 Labor sc-labor-economics economics 335 Socialism & related H-HX 1 – 970.7 sc-socialism political beliefs systems. Socialism. Communism. sc-communism sc-anarchism Anarchism sc-facism Table 1 A part of the class representatives in the area of social sciences and economics with respect to DDC and LCC. Our work is different from other work in that we focus on (1) the use of domain ontologies, rather than a dictionary or thesaurus, to assist classification; (2) building ontologies based on the text information within the classification schemes and Web pages; (3) the mapping between a set of ontologies and the classification schemes. We also restrict our attention to domain ontologies, especially those which can be mapped into any existing classification schemes. The reason behind this restriction is that domain ontologies use a set of terminological resources which can be used to capture the knowledge, which is valid for a particular type of a domain [5]. Hence, one can use them to reduce the ambiguity and maintain the modularity of the domain ontologies. The advantages of an ontology-based classification approach over the existing ones, such as hierarchical, - [24], and probabilistic - approach [18], are that (1) the nature of the relational structure of an ontology provides a mechanism to enable machine reasoning; (2) the conceptual instances within an ontology are not only a bag of keywords, but have inherent semantics, and a close relationship with the class representatives of the classification schemes. Hence, they can be mapped to each other (see section 3); (3) this kind of mapping provides a new way to measure the similarity between a Web page and a class representative. It also enables us to get insights into and observe the way the classifier assigns a class representative to a Web page by tracking the links between the conceptual instances involved and the associated class representative (see section 4).
3. Building Domain Ontologies This section describes our experience in building a set of domain ontologies from scratch with respect to DDC and LCC class representatives, and the way these domain ontologies are mapped into the associated DDC and LCC class representatives.
3.1 Building the ACE Database There are four steps necessary for building the ACE database. The first step is to define the mapping between DDC and LCC class representatives, and to manually build a classification scheme ontology using a web-based representa -
discussion about these two classification schemes can be found in [25]. Due to the different strategy and naming convention used to organise and name the class representatives, mapping these two schemes poses a special challenge. The first and second column of table 1 depict a part of the class representatives in the area of social sciences and economics with respect to DDC and the associated LCC. Section 3.2 discusses a proposed solution to this mapping. The second step is to define an ontology which contains a set of classes, called "shared classes". The key idea in using a shared class is to define a common class representative which can represent a part of a class representative given a classification scheme, and give an "anchor" for the associated conceptual instances. Table 1 (third column) depicts the shared classes. The prefix, "sc" is used to differentiate the shared class from the associated conceptual instance when it has the same marker. For example, "sc-socialism" represents a part of these two class representatives: "DDC 335 Socialism and related systems" and "LCC H-HX Socialism Communism and Anarchism", and links the class representatives with the associated conceptual instance, "socialism". The third step is to manually build domain ontologies which are related to the class representatives. As a starting point, DDC and LCC vocabularies are used for defining the markers of conceptual instances within an ontology. The way these conceptual instances are organised, however, is not dependent on the way a classification scheme organises these class representatives. The fact that "DDC 330 Economics" is a subclass of "DDC 300 Social Sciences” does not imply that an ontology about "social sciences" should contain a conceptual instance, called "economics". Instead, two ontologies are created: "social sciences" and "economics". Hence, we can maintain the modularity of these ontologies. This is particularly important, if we intend to refine these ontologies in more detail and map them into other classification schemes. In contrast, having too specific an ontology might degrade the degree of the completeness of an ontology in representing the conceptualisation of a domain. Another aspect in building an ontology is to define a gestalt instance that can cover specific ones. For example, one might want to know to which concept "socialism" belongs. In this context, a new gestalt instance is defined, i.e. "political beliefs". In other words, the abstraction
Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02) 0-7695-1766-8/02 $17.00 © 2002 IEEE
in this third step is initiated by the way a classification scheme is organised, and further completed and refined by human judgement. Table 1 (column 4) depicts the related ontologies. The last step is to use fast classification of terminologies (FaCT) [8] in order to detect the redundancy in naming a concept marker, and to validate the "is-a", transitive and other relationships between a set of conceptual instances within the ACE database. In addition to OIL, RDF is adopted to express the ontologies [19]. Although OIL does not need RDF to define ontologies, the merging of OIL and RDF establishes a better foundation for expressing ontologies. From the RDF point of view, OIL enriches the expressive power of RDF in representing ontologies. From the OIL point of view, the RDF scheme [3] provides a mechanism for expressing ontologies that can be understood by many Web participants. A thorough discussion of the merging of OIL and RDF/RDFS can be found in [21]. To maintain the scalability of the domain ontologies with respect to DDC and LCC, the number of conceptual instances within a domain ontology is restricted between 100 and 200. The number of domain ontologies related to a set of DDC class representatives in the second level ranges from 20 to 25. Since DDC has 10 main first-level class representatives, the number of conceptual instances within the ACE database with respect to DDC is expected to be between 20,000 and 50,000. The determination of the maximum magnitude of a domain ontology is chosen so that it enables easy maintenance, storage and dynamic access.
3.2 Formal Definitions and Issues of Mapping This section proposes formal definitions of mapping between two different classification schemes, and between a classification scheme and an ontology.
3.2.1 Formal Definition of a Classification Scheme Let CRΧ,Α be a class representative, Χ, which belongs to the classification scheme, Α, and a direct-superclass of a set of subclass representatives, SCRΧ,Α. SCRΧ,Α = U SCRΧ,Α,i ⊂ CRΧ,Α ,where i = 1..n (1) Let also SCRΧ,Α,i be a class representative which covers a set of subclass representatives, ΩΧ,Α,i and each element of ΩΧ,Α,i be a direct (or indirect) subclass of SCRΧ,Α,i. ΩΧ,Α,i = U ΩΧ,Α,i,k ⊂ SCRΧ,Α,i ,where k = 1..u (2) To guarantee the consistency of a subsume condition of a set of class representatives, a transitive condition is defined as follows ΩΧ,Α ⊂ CRΧ,Α ⇔ (ΩΧ,Α ⊂ SCRΧ,Α) ∧ (SCRΧ,Α ⊂ CRΧ,Α) (3) Analogous to (1), (2) and (3), let CRΥ,Β be a class representative, Υ, which belongs to a classification scheme, Β, and a direct-superclass of a set of subclass representatives, SCRΥ,Β ,which cover a set of subclass representatives, ΩΥ,Β, and each element of ΩΥ,,Β,j be a direct- (or indirect) subclass of SCRΥ,,Β,j, where j∈{1,...,m}. SCRΥ,Β = U SCRΥ,Β,j ⊂ CRΥ,Β ,where j = 1..m (4) ΩΥ,Β,j = U ΩΥ,Β,j,l ⊂ SCRΥ,Β,j ,where l = 1..v (5) ΩΥ,Β ⊂ CRΥ,Β ⇔ (ΩΥ,Β ⊂ SCRΥ,Β) ∧ (SCRΥ,Β ⊂ CRΥ,Β) (6) Let us suppose CRΧ,Α represents a set of class
representatives which should be mapped into CRΥ,Β. The next two sections describe the formal definition of a full and partial mapping of CRΧ,Α into CRΥ,Β. "≡" is used as a notation for "full mapping", and "≅" for "partial mapping". "A → B" means "A can be mapped into B".
3.2.2 Formal Definition of Full Mapping CRΧ,Α can be fully mapped into CRΥ,Β (CRΧ,Α ≡ CRΥ,Β) if the following three mapping conditions are met. (I) CRΧ,Α can be mapped into CRΥ,Β. [CRΧ,Α → CRΥ,Β] (II) For all elements of SCRΧ,Α ⊂ CRΧ,Α and SCRΥ,Β ⊂ CRΥ,Β, there is a one-to-one mapping between elements of SCRΧ,Α and SCRΥ,Β. ∀(SCRΧ,Α ⊂ CRΧ,Α ∧ SCRΥ,Β ⊂ CRΥ,Β) [SCRΧ,Α,i → SCRΥ,Β,j] , whereas {i = 1,...n} and {j = 1,...m}. (III) For all elements of ΩΧ,Α,i ⊂ SCRΧ,Α,i and ΩΥ,Β,j ⊂ SCRΥ,Β,j, there is a one-to-one mapping between elements of ΩΧ,Α,i and ΩΥ,Β,j. ∀(ΩΧ,Α,i ⊂ SCRΧ,Α,i ∧ ΩΥ,Β,j ⊂ SCRΥ,Β,j) [ΩΧ,Α,i,k → ΩΥ,Β,j,l] whereas {k = 1,...u} and {l = 1,...v}.
3.2.3 Formal Definition of Partial Mapping CRΧ,Α can be partially mapped into CRΥ,Β (CRΧ,Α ≅ CRΥ,Β) if the following condition is met. For all elements of SCRΧ,Α and ΩΧ,Α ,there is at least one element of SCRΧ,Α and ΩΧ,Α which can be mapped into CRΥ,Β, and either SCRΥ,Β or ΩΥ,Β. ∀(SCRΧ,Α ⊂ CRΧ,Α ∧ ΩΧ,Α ⊂ SCRΧ,Α) [∃SCRΧ,Α,i → CRΥ,Β]∧ [(∃ΩΧ,Α,i,k → SCRΥ,Β,j) ∨ (∃ΩΧ,Α,i,k → ΩΥ,Β,j,l)] whereas {i = 1,...n},{j = 1,...m},{k = 1,...u},{l = 1,...v} This formal definition, however, does not explicitly exclude the elements of SCRΧ,Α and ΩΧ,Α which are irrelevant to CRΥ,Β. For this reason, we refine the partial mapping definition based on the use of shared classes (discussed in section 3.1). Based on the equivalence (1)-(2) and (4)-(5), ЅCΧ,Α and ЅCΥ,Β are defined as follows: ЅCΧ,Α = CRΧ,Α ∪ [U (SCRΧ,Α,i , ΩΧ,Α,i,k)] (7) whereas {i=1..n} and {k=1..u} ЅCΥ,Β = CRΥ,Β ∪ [U (SCRΥ,Β,j , ΩΥ,Β,j,l) ] (8) whereas {j=1..m} and {l=1..v} Based on (7) and (8), ЅCΧ,Υ is defined as follows: ЅCΧ,Υ = ЅCΧ,Α ∪ ЅCΥ,Β (9) The equivalence (9) states that a set of shared classes are composed of a number of class representatives of the two classification schemes, A and B. In order to have a reasonable partial mapping, a number of irrelevant elements of shared classes have to be excluded. Let ЅC′Χ,Υ be a set of shared classes which are relevant to assist partial mapping. Based on (7) and (8), ЅC′Χ,Υ is defined as follows: ЅC′Χ,Υ = (ЅCΧ,Α ∩ ЅCΥ,Β) = {SC′1,..., SC′w} (10) Hence, the partial mapping definition can be refined as follows: for all elements of ЅC′Χ,Υ ,there is at least one element of ЅC′Χ,Υ which represents SCRΧ,Α and ΩΧ,Α, and can be mapped into CRΥ,Β, and either SCRΥ,Β or ΩΥ,Β. ∀ЅC′Χ,Υ [(∃SC′p = ЅCRΧ,Α,i) → (CRΥ,Β ∈ ЅC′Χ,Υ)] ∧ [(∃SC′p = ΩΧ,Α,i,k) → ( (ЅCRΥ,Β,j ∨ ΩΥ,Β,j,l) ∈ ЅC′Χ,Υ]
Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02) 0-7695-1766-8/02 $17.00 © 2002 IEEE
whereas {p =1,...,w},{i = 1,...n},{j = 1,...m},{k = 1,...u},{l = 1,...v}. Based on the equivalence (10) and the mapping definition, one can map a number of classification schemes via a set of shared classes without depending on the peculiarity of the structure of the classification schemes involved. To guarantee the consistency of abstraction, i.e. the hierarchical structure of the class representatives involved, two partial mapping conditions are described as follows: (I) [(∃SC′p = ЅCRΧ,Α,i ) → (CRΥ,Β ∈ ЅC′Χ,Υ)] ⇔ [SC′p ⊆ CRΥ,Β ∈ ЅC′Χ,Υ] (II) [(∃SC′p = ΩΧ,Α,i,k) → ((ЅCRΥ,Β,j ∨ ΩΥ,Β,j,l) ∈ ЅC′Χ,Υ)] ⇔ [SC′p ⊆ (ЅCRΥ,Β,j ∨ ΩΥ,Β,j,l) ∈ ЅC′Χ,Υ]
that "an institution" is semantically different to "an organization", unless it is explicitly stated in the classification schemes involved that "institution" is a subclass of "organization".
3.2.4 Two Mapping Issues
a taxonomy, ℋ. Concepts are taxonomically related by the
To illustrate the first issue, an example of mapping with respect to DDC and LCC is described. Let us use, "DDC 306 Culture and Institutions" as an example, and concentrate only on the associated LCC class representatives which are semantically identical to one of the two topics of DDC 306, i.e. "institutions". Institutions in DDC 306 covers political, economic, religious institutions, and all institutions which are pertinent to death and relations of the sexes. The associated LCC class representative for "political institutions" is "J-JF(20)-2112 Political institutions and public administration". The associated LCC class representative for other types of institutions does not exist. Recall that LCC is a classification scheme which is organised by discipline. The word, "institutions", however, refers to a broad subject, rather than a specific discipline. An attempt to fully map "institutions" to any related LCC class representatives will result in loss of precision. Therefore, the formal definition of partial mapping is applied. By applying equivalence (7), (8), (9), we obtain the shared classes for "DDC 306 Culture and Institutions", and the associated LCC class representatives. ЅC306-pertaining-toinstitutions,DDC = {political institutions, economic institutions, religious institutions, institutions pertaining to death, institutions pertaining to relations of the sexes}. ЅCassociatedclasses,LCC = {public administration, political institutions}. ЅC306-pertaining-to-institutions,DDC ∪ ЅCassociatedclasses,LCC. By applying equivalence (10), we exclude all shared classes which are irrelevant to mapping. ЅC′ = (ЅC306-pertaining-to-institutions,DDC ∩ ЅCassociatedclasses,LCC) = {political institutions}. This means that a LCC class representative can be partially mapped into DDC 306 based on the fact that there is one element of ЅC′, i.e. "political institutions", which is semantically identical to one element of ЅC306-pertaining-toinstitutions,DDC and ЅCassociated-classes,LCC. The second issue is concerned with the abstraction of a class representative. For example, "DDC 306.1 Religious institutions" and "LCC B-BL-630-(632.5) Religious organizations" refer to the same meaning, i.e. a group of people who share the same interest or purpose. An attempt to map these two class representatives via a shared class would, however, result in loss of precision due to the different meaning of "an institution" and "an organization". "An institution" is not only an organization, but also has influence in the community. For this reason, a partial mapping is not allowed due to the fact ЅC
=
3.2.5 Ontology-Classification-Scheme Mapping An ontology is defined as a sign system Ο = (ℒ, ℱ,Ǥ, ℂ,
ℋ,ℛ,Å). A complete description about the formal definition of an ontology can be found in [1]. In this paper, we present a part of this definition which is used throughout this paper, i.e. a set of concepts, ℂ. For each C ∈ ℂ, there is at least one statement in the ontology, i.e. its embedding in the taxonomy; irreflexive, acyclic, transitive relation ℋ ⊂ ℂ × ℂ. ℋ(C1, C2) means C1 is a subconcept (or subclass) of C2;
a set of binary relations ℛ. The formal definition of an ontology-class-representative mapping via a set of shared classes, ЅC′, is described as follows: an ontology, Ο can be mapped into a class representative, CRi, if and only if (I) there is at least one element, Ci ∈ℂ which refer(s) to one element of ЅC′∈CRi
[(∃Ci ∈ ℂ ) → (ЅC′ ∈ CRi)]
(II) there is at least one element, Cj∈ ℂ, whereas Ci is a subclass of Cj , and ℋ(Ci, Cj) refers to the hierarchical relationship of CRi and its direct superclass, CRj.
[ (∃Cj ∈ ℂ ) ∧ ℋ(Ci, Cj) → ℋ(CRi,CRj)]
4. The Way the ACE Works This section discusses the implementation of ACE. Section 4.1 describes the dynamic term table used to store terms found within a Web page, and, how the domain ontologies, DDC and LCC schemes are dynamically represented. The way the ACE carries out classification is discussed in section 4.2.
4.1 Initialisation At start-up time, ACE creates a dynamic table, called "term table", which consists of four columns, i.e. name, weight, tag, and position. It is used to store the terms found within a Web page. The first column, "name", is used to store the name of a term, the second column and third column, "weight" and "tag", are used to determine and store the maximum weight of a term and the tag within which the term occurs. How a weight is obtained is discussed in section 4.2. The last column is used to store the position of a term, which is important for determining a phrase. Then, ACE validates the syntax of the domain ontologies and generates a set of triples [19] by using RDF – API [23], which contains a RDF parser. Based on these triples, ACE builds a semantic network [22] to represent the domain ontologies. To represent the relationship between the conceptual instances, shared classes and class representatives, and to carry out the classification process, a feed-forward neural network [22] is built. A feed-forward neural network is a neural network in which each unit is linked only to units in the next layer.
Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02) 0-7695-1766-8/02 $17.00 © 2002 IEEE
There are three layers involved in this feed-forward network model described as follows: (1) input layer: This layer represents a set of input units, i.e. high level conceptual instances: terminologies or gestalt instances; (2) hidden layer: This layer represents a set of hidden units, i.e. shared classes. The hidden units are functional units that do not receive direct inputs from the environment. Rather, they serve as an intermediate stage in the analysis of the input; (3) output layer: This layer represents a set of output units, i.e. DDC and LCC class representatives. The activation spreading is only from an input layer towards the output layer. The Sigmoid function [22], f(x) = 1 / (1+e-x) , where x > 0 and 0 < f(x) < 1 is chosen as the activation function. The reason why Sigmoid function is chosen is given in section 5.2. The notation used in the feed-forward network is described as follows: wI,S = a weight on the link between an input unit and a hidden unit. wS,C = a weight on the link between a hidden unit and an output unit. aI = an activation value of an input unit. aS = an activation value of a hidden unit. EI = an environment contribution of an input unit. The weights on the links between a parent unit and its children units are contingent on the number of children given a parent unit, and are scaled so that the maximum sum of the activation values of the parent unit is ≈1 (= 0.99). Based on the following three assumptions: (1) the strength of an "xi" in the input level is the product of the activation value of an input unit, "ai" and the weight on the link, w, plus a normalised environment contribution, E; (2) the activation value of an input unit is 1, which means that this input unit is activated; (3) the maximum sum of the normalised environment contributions (as described in section 4.2. stage 3) is 1, we can determine the maximum sum of weights on the links between input units and a hidden unit: f max( ∑ {wI,S * aI } + ∑ EI ) ≤ 0.99 f max ( ∑ {wI,S * 1 } + ∑ EI) ≤ 0.99 ∑ w I,S + ∑ EI ≤ ln 99 ∑ w I,S ≤ 3.59511985 The weight on each link involved can, then, be computed as follows: w I,Si = (3.59511985) / n, where n is the number of input units given a hidden unit. We can determine the maximum sum of weights on the links between hidden units and an output unit in the same way. Since the environment contribution has already been taken into account, the weight on each link involved can, then, be computed as follows: w S,Ci = (4.59511985) / n ,where n is the number of hidden units given an output unit.
4.2 Automatic Classification Process The five sequential stages in the automatic classification process are described below. Stage 1: Analysing and weighting terms The terms found within a Web page are analysed, weighted and stored in the term table based on the tag within which a term is found. A term which occurs in a title, heading, meta-keyword tag is assigned with a large weight, ω1. Quite often, a term which occurs in these three significant tags occurs again in other tags. In this case, ACE increments the weight of this term with a large weight, although it appears in other tags, because it has been considered significant in the first occurrence.
Otherwise, a small weight, ω2, is assigned to the term. This means that if the two tags are different in terms of their significance, ACE chooses the most significant one, and rectifies the weight of the term as follows: wt ωt tf
=
wt = ωt * tf the weight of a term, t.
= =
the degree of weight of a term, t : {ω1, ω2}. the number of occurrences of a term, t.
This strategy facilitates the differentiation of significant terms from insignificant ones. Note that ω1 and ω2, are tuning parameters which play a key role to determine the primary topic and secondary topic of the Web page. How these two weights are obtained is discussed in section 5. Stage 2: Weighting conceptual instances. Based on the domain ontologies within the ACE database, ACE compares conceptual instances within the Web page with the conceptual instances within the ACE ontologies. Here, "a conceptual instance within a Web page" means a keyword or a phrase descriptor (from within the term table), which may consist of more than one term and refers to a concept. To identify phrase descriptors as a set of concept markers, ACE adopts a phrase recognition method, called non-syntactic phrase indexing method [9]. When a conceptual instance within the Web page matches a conceptual instance within the ACE database and the weight of the conceptual instance within the Web page is greater than ω1, ACE determines whether the conceptual instance has a parent reference(s), other than a shared class reference. If yes, then ACE increments the weight of its parent. Otherwise, ACE increments the weight of the conceptual instance. In other words, ACE converges (or accumulates) all weights on the weight of the high level conceptual instance. This strategy is applied in order to identify the "gestalt" (or broader concept) rather than the "parts" (or specific concepts) within a Web page. This convergence is only allowed for "is-a", "cover" and "part-of" relationships. For other types of relationships, ACE converges (or accumulates) the weight on a high level conceptual instance (or the primary concept) based on the primary concept properties and facet values. Based on the concept properties attached to the concept, and the facets which define the allowed values on the relation between the primary concept and its concept properties, the primary concept can be identified. For example, the primary concept, "communications" has two concept properties: "types of communications" and "communications services". The facets for the concept property, "types of communications" are "wireless, computer and postal communications". The facets for the concept property, "communications services" are "electronic mail, news and mail". Let us suppose, a Web page contains the words, "wireless communications" and "electronic mail". Based on the facets, ACE can know that the Web page implicitly contains two concept properties, "communications services" and "types of communications". This leads ACE to the conclusion that the primary concept of the Web page is about "communications". To avoid over-fitting, the concept weighting strategy regards prepositions as a part of the concept marker, but does not take the weight of prepositions into account, and focuses only on the weight of the nouns involved. The following describes the algorithm for the weighting strategy. To simplify the description of the algorithm, let Φ be a set of weighted
Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02) 0-7695-1766-8/02 $17.00 © 2002 IEEE
keywords and phrase descriptors: {µ1,µ2,...,µn}, which are constituted by a term(s) . Let also Ο be a domain ontology which consists of a set of concepts, ℂ: {c1, c2,...., cn};
a set of relations, Ř : {ř1, ř2,...., ř n} which express "is-a", "part-of", and "cover" relations; a set of relations, Ŕ : {ŕ1, ŕ2,....,ŕn} which express other types of relations. procedure converge_weights(Φ) { for each µi ∈ Φ { for each cj ∈ ℂ { if (µi=cj ∧ cj.relation=ř ∈ Ř ∧ cj.shared_class=nill) { converge_weight_on_the_associated_superclass(cj, µi); } elseif (µi=cj ∧ cj.relation=ŕ ∈ Ŕ ∧ cj.shared_clas=nill) {
Using the feed-forward network model, ACE searches the associated class representative(s). Based on the assumption that the normalised weights of the significant conceptual instances are evenly distributed, the activation values would be continuous with an upper bound. For this reason, the Sigmoid function is chosen as the activation function. The activation of an upper level unit (a shared class or a class representative) is possible if the strength of the signal, x is greater than the predefined threshold value, t = 0.5. CR1
wSC1,CR1 SC1
wCI1,SC1
wSC2,CR1 SC2
wCI2,SC1
wCI3,SC2
CI1
CI2
CI3
EI1
EI2
EI3
converge_weight_on_the_associated_primary_concept(cj, µi);
search_significant_associated_facet_values(primary_concept);
} elseif (µi=cj ∧ cj.shared_class=ЅCk ∈ ЅC) { store(cj, µi);
} } } } The procedure, "converge_weights" consists of three parts which are described as follows: (1) the procedure, " converge_weight_on_the_associated_superclass" is called recursively until a high level conceptual instance which is related to a shared class is found; (2) the procedure, " converge_weight_on_the_associated_primary_concept" searches and converges weights on a primary conceptual instance, and passes it to the procedure "search_significant_associated_facet_values", in order to identify whether the significant associated facet values of the primary conceptual instances are found; (3) in the case where a high level conceptual instance is found and linked to a shared class(es), the procedure "converge_weights" only needs to call the procedure, "store" in order to weight and store the conceptual instances. Stage 3: Weight normalisation. Before ACE searches for the associated class representative(s), ACE normalises the weight of the significant conceptual instances only. To capture the intuitive semantic value of the significant conceptual instances, the normalised weights are scaled so that the Euclidean norm of the sum of the weights is 1 [16]. The Euclidean norm is defined as follows:
where n is the number of significant conceptual instances. A conceptual instance which has a weight greater than predefined threshold value, x = 0.5 is considered to be significant, and will be taken into the next stage. Stage 4: Assigning class representative(s).
Figure 1 A feed-forward network model The feed-forward network is also used to assist the classifier to measure the similarity between the Web page and a class representative. The classifier regards the activation values of an output unit (a class representative) as the similarity coefficient which depends on the environment contributions, i.e. normalised weights of the conceptual instances (described in stage 3), and the number of children, i.e. high level conceptual instances or shared classes, given a parent unit, i.e. a shared class or a class representative. This is the main motivation for employing the feed-forward network. The following three examples illustrate the way the classifier measures the similarity coefficient of a Web page. Figure 1 is used as the feed-forward network model for these three examples. Example 1: a Web page only contains a concept marker which refers to the conceptual instance, CI1. This means that the normalised weight of the conceptual instance is 1. The activation value of the class representative, CR1 is computed as follows: f(aCR ) = f(wSC ,CR * aSC ) = f(w SC ,CR * f(wCI ,SC * 1 1 1 1 1 1 1 1 aI1 + EI1)) = f(2.2976 * f(1.7976 * 1 + 1)) = 0.8971 Example 2: a Web page contains two concept markers which refer to the conceptual instances, CI1 and CI3, and each of them is assigned with a normalised weight, 0.7071. The activation value of the class representative, CR1 is computed as follows: f(aCR ) = f(wSC1,CR1 * aSC1 + wSC2,CR1 * aSC2) = f(w 1 SC1,CR1 * f(wCI1,SC1 * aI1 + EI1) + wSC2,CR1 * f(wCI3,SC2 * aI3 + EI3)) = f(2.2976 * f(1.7976 * 1 + 0.7071) + 2.2976 * f(3.5951 * 1 + 0.7071)) = 0.9878 Example 3: a Web page contains three concept marker which refer to the conceptual instances, CI1, CI2 and CI3, and each of them is assigned with a normalised weight, 0.5774. The activation value of the class representative, CR1 is computed as
Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02) 0-7695-1766-8/02 $17.00 © 2002 IEEE
follows: f(aCR1) = f(wSC1,CR1 * aSC1 + wSC2,CR1 * aSC2) = f(wSC1,CR1 * f( (wCI1,SC1 * aI1 + EI1) + (wCI2,SC1 * aI2 + EI2)) + wSC2,CR1 * f(wCI3,SC2 * aI3 + EI3)) = f(2.2976 * f((1.7976 * 1 + 0.5774 ) + (1.7976 * 1 + 0.5774)) + 2.2976 * f(3.5951 * 1 + 0.5774)) = 0.9894 Example 3 shows that a Web page which contains all conceptual instances yields the largest activation value, 0.9894. In contrast, a Web page which only contains one conceptual instance yields the smallest activation value, 0.8971. In addition, example 2 shows that although the Web page does not contains all conceptual instances, it still yields a significant activation value, 0.9878. This is justified by the fact that the Web page conceptually represents the two shared classes which refer to the class representative, CR1. In other words, not only the weights of the conceptual instances play the role in the activation value, but also the shared classes which represent all the topics covered by the class representative. Stage 5: Generating and storing metadata. The last step in the classification process is to generate metadata for the Web page. The metadata contains the significant, expanded and collateral concept marker(s), and the assigned DDC and LCC class representatives. Significant concept markers refer to the conceptual instances which are considered to be significant in stage 3. Based on the semantic links between the conceptual instances, ACE can look for and store the superclass of the significant conceptual instances in the metadata as expanded information. The main purpose of this expanded information is to attach new relevant concepts to a Web page. Collateral concept markers refer to the conceptual instances which are considered to be insignificant in stage 3. They are also stored in the metadata based on the assumption that they may be useful for a Web user seeking a specific information which is closely related to the significant concept markers. The metadata is stored using an XML database management system, called Xindice [4].
5. Experimental Evaluation This section describes an experiment to evaluate the effectiveness of the classifier in terms of its coverage and accuracy. Coverage ratio is measured as the ratio of the number of classified Web pages to the total number of Web pages in a collection. Accuracy ratio is measured as the ratio of the number of correctly classified Web pages to the total number of classified Web pages.
5.1 Test Samples and Domain Ontologies used For test samples, we manually collected Web pages from within Google – Web – directories [7], i.e. 202 pre-classified Web pages which semantically refer to the DDC 300 in the third level (as the first test set); 200 pre-classified Web pages which semantically refer to the DDC 500 in the third level (as the second test set). In order to manually select the samples, we traversed down through the Google path directories until we found the directories which semantically refer to the DDC class representatives. Then, we took from each of those directories pre-classified Web pages as our samples. The samples were chosen so that each of them represented a DDC class representative in the third or fourth level. In order to measure the accuracy of classification results against a random collection, ACE automatically classified 303,000 Web pages
and only achieved 10% coverage ratio due to the incompleteness of the domain ontologies used; of these, 207 Web pages classified to DDC 300 were randomly selected as the third test set. As these 207 Web pages were not preclassified by Google, we manually classified these Web pages. To avoid bias we submitted queries to Google based on the title texts of these 207 Web pages. Google was used to generate the categories for each of these Web pages when these were known to Google. The domain ontologies which contain 402 conceptual instances related to the DDC "300 Social Sciences" were chosen. The reason for this choice was because DDC 300 covers many different subject areas. DDC 300 contains 92 class representatives in the third level; of these, only 59 class representatives were mapped to the related domain ontologies. In other words, the domain ontologies used covered 64% of DDC 300. In addition, the domain ontologies that contains 398 conceptual instances related to the DDC "500 Natural Sciences" were chosen as a counter example. DDC 500 contains 95 class representatives in the third level; of these, only 61 class representatives were mapped to the related domain ontologies. In other words, the domain ontologies used covered 64% of DDC 500. The reason domain ontologies for DDC 500 were chosen as a counter example was that they are almost as many as the number of domain ontologies for DDC 300. Hence, domain ontologies for DDC 300 and 500 are suitable for determining a correlation between the completeness of domain ontologies and classification accuracy. The term, "completeness", refers to the extent to which an ontology can be said to be exhaustive for representing a domain in terms of the representativeness of its conceptual instances and the relationships among them. To precisely quantify the completeness of an ontology given a domain, however, is almost impossible. With respect to this assumption, we argue that there was a differentiation between the two distinctive ontologies in terms of their completeness. From this point of view, this experiment was conducted to see as to whether this differentiation could affect classification accuracy (section 5.3).
5.2 Classification Coverage and Accuracy with respect to the Weight, ω1 To determine the ratio of the small weight, ω2 to the large weights, ω1, we used and labelled the 609 test samples. ω2 was set to 1 throughout this experiment. As the starting point, we set ω1 to 1, and assumed that terms found within the three HTML tags, i.e. title, heading and meta-keyword tag, were the candidates which might be entitled to the large weight, ω1. For each loop, ACE subsequently classified the samples, compared the classification results with the predefined ones and measured the coverage and accuracy ratio. In order to know which terms were entitled to the large weight, ω1, ACE classified the test samples in three different modes. In the "T" mode, ACE only assigned a large weight, ω1 to the terms found within a title tag; in the "T+H" mode within a title and heading tag; in the"T+H+K" mode within a title, heading and meta-keyword tag. The loop was repeated until ω1 = 100. Figure 2a and 2b show that the classification accuracy increased from 441 to 471 (+13.77%) for ω1 = {1...6}. This gain, however, came at a cost in degradation of classification coverage from 582 to 526 (–9.20%). For ω1 = {6...16}, the
Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02) 0-7695-1766-8/02 $17.00 © 2002 IEEE
classifier excluded all the significant terms found within a heading or meta-keyword tag. This degraded both, classification coverage from 526 to 508 (–2.95%), and accuracy from 471 to 460, although the accuracy percentage was increased by 1.01%. For ω1 = {16...100}, the coverage and accuracy remained constant.
In contrast, figure 2b shows that there is a differentiation in the coverage and accuracy percentage. The coverage percentage remained steady at 83.42%, whilst the accuracy percentage remained steady at 90.55%. This suggests that taking terms found within a header tag as a significant concept marker into
Weight, ω1
Weight, ω1 Figure 2a Coverage-Accuracy curve with respect to the terms – found within the title tags – which were assigned with the large weight, ω1.
Weight, ω1
Figure 3a Coverage-Accuracy curve with respect to the terms – found within the title and heading tagswhich were assigned with the large weight, ω1.
Weight, ω1
Figure 2b Coverage-Accuracy Percentage curve with respect to the terms – found within the title tags – which were assigned with the large weight, ω1.
Figure 3b Coverage-Accuracy Percentage curve with respect to the terms – found within the title and heading tags- which were assigned with the large weight, ω1.
Analogously, figure 3a and 3b show that the classification accuracy increased from 441 to 464 (+10.96%) for ω1 = {1..7}. This gain, however, came at a cost in degradation of the classification coverage from 582 to 535 (–7.72%). For ω1 = {7...16}, the classifier excluded all the significant terms found within a meta-keyword tag. This degraded the classification coverage from 535 to 527 (–1.31%), and the classification accuracy from 464 to 458, although the accuracy percentage was increased by 0.18%. For ω1 = {16...100}, the coverage and accuracy remained constant. Figure 3b also shows that by assigning terms found within the title and heading tags with ω1 = {16...100}, the classification coverage and accuracy percentage remained steady at ≈86%.
account can deteriorate classification accuracy. On the other hand, the coverage degradation is not as much as shown in figure 2a/2b. Assigning terms found within a meta-keyword tag with the large weight, ω1 does not result in better level of accuracy. In fact, it causes the classification coverage and accuracy to deteriorate. This suggests that terms found within a metakeyword tag should not be assigned with a large weight, ω1. ω1 can be categorised into three intervals. The first weight interval consists of a set of weights which result in a better level of accuracy. The second one consists of a set of weights which degrade classification accuracy. The last one consists of a set of
Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02) 0-7695-1766-8/02 $17.00 © 2002 IEEE
weights which do not affect classification coverage and accuracy. To analyse the characteristic of these three intervals in more details, ACE classified the 609 sampes separately, i.e. 202 Web pages related to DDC 300, 200 Web pages related to DDC 500, and 207 Web pages from the random collection. Despite some fluctuations, the coverage-accuracy curves for these three test sets yielded the same type of curves as shown in figure 2a and 3a. The length of each interval, however, was varied and depended on the test set used and the modus.
5.3 Correlation between the Completeness of Domain Ontologies and Classification Accuracy This section presents an experiment which was conducted in order to see as to whether the completeness of the ontologies used could make a differentiation in terms of classification accuracy. We used, for each test set, the ratio of ω2 to ω1 which yielded the best performance in terms of accuracy, and only assigned ω1 to the terms found within a title tag in order to avoid bias, i.e. other factors that can degrade classification accuracy, such as terms within a meta-keyword tag. The experiment tested: H0: there is no differentiation between classification results for DDC 300 and 500 in terms of classification accuracy (null hypothesis) against H1: there is differentiation between classification results for DDC 300 and 500 in terms of classification accuracy (alternative hypothesis). ACE automatically classified the pre-classified Web pages from within the Google - Web - directories. If ACE was able to classify the Web page as Google did, then the accuracy ratio was incremented. Table 2 below depicts the observed frequencies (O). Observed Frequencies (O)
Correctly Classified 202 pre-classified Web 166 pages related to DDC 300 200 pre-classified Web 173 pages related to DDC 500
Wrongly Classified 9
Not Classified 27
11
16
Table 2 Observed Frequencies (O). 2 χ A chi-square, analysis was carried out. The critical value, χ2, was 3.1486 which is smaller than χ2 (0.05) = 5.99. The conclusion is that there is insufficient evidence to reject H0 (null hypothesis). This indicates that differences in the completeness of the domain ontologies used does not affect the accuracy of the classification. In order to detect whether there is a differentiation between classification accuracy on pre-classified and random pages with respect to DDC 300, we conducted another experiment. The experiment tested: H0: there is no differentiation between classification accuracy on pre-classified and random pages with respect to DDC 300 against H1: there is differentiation between classification accuracy on pre-classified and random pages with respect to DDC 300. Table 3 below depicts the observed frequencies (O). Observed Frequencies (O) 202 pre-classified Web pages related to DDC 300 207 Web pages classified to DDC 300 (from random collection)
Correctly Classified 166
Wrongly Classified 9
Not Classified 27
137
9
61
Table 3 Observed Frequencies (O). 2 2 A chi-square, χ analysis was carried out. The critical value, χ ,
2 was 15.8532 which is greater than χ (0.01) = 9.21. The conclusion is that there is strong evidence to reject H0 (null hypothesis). This means that the level of classification accuracy on random pages had deteriorated significantly. The reason for this deterioration is that the pre-classified, manually selected Web pages from within Google Web directories contain conceptual instances that fully match the domain ontology used. In contrast, the random pages contain conceptual instances that do not fully match the domain ontology used. For this reason, the classification accuracy on random pages deteriorated. In addition to the statistical test analysis described above, we found two factors that degrade the classification accuracy, i.e. (1) incompleteness of the conceptual instances within an ontology. The percentage was 99.25% (= 132 of 133 wrongly and not classified samples). This had been anticipated from the beginning of the project. We decided to start with small domain ontologies. These ontologies mostly contain high level concepts and some low level (or specific) concepts. The high level concepts used are closely related to (or the same as) the DDC and LCC class representatives. It is not surprising that ACE often fails to identify a specific concept due to the incompleteness of the ontology used; (2) inability of ACE to reason. The percentage was 0.75% (= 1 of 133 wrongly and not classified samples). This is because ACE does not think and reason as a human classifier does. For example, when ACE analyses and classifies a Web page about slavery, ACE found a title text, "Beyond Face Value". The concept about slavery itself is contained in an image file. A human classifier can easily classify this Web page since they can see information which is encoded in the image file, and can arrive at a conclusion based on this image information and the title text.
6. Conclusions and Future Works There are three issues in the ontology-based automatic classification with respect to DDC and LCC. The first issue is to map the LCC class representatives into the DDC class representatives. Due to the different strategy and naming convention used to organise and name the class representatives, mapping these two schemes poses a special challenge. This paper proposes the formal definitions for full and partial mapping, and the use of shared classes to partially map these two classification schemes without loss of precision. The main idea of the use of shared classes is to define a set of common class representatives which conceptually refer to the related class representatives, and to link the class representatives with their associated domain ontologies. The second issue is how to correctly identify the main topic of a Web page based on the terms within the Web page. To tackle this issue, a term weighting strategy is applied in order to clearly differentiate the significant terms from the insignificant ones. Subsequently, concept weighting strategy is applied to identify the significant conceptual instances, which represent the primary concept of the Web page, based on the weighted terms and structure of the domain ontologies. This strategy is based on the assumption that the significant conceptual instances are dependent to each other, represent the main topic of the Web page, and conceptually refer to the associated class representatives. The third issue is how to map the significant conceptual instances into their associated class representative(s), and to
Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02) 0-7695-1766-8/02 $17.00 © 2002 IEEE
represent this relationship. To tackle this issue, we propose the formal definitions for ontology-classification-scheme mapping. Then, we use a feed-forward network model to dynamically represent the semantic links between a set of conceptual instances and their associated class representative(s). The model allows us to observe the way the classifier assigns a class representative to a Web page by tracking the links between the conceptual instances involved and the associated class representative The experiment results show a statistical improvement in terms of accuracy, when compared to previous classifier described in [2]. This means that the strategy applied for classification based on the use of the domain ontologies does result in a better level of accuracy. This improvement, however, comes at a cost in a low coverage ratio due to the incompleteness of the ontologies used (table 3, 3rd row). For this reason, we propose the adoption of natural language processing techniques which can be used to automatically identify new concept markers, and machine learning techniques which can be used to automatically integrate these new concept markers into the existing ones, in order to complete the existing domain ontologies. Hence, a better level of accuracy can be achieved without degrading the coverage ratio.
Acknowledgement The authors thank Peter Musgrove for his invaluable comments.
7. References [1] A. MAEDCHE, S. STAAB, N. STOJANOVIC, R. STUDER, and Y. SURE, 2001, “SEAL – a framework for developing semantic Web portals.“, Proc.: conference of the 18th British National Conference (BNCOD) on Databases, 2001, July, Chilton, U.K. [2] C. JENKINS, M. JACKSON, P. BURDEN, and J. WALLIS, 1999, “Automatic RDF metadata generation for resource discovery.“, Proc. of 8th International WWW Conference, May 11-14, 1999, Toronto. [3] D. BRICKLEY, and R. GUHA, 1999, “Resource description framework (RDF) schema specification - W3C Recommendation 23 March 2000.”, U.S.: W3C Consortium (updated 27 March 2000, accessed 26 February 2001). [4] DBXML, 2001, “The dbXML Project.“, (accessed 7 January 2002). [5] D. FENSEL, 2001, Ontologies: a silver bullet for knowledge management and electronic commerce. 1st ed., Heidelberg: Springer. [6] E.D. LIDDY, W. PAIK, and E.S. YU, 1994, “Text categorization for multiple users based on semantic features from a machine-readable dictionary.”, Journal of the ACM, 12(3), pp.278-295. [7] GOOGLE, 2001, “Google Web Directory.“, (accessed September 2001). [8] I. HORROCKS, D. FENSEL, J. BROEKSTRA, S. DECKER, M. ERDMANN, C. GOBLE, F. van HARMELEN, M. KLEIN, S. STAAB, R. STUDER, and E. MOTTA, 2000, The ontology inference layer, OIL. Information Society Technologies. (accessed 14 December 2000). [9] J.L. FAGAN, 1987, “Automatic phrase indexing for document retrieval - an examination of syntactic and non-syntactic methods.“, Proc. of the 10th annual international ACM SIGIR, June 3-5,1987, pp.91-101, New Orleans, LA, U.S. [10] K. AHMAD, R. BONTHRONE, G. ENGEL, A. FOTOPOULOU, D. FRY, C. GALINSKI, J. HUMBLEY, N. KALFON, M. ROGERS,
C. ROULIN, K. SCHMALENBACH, and E. TANKE, 1995, “The Importance of terminology.“, Final Report of the POINTER Project. [11] LIBRARY OF CONGRESS., 2002, “Library of Congress Bibliographies.“, (updated on 25 May 2001, accessed 13 September 2001). [12] M. CRAVEN, D. DIPASQUO, D. FREITAG, A. MCCALLUM, T. MITCHELL, K. NIGAM, and S. SLATTERY, 2000, “Learning to construct knowledge bases from the World Wide Web.“, Artificial Intelligence, 118(1-2), pp. 69-113. [13] M. GOLDSTEIN, and W. R. DILLON, 1978, Discrete discriminant analysis, New York: John Wiley & Sons, Inc. [14] M. JACKSON, and P. BURDEN, 1999, “WWLib-TNG - new directions in search engine technology.”, IEE Informatics Colloquium Lost in the Web - navigation on the Internet, pp.10/1-10/8, November 1999. [15] M.J. BLOSSEVILLE, G. HEBRAIL, M.G. MONTEIL, and M. PENOT, 1992, “Automatic document classification: natural language processing, statistical analysis, and expert system techniques used together.“, Proc. of the 15th annual international ACM SIGIR, June 1992, Copenhagen, Denmark. [16] M.W. BERRY, and M. BROWNE, 1999, Understanding search engines – mathematical modelling and text retrieval. 1st ed., Philadelphia: SIAM (Society for Industrial and Applied Mathematics). [17] N.F. NOY, and D.L. MCGUINNESS, 2001, Ontology development 101: a guide to creating your first ontology. Knowledge Systems Laboratory (KSL) of Department of Computer Science Stanford, USA: Technical report, KSL-01-05. [18] N. GOEVERT, M. LALMAS, and N. FUHR, 1999, “A probabilistic description-oriented approach for categorising Web documents.“, Proc. of the 8th ACM International Conference on Information and Knowledge Management, Novermber 2-4, 1999, pp. 475-482, Kansas City,U.S. [19] O. LASSILA, and R.R. SWICK, 1999, “Resource description framework (RDF) model and syntax specification. - W3C Recommendation 22 February 1999“, World Wide Web (W3C) Consortium (updated 24 February 2000, accessed 26 February 2001). [20] ONLINE COMPUTER LIBRARY CENTER, Inc., 2002, “Dewey Decimal Classification.“,. (accessed 13 September 2001). [21] S. DECKER, F. van HARMELEN, J. BROEKSTRA, M. EERDMANN, D. FENSEL, I. HORROCKS, M. KLEIN, and S. MELNIK, 2000, “The Semantic Web – on the respective roles of XML and RDF.”, IEEE Internet Computing, 4(5), pp. 63-74. [22] S. RUSSEL, and P. NORVIG, 1995, Artificial intelligence - a modern approach. 1st ed., New Jersey: Prentice-Hall. [23] STANFORD UNIVERSITY, Database Group, 2001, “RDF API Draf.“, (accessed 15 January 2001). [24] S.T. DUMAIS, and H. CHEN,2000, “Hierarchical classification of Web content.“, Proc. of the 23rd Annual International ACM SIGIR, July 24-28, 2000, Athens, Greece. [25] T. KOCH, A. BRUEMMER, D. HIOM, M. PEEREBOORN, A. POULTER, and E. WORSFOLD, 1998, “Specification for resource description methods Part 3: the role of classification schemes in Internet resource description and discovery.“, European Commission: a DESIRE project. [26] YAHOO! Inc., 2002, “Yahoo! search (accessed 7 January 2002).
Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02) 0-7695-1766-8/02 $17.00 © 2002 IEEE
engine.”,