Jun 11, 2001 - gies containing reusable knowledge to be shared with software agents, and ... example of this was the ambiguity of the meaning of the is-a .... It can be inferred that an ontology should be produced in a bespoke manner.
Automatic Derivation of On-line Document Ontologies Dave Elliman J.Rafael G.Pulido∗ Image Processing and Interpretation Research Group, Computer Science and Information Technology School, University of Nottingham, United Kingdom [dge|jrp]@cs.nott.ac.uk Jun 11, 2001
Abstract This paper describes a method for constructing an ontology which will represent the set of web pages on a specified site. We are developing a technique that will extract knowledge from digital sources, create ontologies containing reusable knowledge to be shared with software agents, and present a view of this knowledge to users. This method will provide a solution to the problem of classifying information and supporting mechanisms that explore its structure, as well as allowing knowledge to be extracted and shared with other software agents.
1
Introduction
Many web sites contain a large number of pages which have grown, often in a rather uncontrolled manner, over time. Some sites are neat and tidy, like a wellkept garden, whilst others are an overgrown wilderness in which it is difficult to find anything of interest, even if it is known to be hidden there somewhere. It would be useful to be able to construct some representation of the information in the site which would support intelligent and interactive searching, and could also be used as an index. In this paper we describe an approach that is being used to construct an ontology for such web sites. A novel aspect of this work is that the form of the ontology is tailored to each individual user. The presentation of information to the user is an ongoing area of work, as is the use of the structure to facilitate the work of distributed agents. A task that could help exploration is one that librarians have carried out throughout centuries, the organization of documents into hierarchies in a dynamic data-driven way. The remainder of ∗ Corresponding
author
A document is something created by author(s) that may be viewed, listened to, etc., by some audience. It persists in material form (e.g., a concert or dramatic performance is not a document). Documents typically reside in libraries. Subclass-Of: Individual-Thing, Individual, Thing Superclass-Of: Book, Cartographic-Map, Computer-Program, Doctoral-Thesis, Edited-Book, Journal, Miscellaneous-Publication, Multimedia-Document, Periodical-Publication, Proceedings Technical-Manual, Technical-Report, Thesis Class hierarchy (20 classes defined): Author Document Book Edited-Book Miscellaneous-Publication Artwork Cartographic-Map Computer-Program Multimedia-Document Technical-Manual Periodical-Publication Journal Magazine Newspaper Proceedings Technical-Report Thesis Doctoral-Thesis Masters-Thesis Title
4 relations defined: Has-Author Has-Editor Has-Series-Editor Has-Translator 7 functions defined: Conference-Of Number-Of-Pages-Of Organization-Of Publication-Date-Of Publisher-Of Series-Title-Of Title-Of
Figure 1: An Ontology Example
this paper is organized as follows. In section 2 some key concepts on ontologies are introduced. Some related work is presented in section 3. Our approach is depicted in section 4. The paper is concluded in section 5.
2
The Purpose of an Ontology
A representation that brings order and structure to a web site can be referred to as an ontology. It is important to identify an appropriate representation for this meta-level description. In Figure 1, a description of the document ontology defined at Stanford KSL Network Services1 , its subclasses, relations, functions, and its superclasses are shown. From this figure, a taxonomy can then be derived, Figure 2. Representing knowledge about a domain as an ontology is a challenging process which is difficult to do in a consistent and rigorous way. It is easy to lose consistency and to introduce ambiguity and confusion. The most famous example of this was the ambiguity of the meaning of the is-a relationship as discussed by Brachman in [1]. Ontologies can be expressed with varying degrees of formality, however the following four categories are the most common ways to express them [17]: 1 http://www-ksl-svc.stanford.edu:5915/
2
thing
author
title
document
periodical-publication
book
thesis proceedings
miscellaneous-publication technical-report
Figure 2: Taxonomy Derived
1. Highly informal: written using natural language. 2. Semi-informal: restricted and structured using natural language. 3. Semi-formal: using a formally defined language. 4. Rigorously formal: semi-formal, including theorems and proofs. It is tempting to believe that a well-designed universal ontology expressed in a formal way should be the goal of this research. In practice, however, it rapidly became evident that the quality of an ontology is a subjective matter. It depends on how useful it is for a given purpose. An analogy with hand tools illustrates this point. It is silly to argue that a saw is a better tool than a chisel. They are each best for their own particular task. A highly formal ontology, on the other hand, may have many desirable qualities, but it is not much of use to most people if only graduate mathematicians can understand it. A formal ontology may still be a useful internal representation, but it may need an interpretation wrapper in order to be viewed by a user. However it is a useful form of knowledge representation which may be used to support the design and development of intelligent software applications and expert systems. One of the most common uses of ontologies is to support the development of agent-based systems for web searching, for example [3] and [9]. For this interaction to be possible, agents must share a common ontology, or at least a common wrapper to existing information structures. Reusing ontologies that are less generic and less carefully designed is likely to be much more difficult. The most important observation in this context is that there is significant manual effort involved in translating ontologies as reported in [17].
3
3
Related Work
Research from a number of areas has been used to inform this research. Useful results from natural language understanding systems have been adopted for the semantics and to develop distance measures between documents using principal component analysis. Previous work in web searching, in building digital libraries, in self-organizing networks, and of course in constructing ontologies has all been used.
3.1
Constructing Ontologies
Web ontologies can take rather different forms to traditional ones. In [2] the use of the so-called Simple HTML Ontology Extension (SHOE) in a real world internet application is described. This approach allows authors to add semantic content to web pages, relating the context to common ontologies that provide contextual information about the domain. Most web pages with SHOE annotations tend to have tags that categorize concepts, therefore there is no need for complex inference rules to perform automatic classification. On the other hand, XML is also a rapidly growing standard for describing semantic content in web applications. These tags2 are likely to become widely adopted in future, and will form an excellent basis for building ontologies for specific domains from information on web sites. At the present time we were unable to find sufficient examples to use this information in our prototype. Another interesting project is presented in [7], where the results of applying the WEBSOM2, a document organization, searching and browsing system, to a set of about 7 million electronic patent abstracts is described. In this case, a document map is presented as a series of HTML pages facilitating exploration. A specified number of best-matching points are marked with a symbol that can be used as starting points for browsing. Ontologies also can be classified according to the level of formalisms they were written [2]: 1. Catalogue: a list of terms, no axioms, no glosses. 2. Glossed catalogue: a catalogue with natural language glosses. 3. Taxonomy: a collection of concepts with a partial order induced by inclusion. 4. Axiomatized taxonomy: a taxonomy with axioms. 5. Context library: a set of axiomatized taxonomies with relations among them. It should be emphasised that in this context our ontology can be regarded as a taxonomy. Two ubiquitous and inter-related concepts in meta-level descriptions of information are hierarchy and proximity. Documents can be described as 2 Available
at http://www.xml.org/xmlorg registry/index.shtml
4
being close to one another if they are similar in some sense. Two documents might be close in one respect, say writing style, but distant in another respect, for example content. On one hand, a distance measure applied to a set of documents results in a partial order relation which can form the basis for an ontology. P min(dik , djk ) k P i cij = (1) dk cij =
k
1 0
: di = dj : di = 6 dj
(2)
On the other hand, it is desirable to have an objective measure of the quality of a given ontology in order that a decision can be made as to whether one putative representation is better or worse than another. This is a vexed question, as it is necessary to ask, “In what sense do you mean the best?”. It is tempting to give an answer in terms of conformance to certain design criteria for ontologies that are considered desirable, for example those listed by Gruber [4]. A useful set of criteria might be as follows: 1. Clarity: it should provide the intended meaning of the defined terms. 2. Coherence: axioms should be logically consistent. 3. Extendibility: when defining new terms, it should not be necessary to revise existing definitions. 4. Commitment: it should be sufficient to support the intended purpose. 5. Distinction Principle: classes should be disjointed. 6. Diversification: this increase the power provided by multiple inheritance mechanisms. 7. Minimization of semantic distances: similar classes are grouped together. 8. Encoding: the ontology should be specified at the knowledge level without depending on a particular symbol-level encoding.
3.2
Constructing Browsing Systems
Support for browsing using classification hierarchies is an important tool for users of online archives. Users would like the data to be structured in a way that makes sense from their point of view. The purpose of browsing an environment is to present the data in a structured way such that this facilitates the discovery of information for a given purpose. In [15] a distributed architecture for the extraction of meta-data from WWW documents is proposed which is
5
First Level (Cetegories)
Second Level (Subcategories)
Figure 3: A two-level Hierarchical SOM
particularly suited for repositories of historical publications. This information extraction system is based on semi-structured data analysis. The system output is a meta-data object containing a concise representation of the corresponding publication and its components. Gatherers have been designed as a combination of a parser, based on a context-free grammar, and a web robot, which navigates the links contained in the basic document type to infer the document structure of the entire site. These meta-data objects can be interchanged with other web agents, then classified and organized. In [6] an intelligent agent for libraries is described. This inhabits a rich virtual environment enhanced with various information tools to support searching. It offers stacks with books containing standard meta-data, and adds, whenever possible, extra meta-data such as table of contents, full text, reviews, and frequency of citation to the data stored about each book. In the same way, a system to provide a more intuitive interface to document repositories is presented in [12]. In this case, documents are grouped using one SOM (Self Organizing Map), and then a graphical real-world metaphor is used to present the documents to users. That system was used as a front-end to the Altavista search engine. SOMLib and libViewer [13], and parSOM [16] also were used in this proposal. In SOMLib, maps can be integrated to form a high-level library. This allows users to choose sections of a library to create personal libraries. Hierarchical feature maps consist of a number of individual self-organizing maps, and are able to represent the contents of a document archive in form of a taxonomy [10]. The distinguished feature of this model is that it provides a hierarchical view of the underlying data collection in form of an atlas, where starting from a map representing the complete data collection, different levels are shown at finer levels of granularity (Figure 3). The complete archive is represented by means of a small overview map.
4
The Approach used for Constructing our Ontology
The best ontology for someone looking for references by a given author, will be different from that for someone looking for papers in a particular field, which
6
analyze an agent community
produce digital source
OOO
Figure 4: Basic Approach
will be different again from that required by someone interested in the history of language and how word usage has changed over time. Even within a pure subject-based ontology the optimum shape of the representation will depend on the particular slant of the user. Take the case of an engineer who may be interested in the design of petrol engines. The same set of papers may be of interest to one person concerned with combustion processes, and to another interested in materials, as both factors may be discussed. However, the proximity of individual papers to one another, and to the researcher’s key interests would ideally be quite different. Our system can be outlined as a number of sets [14]: 1. Set of objects (entities, concepts) 2. Set of functions (for example is-a) 3. Set of relations (between objects) 4. Set of semantic rules It can be inferred that an ontology should be produced in a bespoke manner to suit its purpose (Figure 4). This of course raises the crucial question of how such a purpose may be identified and specified. Our approach is to take the keywords supplied by the user and to see if they represent an abstraction of a set of less general objects. That is so they link to a set of hypernyms in WordNet as explained in [11]. A pig, a sheep, and a cow are hypernyms of animal for example. Words that are hypernyms of search keys are good candidates for branches in an ontology. The Java language and its various APIs are powerful for constructing software systems that build ontologies. The collections framework provides efficient hash table structures for use as word lists, and the support for sockets and parsing HTML (and indeed XML) are available since the version 1.2 of the SDK 3 . A set of documents, in the form of a directory of text files, was processed to build an ontology in terms of a user specified query. A stop list of common words that carry little information was used to prune these words from the files. A first level ontology was constructed by counting the occurrence per thousand words of the hypernyms of the search keywords or their synonyms. A fuzzy membership function of this first level hierarchy resulted. A word vector was then formed from the remaining words, and a principal component analysis was performed to identify the key words that are most 3 Available
from http://java.sun.com
7
significant in separating the individual documents. The first fifty terms were taken to form eigenvectors, and these were combined with the user information to train a SOM, which produced an alternative structured view of the information which was suitable for display to the user (Figure 5). Documents can be regarded as high dimensional vector spaces: d1 d2 .. .
w1 7 1 .. .
w2 5 3 .. .
··· ··· ··· .. .
wn 1 0 .. .
dn
2
11
···
10
{w1 , w1 , · · · , wn } are terms, and {d1 , d2 , · · · , dn } documents. X nj = djk ek
(3)
k
where ek is the unit vector and djk is the frequency of occurrence of wj in dk . The ontology was thus derived using a combination of intrinsic differences in word frequencies between the documents, and according to the semantic concepts inherent in the user’s query. This produced results that are often surprisingly close to the user’s intuitive expectation, but more research is needed before this can be established for certain. We are investigating the use of hierarchical SOMs, and the identification of further keywords from HTML tags and from the hyperlink structure of the web site itself as described in the next session.
4.1
Sources of Information for Building the Ontology
The obvious source of information for constructing an ontology is the words contained in the documents themselves, together with those supplied by the user wishing to view and search the information. Synonyms and hypernyms can make this more powerful, and can provide a strong hint of the structure of an ontology. Web pages also contain a great deal of information concerning the structure and importance of documents. The graph of hyperlink references is an ontology in itself as suggested in [5]. This may be a close approximation to that required by the application, or a structural mismatch which is of little assistance. Experience suggests that it will rarely be a clean and consistent ontology. It is an important information source nonetheless. Manocha et al describe a method of inexact graph matching which is used to match user queries to graphs derived from actual web sites [8]. HTML tags such as heading levels and emphasis, for example text size and boldface also provide heuristics as to the importance of words. Sometimes the words in the first sentence of a paragraph may be considered more important that other words. These heuristics can be used to weight the selection of words after principal component analysis, or simply to select sufficiently important words that may then earn a place in the feature vector by being placed in important points, rather than for their discriminatory power. A great deal more 8
w0
Iterative Process
...
wm
d0
. . .
n - number of docs p - reduced number of docs m - vocabulary size l - reduced vocabulary size
dn
SOM categories (classes)
w0
...
wl
d0
. . .
dn l