Integrating Domain Knowledge into a Generic Ontology - CiteSeerX

1 downloads 907 Views 155KB Size Report
borrowed from the topics used in popular Web directories to ... topics that are of interest across the Web denizens. .... Domain names found among BILI synsets, and .... Buying. Corporation. Diplomacy. International. Process. Political. Process.
Integrating Domain Knowledge into a Generic Ontology Sofia STAMOU Computer Engineering & Informatics Dpt Patras University and Research Academic Computer Technology Institute 61 Riga Feraiou Patras, Greece, 26221 [email protected]

Abstract In this paper we report on a combined ontology of various subject domains that evolved after merging ontological content that we harvested from the Suggested Upper Merged Ontology (SUMO), MultiWordNet Domains and BalkaNet. First, we present the process we followed for integrating SUMO and MWND into a common resource. Then we describe how we enriched the resulting joint ontology with an upper layer of domain concepts, which were borrowed from the topics used in popular Web directories to categorize Web pages. Finally, we address the integration of the enriched ontology with BalkaNet’s Inter-Lingual-Index (BILI). BILI comprises a subset of WordNet 2.0 synsets that have been linked to their semantic equivalents across six Balkan wordnets. Currently, BILI forms a combined ontology of multiple topics and can serve, among other, as a referential structured resource for categorizing documents into topics.

1

Introduction

Ontologies define sets of representational terms, referred to as concepts, which are interrelated in order to describe a target world (Bunge, 1977). There are two predominant approaches in building an ontology, i.e. representation of generic knowledge and representation of domain knowledge. Examples of generic knowledge ontologies are: CYC (Lenat, 1995), WordNet (Fellbaum, 1998), Sensus (Swartout et al., 1996), SUMO1. In this paper we report on the integration of a generic ontology with domain information and the subsequent organization of the resulting ontology’s hierarchies into domains that correspond to topical categories utilized by popular Web directories. The ontology that we used in our study is BalkaNet2, a multilingual 1 http://ontology.teknowledge.com/

Dimitris CHRISTODOULAKIS Computer Engineering & Informatics Dpt. Patras University and Research Academic Computer Technology Institute 61 Riga Feraiou Patras, Greece, 26221 [email protected] wordnet for six Balkan languages (Tufis et al., 2004), which like EuroWordNet (EWN) (Vossen, 1998) it incorporates an Interlingual Index in order to enable the efficient mapping of equivalent concepts across languages. However, unlike the EWN ILI, Balkanet’s ILI, named BILI, is a structured repository of senses that we leveraged from WordNet3. As of this writing, almost ~20K WordNet synsets have been linked to their semantic equivalents across each Balkan wordnet. The main challenge while building BalkaNet, aside from providing a qualitative lexical network for the Balkan languages, was to deliver a shared lexical ontology of various subject domains that would be readily explored by applications that handle domain knowledge data sources. To that end, we put a fair amount of effort to make BalkaNet’s backbone resource, i.e. BILI, an ontology of various domains. The objectives and process toward BILI restructuring are given in detail in (Stamou et al., 2004) and (Bilgin et al., 2004), but the main idea for improving the set of WordNet 2.0 hierarchies that are exploited in BILI, was to ensure the consistent mappings across the monolingual wordnets and the sufficient representation of languagespecific properties. Having improved BILI, our next step was to enrich BILI with domain knowledge and thus make it a rich ontology of different topics that are of interest across the Web denizens. The apparent reason for enriching BILI with domain knowledge is because generic ontologies provide a large number of fine grained concepts that might contribute in degrading the ontologies’ efficiency in large-scale applications that deal with domain knowledge re sources. Most importantly, we choose to restructure the ontology for the sake of a specific application we are experimenting with; that of organizing Web documents meaningfully on the basis of their content. To acquire domain knowledge information, we explored two available resources, which despite their diversifications; they both have their concepts mapped to their equivalent WordNet synsets. The

2

BalkaNet (IST-2000-29388) was funded by the European Commission from Sept. 2001 until Aug. 2004.

3

http://www.cogsci.princeton.edu~wn/

resources that we used are: the SUMO ontology, a generic ontology of domain labels that have been mapped to their corresponding WordNet synsets, and the MultiWordNet Domains4 (MWND), a resource that assigns every WordNet 1.6 synset a domain label among the total set of 165 domains it consists of. Domains are arranged in a special hierarchy and their annotation to WordNet 1.6 synsets has been assigned semi-automatically by members of the Istituto Trentino di Cultura (ITC-IRST) group, where we licensed MWND from. In the remainder of the paper, we briefly introduce the resources that we used for building the ontology and we describe the selection criteria for the ontology’s top level domain concepts (Section 2). Next, we report on the integration of the existing resources with the selected top level domains and we give examples that communicate the integration process (Section 3). We conclude the paper with a discussion on some practical issues that pertain to ontologies’ integration (Sections 4 and 5).

2

Integration Setup

In this section, we introduce the resources that we explored for building our ontology and we describe how we selected the ontology’s top level domains. 2.1

The Resources

SUMO is an upper level ontology of over ~1,000 general purpose terms that was created by merging publicly available ontological content and which can act as a foundation for more specific domain ontologies (Niles and Pease, 2001). Each SUMO concept has been manually mapped via a synonymy, hypernymy or instantiation relation to every synset in WordNet 2.0 that is related to it (Niles and Pease, 2003). SUMO assigns domain labels to 65,636 WordNet nouns, 11,793 verbs, 17,453 adjectives and 3,101 adverbs. MWND is a lexical resource that provides annotations to every WordNet 1.6 synset with one of the 165 Domain Labels, which exhibit a hierarchical organization and have partially derived from the Dewey Decimal Classification5 (Magnini and Cavaglia, 2000) and (Castillo et al., 2004). Each MWND domain is subsumed by one of the following six upper level concepts: doctrines, applied science, pure science, social science, free time and factotum. Despite the rich annotations of WordNet 1.6 with MWND labels; 36,820 synsets have been assigned the factotum label, indicating that either no specific domain could be found for those synsets (known as generic synsets), or that these syn4 5

http://wndomains.itc.it/ http://www.oclc.org/dewey

sets can appear in a multitude of domains (known as stop senses). BILI is a structured lexical ontology of roughly 20K synsets, most of which were borrowed from WordNet. BILI initially comprised the WordNet 1.5 synsets, but soon it was upgraded with synsets from WordNet 1.7.1, and eventually with synsets from WordNet 2.0. BILI updates were carried out automatically, by utilizing available mappings6 between WordNet versions. Produced mappings were then manually verified and inconsistencies were corrected. Mapping inconsistencies were either due to certain language specific concepts for which a correct mapping equivalent could not be found, or due to synsets’ merge or split between different WordNet versions. Currently, a common set of almost 20K WordNet 2.0 synsets are incorporated in BILI and are linked to their semantic equivalents across the Balkan wordnets. 2.2

Challenges to Integration

Integrating distinct ontologies into a single structure is a laborious process that involves several difficulties, like ensuring the consistency of data, the interrelations’ validity, domain coverage and so forth. However, our work of integrating SUMO with MWND did not encounter serious difficulties essentially because both of the above resources were already mapped to their equivalent WordNet synsets. However, despite the resources’ compatibility, our task of merging SUMO and MWND into a common ontology entailed a dual challenge. The first issue that we had to tackle concerned the pruning of too abstract domain concepts from the resulting ontology. To remove abstract concepts from the ontology, we manually investigated the upper level domain concepts both in SUMO and MWND, and selected those that were too abstract to be of practical usefulness. Examples of such concepts are abstract, agent, asleep, etc. We decided to eliminate from the ontology the abstract SUMO domain concepts along with the MWND domain factotum, as we anticipate that these concepts would be error prone in giving accurate document classifications. Removal of abstract concepts though, should not result to information loss. To account for that, we retrieved all BILI (i.e., 20K WordNet 2.0) lexical hierarchies that have been originally mapped to those abstract domains. Then relying on the hierarchies’ lexical elements, we attempted to find a more suitable domain concept to append them. This new domain could be either another SUMO concept, or a MWND domain. To minimize the likelihood of subjective judgments, while looking for a new domain, we relied on the 6

http://www.lsi.upc.es/~nlp/tools/mapping.html

terms used in the sense descriptions of these concepts, their semantic relations in BILI as well as their domain relations encoded in WordNet 2.0. Upon failure to find a suitable domain concept to map a hierarchy, the latter was disregarded from our ontology. Having eliminated abstract domain concepts, the second issue that we had to deal with was to ensure that the mappings between the concepts of the merged SUMO and MWND hierarchies and the BILI hierarchies would not entail inconsistencies. In particular, we had to reassure that every BILI synset would be assigned an appropriate domain label and that the domain assignment would be in accordance with the BILI semantic relations. Although, mapping BILI concepts to SUMO and MWND might seem a straightforward process at a first glance due to the availability of the mappings7 between the above resources and the WordNet synsets, nevertheless mapping could not be conducted entirely automatically. This restriction is imposed by the fact that BILI does not comprise the full set of WordNet 2.0 synsets but only a small portion of it. Moreover, BILI comprised some additional concepts that had been incorporated in it within the framework of the BalkaNet project. To ensure the accuracy of mappings among the concepts in SUMO, MWND and BILI we proceed as follows: first we manually selected the BILI hierarchies that match to every SUMO domain concept and marked their starting and ending nodes. Then, each BILI synset found within a marked hierarchy’s path was automatically assigned the corresponding SUMO domain label. Automatic mapping was enabled due to the transitivity of the hierarchical links. On the other hand, to assign every BILI synset a MWND domain, we used the mappings from WordNet 1.6.1 to 2.0 provided from the Natural Language Research Group8. The mapping between the two WordNet versions was specified in terms of pairs of synsets and was deterministic (one-to-one) in most cases. The few conflicting cases (one-to-many) have been manually corrected either by removing or by improving in monolingual wordnets the structural elements that were illformed and introduced inconsistencies. Finally, using the correct mappings between WordNet 1.6 and 2.0, we assigned the 165 MWND Domain Labels to their corresponding BILI synsets. Assigned annotations were manually verified for the most part of BILI. The next step was to merge the above resources into a shared ontology of topical domains. To enable merging, we manually selected a 7

http://cvs.sourceforge.net/viewcvs.py/sigmakee/KBs /WordNetMappings/ 8 http://www.lsi.upc.es/~nlp/

set of domain concepts around which the ontology would be organized. 2.3

The Top Level Domains

The top level concepts of the ontology were chosen manually and they correspond to popular topics covered in Web data. In selecting the top level concepts, we examined the topics and sub-topics used in two popular Web directories i.e. Google Directory (http://dir.google.com) and Yahoo! Directory (http://yahoo.com). First, we compiled a list of the terms used to label each of the topics and subtopics in both the above directories. We then removed from the list duplicate domain names and we manually searched the remaining ones in BILI. Domain names found among BILI synsets, and which had deep and dense subordinate hierarchies, were retained and we ignored the remaining. At the end of this process we came down to a total set of 8 Web topical domains and sub-domains from both Yahoo! and Google Directory for which there was sufficient data in BILI. These top level domain concepts are: Animals, Recreation/Free Time, Arts, Health/Medicine, Science, Society, Sports /Athletics, Measurements, and they were employed as the top concepts in our domain ontology. Thereafter, we integrated MWND, the selected SUMO hierarchies and BILI hierarchies into these 8 top level domains.

3

Integrating Ontologies

Merging SUMO domain hierarchies and MWND into a common ontology was carried out iteratively and by consulting the BILI lexical hierarchies. The first straightforward step we took was to detect common domain names across the two resources. Identical or quasi-identical (e.g., transport, transportation) domain names were traced and all their corresponding BILI hierarchies were retrieved. We then compared the lexical elements of those hierarchies and if they displayed significant overlap, they were merged together following their BILI semantic relations. Merged hierarchies were grouped together into a common parent concept, lexicalized with the name of the hierarchies’ common domain label. As an example, consider the BILI hierarchies that are mapped to the SUMO domain swimming and the hierarchies that have been assigned the MWND domain label sport. Due to the BILI relation between the concepts sport and swimming, their corresponding hierarchies exhibit a large number of overlapping hypernyms. Hence, they are integrated into a common parent concept. This root concept is the MWND domain sport. In a second step, this parent concept is searched against the 8 top level domains, chosen previously. If there is a matching found between a top level domain name

and the parent concept of a combined hierarchy, the latter is integrated with this top level domain. In our example, the parent concept sport is also a top level domain concept (Sports/Athletics), hence all BILI hierarchies assigned to the domain sports are integrated with the ontology’s domain Sports/ Athletics. Conversely, if there is no matching between the combined hierarchies’ root concept and a top level domain, the direct hypernym of the parent concept is retrieved from BILI and searched against the 8 top domain concepts. If a matching is found between the hypernym of a merged hierarchy’s parent concept and a top level domain, the hierarchy is integrated with the top level domain via the IS-A relation. This way the hierarchy’s parent concept becomes a sub-domain in one of the 8 top domains, and denotes a middle level concept in the ontology. As an example, consider the SUMO domain concept computer program, whose hierarchies have been integrated with those that are assigned the MWND label computer science. Note that hierarchies corresponding to the above concepts are merged because they share common elements in their BILI hierarchies. Following merging, the combined hierarchies are organized into the common parent concept applied science, which in MWND subsumes the computer science domain concept. But, this parent concept is not among the ontology’s 8 top level domains. To find the most suitable domain for integrating the hierarchies that correspond to the parent concept applied science, we retrieve from BILI the direct hypernyms of applied science. Among its hypernyms is the concept science, which is also a top level domain concept. As such all hierarchies linked to applied science, are appended to the Science domain. In this way, applied science becomes a middle level concept in the ontology. At the end of this integration step, a significant portion of the combined hierarchies was grouped into the top level domains, but still there were hierarchies left disjoint. Disjoint hierarchies occurred for two reasons: (i) failure to merge; due to their little overlap in BILI, (ii) failure to match a top level domain to be grouped in. As a final integration step, we manually partitioned disjoint hierarchies into two classes. The first class comprised the hierarchies whose domain concepts are semantically related in BILI and the second class contained everything else. The hierarchies of the second class were disregarded because there was not sufficient evidence in BILI to support our judg-

ments for their potential merging. Hierarchies of the first class were merged together as follows. Every domain concept (from either SUMO or MWND) was searched sequentially in BILI and all its hypernyms two levels up in the hierarchy were retrieved. Retrieved hypernyms were searched against the ontology’s top and middle level concepts, defined previously. If there was a matching found between the ontology’s top or middle level concepts and a hypernym or co-hypernym of a disjoint hierarchy’s domain, this (co)-hypernym was incorporated in the ontology as a middle level concept. All hierarchies in the first class that are linked to that middle level concept were merged together following their BILI interrelations. Upon another failure to find a matching top or middle level domain concept, the process was terminated and the hierarchies that still remained disjoint were disregarded. We posed this limitation to the number of hypernyms considered instead of continuing going up in BILI hierarchies, to exclude too abstract concepts from the ontology. Hence, by limiting the number of the merging iterations, we speculate that concepts in coarser grain are pruned form the ontology’s middle level concepts. At the end of the process we came down to a total set of 293 middle level concepts, which are linked to the 8 topic domains in the ontology. The resulting upper level ontology (i.e., top and middle level concepts) is a directed acyclic graph with maximum depth four levels and maximum branching factor, i.e., number of children concepts from a node, 28. Figure 1 illustrates a small portion of the ontology for the Society domain. Finally, we integrated BILI with the resulting ontology, simply by anchoring to each middle level concept its corresponding BILI lexical hierarchies. Merging BILI and the ontology was determined by the BILI interrelations that hold between the ontology’s upper level concepts and BILI synsets. Here, we should point out that the underlying goal in our work was to deliver a qualitative ontology of domains, rather than to provide an exhaustive ontology for every WordNet synset. Consequently, we omitted from our ontology those WordNet hierarchies for which there was not adequate evidence in BILI to support our judgments for their inclusion. Quality is ensured by the fact that we manually checked the validity of every derived merging, by consulting BILI; a resource delivered by the joint collaboration of dozens of experts for almost three years.

Society Economy Financial Transaction

Exchange

Law

Politics

Legal Action

Diplomacy

Religion Political Process

Religious Organization

Contracts Banking

Buying

Currency Measure Industry

Selling

International Process Service Contracts Purchase Contracts

Borrowing

Believes

Roman Catholic

Corporation

Figure 1: A small portion of the ontology’s upper layer for the Society domain. Links denote an IS-A relation. Nodes linked directly to the root are sub-domain concepts of Society and together with their sub-(sub) ordinates form the ontology’s middle level concepts.

4

Practical Considerations

Having developed the ontology, one question we need to answer is how much does our ontology cover? To answer that, we have to consider the core objective that the ontology was built for. In essence, we perceive our ontology to be complementary to existing knowledge bases (e.g. Meaning (Atserias et al., 2004), and we deem its contribution in the course of facilitating the categorization of Web documents in topics. Based on the above objective, we believe that our ontology covers a broad spectrum of popular topics among the Web content reasonably well, since the top level domains were borrowed from both the Yahoo! and Google Directory. Moreover, we believe that our ontology is of good quality since the middle level concepts were manually checked and the lower level hierarchies were adopted from BILI. Relying on popular Web directories for the selection of the ontology’s top level domains, gives practical evidence that the topical domains covered in the ontology have been found useful by people; especially by Web cataloguers and catalogs’ users. Furthermore, the manual selection of the middle level concepts, on the basis of their sufficient representation in BILI, indicates that the knowledge represented in the ontology is meaningfully organized. Another issue is the soundness of the merging approach. In this respect, merge did not pose any serious practical or theoretical problems, since the lexical hierarchies of both the resources that we used (i.e., MWND and SUMO) contain a significant amount of overlapping content and they both have been mapped to a common lexical network, i.e., WordNet. Our approach for merging the above resources relies on evidence from BILI, a fraction of WordNet 2.0, which gives adequate support to our judgments about grouping lexical hierarchies into common domains. Of course, this sort of

judgments is again somewhat subjective; however we should not diminish the amount of effort and time that has been spent for developing these resources. Finally, with respect to the overhead associated with building the underlying ontology it is worth pointing that although we relied on existing structured resources, our work turned out to be quite costly in terms of both time and effort. Evidently, this is imposed by the lack of scalability, which is inherent in the manual construction of ontologies.

5

Conclusion and Future Plans

We have presented a combined ontology of various domains that emerged after enriching BalkaNet’s Interlingua (BILI) with existing ontological content. To improve BILI, we explored the domain hierarchies of both SUMO and MWND, which we combined into a common structured resource. The reason for relying on existing resources rather than developing the ontology from scratch is twofold. On the one hand, the resources we explored contain a large volume of qualitative data, which was already mapped to WordNet (the backbone of BILI) by field experts. On the other hand, we believe that no ontology is intrinsically useful for a practical application unless it is used in a complementary manner to other resources. In detail, we have addressed the methodology we adopted for integrating MWND domains and SUMO generic ontology into a set of 8 manually selected domain concepts. The rationale for the selection of these domain concepts is their established usefulness in organizing the Web data thematically; an argument that is supported by their wide usage in popular Web directories. The objective of this article is neither to compare the resources we started off with, nor to assess the final impact of the delivered ontology, but rather to present the steps we took toward ontologies’ inte-

gration, as well as to communicate the insights we accumulated from this experience. It is our hope that this will get others motivated to experiment with our resource and employ it into their systems and applications. Essentially, we perceive our ontology to be complementary both in nature and scope to existing resources and by no means do we deem its usefulness in isolation. Most importantly, we argue that the added value of our ontology lies in its alignment with the respective hierarchies across six monolingual Balkan wordnets. We wish that, this will open up avenues for the Balkan languages to become widely studied and richer in lexical resources and language processing applications. Concerning the latter and with emphasis on the practical site of our work, we are currently testing the efficiency of our ontology in classifying Web documents into topical categories, motivated by the findings of (Magnini et al., 2001) and (Gliozzo et al., 2004), who showed that domain information can give good estimates for documents’ domain relevance. To that end we have designed a prototype classification framework that makes explicit use of the information encoded in BILI and which attempts to group Web retrieved documents into the ontology’s top level concepts. Although our work is still underway, early experimental results indicate that our ontology has a potential in serving document categorization tasks.

6

Acknowledgements

Our sincere thanks go to each member of the BalkaNet consortium for their priceless contribution throughout the three years of the project. We also acknowledge the valuable support and consultation of Dr. Christiane Fellbaum and Dr. Piek Vossen.

References J. Atserias, S. Climent and G. Rigau. 2004. Towards the MEANING Top Ontology: Sources of Ontological Meaning. In Proceedings of the 4th Language Resources and Evaluation Conference (LREC). Lisbon, Portugal. O. Bilgin, O. Cetinoglu and K. Oflazer. 2004. Morphosemantic Relations In and Across Wordnets: a Study Based on Turkish. In Proceedings of the 2nd International Global WordNet Conference, Brno, Czech Republic, pages 60-66. M. Bunge. 1977. Treatise on Basic Philosophy. Ontology I. The Furniture of the World. vol3. Reidel, Boston. M. Castillo, F. Real and G. Rigau. 2004. Automatic Assignment of Domain Labels to WordNet. In Proceedings of the 2nd International

Global WordNet Conference, Brno, Czech Republic, pages 75-82. Ch. Fellbaum. 1998. WordNet: an Electronic Lexical Database. MIT Press. A. Gliozzo, C. Strapparava and I. Dagan. 2004. Unsupervised and Supervised Exploitation of Semantic Domains in Lexical Disambiguation. Computer Speech and Language, 18(3), pages 275-299. D. Lenat. 1995. Cyc: a Large-Scale Investment in Knowledge Infrastructure. Communications of the ACM 38(11), pages 33-38. B. Magnini and G. Cavaglia.2000. Integrating Subject Field Codes into WordNet. In Proc. of the 2nd Language Resources and Evaluation Conference (LREC), Athens, Greece, pages 1412-1418. B. Magnini, C. Strapparava, G. Pezzulo and A. Gliozzo. 2001. Using Domain Information for Word Sense Disambiguation. In Proceedings of SENSEVAL-2 International Workshop on Evaluating Word Sense Disambiguation Systems. Toulouse, France, pages 111-114. I. Niles and A. Pease. 2001. Origins of the Standard Upper Merged Ontology: a Proposal for the IEEE Standard Upper Ontology. In Working Notes of the IJCAI-2001 Workshop on the IEEE Standard Upper Ontology. Seattle, Washington. I. Niles and A. Pease. 2003. Linking Lexicons and Ontologies: Mapping WordNet to the SUMO Ontology. In Proceedings of the IEEE Conference on Information and Knowledge Engineering, pages 412-416. S. Stamou, G. Nenadic and D. Christodoulakis. 2004. Exploring BalkaNet Shared Ontology for Multilingual Conceptual Indexing. In Proceedings of the 4th Language Resources and Evaluation Conference (LREC). Lisbon, Portugal. B. Swartout, R. Patil, K. Knight and T. Ross. 1996. Toward Distributed Use of Large Scale Ontologies. In Proceedings of the 10th Workshop on Knowledge Acquisition for Knowledge-Based Systems, Banff, Canada. D. Tufis, D. Cristea and S. Stamou. 2004. BalkaNet: Aims, Methods, Results and Perspectives. A General Overview. Romanian Journal of Information Science & Technology, 7(1-2): 1-35. P. Vossen. 1998. EuroWordNet: a Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers. SUMO Ontology http://ontology.teknowledge.com/ WordNet Domains http://wndomains.itc.it/