Building an Information and Knowledge Fusion System - Amazon Web ...

3 downloads 1836 Views 648KB Size Report
Even building a simple top-level ontology requires considering several possible .... content type is the list of links present in a document. Typically, these content ...
An Ontology-based Information Retrieval System Péter Varga, Tamás Mészáros, Csaba Dezsényi, Tadeusz P. Dobrowiecki Department of Measurement and Information Systems, Budapest University of Technology and Economics (BUTE), Magyar tudósok körútja 2. H-1117 Budapest, Hungary {pvarga, meszaros, dezsenyi, dobrowiecki}@mit.bme.hu

Abstract. Authors describe a general architecture and a prototype application for the concise storage and presentation of the information retrieved from a wide spectrum of information sources. The proposed architecture was influenced by particular challenges of knowledge intensive domain, mining the knowledge content of primarily unstructured textual information, demands for context driven, multi-faceted, up-to-date query and presentation of the required information, and by the intricacies of the Hungarian language, calling for special solutions to a number of linguistic problems.

1

Introduction

The Information and Knowledge Fusion international EUREKA project aims at the design and implementation of a new Intelligent Knowledge Warehousing environment, which would allow advanced knowledge management in various application domains (e.g. banking, legal information, education, health care, etc.) [1]. The developed IKF Framework implements this environment using information retrieval and extraction, various knowledge representation and information access methods. The IKF Framework is a generic, domain-independent architecture that can be used to build IKF Applications providing services in specific domains. The Hungarian IKF project (IKF-H) concentrates on developing a financial advisory prototype application. In an earlier paper [3] we have already presented the general model of the problem along with a high level architecture and the key technologies to implement such a system. In order to build a successful information retrieval and integration application we have proposed a strong usage of background knowledge about the application domain. This paper presents the first results in developing a general methodology of how this background knowledge should be organized and used in the IKF Framework. We also present the architecture and operation of the IKF-H prototype application. 1.1

Overview of the IKF Architecture

At a properly abstracted level all of the application areas mentioned above can be conceptualised as an interaction of three information environments [3]. By the Target

1

Environment we denote that fragment of the real world, where the targeted (monitored) objects (corporations, bank clients, business processes, etc.) do exist. The Information Cumulating Environment comprises all forms and media, which cumulate information about the targets. In our case it is the Internet, various Intranet resources, corporation databases, published resources, financial experts’ personal expertise and the like. Finally, the Information Utilizing Environment represents the users of information (e.g. the staff of a bank), at various level of management. Figure 1 shows the high level IKF architecture.

Figure 1. The IKF high level architecture

2

The role of ontology in the IKF System

In order to surpass the performance of a typical information retrieval system (both in precision and recall), the process of a human information retrieval must be studied and, at least partially, followed. Even the shallowest analysis of the human performance shows that its advantage consists of two main factors: (1) the use of linguistic competence and (2) the benefits of background knowledge. Since linguistics techniques are rapidly being added to implemented information retrieval systems, the construction, mapping and incorporation of background knowledge becomes the biggest challenge. This involves abandoning the solely indexbased searching methods, and requires making use of some logical apparatus. This is one of the central designing goals of the IKF-H application. The designer of any such information retrieval system has to face a complex problem: how to transform the background knowledge (which resides in humans) to terms of computer science that can lead to an implemented system. Although human knowledge, the nature of which is itself a huge philosophical problem, might be modelled with the use of intensional logic, but this would clearly lead to an almost unimplementable system burdened with theoretical problems. If a retrieval system is aiming at making use of background conceptual knowledge, its implementers must

2

confine themselves to a less powerful logical apparatus. As it is well known, the use of ontologies can provide a solution to this paradox. A suitable definition for ontology can be found in [9] [12] [13]: an ontology is a theory formulated in the less powerful language of the working system, which theory tries to cover those models of this language that, in some manner, correspond to the conceptualisation (which is an intensional system trying to account for the human background knowledge). The crucial role of well-defined ontology has already been recognised. IEEE Computer Society pioneered a Standard Ontology Study Group [8]. A research group related to the IKF project is tackling the question of the meta-organisation of the ontological hierarchies [9], finally fairly recently a number of ontology based enterprise models has been developed [10] [11]. All these developments serve as a basis for the development of the suitable Hungarian enterprise ontology.

3

Building the IKF-H ontology

The consideration of the specific requirements of an information retrieval system could gain several insights. First, a strict distinction must be drawn between conceptual knowledge and factual knowledge. Efforts to a design-time incorporation of the second would be a petitio principi, since this is the very information we want to retrieve. On the other hand, the conceptual knowledge must be included before the system starts working. As described above, this separated incorporation is to be done via ontologies. Further insights emerge from the comparison with existing initiatives in the ontology research. Whilst one central function of ontologies nowadays is to ensure semantic agreement between communicating computer systems, IKF, in its primer form, does not involve communication with other systems, therefore possesses less need of a rigorously defined ontology. After identifying these designing requirements the task of constructing the ontology could start. Unfortunately, the very first step of this process proves to be the most difficult, since it involves deciding what to include at the top-level. On one hand, it raises extremely difficult philosophical questions; on the other hand, it leads to a jungle of currently proposed top-level ontology standards. Making justice between them and creating the ideal top-level ontology would be a great challenge, but definitely not the most important task to solve in building an information retrieval system. Fortunately, such systems do not need an ideal and exact ontology of the world, since neither their input, nor their output is to be considered ideal or exact; but they may utilise an ontology to improve accuracy (just like humans do). Even building a simple top-level ontology requires considering several possible world descriptions. In the following, we explain how the Aristotelian ontology1 was adopted in the IKF-H system. The backbone of the IKF-H ontology consists of several concept hierarchies based upon the generic-specific distinction (a graph theorist would call it a forest). The 1

What we call Aristotelian ontology is the mainstream reception of his logical works. Apart from Aristotle’s own works, it also rests upon authors like Porphyrios.

3

genus-concept is partitioned into several species-concepts, which stand in mutual exclusion (this is a kind of semantics of the graph-theory). It is important to state, that the bottom parts (the leafs) of a hierarchy are also concepts, not instances (every species and a genus concept stands in an is-a relation). These conceptual hierarchies provide an easily computable yet expressive basis of the ontology. Of course, up to this point the concept of ontology proposed is not far from trivial. The main trick is the introduction of the system of categories. There is only a previously known, fixed number of concept-hierarchies (trees), each with a fixed meaning. (For example there are substances (roughly saying entities) with their own hierarchy, there are qualities (again with their own hierarchy), there are quantities and so on). For the purpose of a prototype-system, the original categories of Aristotle were adopted2, but later it can be changed depending upon specific needs. The notion of intercategorial relation is also introduced, i.e. a concept can involve constraints on a concept of an another category (i.e. natural substance involves colour). These relations are of logical nature (implication or exclusion, maybe of complex concepts). The fact, that the assignment of categories is done before designing the ontology, makes it possible to introduce implicit relations joining together the root concepts of each category. The intercategorial relations, then, are only refinements of these primordial relations. (This could help the ontology designer to sketch the possible intercategorial relations.) This concept of ontology results in a system that is capable of incorporating taxonomies as well as handling constraints that join together different taxonomies. It is also capable of coping with partially known information, since categories related to the unknown piece of information reduce in default to the category’s ancestor concept. In the following we describe the prototype IKF application and how the ontology was used in this system.

4

The prototype IKF-H application

In the framework of the Hungarian IKF project a prototype system is under development in order to implement and demonstrate theoretical ideas in a real-world application. This prototype system collects available information about Hungarian companies from the World Wide Web, and provides it in a concise and integrated way to end users in a bank to support their decision processes (e.g. loan management). The prototype application contains the following main components: • Information Retrieval Subsystem, • Document Analysis Subsystem, • Knowledge Repository, • Document Repository, and • Search and Report Interface. Documents retrieved by the Retrieval Subsystem are analysed and stored in the repositories. End users access the repositories via a search and report generation 2

These are: substances (entities), quantities, qualities, relations (relational qualities), space (spatial qualities), time (temporal qualities), positions, states, actions and passions.

4

interface. In the following we give a closer look on the document retrieval and analysis. 4.2

Information Retrieval in the IKF system

The Retrieval Subsystem automatically traverses the information sources, retrieves documents, and prepares them for information extraction. It works as an autonomous agent [5], that receives its goals (in the form of document source URLs and search patterns) from the rest of the IKF system, and achieves them by collecting and analysing documents from the sources [2] information extraction methods [16]. Its internal structure and working mechanism is shown in Figure 2. Source environment (WEB)

Search Knowledge Base

Domain Knowledge Base

URL Register URL

Donwloader

Source Document

Source Content Analyzer

DB Content Objects

"Links" type Content Object feedback

Distributor

XML

Document Builder and Register

Figure 2. Architecture of the document retrieval system The Retrieval Subsystem works the following way. The URL register builds an internal model of the structure of the source environment [4]. With this, the agent has a general view of all the web places it visited or needs to be visited. The task of the Downloader module is to select the next URL, and to retrieve the document from the selected address, and build the so called IKF Source Document, which is the internal representation of the original source document in XML form. The next step is to perform an initial analysis of the IKF Source Document. This is done by the Source Content Analyser that is responsible for recognizing the structure and main characteristics of the retrieved document.. It extracts so called content objects, that hold all the relevant information found in the original documents. These content objects are XML documents containing the extracted information in a structured form according to domain-specific type definitions. These definitions indicate the meaning of the content. For example, a common and simple content type is the list of links present in a document. Typically, these content objects hold unstructured text between the tags, but these text fragments contain the information we want to extract. The Source Content Analyser also performs textual analysis on the content objects in order to describe them in more detail. This is based on indexing and information retrieval methods. The

5

result is content descriptions attached to the content objects that help the deeper analysis of the retrieved texts. Figure 3 shows an example of extracting information from articles of a news portal. The left side contains a sample picture of one article. It is embedded in a page that contains advertisements, menus, and other non-relevant information. On the right hand the resulting content object is shown. In this simple example the title, date, author, article-text and the citations (links) inside of the article text were extracted.

Figure 3. Example for source document parsing

4.3

Document analysis

The IKF Retrieval system performs only a shallow analysis of the source documents. Its main task is to transform them into XML structures (content objects) that help the further and deeper analysis. The final goal is to extract knowledge pieces from the IKF Documents that can be integrated into an IKF Application’s knowledge repository. In order to achieve this goal, the IKF Document Analysis module utilizes several analysis techniques. Based on the XML technology we created a general framework for document analysis. This general framework does not have any document processing capabilities by itself. It employs so called document analysis plugins (i.e. modules) in a dynamically configurable way to perform the analysis. These plugins share the same interface. Their inputs and outputs are IKF Documents, and they uniformly perform some kind of transformation upon them. Typically an analysis plugin creates or modifies an XML structure found in the IKF Document. There are several types of plugins, from a simple address, email, or phone number extractor, to the more complex grammatical analysis. The most complex analysis plugin is the linguistic analyser [15]. It analyses the documents at several level: words, sentences and paragraphs. Its main goal is to build a grammatical construe that helps the knowledge extraction process. This is done in three steps: morphologic analysis, nominal phrase recognition, and verbal phrase recognition. The morphologic analysis identifies the words in the sentences and creates the morphologic representation of these words. Recognition of nominal and verbal phrases is based on a rule fitting mechanism. The nominal phrase rules refer to the morphologic information in the words, and the verbal phrase recognition is based

6

on both morphologic and semantic information (based on the Hungarian Explanatory Dictionary). The main aim of the creation of the grammatical construe is to enhance the effectiveness of the information extraction. This can be done in several ways. Noun constructions help in identifying objects and attaching attributes to them. It makes it possible to establish links between the analysed documents and the ontology) of the application area. Verbs and their complements help in recognizing relations between objects. They also allow the transformation of the information found in the documents into a standardized form suitable for knowledge representation and reasoning. 4.4

Application of the ontology

There are several methods to utilise the potential of the ontology. Let us suppose that a powerful index-based search engine is available, along with a vast collection of documents (as it is in the IKF application). In a typical case the user wants to retrieve a relevant set of documents. In practice, the user defines relevance by supplying a list of words in a query. Typically, the user does not specify an extensive and precise list of words. There could be several reasons for it. Some of these extra words might seem trivial, some subtle; still they would significantly enhance the quality of the result. The simplest proof of this hypothesis is the monitoring of how a failed search query continues: by supplying a refined search term, i.e. more search terms. This led to the idea that ontology should assist in expanding the query, i.e. in adding search terms that are implied by the background knowledge. When proposing new query terms based on background knowledge, a computed weight factor could be also assigned which estimates the relevance of the term. Disjunction constraints (and implications resulting in negations) could be also utilized via negative weight. As it is implemented in the IKF prototype application, the query supplied by the user is also processed by the linguistic analyser. In order to achieve the query expanding functionality, the word stems are converted to concept names found in the ontology. It must be noted, that this relation is not a bijection, since a concept might have several names in natural language. (Although the current implementation treats the relation as a function, it is also theoretically possible that one name signifies several concepts). After query words are converted into their conceptual counterparts, the essential function of this process could follow: the expansion of query terms (see Figure 4 on the next page). Given our structure of ontology this could be done in the following way: subsuming and superseding concepts (with a decreased weight factors) are also to added to the query list, and this process could be iterated up to some level (with respectively decreased weight factors). The intercategorial constraints also could be utilized via specifying those concepts from other categories, which are implied by these constraints.

7

Ste ms-conc epts

User

Word stem s

Linguistic Analysis Plugin

Co nc ept na me s

C oncepts-index word s

Expansion

Inference Engine

Index words

OUTPUT

Ontology

Figure 4. The process of expanding a query A query containing several words is mapped onto list of concept names. This could be seen as complex concepts joined together by a conjunction, and therefore an equivalent simple concept could be also retrieved from the ontology. If such search fails, then it should be considered as a disjunctive complex concept in order for the process to continue. Another possibility is to apply the same process to the negated concept list (with negative value factors). This step results in an another list of concepts, which again must be converted. The output of this conversion, however, will be utilized by an index-based search method so another mapping is also required, which relates concept names to index words. Again, this relation ought not to be a strict bijection and every index word could carry an extra weight multiplication factor. The resulting index word list serves as an input to the index-based search. The advantages that can be gained from using ontologies depend on the content and structure of the whole ontology, but still can be illuminated by the following simple example. Suppose that a computer user is searching for information on socalled “exploit.” Using a simple ontology depicted on the left side of Figure 5. the query could be supplemented with additional weighted terms (right side). exploit web server OS FUNCTIONALITY Microsoft Unix Windows 2000 MS_OS UNIX WEBSERVER DISTRIBUTION Windows NT Linux Solaris LINUX SOLARIS WINDOWSNT WINDOWS2000 IIS APACHE THTTPD SUSE DEBIAN REDHAT operation system EXPLOIT_TOP

EXPLOIT

Figure 5. An ontology fragment (only taxonomy shown) and the resulting query For the sake of example a collection of 101 documents (approx. 1.2 MB) containing the word “exploit” was fetched from the Internet and loaded in the Document Repository. Each of these documents was manually classified into one of three categories: relevant, semi-relevant and irrelevant. Using this example it can be seen that because of polysemity and different contexts of usage the simple index search using vector model delivered inadequate results, whilst the same search engine with the supplemented query performed much better (see Figure 6). It should be noted, that this technique results improves both on

8

pinpointing specific results and eliminating irrelevant ones, which are fruitful advances factors from the viewpoint of further processing the results. This example is only to illustrate the original idea of how index-based searching can profit from using ontologies. As mentioned above, in the IKF Application ontologies fulfill a more complex role: they assign weight factors to each term and are also responsible for the translation of linguistically analyzed natural language terms into terms of the index language. Relevance Curves

Relevance Values

0,6 Relevance Values of Supplemented Search

0,5 0,4

Relevance Values of Simple Search

0,3 0,2 0,1

91

101

81

71

61

51

41

31

21

1

11

0

Result Position

Figure 6. Relevance Examination of the Two Results Sets The process of query completion described above is implemented in a prototype system, which proves its utility value (and was also used to produce the example above). For the sake of this implementation a prototype ontology was also developed (primarily concerning concepts related to business processes). In order to execute queries in the ontology, an implementation of a suitable logical apparatus must be selected. For the purpose of prototyping, the selected one was I. Horrock’s FaCT (fast Classification of Terms) system [14]. The whole implementation was done in component-based Java with XML and CORBA interfaces. Further improvements of this ontology-based functionality might make use of other levels of linguistic analysis, or might include reasoning with instances.

5

Summary

In this paper we have shown how the generic information and knowledge fusion architecture (envisioned within an international EUREKA cooperation) takes shape of a Hungarian language, financial domain specific application. The key issue in progressing toward well functioning application is the development of a suitable domain ontology, as the basis for the interpretation and grounding of the knowledge extracted from short financial news. The identified practical and theoretical requirements led to the construction of an ontology with a structure conforming to human way of describing the world, yielding no extra implementation problems and still being theoretically well founded. The developed prototype system autonomously retrieves documents from assumed information sources (e.g. web resources, here Hungarian electronic financial publications). Based upon the proposed structure of the ontology, functionality was

9

designed and implemented, which expand queries by adding terms implied by the background (ontological) knowledge. The system selects then an appropriate source content parser based on previously defined document and source models, and transforms the retrieved source content into XML structures (called content objects). Then, a document analysis module performs various text analysis tasks on content objects in order to extract information from the source documents. These tasks include also linguistic analysis with tools suited to the application language. Although there is still much to be done with the prototype the lesson learned so far is the crucial role of the ontology to every of the system services and the dismaying fact that in the interpretation of the Hungarian language news widely used linguistic tools are not enough, and that the language calls for essentially heuristic approach.

6

References

[1] EUREKA PROJECT “IKF - Information and Knowledge Fusion”, March 2000. [2] “The IKF architecture”, IKF project report, August, 2002. [3] T. Mészáros, Zs. Barczikay, F. Bodon, T. Dobrowiecki, Gy. Strausz, "Building an Information and Knowledge Fusion System", IEA/AIE-2001 The Fourteenth International IEA/AIE Conference, June 4-7, 2001, Budapest, Hungary [4] S. Chakrabarti, et.al., “Mining the Web's Link Structure,” IEEE Computer 32: 60-67, 1999 [5] J. M. Broadshaw, “Software Agents”, The MIT Press, 1997 [6] John Sowa's web site devoted to knowlege representation and related topics of logic, ontology, and computer systems, http://www.bestweb.net/~sowa/direct/index.htm [7] J. Sowa, “Knowledge Representation: Logical, Philosophical, and Computational Foundations,” Brooks Cole Publishing Co., Pacific Grove, CA, 2000 [8] Standard Upper Ontology, IEEE Study Group, IEEE Computer Society, Standards Activity Board, June 2000, wysiwyg://634/http://ltsc.ieee.org/suo/index.html [9] N. Guarino and Ch. Welty, “A Formal Ontology of Properties”, LADSEB/CNR Technical Report 01/2000, http://www.ladseb.pd.cnr.it/infor/ontology/Papers/OntologyPapers.html [10] M.S. Fox, J.F. Chionglo, and F.G. Fadel, “A Common-Sense Model of the Enterprise”, Proceedings of the 2nd Industrial Engineering Research Conference, 1993, pp. 425-429, Norcross GA: Institute for Industrial Engineers [11] US Taxonomies, US GAAP C&I Taxonomy 00-04-04 [12] N. Guarino, “Formal Ontology in Information Systems,” In N.Guarino (ed.) Formal Ontology in Information Systems. Proceedings of FOIS'98, Trento, Italy, 6-8 June 1998. IOS Press, Amsterdam: 3-15. [13] N. Guarino, and Giaretta, P. “Ontologies and Knowledge Bases: Towards a Terminological Clarification,” In N. Mars (ed.) Towards Very Large Knowledge Bases: Knowledge Building and Knowledge Sharing. IOS Press, 1995, Amsterdam: 25-32. [14] I. Horrocks. The FaCT system. In H. de Swart, editor, Automated Reasoning with Analytic Tableaux and Related Methods: International Conference Tableaux'98, number 1397 in Lecture Notes in Artificial Intelligence, pages 307-312. Springer-Verlag, Berlin, May 1998. [15] B. Benkő, Katona, T., and Varga, P. „Understanding Hungarian language texts for information extraction,” Internal Report, Dept. of Measurement and Information Systems, Budapest University of Technology and Economics, 2002 (in Hungarian) [16] Eikvil, L., “Information Extraction from World Wide Web - A Survey”, Report No. 945, Norweigan Computing Center, July 1999.

10

Suggest Documents