uncorrected proof - Dipartimento Informatica - Sapienza

5 downloads 39730 Views 766KB Size Report
Feb 7, 2003 - Text Mining Techniques to Automatically Enrich a Domain Ontology. ∗. MICHELE MISSIKOFF .... where, besides the definitions and relationships among terms of a given ... proper names) representing the lexicalized appear- ance of the ... an actor's goal (e.g., Hotel Room Purchasing or. Flight Booking);.
P1: Dhirendra Samal (GJE) Applied Intelligence

KL1703-07

February 7, 2003

19:27

Applied Intelligence 18, 323–340, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. 

Text Mining Techniques to Automatically Enrich a Domain Ontology∗ MICHELE MISSIKOFF IASI-CNR, Viale Manzoni 30, Rome [email protected]

PAOLA VELARDI AND PAOLO FABRIANI DSI, University La Sapienza, Rome [email protected]

F

O R P

O

Abstract. Though the utility of domain ontologies is now widely acknowledged in the IT (Information Society) community, several barriers must be overcome before ontologies become practical and useful tools. A critical issue is the ontology construction, i.e., the task of identifying, defining, and entering the concept definitions. In case of large and complex application domains this task can be lengthy, costly, and controversial (since different persons may have different points of view about the same concept). To reduce time, cost (and, sometimes, harsh discussions) it is highly advisable to refer, in constructing or updating an ontology, to the documents available in the field. Text mining tools may be of great help in this task. The work presented in this paper illustrates the guidelines of SymOntos, ontology management system, and the text mining approach adopted herein to support ontology building. The latter operates by extracting, from the related literature, the prominent domain concepts and the semantic relations among them.

D E

T C

E R

R

O C

Keywords: ontology, text-mining, terminology, ontology management system, natural language processing

1.

N U

Introduction and Motivations

With the spreading of globalization, and the enhanced opportunity for enterprises to cooperate, even on a unplanned manner, there is a growing need for a common, shared vision of entities and activities in a given application domain. On a more technical ground, XML is gaining popularity in information exchange among enterprises. It is customary to say that XML, with respect to HTML, introduces a semantic level in web-based information exchange. Indeed, XML allows domainoriented tags to be introduced, but it does not offer more than bare terminology, lacking of a real semantic specification of the used terms. The two mentioned issues, one more on the business side and the other on the technological side, show the necessity to have ∗ This

work has been partially supported by the European Project ITS-13015 (FETISH).

an infrastructure able to provide precise definitions, and possibly more, for the concepts characterizing a given application domain. Such an infrastructure is represented by a domain ontology, that can be constructed and made available to the interested community, by means of specific software systems. In this paper we present the experience carried out within the European project FETISH, aimed at developing an interoperability infrastructure for small and medium European enterprises that operate in the tourism sector. A key element of the FETISH architecture is Ontotour [1], a shared Ontology for the tourism domain. Constructing an ontology is a challenging task that impacts on several issues. One is the symbolic ontology management system, that allows the users to manage (i.e., define, update, retrieve) domain concepts. To this end, in FETISH, the SymOntos system has been developed. Another key issue is the task of identifying, defining, and entering the concept definitions. In case

P1: Dhirendra Samal (GJE) Applied Intelligence

324

KL1703-07

February 7, 2003

19:27

Missikoff, Velardi and Fabriani

of a large and complex application domain this task can be lengthy, costly, and controversial, since different persons may have different points of view about the same concept. To reduce time, cost (and, sometimes, harsh discussions) it is highly advisable to refer, in constructing or updating an ontology, to the documents available in the field. Text mining tools may be of great help in this task. The work presented in this paper illustrates the guidelines of SymOntos and a text mining approach aimed at the extraction of prominent domain concepts from the related literature and the detection of semantic relations among them. In the rest of this section we will briefly introduce the main issues concerning ontologies. In Section 2 the ontology management system SymOntos will be briefly described, while in Section 4 the proposal concerning text mining for ontology extraction is presented. The related works are reported in Section 5, followed by the conclusions in Section 5. 1.1.

Figure 1.

Ontology

Thesaurus

Vocabulary

An intuitive account of conceptualization levels.

a consistent interpretation of the terms defined in the vocabulary. The construction of an ontology requires a thorough domain analysis that is accomplished by [3]:

F

O

O R P

– Examining the vocabulary that is used to describe the relevant objects and processes of the domain, – Developing rigorous definitions about the terms (concepts) in that vocabulary, – Characterizing the conceptual relations among those terms.

D E

T C

What is an Ontology?

As anticipated, the goals of an ontology is to reduce (or eliminate) conceptual and terminological confusion. This is achieved by identifying and properly defining a set of relevant concepts that characterize a given application domain. The construction of a shared understanding, i.e., a unifying conceptual framework, fosters:

E R

R

• • • •

Knowledge Base

O C

N U

Communication and cooperation among people; Better enterprise organization; Interoperability among systems; System engineering benefits (reusability, reliability, and specification).

An ontology is a shared understanding of some domain of interest [2]. It contains a set of concepts (e.g. entities, attributes, and processes), together with their definitions and their inter-relationships; this is also referred to as a conceptualisation. In other words, an ontology is an explicit, agreed specification about a shared conceptualisation. Ontologies may have different degrees of formality but, necessarily, they include a vocabulary of terms with their meaning (definitions) and their relationships. According to [3], an ontology is a domain vocabulary containing a set of precise definitions, or axioms, that: (i) provide the meaning of the terms, (ii) enable

Sometimes an ontology is confused with a thesaurus. With respect to the latter, an ontology aims at describing concepts, whereas a thesaurus aims at describing terms. An ontology can be seen as an enriched Thesaurus where, besides the definitions and relationships among terms of a given domain, more conceptual knowledge, by means of richer semantic relationships, is represented. With respect to a Knowledge Base (KB), an ontology can be seen as a preliminary stage of a KB, whose goal is the description of the concepts necessary for talking about a given domain. A KB, in addition, includes the knowledge needed to model and elaborate a problem, derive new knowledge, prove theorems, or answer to intentional queries about the domain. 1.2.

Reducing the Cost of Ontology Construction

Though the utility of domain ontologies is now widely acknowledged in the IT (Information Technology) community, several barriers must be overcome before ontologies become practical and useful tools. We envisage three main areas where innovative computational solutions could significantly reduce the cost and effort of ontology construction: – To provide effective support for collaborative development of consensus in the conceptualization of a given domain, since consensus is the first condition

Au: Pls. cite Fig. 1 in text

P1: Dhirendra Samal (GJE) Applied Intelligence

KL1703-07

February 7, 2003

19:27

Text Mining Techniques

to be met in order to obtain the desired benefits from an ontology – To enable distributed development and access to ontologies, since wide-spread usage of a resource outweigh the cost of development – To develop tools to automatic identify the prominent concepts and enrich with semantic information the terms of the ontology, thus reducing the cost and complexity of manually defining a large number of concepts belonging to a complex domain. In this paper, we describe SymOntos, a symbolic ontology management system developed at LEKS (Lab for Enterprise Knowledge and Systems), IASI-CNR [4]. In designing SymOntos, we have been working to define innovative solutions concerning the three critical issues listed above. These solutions are currently being experimented in the context of the European project FETISH, aimed at the definition of an interoperability platform for Small and Medium Enterprises (SME) in the tourism sector. Though we will briefly describe all the main features of SymOntos, this paper is concerned with the third issue, that is, the description of text mining methods and tools for automatic ontology construction. In the FETISH Project, we decided to explore the possibility to support the extraction of shared/able knowledge from on-line textual documentation available on the Web. In almost any domain and aspect of social life, knowledge is primarily transmitted through documents [5]: contracts, agreements, announcements, journals, and newspapers. Though this knowledge is not readily available for processing by computers, in the past few years the emerging technique of Information Extraction demonstrated to be mature enough to provide benefits in several practical applications. Information Extraction (IE) is a relatively new technology [6] aiming at extracting (structured) fact descriptions from textual information in electronic form. Information Extraction significantly differs from the more “mature” field of Information Retrieval, in that IE aims to extract specific knowledge from documents. This task is rather more complex, since there are many ways to express the same fact in natural language, and information may be spread across different sentences. For the purpose of FETISH, some of the IE technologies jointly developed at the Universities of Ancona, of Roma “La Sapienza”, and of Roma “Tor Vergata”, have been made available. This allowed us to set up an

325

initial activity, consisting in the automatic analysis of Web sites on tourism, parsing the texts found therein. In our project, we used available and newly developed Natural Language Processing (NLP) techniques to: – identify linguistic patterns (such as terminology and proper names) representing the lexicalized appearance of the relevant domain concepts – identify semantic relations among such concepts, thus helping to automate the process of concept definition

F

O

In the following, we will first describe the SymOntos concepts, metamodel, and formal foundations. We will briefly mention interoperability and consensus solutions adopted in the SymOntos projects, providing references to more detailed descriptions. In the second part of the paper, we will describe a text mining technique aimed at supporting domain ontology development. Experimental evidence extracted from the tourism domain produced encouraging results, though experimental results of the impact of such techniques on ontology development and use are not yet available. As remarked in [7], “ontology development and use technology will succeed when it becomes commonplace for people in a broad spectrum of communities to build and use ontologies routinely”, but on the other side only significant progress in automatic ontology construction will allow such success indicators to emerge. We hope that the results of our project will be a first step towards widespread usage and cost-effective ontology development.

D E

O R P

T C

R

E R

N U

O C

2. SymOntos: A Symbolic Ontology Management System The purpose of this Section is to summarize the features of the SymOntos ontology management system. Though this is not the main focus of the paper, few details are necessary to highlight the impact that text mining techniques may have on ontology construction. SymOntos supports the construction of an ontology following the OPAL (Object, Process, Actor modeling Language) methodology. OPAL [8] is a methodology for the modeling and management of an Enterprise Knowledge Base and, in particular, it allows to start with a preliminary form of KB, represented by an ontology. Below the main issues of a SymOntos ontology are presented, starting from the notion of a concept.

P1: Dhirendra Samal (GJE) Applied Intelligence

326

2.1.

KL1703-07

February 7, 2003

19:27

Missikoff, Velardi and Fabriani

SymOntos Concepts

As already mentioned, an ontology gathers a set of concepts that are considered relevant to a given domain. Therefore, in SymOntos the construction of an ontology is performed by defining a set of inter-related concepts. In essence, in SymOntos a concept is characterized by: – a term, that labels the concept, – a description, explaining the meaning of the concept, generally in natural language, – a set of relationships with other concepts. Concept relationships play a key role since they allow concepts to be inter-linked according to their semantics. The set of concepts, together with their links, forms a semantic network [9]. In a semantically rich ontology, both concepts and semantic relationships are categorized. In SymOntos, concepts are categorized according to the OPAL methodology, by associating with each concept a kind (also referred to as meta-concept). Below, the six primary kinds of OPAL are considered:

general notion). Therefore, we will use “term” and “concept” interchangeably, when no confusion may arise. The Broader Terms relationship allows a set of concepts to be organized according to a generalization hierarchy (corresponding in the literature to the well known ISA hierarchy). In such a hierarchy, a broader concept is a generalization of the concept being defined. For instance, within the tourism domain, the Accommodation concept is a generalization of the Hotel concept, and a person is a generalization of a tourist, or a travel agent. This relationship is defined between concepts of the ontology that, furthermore, must be of the same kind. With the Similar Words relationship, a set of concepts that are similar to the concept being defined are given, each of which annotated with a similarity degree. Such a degree is a real number not lesser than 0.4 (that is a threshold under which the similarity is considered meaningless), and lesser than or equal to 1.0 (in the case of similarity degree equal to 1.0, concepts are equivalent, and the denoting words are synonyms). For instance, the term Hotel can have as similar terms Agro-Tourism and Farm Hous, with similarity degree 0.5, or Guest Farm, with similarity degree 0.6. Analogously to the previous relationship, similar terms must identify concepts of the same kind. However, the Similar Words relationship can also be established among concepts that are not defined in the ontology (therefore words, rather than terms denote these undefined concepts). Finally, the Related Terms relationship allows the definition of a set of concepts that are semantically related to the concept being defined. Related concepts may be of different kinds, but they must be defined in the ontology. For instance, TravelAgency, Customer, or CreditCard, are concepts that are semantically related to the Hotel concept.

F

O

D E

O R P

T C

Actor: an active entity of the domain that is able to activate or perform processes (e.g., Customer or Travel Agency); Object: a passive entity on which a process operates (e.g., Hotel); Process: an activity aimed at the satisfaction of an actor’s goal (e.g., Hotel Room Purchasing or Flight Booking); Information Component: a cluster of information representing relevant aggregated properties of an Actor or an Object (e.g., Customer Contact Information); Information Element: an atomic information element that is part of an Information Component (e.g., Customer email); Elementary Action: activity that represents a process component that is not further decomposable (e.g., Printing customer bill).

E R

R

O C

N U

Semantic relationships are distinguished according to three categories namely, Broader Terms, Similar Words, Related Terms, that are described below. Please note that we introduce relations between terms, and these terms are labels of concepts. There is a tight connection between the linguistic and conceptual dimensions, but this elaboration falls outside the scope of this paper. Here we will use “term” and “word” where the former denotes a concept label, while this is not necessary true for the latter (being therefore a more

2.2.

The SymOntos Metamodel

In SymOntos a concept is defined according to the metamodel of the OPAL methodology. As anticipated, the construction of a concept is primarily based on the identification of its name, definition, and the set of relationships with other concepts. In more formal terms, a concept c is defined by a 7-tuple: c = (n, i, d, k, B, S, R),

P1: Dhirendra Samal (GJE) Applied Intelligence

KL1703-07

February 7, 2003

19:27

Text Mining Techniques

say that a similarity is legal if and only if A = B, and 1.0 ≥ D ≥ 0.4. Then, the Broader, Similar and Related relations are defined as follows:

Accommodation

Au: Pls. cite Fig. 2 in text

Agro-Tourism, (0.5) Farm Houses, (0.5) Guest Farm, (0.6)

Travel Agency, Customer, Credit Card

Broader Terms

Similar Words

• Broader is a binary relation defined on T × P and, in particular, a pair ∈ Broader if and only if in the ontology there exists a concept where:

Related Terms Hotel

Figure 2.

327

Example of semantic relationships.

where: n: the name of the concept, i.e. the term that represents the label of the concept defined in the expression; i: the identifier of the concept (also referred to as cid: concept identifier), i.e., a short string that uniquely identifies the concept; d: the description of the meaning of the concept in natural language; k: the kind of the concept (i.e., Actor, Object, Process, Information Component, Information Element, or Elementary Action); B: the set of Broader terms, denoting generalizations of the concept holder of the Form; S: the set of Similar words, with the related similarity degree (a real number greater than or equal to 0.4, and lesser than or equal to 1.0), denoting similar concepts or words; R: the set of Related terms, denoting related concepts.

A is the concept name (term) and B is one of its Broader Terms; • Similar is defined on T × P × [0..1] and, in particular, a Similarity ∈ Similar if and only if in the ontology there exists a concept where:

F

O

O R P

A is the concept name (term), B is one of its Similar Words and D is the similarity degree associated with B in the definition; • Related is defined on T × P and, in particular, a pair ∈ Related if and only if in the ontology there exists a concept where:

D E

A is the concept name (term) and B is one of its Related Terms.

T C

2.3.

E R

R

O C

SymOntos: Formal Foundations

N U

An ontology is a set of related concepts. Besides the concept manipulation functions (see Subsection 2.3.2), SymOntos guarantees the high quality of the stored ontology. The primary issue for quality is correctness. In particular, a SymOntos ontology is correct iff the relations induced from Broader Terms, Similar Words, and Related Terms (that, in formal terms, are the Broader, Similar, and Related relations, respectively), fulfill some properties. In this section such properties will be shortly illustrated. 2.3.1. Correct Ontology. Below we start by formally defining the Broader, Similar, and Related relations. In the following, given an ontology, let T be the set of terms of the ontology, P the set of words used in the natural language, and [0..1] the interval of real numbers including the extremes 0 and 1. Furthermore, let Similarity be a triple of the set P × P × [0..1] (e.g., ). In particular, we will

Furthermore, let TBroader and SimilarTerm be two further relations defined as follows:

• TBroader (TransitiveBroader) is the transitive closure of the Broader relation; • SimilarTerm is defined on T × T × [0..1] and, in particular, a Similarity ∈ SimilarTerm if and only if in the ontology there exists a concept where: A is the name (term) B is one of its Similar Words that is also a term, i.e., the name of a concept in the ontology, D is the similarity degree associated with B in the Form. Of course, Broader is contained in TBroader, and SimilarTerm is contained in Similar. Now we are ready to introduce the notion of a correct ontology: Definition (Correct ontology). A OPAL ontology is correct iff the following conditions hold: – for each concept one and only one Name, Code, and Kind must be specified; – two different concepts must have different Names and Codes; – in each concept the Kind must be one of the OPAL meta-concepts;

P1: Dhirendra Samal (GJE) Applied Intelligence

328

KL1703-07

February 7, 2003

19:27

Missikoff, Velardi and Fabriani – is not in the Related relation, and there exist a term C such that: – ∈ TBroader – ∈ Related

Furthermore, regarding to Broader Terms: – Broader is defined on T × T (i.e., Broader is defined on terms); – if ∈ Broader, then A and B are of the same Kind; – TBroader is anti-reflexive;

i.e., in other words, the terms related with the generalizations of a given concept are related to the latter concept. Dsimilar (DerivedSimilar) is the Transitive Similarity Closure of SimilarTerm. It is defined on T × T × [0..1]. In particular, ∈ DSimilar if and only if the following conditions are fulfilled:

Regarding to Similar Words: – if ∈ SimilarTerm, then A and B are of the same Kind; – if ∈ SimilarTerm, then D ≤ 1.0; – each Similarity in Similar is legal; – two Similarities that differ only for the degree do not belong to Similar; – SimilarTerm is symmetric;

F

– belongs to SimilarTerm, or – there exist and in SimilarTerm and D = D1 ∗ D2, or – is obtained generalising the above mechanism to n steps.

O

Regarding to Related Terms:

O R P

Furthermore:

– Related is defined on T×T (i.e., Related is defined on terms); – Related is symmetric.

D E

T C

2.3.2. SymOntos Functions. After the definition of a correct ontology, below we briefly illustrate the main functions performed by the SymOntos in building an ontology.

E R

R

1. Concept Management. SymOntos presents an interface for the definition of concepts based on form filling. A similar form-based interface allows the knowledge engineers to update and retrieve stored concepts. 2. Verification. SymOntos checks the correctness of the ontology; in the case it is not correct, it notifies the incorrectness and suggests possible solutions for achieving correctness (for instance, if SimilarTerm is not symmetric, the system suggests and computes the Symmetric Similarity Closure of the SimilarTerm relation). 3. Ontology Closure. SymOntos derives two further relations, namely the DRelated and Dsimilar. The former is obtained by applying the inheritance mechanism (well known in the Object-Oriented field [10] and, more generally, in knowledge representation [11]) so that a given concept will increment its (declared) related terms with the (inherited) related terms of its Broader concepts. More precisely:

O C

N U

DRelated (DerivedRelated) is defined on T×T. In particular, ∈ DRelated if and only if:

– there are no triples ∈ SimilarTerm, and – is legal.

In other words, the terms that are transitively similar to a given concept are similar to the concept too, subject to their legality (essentially, the similarity degree must be ≥0.4). Finally, in the case of multiple similarity degrees found by transitivity, the maximum one is chosen. 4. Interface. There is a web-based user interface allowing a geographically distributed community of users to access an ontology, and extract concepts and definitions. Furthermore, since an ontology tightly mirrors an application domain that constantly evolves, the ontology must evolve accordingly. To support a group of heterogeneous people to work together in constructing and maintaining an ontology, the Consys systems has been conceived [12]. It is a group decision-making system associated to SymOntos, aimed at supporting ontology construction and management. 2.4.

Constructing an Ontology in the Tourism Domain

An objective of the FETISH European project was the construction of an ontology that stores knowledge regarding the actors, the processes, and the objects (both abstract and physical) that can be found in the Tourism domain, named Ontotour. Figure 3 provides an example of a concept definition for ”Hotel”, constructed by using SymOntos. Please note that the example exhibits

P1: Dhirendra Samal (GJE) Applied Intelligence

KL1703-07

February 7, 2003

19:27

Text Mining Techniques

Hotel Def: A place where a

XML tag:

cid: htl

tourist can stay Gen: Accommodation

Related-objects: Reservation, payment, deposit

Spec: Country_ Guest_ house,

Related-actors: Htl-manager, cashier,

motel

room_service

Part-of: receptivity _system

Related-processes: reserving, paying, billing,

Has-part: fitness_facilities,

airport_transfer

restaurant, garage

Similar-concepts: B&B[0.6], camping[0.4], holiday_apartment[0.7]

(all reported terms, except in the Def, correspond to concepts in the ontology)

Figure 3.

The Hotel concept in OntoTour.

a richer knowledge model that the simplified OPAL version illustrated in Section 2.2 (e.g., it includes the Part-Of conceptual relation). A comprehensive methodology for developing ontologies includes the following topics: – Ontology capture – Ontology coding – Integrating existing ontologies

and how to use part of other ontologies that already exist. In general, this is a very difficult problem that has been investigated in the literature [13] . In order to allow different ontologies to be shared among multiple user communities, much work has to be done. As far as the correctness of the semantic model is concerned, [14] presents an interesting solution to allow a formal inclusion of an ontology A into an ontology B, based on concept renaming and axioms inclusion. However, there are other issues involved with ontology extension and integration that are more difficult to model formally. The identification of synonyms concerning existing concepts, and ontology extensions with concepts that are not present are easy to do, whereas similarities among different concepts are quite difficult to deal with. The three tasks of ontology capture, coding, and integration are rather complex, controversial, and time consuming. Documents related to a given domain may provide an objective reference for the ontology Engineers, provided that automatic tools are available to extract the prominent information in some structured form. This is discussed in the next Sections.

Ontology Capture. By ontology capture we mean the process that a group of domain experts accomplish in order to find an agreement on the:

E R

– identification of the key concepts and relationships in the domain of interest; – production of precise unambiguous textual definitions for such concepts and relationships; – identification of terms to refer to such concepts and relationships; – identification of further terms expressing the same concepts (synonyms and similar terms).

R

O C

N U

Ontology Coding. By coding, we mean an explicit representation of the conceptualization captured in the previous stage, by using a formal language. In particular, coding involves: – committing to the categories that will be used to specify the ontology (e.g., actor, object, . . .); as anticipated, these categories are often referred to as meta-concepts; – choosing a representation language to encode the ontology; – writing the specification of the ontology according to the selected language. Integrating Existing Ontologies. During the capture and coding phases, there is the question of whether

F

O

D E

T C

that will be illustrated hereafter:

329

3.

O R P

Text Mining Techniques to Reduce the Cost of Ontology Construction

In Section 2 we illustrated the main features of the SymOntos system, and provided an example of concept definition in the Tourism domain. A detailed analysis of the conceptual model and the methodology to build the ontology was necessary in order to provide a better understanding of the impact that text mining and natural language processing techniques may have on the process of ontology construction. The techniques described in this Section are intended to significantly improve human productivity in the three phases of ontology capture, ontology coding, and ontology integration. As clarified throughout this Section, the contribution of our work has been to integrate in a Ontology Management System, SymOntos, several Natural Language Processing techniques, in part already available, in part developed under the pressure, and stimuli, of the FETISH project. In the final Section we also remark that this research is not concluded, since we foresee to extend the number and type of ontological information that may be automatically extracted from texts to support ontology building.

P1: Dhirendra Samal (GJE) Applied Intelligence

330

3.1.

KL1703-07

February 7, 2003

19:27

Missikoff, Velardi and Fabriani

NLP Tools Used for Text Mining

In order to help the construction of the ontology, we used a NLP processor oriented to corpus processing and extraction of linguistic knowledge named ARIOSTO [15], whose performance has been improved with the addition of new features developed in part during a recently terminated European project on Information Extraction called ECRAN, in part for the purpose of the FETISH project. In the following, we will refer to this enhanced release of the system as ARIOSTO+. In its English version (used in this work) ARIOSTO+ has the following modules: – a morphologic analyzer, that recognizes the lemma of single words and its morphologic features (e.g. noun, verb, number, tense, . . .). – a post-morphology analyzer, recognizing word strings like time, numerical and monetary expressions, comparative expressions, compound verbs, etc., (e.g. December 23rd, most amazing, has been enhanced ) – a part-of-speech (POS) tagger that, on the basis of machine learned contextual rules,1 eliminates morphological ambiguity (e.g. verb or noun, as in cut and paste and. a sharp cut), – a gazetteer look-up (e.g. a look-up to a large Proper Names dictionary), – a Named Entity2 (NE) recognizer based on contextual rules learned using decision lists and probability calculus [17, 18], – a Chunk3 Parser called CHAOS [19] guided by a dictionary of verb argument structures.

where: n1 indicates the position of [string] in the text, and t1 is the syntactic type of [string] (noun phrase, prepositional phrase, verb..). Secondly, more complex syntactic constituents are generated (the link() predicates in Fig. 4). The output of this phase reads as follows: link(n1,n2,t,plaus(value)) where: n1 and n2 identify the strings to be related, t is the type of syntactic relation detected by the parser, plaus(value) is a probabilistic estimate of syntactic correctness for the generated link [19], called plausibility. A lexicon of verb expected syntactic relations is used to guide the detection of complex constituents, like subject verb (V Sog), verb-object (V Obj), and prepositional phrase attachments (NP PP). The verb lexicon provides the expected syntactic relations for the most frequent 5000 English verbs. For example, the lexical entry:

O

D E

T C

E R

R

O C

N U

ARIOSTO was initially designed and evaluated on economic and financial domains in English and Italian (Wall Street Journal, Sole24 Ore). Adaptation to the Tourism domain required learning more fine-grained rules4 to detect names of locations, obviously pervasive in a Tourism domain, and an extension of the postmorphology grammar to capture certain frequently found structures such as phone numbers and addresses. Figure 4 provides an example of final output (simplified for sake of readability) of ARIOSTO+ on a tourism text. The CHAOS parser at first identifies simple constituents, like noun phrases and prepositional phrases. This output is bracketed in Fig. 4. It reads as follows: [n1,t1,[string]]

F

O R P

pattern(travel, [[ ],[(to,‘Post’)],[(over,‘Post’), (during, ‘Post’. . . )

says that the verb to travel has the following expected post-modifiers (Post): travel to NounPhrase, travel over NP, travel during NP, etc. Whenever the lexicon does not provide the necessary information, a plausibility measure is computed for the generated links. The plausibility is simply computed as 1/n, where n is the number of colliding interpretations in a syntactically ambiguous structure (for example, in Fig. 4 the prepositional phrase 5 (“of beautiful Texas Country”) has three possible alternative attachments). Since CHAOS is a partial parser, there is an over generation of syntactic links, but, as discussed in the next sections, subsequent methods based on probability calculus may help to reliably identify several systematic, and therefore interesting, linguistic patterns. 3.2.

Ontology Capture: Text Mining Tools to Identify Key Concepts

The first phase of the ontology building process consists in the identification of the key concepts of the application domain, categorized according to the OPAL metamodel, as Actor, Object, and Process. In this Section we describe the tools that we adapted and/or developed to help Ontology capture. Though concepts do not always have a corresponding word in

P1: Dhirendra Samal (GJE) Applied Intelligence

KL1703-07

February 7, 2003

19:27

Text Mining Techniques

331

The Colorado River Trail follows the Colorado River across 600 miles of beautiful Texas Country - from the pecan orchards of San Saba to the Gulf of Mexico .

[ 1 , Nom , [The,Colorado_River_Trail] ] [ 2 , VerFin , [follows] ] [ 3 , Nom , [the,Colorado_River] ] [ 4 , Prep , [across,600_miles] ] [ 5 , Prep , [of,beautiful,Texas_Country] ] [ 6 , ? , [-] ] [ 7 , Prep , [from,the,pecan,orchards] ] [ 8 , ? , [of] ] [ 9 , ? , [San] ] [ 10 , ? , [Saba] ] [ 11 , Prep , [to,the,Gulf_of_Mexico] ] [ 12 , punt , [.]] link(0,2,'Sentence'). link(2,1,'V_Sog', plaus(1.0)). link(2,3,'V_Obj', plaus(1.0)). link(3,4,'NP_PP',plaus(0.5)). link(2,4,'V_PP',plaus(0.5)). link(4,5,'PP_PP',plaus(0.3333333333333333)). link(3,5,'NP_PP',plaus(0.3333333333333333)). link(2,5,'V_PP',plaus(0.3333333333333333)). link(5,7,'PP_PP',plaus(0.25)). link(4,7,'PP_PP',plaus(0.25)). link(3,7,'NP_PP',plaus(0.25)). link(2,7,'V_PP',plaus(0.25)). link(7,11,'PP_PP',plaus(0.2)). link(5,11,'PP_PP',plaus(0.2)). link(4,11,'PP_PP',plaus(0.2)). link(3,11,'NP_PP',plaus(0.2)). link(2,11,'V_PP',plaus(0.2)). (…more follows…)

Figure 4.

O

R

O C

D E

O R P

T C

E R

An example of parsed tourism text.

F

natural language, often one such correspondence may be drawn among the less general concept nodes and the domain-specific words and complex nominals, like:

N U

• Domain Named Entities (e.g., gulf of Mexico, Texas Country, Texas Wildlife Association) • Domain-specific complex nominals (e.g., travel agent, reservation list, historic site, preservation area) • Domain-specific singleton words (e.g., hotel, reservation, trail, campground) We denote these singleton and multiword strings as Terminology. Terminology is the set of words or word strings that convey a single, possibly complex, meaning within a given community. In a sense, Terminology is the surface appearance, in texts, of the domain knowledge in a given domain. Because of their low ambiguity and high specificity, these words are also particularly useful to conceptualize a knowledge domain.

We now describe how the different types of Terminology are captured using NLP techniques. 3.2.1. Detection of Named Entities. Proper names are the instances of domain concepts, therefore they populate the leaves of the ontology. Proper names are pervasive in texts. In the Tourism domain, as in most domains, Named Entities (NE) represent more than 20% of the total occurring words. To detect NE, we used a module already available in ARIOSTO+ that we extended to capture the variety of location names found in a Tourism domain (historical and scenic sites, hotel names, etc.) A detailed description of the method summarized hereafter may be found in [17] and [18]. In ARIOSTO+ NE are detected and semantically tagged according to three main conceptual categories: locations (objects in OPAL), organizations and persons (actors in OPAL).

P1: Dhirendra Samal (GJE) Applied Intelligence

332

KL1703-07

February 7, 2003

19:27

Missikoff, Velardi and Fabriani

NE recognition implies first, to assign the morphologic tag “Proper Name” to single and multiple strings of (partly) capitalized words, second, to assign a semantic class (or “tag”) to the entire string. In phase one proper names are morphologically tagged as such by the Part Of Speech (POS) Brill’s tagger [16]. Then, a large dictionary of common Proper Names is used to recognize (some of the) constituents of a complex Named Entity. This phase is called “gazetteer look up”. The dictionary includes common person names and family names, geographic locations, organizations, and a list of so-called “trigger words”, i.e., semantic indicators such as Inc. Mr. gulf, association, lake, trail, etc. To make an example, after POS tagging, the words in “Colorado River Authority” are individually tagged as follows:

final output is: am (8,[[142,143,144], ‘Colorado River Authority’], ‘Colorado River Authority’, ‘proper noun organization’,invariable).

am (8,[142,‘Colorado’],‘Colorado’, ‘proper noun province’,invariable). (∗ ) am (9,[143,‘River’],river,,‘common noun’, singular). am (9,[143,‘River’],‘River’,‘loc key’, invariable). (∗ ) am (10,[144,‘Authority’],authority, ‘common noun’,singular). am (10,[144,‘Authority’],‘Authority’, ‘org base key’,invariable). (∗ )

Rules are manually entered or machine learned using decision lists [18]. If a complex nominal does not match any contextual rule in the NE rule base, the decision is delayed until syntactic parsing. A classification based on a syntactically augmented word associations is later attempted [17]. In Fig. 4 of previous Subsection multiword Named Entities detected by the NE recognizer are linked with a “ ” sign. The NE recognizer also provides a semantic tag for each NE, e.g. location, person, organization, as shown in the “Colorado” example, but not in Fig. 4 for sake of space. When contextual cues are sufficiently strong (e.g. “lake Tahoe is located.”.), names of locations are further sub-categorized (city, bank, hotel, geographic location, . .), therefore the ontology Engineer is provided with semantic cues to correctly place the instance under the appropriate concept node of the ontology. Notice that in Fig. 4, for example, “San Saba” is not recognized as a unique multiword proper name, since “San” was not included in the list of trigger words. However, the method described in [18] is also used to automatically enrich the proper names dictionary,5 thus leading to increasingly better coverage as long as new texts are analyzed. As reported in the mentioned papers, the F-measure (combined recall and precision with a weight factor α = 0, 5) of this method is consistently (i.e. with different experimental settings) around 89%, a performance that compares very well with other NE recognizers described in the literature.6

Phase two is called NE identification or recognition. The purpose of this phase is to aggregate and semantically classify complex constituents composed by known and/or unknown proper names. In the example above, we wish to aggregate the full string Colorado River Authority and assign to this string the semantic class organization. Named Entity recognition is based on a set of contextual rules (e.g. “a complex or simple proper name followed by the trigger word authority is a organization named entity”). In the example above, the items with “∗” trigger the appropriate rules within the NE recognizer, and the

3.2.2. Detection of Domain-Specific Words and Complex Nominals. NE are word string in part or totally capitalized, and they often appear in well characterized contexts. Therefore, the task of NE recognition is relatively well assessed in literature. Other not-named terminological patterns (that we will refer hereafter again as “terminology” though in principle terminology includes also NEs) are more difficult to capture. In the context of the FETISH project, we developed a new method to identify lists of domainrelevant terms and structure these list in partial hierarchical order. This method is described in this Section and the following.

am (8,[142, ‘Colorado’],‘#u#’,‘#u#’,‘#u#’). am (9,[143,‘River’],river ‘common noun’, singular). am (10,[144,‘Authority’],authority, ‘common noun’,singular).

F

O

D E

O R P

T C

where ‘#u#’ means “unknown” and numbers identify the words within a sentence and within a document. After the gazetteer look up: the output is:

E R

R

O C

N U

P1: Dhirendra Samal (GJE) Applied Intelligence

KL1703-07

February 7, 2003

19:27

Text Mining Techniques

Current approaches to the detection of terminological candidates can be classified in knowledge-intensive and statistical methods. The first group of contributions relies mostly on a detailed definition of expected surface appearance, in texts, of terminological patterns, [20] or on external resources like existing terminological databases [21]. The latter, irrespectively from relations and properties of word patterns, use mainly their frequency distribution (e.g. [22]) to select the actual domain terminology. The method proposed in [22] is to apply first “shallow” linguistic filters, based on part-of-speech (POS) tags, in order to detect “eligible” terminological patterns. Then, statistical association criteria are used to extract from these patterns “true” terminology. Mere frequency counts emerge as the best statistical indicators, on the basis of experimental data. However, since simple POS information is used to verify linguistic specifications, rather than a syntactic parser, the extracted “eligible” terminological patterns include many errors. In [23] firstly, singleton words are identified that, according to their distributional properties, appear to be specific of a given domain (represented by a subset of documents within a generic archive of documents). A terminology grammar (as in [22]) is then used to detect syntactically valid terminological patterns including these domain words. Finally, a statistical measure of relatedness, the Mutual Information, is applied to extract complex terminological expressions. The Mutual Information measure [24] is estimated (for a two word list: wi , w j ) as:

representing a given domain, but are relatively rare in a collection N of generic documents. This measure captures also words that appear just one time in a domain, which is in principle correct, but is also a major source of noise. Another problem is that the Mutual Information often does not provide a reliable measure of relatedness, as discussed later. In the following we describe the terminology extraction method adopted in FETISH. Let L be a learning corpus in the domain of interest Di and let N be a larger balanced corpus including several corpora in domains D1 D2 . . . Dn , N ⊃ L. The proposed approach requires the following steps:

R

O C

N U

E(M(wi , w j )) = log2

Wfreq(wi ∧ w j ) freq(wi )freq(w j )

where E(x) is the estimate of x. An interface is provided to validate the acquired terminological database. One of the major source of noise in this process is the identification of domain-relevant singleton terms. One commonly used indicator of domain-specific terms is the inverse document frequency idfi of a word i (used also in [23] with some modification): idfi = log2

N dfi

where dfi is the number of documents that include the word i, and N is the total number of documents in the collection. The idea underlying this measure is to capture words that are frequent in a subset of documents

F

O

O R P

1. Run ARIOSTO+ over the learning corpus L 2. Collect all the complex nominals extracted by the CHAOS parser (the first type of output in Fig. 4, e.g. [from, the, pecan, orchards]), whose syntactic structure matches syntactic specifications for terminology. 3. Remove articles and initial prepositions (if any), order the multi word lists, and compute the frequency of all the sublists of two or more words.7 For example, the chunk: [of,Louisiana,’s, country, music] generates the following strings:

D E

T C

E R

333

[Louisiana, country, music] [Louisiana, country] [country, music]

4. For each string of two or more words, compute over the balanced corpus N a statistical measure of domain relevance DR and domain consensus DC, described later. Select those strings for which the DR and DC are over given thresholds α and β, and build an initial terminology list T of complex nominals ti 5. Let V be the list of all singleton words appearing in T as syntactic heads of a string ti . Compute over the balanced corpus N the inverse document frequency for all words in V, and select those words whose idf is over a given threshold δ. Add these singleton terms to the list T. 6. Structure the final list T in subtrees, based on an inclusion relations of syntactic links In order to find a good statistical filter for the patterns detected in step 3, we performed several experiments. At first we ordered the detected strings ti according to plain frequency count, to the Mutual Information measure, and to the Dice factor, another commonly

P1: Dhirendra Samal (GJE) Applied Intelligence

334

KL1703-07

February 7, 2003

19:27

Missikoff, Velardi and Fabriani

used measure of relatedness defined as: Dice(w1, w2) =

2freq(w1, w2) freq(w1) + freq(w2)

As in [22], we found that mere frequency counts show the most relevant information. Both the Mutual Information and the Dice factor have some undesired property of which we will give only an intuitive description. In both formulas, the denominator includes the isolated frequency of each word in the string ti . If one of the words is particularly frequent, both measures tend to be low. This may cause a problem whenever certain very prominent domain words appear in many terminological patterns. For example, in the tourism domain the word “reservation” participates in many terminological patterns (like reservation list, airline reservation, on-line reservation, etc.). Because of the high frequency of the word “reservation” in isolation, MI and Dice assign a low association strength to the above patterns. On the other side, plain frequency counts include at least two types of undesired patterns: frequent non terminological strings (like “new year”, “first time” “long line” etc.) and not-domain specific terminology (like “word wide web”, “laser disc” etc.).

given domain. We defined a second probabilistic measure that provides an estimate of this “agreement”. The underlying idea is that “true” domain terms (e.g. travel agent) are referred frequently throughout the documents of a domain, while there are certain specific terms with a high frequency within single documents but completely absent in others (e.g. petrol station, foreign income). Domain consensus measures the distributed use of a term in a domain Di . The distribution of a term t in documents d j can be taken as a stochastic variable estimated throughout all d j ∈ Di . The entropy H of this distribution expresses the degree of consensus of t in Di . More precisely, the domain consensus is expressed as follows

O

E R

R

O C

N U

P(t | Di ) i=1...n P(t | Di )

DR(t, Di ) = 

(1)

where the conditional probabilities (P(t | Di )) are estimated as: freq(t ∈ Di ) i=1...n freq(t ∈ Di )

E(P(t | Di )) = 

This measure is used to detect terms that are frequent in the domain of interest but are rare, or absent, in other domains. Domain terminology should reflect concepts whose meaning is agreed upon large user communities in a

O R P

DC(t, Di ) = H(P(t, d j )   = P(t, d j ) log2

D E

d j ∈Di

1 P(t, d j )

 (2)

where:

T C

3.2.2.1. Two Probabilistic Measures of Term Relevance. As observed above, high frequency in a corpus is a property observable for terminological as well as non-terminological expressions (e.g. “last week” or “real time”). We measure the specificity of a terminological candidate with respect to the target domain via a comparative analysis across different domains. Given a set of n conceptual domains and related corpora (D1 , . . . , Dn ) the domain relevance of a term t is computed as:

F

E(P(t, d j )) = 

freq(t ∈ d j ) d j ∈ Di freq(t ∈ d j )

Pruning not terminological (or not-domain specific) candidate terms is performed using a combination of the measures (1) and (2). We experimented several combinations of these two measures, with similar results. The results, discussed in the next Section, have been obtained applying a threshold α to the set of terms ranked according to (1) and then eliminating the candidates with a rank (2) lower than β. 3.2.2.2. Detecting Taxonomic Relations Among Terms. The final result of the above outlined process is a flat list of terms. However, terms may be further structured in sub-trees, thus facilitating a subsequent linking of the sub-trees to the appropriate node of the Domain Ontology. Following [25] and [26] we extract taxonomic (vertical) relations starting from the syntactic head of multiword terms (Fig. 5). In [26] an algorithm is presented to attach sub-trees to WordNet [27] nodes. In our project, the top-level

Figure 5.

Sub-tree for the head card.

P1: Dhirendra Samal (GJE) Applied Intelligence

KL1703-07

February 7, 2003

19:27

Text Mining Techniques

nodes are not related to WordNet (at least at the current stage of the project one such decision has not been made), therefore placing a sub-tree under the appropriate node is performed manually by the ontology Engineer using the SymOntos interface. However, structuring terms in sub-trees significantly reduces manual work, because only term heads must be linked to the ontology. The amount of reduction is evaluated in the next Section, dedicated to performance evaluation. 3.3.

Experimental Analysis

As remarked at the beginning of this section, terminology and complex proper names are not found in Dictionaries. Therefore an obvious problem of any automatic method for concept extraction is to provide objective performance evaluation. There are three possible ways of formally evaluating a terminology: 1. The first is to use the extracted terms within a NL application (for example, document classification) and measure the performance of the application with and without the component. However one such evaluation strategy may not produce clear-cut results, especially when the influence of the component on the overall system performance is not predominant. 2. The second method is to use some existing thesaurus as a “golden standard”, and to measure the precision and recall of the method at automatically extracting the terms included in the available thesaurus. This approach is sufficiently assessed for Named Entities, since large gazetteers of proper names do exist. For example, our method for NE extraction is carefully evaluated in [+3] using a relatively large reference gazetteer for Persons, Organisations and Locations. Evaluation of not-named terminology is far more problematic, since no method would detect terms that are absent or appear rarely in the corpus used for term extraction. Moreover, the notion of “term” is too vague to consider available terminological databases as “closed” sets, unless the domain is extremely specific. 3. The third method is manual inspection by a team of experts. The notion of Named Entity is more precise, therefore manual judgement of extracted names is a relatively reliable approach, but as far as not-named terms are concerned, reaching the consensus about the introduction of a new concept is more problematic.

335

In a recent paper [28] we adopted the second approach to compare the precision and recall of our term extraction formula against other measures, such as the Dice factor, the Mutual Information and the frequency count. We used the Wall Street journal corpus to extract terms, and the Washington Post8 (WP) dictionary of economic and financial terms to measure the accuracy of the results. In the paper we show that our model outperforms the other methods, though, due to the problems outlined in point 2 above, we reach a (balanced Recall and Precision) F-measure of only 30% in the best experiment. Manual evaluation resulted in a precision value of 87.5%. In the FETISH project we could not rely on an assessed terminology, since the production of Tourism ontology is one of the objectives of the project. Therefore we used the third approach. Manual evaluation has been performed by the participant in the project, but in the next future we plan to use the Consys systems to ensure consensual decisions [12]. Consys is a group decision-making system oriented to domain ontology construction and management, associated to SymOntos. To manually evaluate our method, we first collected corpora in several domains: a collection of Tourism texts (description of tourist sites extracted from the WWW) economic prose (Wall Street Journal), medical news (Reuters), sport news (Reuters), a balanced corpus (Brown Corpus) and four novels by G. Wells. Overall, about 3.2 million words. Domains are rather different so that contrastive analysis empowers the filtering capability of the method. The Tourism corpus was manually built using the WWW and currently has only about 200,000 words, but it is rapidly growing. Table 1 summarises our results. Table 1 shows that only 2% terms are extracted from the initial list of candidates. This extremely high filtering rate is due to the small corpus: many candidates are found just one time in the corpus. However, candidates are extracted with high precision (over 85%). This result is in line with the experiments on the Wall Street journal described in [28]. We may conclude that the performance of our technique does not depend upon

F

O

D E

O R P

T C

R

E R

N U

O C

Table 1.

Summary results for the term extraction task.

No. of candidate multiword terms (after parsing) No. of extracted terms (with α = 0.35 and β = 0.23) % correct (3 human judges) Number of subtrees (of which with depth > 0)

14.383 288 85.42% 177 (54)

P1: Dhirendra Samal (GJE) Applied Intelligence

336

KL1703-07

February 7, 2003

19:27

Missikoff, Velardi and Fabriani

Table 2.

Table 4. Most highly populated sub-trees in the tourism domain.

The 15 most highly ranked multiword terms. Domain consensus

Subtree root

No. of different multiword terms

credit card

0.846913

tourist information

0.696701

hotel

34

travel agent

0.686668

service

21

swimming pool

0.664041

travel

17

service charge

0.640951

passport

14

0.635580

tour

14

credit card number

0.616671

visa

14

card number

0.616671

rate

13

room rate

0.596764

office

12

0.579662

certificate

beach hotel

0.571898

card

tourist area

0.565462

fee

tour operator

0.543419

booklet

standard room

0.539450

video camera

0.523142

car rental

information centre

the more or less specific sublanguage, though it is sensible, as any statistical method, to the amount of available evidence, i.e. the corpus size. Table 2 shows the 15 most highly rated multiword terms, ordered by consensus (relevance is 1 for all the terms in the list). Table 3 illustrates the effectiveness of Domain Consensus at pruning irrelevant terms: all the candidate terms in the list have DR > α, but DC < β.

D E

R

O C

N U

Table 3. Terms with high domain relevance and low domain consensus.

english cyclist manual work

Domain relevance

Domain consensus

1.000000

0.000000

1.000000

0.000000

petrol station

1.000000

0.000000

school diploma

1.000000

0.000000

western movie

1.000000

0.000000

white cloud

1.000000

0.000000

false statement

0.621369

0.000000

best price

0.612948

0.224244

council decision

0.612948

0.000000

foreign income

0.441907

0.000000

gay community

0.441907

0.224244

mortgage interest

0.441907

0.000000

substantial discount

0.441907

0.224244

typical day

0.441907

0.224244

O R P

10 10

10

Table 1 also shows that sub-tree induction reduces the task of term classification of about 40% (177 heads over 288 terms). Table 4 provides the list of sub-tree roots occurring in at least 10 different multiword terms.

T C

E R

F

O 11

3.4.

Ontology Coding: Text Mining Tools to Identify Relatedness and Similarity

The second step in ontology construction is Ontology coding. According to the SymOntos conceptual schema, a definition has a structural section (the lefthand side of Fig. 3) describing taxonomic and part of relations, and a relational section (the right side of Fig. 3), describing its relations with other domain concepts. Formal relations such as hyponimy and hyperonimy and constitutive relations such part of can hardly be extracted from corpora (on-line Dictionaries are more useful for this task, see [29]). On the contrary, relatedness and similarity can be detected using text-mining techniques. The automatic acquisition of relatedness and similarity relations from text is a very recent objective of the FETISH project. In this section we briefly present some preliminary results and ideas. According to the definition of Objects, Actors and Process provided in Section 2.1, a conceptual triples of the Actor Process Objects kind have a lexical realization in texts captured by syntactic triples of the Subject Verb Object (SVO) form, where either the subject, the verb or the object have a conceptual correspondent in the ontology. Other syntactic structures are considered, for example, N PP as in:

P1: Dhirendra Samal (GJE) Applied Intelligence

KL1703-07

February 7, 2003

19:27

Text Mining Techniques

_____________________________________________________ V_Obj_PP Fix(P) prices(O)

through

Sog_V_Obj reservation_system you carrier(A) reservation_system

allow(P) use(P) has contains(P)

N_PP providers(A)

of

reservation_system

V_Obj install(P)

reservation_system

reservation_system

changes(P) reservation_system reservation_system information(O)

V_clause reservation_system allow(P) you to_check(P) _____________________________________________________ Figure 6. Related process (P) actors (A) and objects (O) for reservation system.

providers of reservation systems, and compounds, like hotel front-desk. At the present state of the FETISH project, a preliminary visualization built within the SymOntos system provides the ontology Engineer with all the detected syntactic patterns including (at least one) element of the list T described in the previous subection. For example, for the term reservation system, the following syntactic patterns are found: In Fig. 6, the detected syntactic patterns are grouped by type (e.g. V Obj = Verb Object). Suggested related Actors (A), Objects (O) and Processes (P) are shown in bold. Semantic tagging is performed using the on-line lexical taxonomy WordNet [27]. We use a “naive” heuristics to automatically tag actors, objects and processes: Actors are nouns with the first WordNet9 sense in the class person or social group. Every noun with the first sense under the category act or event or process is a Process. Every other noun is an Object. Every verb (except generic verbs like be, make, etc.) is a Process. A very simple method is used also to prune some of the extracted syntactic pattern: we use the plausibility value mentioned in Section 2 to delete patterns with plaus < 0.5. The general idea here is that recall is more important than precision, i.e., it is preferable to provide the ontology Engineer with all the detected information and let him prune/adjust erroneous cues. This seems reasonable, especially because the CHAOS systems is relatively high performing [19].

4.

337

Related Research

Text mining techniques to enrich an ontology have been explored rather recently in literature. A workshop on ontology Learning, held in conjunction with ECAI 2000, collects some papers in this area (the hyperlink to the workshop is reported in the bibliographic references [30] and [31]). In [31] it is presented a method to extract from the WWW conceptual information useful to solve two important problems of (Euro)-Wordnet: the absence of topical links, and the large ambiguity. These two issues are specific of a highly lexicalized ontology like WordNet. Domain ontologies are topical by definition, and far less ambiguous. As we remarked, the main problem is to identify topical concepts and provide for these concept definitions agreed within the corresponding community. In [30] it is proposed a modification of the Minimum Description Length approach to learn word association from texts. Associations are used to enrich with selectional preferences (e.g., verb arguments like subject, object and modifiers) a lexical semantic net. The method that we presented in this paper to detect relatedness links is stronger than simple word associations, since a parser that, in turn, is based on a general-purpose lexicon of verb argument structures guides it. In [32] it is provided a bibliography of papers that propose techniques and methods relevant to ontology learning, though not specifically conceived for this task. Among the relevant techniques, the following are mentioned:

F

O

D E

O R P

T C

R

E R

N U

O C

• Acquisition of selectional restrictions from texts • Word Sense disambiguation • Computation of concept lattices from texts The third class of methods seems the more useful for enhancing SymOntos with text-based algorithms to detect similarity relations among terms. As mentioned in the next Section, we are currently exploring a methodology based on context similarity, which also allows it to automatically compute the similarity degree. Other literature specifically concerned with Named Entity and Terminology extraction has been already referenced in the appropriate Sections. 5.

Conclusion and Future Work

The text mining techniques proposed in this paper are meant to increase the productivity of an ontology

P1: Dhirendra Samal (GJE) Applied Intelligence

338

KL1703-07

February 7, 2003

19:27

Missikoff, Velardi and Fabriani

Engineer during the time consuming task of populating a Domain ontology. The work presented in this paper is in part well assessed, in part still under development. We are designing new algorithms and techniques to widen the spectrum of information that can be extracted from texts and from other on-line resources, such as dictionaries and lexical taxonomies (like EuroWordnet, a multilingual version of Wordnet). A wider availability of EuroWordnet,10 for example, would allow the implementation of multilinguality features in the ontology. Another on-going extension of this research is to detect similarity relations among concepts on the basis of contextual similarity. Similarity is one of the fields (see Fig. 3) in a concept definition form that are currently filled by humans. While it seems feasible and useful to automatically enrich concept descriptions with relatedness and similarity relations, automating the process of taxonomic and constitutive definitions is rather more complex. As we already remarked, this type of definitory information may be found in on-line Dictionaries or Taxonomies, but the problem is that, since domain terminology is poorly represented in on-line dictionaries and taxonomies, and since the most interesting concepts in the ontology correspond to terminology in texts, in practice no textual resources are available from which automatically (or even manually) extract this definitory information. Furthermore, since the success of an ontology depends strongly on consensual definitions, a tool to ensure (human) validation of concept definitions, like ConSys [12], seems in this case more useful than pursuing automatic techniques. One admittedly weak part of the research presented in this paper is evaluation: we could produce a numerical evaluation of certain specific subtasks (extraction of Named Entities and extraction of related concepts), but we did not evaluated the overall effect that our text mining tools produce on the ontology. However, we are not aware of any assessed ontology evaluation methodology in the literature, besides [14], where an analysis of ontology Server users distribution and requests is presented. A better performance indicator would have been the number of users that access ontology Server on a regular basis, but the authors mention that regular users are only a small percentage.11 As remarked in Section 3.3, an objective evaluation of an ontology as a stand-alone artifact is not

feasible: the only possible success indicator is the (subjective) acceptance/rejection rate of the ontology Engineer when inspecting the automatically extracted information. An ontology can only be evaluated in a context in which many users of a community (e.g. Tourism operators in our application) access the ontology on a regular basis and use this shared knowledge to increase their ability to communicate, access prominent information and documents, improve collaboration. Though a field evaluation of OntoTour is foreseen during the last months of the project, we believe that wide accessibility and a long-lasting monitoring of user behaviors would provide the basis for a sound evaluation of the OntoTour system.

O

O R P

Acknowledgments

F

We thank the AI-NLP group of the University of Roma “Tor Vergata” for allowing the integration of their CHAOS parser in the ARIOSTO system. In particular, Roberto Basili contributed to the definition of formal criteria (Velardi et al., 2001) to extract terminology.

D E

T C

R

E R

N U

O C

Notes 1. The English version uses the freely available Brill [16] transformation-based POS tagger 2. Named Entities in the Information Extraction jargon are complex proper names, like “Colorado River Trail”. NE recognition is the task of identifying and semantically tagging these complex constituents. 3. A “chunk” parser reduces parsing errors by first splitting a sentence in segments, called chunks. Note that CHAOS has been developed by the University of Tor Vergata and kindly made available for this research. 4. As described in [18] it is possible to reliably learn contextual rules for Named Entity (NE) recognition using machine learning methods as Decision lists. In practice, we use a combination of manual and machine learning, which turns out to allow for a very rapid updating of NE rules. 5. Newly detected complex nominals are added to the PN dictionary, or gazetteer, and the context in which the PN occurred is used to enrich the contextual model of that PN semantic category (person, location, etc.). This is better described in [18]. 6. Though experimental conditions are not fully comparable, evaluation of several NE recognizers is available on the web site of the 7th Message Understanding Conference ftp.muc.saic.com/proceedings/score reports index.html. 7. Note that comparison is performed on the basis of the lemmas, though morphologic information attached to words is not shown in Fig. 4. 8. http://www.washingtonpost.com/wp-srv/business/longterm/ glossary/indexag.htm

P1: Dhirendra Samal (GJE) Applied Intelligence

KL1703-07

February 7, 2003

19:27

Text Mining Techniques

9. In WordNet word senses are ordered by probability, though this ordering is often questionable. 10. EuroWordNet is still not fully completed, in particular, it lacks of domain terminology. A further probleme is that this resource is rather expensive. 11. The system described by Farquhar and his colleagues, however, is not a specific Ontology, but a tool, Ontology Server, to help publishing, editing and browsing an Ontology.

References 1. M. Missikoff et al., “A tourism ontology for small and medium enterprises in European market,” LEKS, FETISH Project, Deliverable D1.1, IASI-CNR, Rome, 2000. 2. M. Uschold and M. Gruninger, “Ontologies: Principles, methods and applications,” The Knowledge Engineering Review, vol. 11, no. 2, 1996. 3. “IDEF5 ontology description capture method overview,” available at http://www.idef.com/overviews/idef5.htm. 4. “SymOntos, a symbolic ontology management system,” available at http://www.symontos.org. 5. P.-K. Halvosen, “Document processing,” CH7, in Survey of the State of Art in Human Language Technology, edited by R. Cole, 1995. 6. “Information extraction: A multidisciplinary approach to an emerging technology,” Lecture Notes in Artificial Intelligence 1299, edited by M.T. Pazienza, Springer: Heidelberg, 1997. 7. A. Farquhar, R. Fikes, W. Pratt, and J. Rice, “Collaborative ontology construction for information integration,” available at http://www-ksl-svc.stanford.edu:5915/doc/projectpapers.html. 8. M. Missikoff, “OPAL—A knolwedge-based approach for the analysis of complex business system,” LEKS, IASI-CNR, Rome, 2000. 9. R.J. Brachman, “On the epistemological status of semantic networks,” in Associative Networks—Representation and Use of Knowledge by Computers, edited by N.V. Findler, Academic Press: New York, NY, 1979. 10. S. Khoshafian and R. Abnous, Object Orientation: Concepts, Languages, Databases, User Interfaces, John Wiley: New York, NY, 1990. 11. J.F. Sowa, Knowledge Representation—Logical, Philosophical, and Computational Foundations, Brooks/Cole, Thomson Learning, 2000. 12. M. Missikoff and X.F. Wang, “Consys—A group decisionmaking support system for collaborative ontology building,” in Proc. of Group Decision & Negotiation 2001 Conference, La Rochelle, France, 2001, pp. 13. D. Skuce, “Conventions for reaching agreement on shared ontologies,” in Proc. of the 9th Knowledge Acquisition for Knowledge Based Systems Workshop, 1995. 14. A. Farquhar, R. Fikes, W. Pratt, and J. Rice, “Collaborative ontology construction for information integration,” available at http://www-ksl-svc.stanford.edu:5915/doc/projectpapers.html. 15. R. Basili, M.T. Pazienza, and P. Velardi, “An empyrical symbolic approach to natural language processing,” Artificial Intelligence, no. 85, pp. 59–99, 1996.

16. E. Brill, “A simple rule-based part-of-speech tagger,” in Proc. of Third Conf. on Applied Natural Language Processing-ANLP92, Trento, Italy, 1992. 17. A. Cucchiarelli, D. Luzi, and P. Velardi, “Semantic tagging of unknown proper nouns,” in Natural Language Engineering, December 1998. 18. A. Cucchiarelli, V. Karkaletsis, G. Paliouras, C. Spyropolous, and P. Velardi, “Automatic adaptation of proper noun dictionaries through cooperation of machine learning and probabilistic methods,” in Proc. of 23rd Annual SIGIR, Athens, Greece, 2000. 19. R. Basili, M.T. Pazienza, and F. Zanzotto, “Customizable modular lexicalized parsing extraction,” in Proc. of Int. Workshop on Parsing Technology, Povo (Trento), Italy, February 2000. 20. C. Jacquemin, “Variation terminologique,” Memoire d’Habilitation Directeur des Recherces and Informatique Fondamentale, Universit´e de Nantes, Nantes, France, 1997. 21. J. Klavans, “Text mining techniques for fully automatic glossary construction,” in Proc. of the HTL2001 Conference, San Diego, CA, 2001. 22. B. Daille, “Study and implementation of combined techniques for automatic extraction of terminology,” in Proc. of ACL94 Workshop—The Balancing Act: Combining Symbolic and Statistical Approaches to Language, New Mexico State University, Las Cruces, New Mexico, 1994, pp. 23. R. Basili, G. De Rossi, and M.T. Pazienza, “Inducing terminology for lexical acquisition,” in Proc. of the Second Conference on Empirical Methods in Natural Language Processing, Providence, USA, 1997, pp. 24. R. Fano, Trasmission of Information, MIT Press: Cambridge, MA, 1961. 25. E. Morin and C. Jacquemin, “Projecting corpus-based semantic links on a Thesaurus,” in Proc. of 37th ACL Conference, 1999, pp. 26. P. Vossen, “Extending, trimming and fusing WordNet for technical documents,” in Proc. NAACL-2001 Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, 2001, pp. 27. A. Miller, “WordNet: An on-line lexical resource,” Special Issue of the Journal of Lexicography, vol. 3, no. 4, 1990. 28. P. Velardi, M. Missikoff, and R. Basili, “Identification of relevant terms to support the construction of domain ontologies,” in Proc. of ACL-01 Workshop on Human Language Technologies, Toulouse, France, 2001, pp. 29. Y. Wilks, B. Slator, and L. Guthrie, Electric Words: Dictionaries, Computers, and Meaning, MIT Press: Cambridge, MA, 1996. 30. A. Wagner, “Enriching a lexical semantic net with selectional preferences by means of statistical corpus analysis,” in Proc. of ECAI-2000 Workshop on Ontology Learning, available at http://ol2000.aifb.uni-karlsruhe.de/, Berlin, Germany, 2000. 31. E. Agirre, O. Ausa, E. Havy, and D. Martinez, “Enriching very large ontologies using the WWW,” in Proc. of ECAI-2000 Workshop on Ontology Learning, available at http://ol2000.aifb.unikarlsruhe.de/, Berlin, Germany, 2000. 32. A. Maedche and S. Staab, “Learning ontologies for the semantic web,” available at http://www.aifb.uni-karlsruhe.de/ WBS/ama/publications.html.

R

O C

N U

Au: Pls. provide page range in Refs. [12, 22, 23, 25, 26, 28]

F

O

D E

T C

E R

339

O R P

P1: Dhirendra Samal (GJE) Applied Intelligence

340

KL1703-07

February 7, 2003

19:27

Missikoff, Velardi and Fabriani

with the Fondazione Ugo Bordoni and the IBM Scientific Centre in Roma. From 1986 to 1996 she was Associate Professor with the Istituto di Informatica at the University of Ancona. Since 1996 she is with the Dipartimento di Scienze dell’Informazione at the University of Roma “La Sapienza”. Her main interests are in the areas of natural language processing, machine learning, lexical semantics and statistical language processing. She participated in many international projects and is the author of over 80 international publications in these research areas.

Michele Missikoff coordinator of LEKS, Laboratory for Enterprise Knowledge and Systems, at IASI-CNR (Rome, Italy), has a long experience on knowledge representation and databases. In the last years he focussed his attention to enterprise ontologies and, in particular, to the impact they may have with system interoperability, in particular with semantic interoperability, for Small and Medium-sized Enterprises in the tourism sector. He served in the Program Committees of primary international conferences in the field and in the editorial boards of important international journals. He is co-founder and past president of EDBT Foundation, the international organisation that promotes EDBT Conferences. He participated and leaded several international and national research projects. In his activity he produced more than one hundred papers, the half of which at international level.

O

R

N U

O C

Paola Velardi received her Degree in Electrical Engineering from the University of Roma “La Sapienza”, in 1978. She was a researcher

O R P

Paolo Fabriani received his Degree in Computer Science at the University of Roma “La Sapienza” in 2000. Since then, he cooperated with Prof. Paola Velardi and Dr. Michele Missikoff within the FETISH EU project.

D E

T C

E R

F