ONTOLOGY POPULATION - CiteSeerX

12 downloads 10466 Views 256KB Size Report
been tested in a tourism domain corpus and the results of the validation process .... co-reference measures to address the problem of mentions disambiguation.
International Journal of Innovative Computing, Information and Control Volume 7, Number 11, November 2011

c ICIC International ⃝2011 ISSN 1349-4198 pp. 6115–6133

ONTOLOGY POPULATION: AN APPLICATION FOR THE E-TOURISM DOMAIN ˜arro-Gime ´nez1 Juana Mar´ıa Ruiz-Mart´ınez1 , Jose Antonio Min 2 ´nchez3 Dagoberto Castellanos-Nieves , Francisco Garc´ıa-Sa and Rafael Valencia-Garc´ıa1 1

Facultad de Inform´atica Universidad de Murcia ES-30100, Murcia, Spain { jmruymar; jose.minyarro; valencia }@um.es 2

Depto. de Ciencias de la Computaci´on e Inteligencia Artificial Universidad de Granada E-18071, Granada, Spain [email protected] 3

Escola T`ecnica Superior d’Enginyeria Universitat de Val`encia Avda. Dr Moliner 50, Burjassot 46100, Valencia, Spain [email protected]

Received May 2010; revised December 2010 Abstract. The Semantic Web aims to extend the current Web standards and technologies so that the semantics of Web contents is machine processable. For the Semantic Web vision to become real, methods and mechanisms that assist in the creation of an initial pool of semantically described Web resources have been developed. However, these methods suffer from some problems such as lack of scalability and automation. Besides, most of them focus on English language resources and the cost of initial requirements is too high. In order to solve these drawbacks, this paper proposes a methodology for extracting semantic content from textual web documents to automatically instantiate a domain ontology (i.e., ontology population). In a first stage, the system obtains, through the GATE framework, a set of semantic annotations which are considered as ontology instance candidates. In a second stage, the semantic ambiguities are solved, and the annotations are related with their corresponding ontological entities. The methodology has been tested in a tourism domain corpus and the results of the validation process seem promising in terms of precision and recall. Keywords: Ontology population, Natural language processing, Named entity recognition

1. Introduction. The information contained on Web pages was originally designed to be human-readable, and so, most of the knowledge currently available on the Web is kept in large collections of textual documents. As the Web grows in both size and complexity, there is an increasing need for automating some of the time consuming tasks related to Web content processing and management. In 2001, T. Berners-Lee and his colleagues defined the Semantic Web as an extension of the current Web, in which information is given well-defined meaning, enabling computers and people to work better in cooperation [1]. The Semantic Web vision is based on the idea of explicitly providing the knowledge behind each Web resource in a manner that is machine processable. Ontologies [2] constitute the standard knowledge representation mechanism for the Semantic Web. The formal semantics underlying ontology languages enables the automatic processing of the 6115

6116

˜ ´ J. M. RUIZ-MART´INEZ, J. A. MINARRO-GIM ENEZ, D. CASTELLANOS-NIEVES ET AL.

information in ontologies and allows the use of semantic reasoners to infer new knowledge. In this work, an ontology is seen as “a formal and explicit specification of a shared conceptualization” [2]. Ontologies provide a formal, structured knowledge representation, and have the advantage of being reusable and shareable. They also provide a common vocabulary for a domain and define, with different levels of formality, the meaning of the terms and the relations between them. Knowledge in ontologies is mainly formalized by using five kinds of components: classes, relations, functions, axioms and instances [3]. Ontology Web Language (OWL) is the W3C standard for representing ontologies in the Semantic Web and, in this work, it has been used to represent the knowledge extracted from texts. Ontologies are thus the key for the success of the Semantic Web vision. The use of ontologies can overcome the limitations of traditional natural language processing methods such as text classification [4]. They are also relevant in the scope of the mechanisms related, for instance, with Information Retrieval [5, 6], Service Discovery [7], Question Answering [8], Searching for contents and information [9], as well as crawling [10]. However, creating and populating ontologies manually is a very time-consuming and labor-intensive task. Several methodologies for ontology learning and ontology population have been created in order to assist in building ontologies. Yet, none of the current proposals is scalable enough to deal with the ontologization of the bulk of the Web content. This paper proposes a simple and scalable methodology for ontology population from textual resources based on lightweight NLP techniques and ontological engineering. The methodology has been implemented in the form of a software prototype and tested in the tourism domain. It is worth pointing out that, although the prototype has been customized to deal with tourism-related texts, the methodology remains domain independent. The structure of the paper can be described as follows: Section 2 shows an overview of different approaches for Ontology Population; in Section 3, the methodology developed in this work is described; Section 4 explains the experiments conducted to evaluate our methodology; finally, some conclusions are put forward in Section 5. 2. Approaches for Ontology Population. Ontology Learning deals with the acquisition of new concepts and relations. As a result, the inner structure of the ontology is modified. The goal of Ontology Population is markedly different. Ontology Population has to do with the extraction and classification of instances of the concepts and relations that have been defined in the ontology. Instantiating ontologies with new knowledge is a relevant step towards the provision of valuable ontology-based knowledge services. In the last few years, a variety of approaches have been applied to Ontology Population from unstructured text. Many of them combine natural language processing techniques (such as linguistic pattern recognition and extraction, POS-tagger and syntactic analysis) with other machine learning techniques. In a deliverable, presented in the BOEMIE project [11], the analysis of the most prominent ontology population systems is provided. Here, different dimensions to compare these ontology population approaches are established. Next, these dimensions are explained and compared with our approach. 2.1. Elements learned. During the ontology population process the elements learned can be instances of concepts, instances of relations [12] or both. Here, the focus is on those systems which, like the one proposed, extract both concepts and relations instances [14, 15, 16, 17, 18, 19, 20, 21]. 2.2. Initial requirements. Concerning the initial requirements, namely resources or background knowledge, most of the approaches make use of Named Entity Recognition

ONTOLOGY POPULATION: AN APPLICATION FOR THE E-TOURISM DOMAIN

6117

and Classification (NERC) modules. NERC is a subtask of information extraction that seeks to locate and classify the atomic elements within a text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages [13]. For example, in [14], the starting point is a training set of named entities (NEs) instances for each class under consideration. SPRAT [15] considers the NEs identified by GATE as candidates for instances of the Ontology. In [16], the system identifies the NEs mentions and attaches them to concepts and relations already defined in the ontology. The methodology described in [17] is also based on NEs substitution. In our proposal, the NEs identified in the text are considered as instance candidates of a predetermined ontology. Consequently, NERC has a major importance in the proposed framework, since the quality of this process leads to a more accurate and complete final ontology. Other systems also need an annotated training corpus. For example, the starting point of Ontoshopie [18] is an XML annotated corpus where each entity is associated to the corresponding class in the onlogy. In [14], a syntactically parsed corpus containing training entities is needed. [19] makes use of parsed glossary definitions and a set of manually defined linguistic patterns, and OntoPop [20] utilizes semantic annotations of texts. Finally, [21] requires an ontology with a root class and a few simple linguistic patterns in order to extract concept instances and taxonomic relationships from the web. 2.3. Learning approach. Some approaches use Machine learning to populate the ontology. For example, Ontosophie [18] relies on a conceptual dictionary that generates extraction rules. These rules are then used to train the system. In [14], an unsupervised algorithm based on vector-feature similarity is employed. The algorithm is applied to syntactically parsed corpus containing each training entity at least twice. For each occurrence in the corpus, syntactic features are obtained that are used to construct the feature vector. The new feature vector is then compared with the existing ones, and the new instance is inserted in the class with the most similar feature vector. [17], on the other hand, uses a supervised machine learning approach to instantiate a semi-populated ontology from the Web. Other approaches make use of manually constructed patterns as input, as in [15, 20, 21]. This is also the case of [19], where a set of rules define regular expressions in order to annotate certain gloss fragment with the ontology properties (conceptual relations). The result is an annotated fragment where a pair of terms are associated by means of an ontology relation. These terms are considered the domain and range of the relationship. After a disambiguation process, the eligible terms are inserted in the ontology as individuals of the concept defining the annotated gloss. Finally, [16] describes a manually created benchmark for ontology population where NEs mentions are assigned to concepts and relations already defined in the ontology. The methodology presented here is based on predefined patterns for detecting NEs and a heuristic algorithm for populating ontologies, so no machine learning approach is applied in this work. 2.4. Degree of automation. Another parameter to classify the various ontology population approaches is the degree of automation. Some systems, such as [15, 16, 17, 19, 21], are unsupervised or weakly supervised [14], while others, such as [18, 20], need to be guided by an expert. The ontology population process proposed here is fully automatic. 2.5. Domain portability. Some systems have been tested in a specific domain document collection. For example, the framework described in [19] has been tested in the domain

6118

˜ ´ J. M. RUIZ-MART´INEZ, J. A. MINARRO-GIM ENEZ, D. CASTELLANOS-NIEVES ET AL.

of cultural heritage and its portability requires new linguistic patterns to be developed in accordance with the domain and language. This is also the case of [15, 18]. Other systems extract generic NEs types such, for instance, persons or locations [14, 16, 17, 20]. Finally, Ontoshypon [21] is a domain independent methodology. The system described here has been tested in the tourism domain, although it is quite portable to other domains. 2.6. Consistency maintenance. Scarce are the frameworks that provide information about whether the consistency of the ontology is checked during or at the end of the process [14, 16, 17, 18, 19, 21]. In [20], a manual maintenance of the knowledge acquisition rules is required, and they do not use any reasoner to check the consistency of the ontology. Finally in [15], the GATE plugin used to insert the instances in ontology checks the consistency of them before the insertion. The methodology proposed here includes a step to verify the consistency of the populated ontology by using OWL-DL reasoners such as Pellet or Hermit, as described in Section 3.4. 2.7. Entity disambiguation. Some systems perform some disambiguation tasks during the ontology population process. For example, in [18], confidence values are assigned to the extracted entities and, in the case of ambiguity, they select the value with the highest confidence. Other systems, such as [17, 20], use context features to disambiguate. In [15], even though the system can detect and warn about possible ambiguities, the disambiguation process largely depends on the end user. In [14], ambiguous NEs are allowed within the training corpus. However, if ambiguous NEs are found during the system execution, they are not included in the ontology. Some disambiguation strategies based on the inclusion of more information during the search of instances on the Web are proposed in [21]. Finally, other systems apply disambiguation methods during the process, but not necessarily concerning NEs. For example, [19] applies a semantic disambiguation algorithm based on structural patterns to the annotated glosses, and [16] uses different co-reference measures to address the problem of mentions disambiguation. The system described here performs an entity disambiguation process as described in Section 3. 2.8. Language dependency. Almost all the methods examined provide support only for English resources [15, 17, 18, 19, 21]. Nonetheless, the language dependency degree is variable according to the portability of their linguistic components. It is possible to distinguish between strongly language dependent systems (such like [15, 18, 19]) and weakly language dependent ones (like [16, 21]). Only [16, 20] take into consideration other languages, such as Italian and French, respectively. The system presented here has been tested with Spanish documents, but it would be easily portable to other languages merely by changing the initial language resources requirements. 3. Ontology Population Process. The ontology population process proposed here is based on previous works [22]. It is comprised of four sequential stages (see Figure 1): (i) the NLP and Corpus processing stage, (ii) the Named Entity recognition (NER) stage, (iii) the Ontology population stage and (iv) the Consistency checking stage. A detailed view of the ontology population system is depicted in Figure 1. In a nutshell, the system works as follows. First, the NLP and corpus processing module parses the corpus in order to extract the linguistic information. The aim of the NER phase is to gather NEs from the text. The more NEs are available the more information can be gathered in the ontology population phase. During this second phase, the occurrences of the NEs identified in the text are disambiguated and the ontology is then populated. In this ontology population stage each NE is identified as one or various individuals of

ONTOLOGY POPULATION: AN APPLICATION FOR THE E-TOURISM DOMAIN

6119

Figure 1. Overview of the system one or more concepts in the ontology, and the values of their attributes and relationships are identified. If the entity has not been previously recognized, the system populates the ontology with the information extracted creating new individuals. Otherwise, the preexisting entity is enriched by adding the new attributes/relationships found that belong to it. Finally, the consistency of the populated ontology is checked by using an OWL-DL reasoner. 3.1. NLP and corpus processing phase. The main objective of this phase is to obtain the morphologic and syntactic structure of each sentence in the corpus. A set of NLP tools including a sentence detection component, tokenizer, POS taggers, lemmatizers and syntactic parsers has been developed using the GATE framework [23]. GATE is an infrastructure for developing and deploying software components that process human language. GATE helps scientists and developers in three ways: (i) by specifying an architecture, or organizational structure, for language processing software; (ii) by providing a framework, or class library, that implements the architecture and can be used to embed language processing capabilities in diverse applications; (iii) by providing a development environment built on top of the framework made up of convenient graphical tools for developing components. In particular, a Freeling POS-tagger [24] plug-in has been developed and integrated in to GATE. Freeling is an open source language analysis tool suite that provides language analysis services such as morphological analysis, PoS tagging and syntactic analysis. In this phase, the grammar category of each word in the sentence is identified, tokens are lemmatized and, at the same time, a syntactic analysis is performed. 3.2. Named entity recognition phase. During this second stage, the NE candidates are identified by making use of GATE. NER is a subtask of information extraction that seeks to locate and classify atomic elements within the text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values or percentages. Given that the corpora used for testing purposes is written in Spanish, it has been necessary to build specific resources to deal with it.

6120

˜ ´ J. M. RUIZ-MART´INEZ, J. A. MINARRO-GIM ENEZ, D. CASTELLANOS-NIEVES ET AL.

A particular domain or subject matter is characterized by a specialized vocabulary, semantic relations and syntax [25]. Thus, an exhaustive knowledge of the texts from which the ontology instances are to be inferred and their context is necessary to facilitate the development of language resources that are tailored to the needs of that sublanguage. As shown in Figure 1, two main components are needed in the NER phase: a Gazetteer, and a JAPE Transducer. The output produced by each component of GATE is a set of annotations, that is, metadata associated with a particular section of the document content. The role of the gazetteer is to identify named entities in the text based on candidate lists. Thus, the system obtains annotations for every word that appears in the gazetteer lists of entities that are relevant in the domain under question. Several lists with a variable degree of generality have been created. Examples of general lists created are the following: locations, zip code, career, first names, surnames and address identifiers (e.g., “street”, “avenue” and “square”). Restaurant facilities, meals or occupational categories are examples of more specific lists. The JAPE transducer is a module for executing JAPE grammars. JAPE is a rich and flexible regular expression based rule mechanism offered by the GATE framework [25]. Hence, a set of JAPE rules to obtain occurrences of zip codes, telephone, fax or mobile numbers, urls, emails, addresses, restaurants, person names or money references has been implemented. In Table 1, an example of a simple JAPE rule for identifying money references is shown. Table 1. An example of a simple JAPE rule Rule: money ( ({ Token.string == “$” } ({ Token.kind == number } (({ Token.string == “,” } ∥ { Token.string == “.” } ) { Token.kind == number } )? ) ) ) :number −− > :number.Money = { kind =“money”, rule =“Money” } Each annotation obtained from the JAPE transducer is considered as a NE. All the occurrences of the identified NEs in the text are candidates to be instances or values of the attributes of an instance in the ontology. For example, a NE representing a Hotel will be considered as a candidate of an instance of the class Hotel, and the NEs that represent emails or phone numbers will be considered as candidates of possible values of instances attributes of the ontology (e.g., the phone number of a hotel). In Figure 2, an example in which the NE’s annotations and their classification are highlighted, is shown. The system classifies the annotations in the form of groups (i.e., Activity, Hotel, Location). Each group can have one or more NE’s annotations and some of these annotations can be ambiguous, that is, the same fragment of text can be annotated in two different groups. For example, in Figure 2, it is possible to observe that there are two kinds of

ONTOLOGY POPULATION: AN APPLICATION FOR THE E-TOURISM DOMAIN

6121

annotations (Hotel and Location) attached to the first sentence. This kind of ambiguity is resolved in the next phase, where the system figures out whether they match with some instance, relationship or property in the ontology. 3.3. Ontology population phase. Given the groups of NEs annotations, during this phase the system determines whether they are instances, attributes or relationships in the ontology model. Thus, the system must associate each annotation to a particular ontology entity. In OWL, the main type of resources are Classes, “Subclass of” relationships, Datatype Properties, Object Properties and Individuals. The ontology model is defined by classes, relationships that connect those classes, and datatype properties that are the attributes belonging to each class. The methodology for carrying out the ontology population consists of four main stages: (1) gathering the NEs identified in previous phases, (2) creating a tree of combinations of ambiguous NEs, (3) calculating the score of all combinations and (4) inserting individuals in the ontology model. The first step takes as input the list of annotated NEs that the Corpus Processing and NER phases identify in the text. At this point, it is necessary to take care of the NEs that present conflicts during the ontology population phase, not related to recognition mistakes, but to language ambiguity. Different kinds of ambiguities might appear, as follows: a. An annotation can be related with more than one NE. For example: Guggenheim can be a surname and a museum. b. Several NEs are overlapped in the text. For example: Chelsea football club is a NE which has an overlapped NE Chelsea. c. An annotated NE can be related to several resources in the ontology. For example: The number 22358897 may be a phone number or a fax number. The input of the second step is the list of annotated NEs described above. During this step, the methodology deals with the occurrences of ambiguities of type a. This kind of ambiguity can be avoided by defining all possible groups of non-ambiguous annotations. These non-ambiguous groups are represented in the form of a tree, where each level of the tree, from root to leaf nodes, represents a NE and siblings in a level are ambiguous annotations related to a NE that are incompatible. In the third step of the methodology, a score is assigned to each branch of the tree that represents a group of non-ambiguous annotations. However, dealing with ambiguities of type b and c is compulsory during this step. This ambiguity is resolved by relating the NE to its closest one. The assessment is based on how many NEs can be interrelated and how far they are from each other. Hence, the more the number of NEs that adapt the ontology model and the closer they are, the better score that option can achieve. Table 2 describes the algorithm developed to evaluate all possible groups of NEs. The input parameter of the algorithm is the N E list. This list contains all the annotations of NEs identified. The tree that represents all the allowed NEs groups is generated by the function combinatorial tree of N E (N E list). This function creates a tree where ambiguity is represented as siblings, one for each incompatible annotation, and the depth of the tree is the number of NEs. The total ∏ number of groups of non-ambiguous annotations can be calculated with the formula: N Ei , where N Ei is the number of allowed annotations of an ambiguous entity “i” in the text. The algorithm visits all nodes from the root to the leaves, in depth-first order. When the algorithm reaches a leaf node it calculates the score of the group of annotations using the function calculate score (N E node). At the end, when all groups have been generated and have been provided a score, the one with the highest score is returned.

6122

˜ ´ J. M. RUIZ-MART´INEZ, J. A. MINARRO-GIM ENEZ, D. CASTELLANOS-NIEVES ET AL.

Table 2. Ontology population algorithm PROCEDURE get the best combination(N E list) BEGIN root of N E tree = combinatorial tree of N E(N E list); N E to be visited = stack of N E(); N E to be visited.push(root of N E tree); scorebest = 0; solutionbest = list of N E(); WHILE DO N E to be visited.has elements() DO N E node = N E to be visited.pop(); IF N E node.has children() THEN N E to be visited.pushAll(N E node.getChildren()); ELSE solutioncurrent = N E node.list path nodes(); scorecurrent = calculate score(N E node); IF scorecurrent ¿scorebest () THEN solutionbext = solutioncurrent ; scorebest = scorecurrent ; END IF END IF END WHILE return solutionbest ; END PROCEDURE The assessment of each group is based on the number of annotations that can be mapped to the ontology model and the number of relationships that may be created among them. Annotations are represented in the ontology as classes or properties. The annotations are usually surrounded by other annotations within the same scope that can be linked. When an annotation can be linked to other annotations, only the closest one is chosen. Table 3 represents the pseudo-code that defines how the score is calculated: In Table 3, the variable scoretotal represents the total value obtained by the list of annotations, and scoreA contains the value that is assigned to each annotation in the list. Wclass refers to the weight that is assigned to an annotation that is an individual of an ontology class, Wproperty is the weight assigned to an annotation when it is mapped as a data type property in the ontology, and Wrelationship is the weight given to an individual each time it is linked to another individual or data type property in the ontology. Besides, the Wrelationship is further adjusted by considering the distance between annotations. The distance is measured using the number of words that separate such annotations in the text. Thus, the closer the annotations are in the text, the higher the score they get. The values of the constants Wclass , Wproperty and Wrelationship are established by the end user. Thus, the final result of the population process depends on the values given to those weights. Finally, once ambiguity problems have been removed and the group of annotations with the highest score has been identified, it is possible to initiate the population of the ontology. The entities are inserted in the ontology model either as individuals or properties, as appropriate, and they are linked in the same manner as identified in the best group of annotations. An example that explains in detail how the algorithm works is presented below.

ONTOLOGY POPULATION: AN APPLICATION FOR THE E-TOURISM DOMAIN

6123

Table 3. calculate score function FUNCTION calculate score(list of Annotations) BEGIN scoretotal = 0; FOR EACH annotationA IN list of Annotations DO scoreA = 0; IF annotationA .is a class() THEN scoreA += Wclass ; FOR EACH annotationB IN list of Annotations DO IF annotationA .can be related to(annotationB ) AND IF annotationA .is the closet to(annotationB ) THEN distanceAB =annotationA .position - annotationB .position; scoreA +=Wrelationship /distanceAB ; END IF END FOR ELSIF annotationA .is a property() THEN scoreA += scoreA + Wproperty ; END IF scoretotal += scoreA ; END FOR return scoretotal ; END FUNCTION 3.3.1. List of NEs. Figure 2 plots the annotated text that is obtained from the Corpus Processing and NER phases. Each colour represents a different annotation that was recognized. In this example, it is possible to observe the ambiguity among NEs, given that several entities have more than one annotation or that these are overlapped.

Figure 2. Example of annotated text 3.3.2. Creating the groups of annotations. If ambiguity is detected, several groups of annotations are created. For example, in Figure 3, a set of three ambiguous annotations are shown. Certainly, the linguistic expression The Ritz-Carlton New York Central Park is annotated as a Hotel, New York is annotated as a Location and Central Park is annotated as a Location as well. Ambiguity exists because the hotel annotation is overlapped with the location annotations, so in order to avoid this ambiguity it is necessary to create two separate groups of annotations, one with the hotel annotation and the other with the location annotations. Once the groups have been created, a tree is built where combinations of groups are represented. The function combinatorial tree of NE(NE list) is responsible for this step in the algorithm. Figure 4 shows the corresponding tree of groups according to the example

6124

˜ ´ J. M. RUIZ-MART´INEZ, J. A. MINARRO-GIM ENEZ, D. CASTELLANOS-NIEVES ET AL.

Figure 3. Example of a set of ambiguous annotations shown in Figure 2. The algorithm uses the tree to visit all the disambiguous annotations from root to leaf nodes. In particular, the total number of paths that the algorithm needs to visit is: 2 ∗ 2 ∗ 1 ∗ 2 ∗ 1 = 8, where the numbers in the first part of the equation represents the amount of siblings in each tree level. So, the algorithm generates eight different combinations of non-ambiguous annotations.

Figure 4. Tree of disambiguous NEs 3.3.3. Calculating the score for each group of annotations. This step is related to the function calculate score(NE node); in the algorithm. The function takes the non-ambiguous annotations and maps them into the corresponding ontology classes and properties. Once all the annotations have been mapped to ontology resources, the function tries to combine each resource with the closest resources surrounding it in the text. Then, when two resources can be combined, a new relationship is created. Besides, the score of each annotation depends on the distance of the entities that are linked in the text. To calculate the score of each group, the weights that are related to the creation of a class, a property and a relationship must be determined. In the sample scenario, the parameters have been assigned the following values: Wclases = 0.1, Wproperty = 0.1 and Wrelationship = 1. The Wrelationship must be several times higher than the other weights because its value is reduced by the distance between annotations in the text. Finally, Table 4 indicates the score for each group of disambiguous annotations that appeared in the example above. In order to clarify how scores are calculated, the way the score has been assigned to the group with the highest score of those analyzed in Table 4 is described below. The annotations with the highest score are B, D, E, F and H (see Figure 4). In order to obtain these scores, the calculate score function showed in Table 3 is invoked. The first annotation is the Hotel Ritz-Carlton New York Central Park, which is located in position zero in the text and is related to the class Hotel in the ontology. The score of this annotation is: 0,354761905. In detail, 0.1 belongs to Wclass , and the rest of the score belongs to the relationships weights: relationship with the address 50 Central Park

ONTOLOGY POPULATION: AN APPLICATION FOR THE E-TOURISM DOMAIN

6125

Table 4. Score of each group Sequence of Annotations A, C, E, F, H A, C, E, G, H A, D, E, F, H A, D, E, G, H B, C, E, F, H B, C, E, G, H B, D, E, F, H B, D, E, G, H

Total Score 2,329912585 2,322069448 2,328100991 2,320257854 2,308196876 2,300353738 2,341530209 2,333687072

South, Upper West Side that are separated by five words, so its value is 1/5; relationship with the Zip Code NY 10019, which are separated by twelve words, so its value is 1/12; relationship with Location Manhattan that are separated by fourteen words, so its value is 1/14. The scores of the Address 50 Central Park South, Upper West Side, the Zip-code NY 10019,the Location Manhattan, the Location New York, the Phone (212) 308-9100, the Fax (212) 207-8831 and the Web-page www.ritzcarlton.com are the corresponding Wproperty and Wclass , which are 0.1 each. The next annotation is again the annotation of Hotel Central Park Ritz, which is located in position forty four in the text and its score is 0.686768304. In detail, 0.1 belongs to Wclass , 1/14 belongs to the relationship with the Location New York, which is 14 words distant, 1/11 belongs to the relationship with the Phone number (212) 308-9100, which is 11 words distant, 1/8 belongs to the relationship with Fax number (212) 207-8831, which is 8 words distant, 1/5 belongs to the relationship with the Web-page www.ritzcarlton.com, which is 5 words distant, 1/17 belongs to the relationship with the Activity theatres of Broadway, which is 17 words distant, 1/15 belongs to the relationship Broadway, which is 15 words distant, 1/22 belongs to the Location Rockefeller Center, which is 22 words distant, 1/24 belongs to the Activity museums, which is 24 words distant and 1/25 belongs to the Activity shopping centres, which is 25 words distant. The scores of the Activity theatres of Broadway, the Location Rockefeller Center, the Activity museums, and the Activity shopping centers are the corresponding Wproperty and Wclass , which are 0.1 each. Finally, the sum of the scores of each annotation is the total score for this group. 3.3.4. The best combination. At this point, the system has evaluated which group of annotations is the best mapped in the ontology and which provides the highest amount of related information extracted from the text. The objective is to maximize the number of properties and relationship associated to each entity. Thus, the process is reduced to an optimization problem with the aim of maximizing the amount of information limited by the ontology model. Figure 5 plots the best combination for the example shown in Figure 4. Figure 5 shows how the most suitable annotations in the text are mapped into the ontology. For example: The Ritz-Carlton New York Central Park is created as an individual of the class Hotel and the address 50 Central Park South, Upper West Side and the location Manhattan are related to it. 3.3.5. Insert individuals in tourism ontology. Once the best group of annotation has been selected, the corresponding elements are inserted into the ontology. An example of the results of the population process is shown in Figure 6 through a screenshot of the Prot´eg´e ontology editor.

6126

˜ ´ J. M. RUIZ-MART´INEZ, J. A. MINARRO-GIM ENEZ, D. CASTELLANOS-NIEVES ET AL.

Figure 5. Example of the best combination of NEs

Figure 6. Result of the ontology population process 3.4. Consistency checking. The proposed methodology populates OWL ontology models. More concretely, this method has been designed and implemented by using the second version of the Web Ontology Language, OWL 2 [26], which is an extension and revision of OWL and has become the W3C recommendation for representing ontologies in the Semantic Web. OWL 2 (http://www.w3.org/TR/owl2-profiles/) addresses several problems and drawbacks that have been identified over the years of the extensive application of OWL in different contexts. In particular, this improved version of OWL adds several new features, including increased expressive power for properties, extended support for datatypes, simple metamodeling capabilities and extended annotation capabilities. Among the different OWL 2 flavors, OWL 2-DL, based on Description Logics, has been used. Its formal model allows a set of Description Logic inference services to be automatically performed, which can be supported by DL reasoners such as HermiT, Pellet2, Fact++ or Racer [27]. Examples of such inference services are the following: • Consistency checking, which ensures that an ontology does not contain any contradictory facts.

ONTOLOGY POPULATION: AN APPLICATION FOR THE E-TOURISM DOMAIN

6127

• Concept satisfiability, which checks whether it is possible for a class to have any instances. If a class is unsatisfiable, then defining an instance of the class will cause the whole ontology to be inconsistent. • Classification, which computes the subclass relations between every named class to create the complete class hierarchy. The class hierarchy can be used to answer queries such as getting all or only the direct subclasses of a class. • Realization, which finds the most specific classes that an individual belongs to, or in other words, computes the direct types of each individual. An OWL ontology can be viewed from a logical point of view as a collection of axioms that must be satisfied. This does not only include classes and properties, but also constraints such as disjoint classes. The existence of such constraints is not only useful to populate the ontology, but also to grant the consistency of the individuals inserted, which must satisfy the restrictions defined for their corresponding class. Moreover, the collection of conditions defined for the classes can be used by the reasoner for the automatic classification of individuals. The consistency of the populated ontologies is validated by using the HermiT reasoner (http://hermit-reasoner.com/). This process of consistency checking ensures that the knowledge that can be inferred from the ontology by applying the corresponding axioms is correct. 4. Use Case Scenario: The Tourism Domain. Motivated by the new advances and trends in Information Technologies (IT), an increasing number of tourism operators offer their products and services to their customers through on line Web services. Similarly, regional and local administrations publish tourism-related information (e.g., places of interest, hotels and restaurants, festivals) in world-accessible websites. Hence, the tourism industry is becoming information intensive, and both the market and information sources heterogeneity are generating several problems to users, because finding the right information is becoming rather difficult. The Semantic Web enables better machine information processing, by structuring Web documents, such as tourism-related information, thus making them understandable for machines. In this sense, ontologies provide for a formal and structured knowledge representation schema that is reusable and shareable. Thus, the successful application of these technologies depends heavily on the availability of tourism ontologies, which would provide a standardized vocabulary and a semantic context. In order to customize the methodology presented here for a particular application domain, the domain ontology and some NER resources have to be defined. First, a discourse analysis has been performed on the training corpora. Some basic features of the tourism sublanguage are: • Specialized vocabulary: Check-in, overbooking, charter. • Vocabulary related to others disciplines such as art, architecture or activities. • Evaluative vocabulary: Excellent situation, Magnificent views, Friendly helpful service. • Use of nominal style on a more frequent basis than verbal style. Only 2 minutes from Paddington Station with the Heathrow Express service, and close to Hyde Park. • Use of simple sentences and lists of characteristics and services. These lexical, syntactic and semantic regularities have been useful for the creation of the rules and lists identifying NE. Some of the most relevant NE identified in the context of the tourism domain are: Hotel, Services, Address, Postal Zip, Telephone, Fax, Country, City, Municipality, Beach, e-mail, Web Page, Monuments, Architectonical style, Airport, Restaurant, Menu, Meals, etc.

6128

˜ ´ J. M. RUIZ-MART´INEZ, J. A. MINARRO-GIM ENEZ, D. CASTELLANOS-NIEVES ET AL.

4.1. Tourism ontology. A large number of ontologies for e-tourism have been developed to date. Hi-Touch, for example, is a tourism-related European project that makes use of the tourism ontology created by the Mondeca’s working group [28]. In the context of this project, the ontology was improved by adding several concepts from the Thesaurus on Tourism and Leisure Activities developed by the WTO (Word Tourism Organisation) [29]. The e-Tourism Working group at DERI (Digital Enterprise Research Institute), for its part, has created a tourism ontology named OnTour [30]. Also, under the scope of Harmonise, a further tourism-related European project, an ontology called IMHO (Interoperable Minimum Harmonisation Ontology) [31] was developed. The SEED (SEmantic E-tourism Dynamic packaging) research laboratory has developed the Ontology for Tourist Information Systems (OTIS) [32]. Other e-Tourism ontology is the Australian Sustainable Tourism Ontology (AuSTO) [33]. LA DMS is a comprehensive ontology for tourism destinations that was deployed in order for the Destination Management System (DMS) to become adaptive to user’s needs related to touristic destinations information requests [34]. Finally, in the SATINE project [35], several ontologies have been created that can be used for service annotation in order to develop a semantic-based interoperability environment for integrating Web Services Platforms in the travel industry. Given that most of the above mentioned ontologies are yet to be completed, and taking into account the shortcomings of developing a new ontology from scratch, for the purposes of our research we have reused the ontology for e-tourism developed by Prot´eg´e which is named travel.owl [36], adding new restaurant-related classes and other properties from the OnTour ontology [30]. We also incorporated new specific classes into the ontology about the Spanish hotel industry that have been considered relevant after analyzing the available resources. The resulting ontology contains all the touristic information that the use case scenario requires. The ontology has been implemented in OWL [37]. An excerpt of the ontology is shown in Figure 7. 4.2. Corpora selection. Ontology population from text implies the existence of certain linguistic resources from which to obtain the instances, i.e., a corpus. A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria, in order to be used as a representative language sample [38]. In this work, two corpora, Hotels Corpus and Restaurants Corpus, have been compiled. Both were obtained from an official tourism Web page. The restaurants corpus consists of a description in Spanish of a total of 848 restaurants (67,500 words approximately). The hotels corpus comprises the description in Spanish of 112 hotels (14,000 words approximately). For the development of the linguistic resources, 200 out of the 846 restaurant descriptions and 40 out of the 112 hotel descriptions have been considered. The remaining descriptions have been used to test the system. 4.3. Evaluation. In this section, the experimental results obtained by our methodology are presented. Two experiments have been conducted on the corpora explained above. Each corpora has been divided into two parts. The first part has been used to create and train the NLP resources, and the other part has been used to measure the precision and recall of the system. The success of the ontology population task is directly related to both the amount of NEs extracted during the first stage of the methodology by using GATE, and the system accuracy in resolving ambiguities, that is, its ability to classify the NE candidates. The results obtained during the NER phase appear in Table 5. This table includes the ambiguous NE’s annotations, the correct ones and the ambiguous NE’s annotations that have not been resolved after the Ontology Population process. For example, in the Restaurants corpus a total amount of 13,072 NE annotations have been correctly extracted, 123

ONTOLOGY POPULATION: AN APPLICATION FOR THE E-TOURISM DOMAIN

6129

Figure 7. An excerpt of the eTourism ontology annotations were ambiguous and only 59 annotations could not be disambiguated after the ontology population process. Table 5. NE annotations identified Restaurants Hotels Ambiguous NE’s annotations 123 73 Ambiguous NE’s annotations after OP process 59 8 Correct NE’s annotations 13072 2958 On the one hand, the number of NE annotations correctly identified by GATE represents the 99% of all the NE annotations. This information refers to the groups of annotations that contain a single mention of a given NE and, therefore, have no ambiguities, which represent the majority of cases. In contrast, the annotations showing some ambiguity (i.e., those groups of annotations that can be matched to more than one NE) were gathered. In proportional terms, the number of ambiguities is greater in the Hotels Corpus. This is mainly because, in most cases, the names of the hotels include elements that are classifiable in different types of NE, such as location and activity. The majority of these ambiguities are solved by the methodology proposed in this work. However, there are some NE’s annotations that cannot be solved during the Ontology Population process. The number of ambiguities that have not been solved is greater in the Restaurants Corpus. The reason for this is that the descriptions are shorter and the number of possible relationships between the entities is smaller. Consequently, there may be no difference in knowledge

6130

˜ ´ J. M. RUIZ-MART´INEZ, J. A. MINARRO-GIM ENEZ, D. CASTELLANOS-NIEVES ET AL.

gain if the system chooses one entity or another, and so the system cannot correctly determine to what class they belong. In order to evaluate the results obtained after the Ontology Population phase, the relevant ontological entities (individuals, object properties and datatype properties) that appear in the corpora were gathered manually. Information about the number of knowledge entities retrieved and knowledge entities correctly retrieved by our methodology was also obtained. Furthermore, it was necessary to check whether the individuals and properties were properly created and the instances correctly instantiated within the ontology. The results of this evaluation process are shown in Figure 8.

Figure 8. Values and measures obtained during the evaluation In order to calculate the standard accuracy metrics for the proposed methodology, that is, recall (1) and precision (2), the total number of knowledge entities retrieved was compared against those that were relevant. Furthermore, precision has been calculated using the relevant entities retrieved and those that were irrelevant. Finally, the F-measure (3) is a weighted harmonic mean that has also been calculated. The results for these accuracy measures are presented in Table 6. Knowledge Entities correctly retrieved (1) recall = Relevant Knowledge Entities in corpus Knowledge Entities correctly retrieved precision = (2) Total Knowledge Entities retrieved 2 ∗ recall ∗ precision F-Measure = (3) recall + precision Significant values for precision and recall were achieved in both corpora. The main reasons for this are that (1) both domains are quite specific, and (2) the linguistic analysis performed has allowed the creation of fitted linguistic resources. Similarly, the F-measure value of individuals and relationships shows that the ontology was correctly instantiated

ONTOLOGY POPULATION: AN APPLICATION FOR THE E-TOURISM DOMAIN

6131

Table 6. Average recall, precision and F-measure for two corpora Restaurants Corpus Recall Precision F-Measure Individual 89.32% 98.66% 93.76% Object Property 82.19% 92% 87.26% Datatype Property 81.49% 93.33% 87.01% Total 84.10% 95.66% 89.51%

Hotels Corpus Recall Precision F-Measure 94.93% 93.21% 94.06% 80.14% 90.56% 85.03% 79.57% 96.54% 87.24% 81.19% 92.35% 86.85%

in a rate of over 89% in Restaurants Corpus and 86% in Hotels Corpus. With an overall F-measure value of 89.51%, the system performance in the Restaurants domain is a bit better than in the Hotels domain. This slight variation in performance between domains could be due to the different specificity of their textual contexts. 5. Conclusion and Discussion. Most of the information currently available on the Web is not properly structured for computers to read and understand, because it has been designed for human consumption. The Semantic Web aims to provide a formal mechanism to organize data on the Web in such a way that machines can easily exploit them. Disappointingly, the chicken-or-egg dilemma has accompanied the Semantic Web from its very conception: without substantial Semantic Web content, few tools will be written to consume it; without many such tools, there is little appeal to publish Semantic Web content [39]. In a fundamental step towards alleviating this problem, in this paper, we propose a methodology for populating ontologies from unstructured web documents. This methodology allows for the automatic population of ontologies on the basis of Semantic Web Technologies and Natural Language Processing techniques. In particular, with our approach, a given ontology can be enriched by adding instances gathered from natural language texts. It is a pattern-based approach, but with the advantages that (1) there is no need for a previously annotated corpus, and (2) once the linguistic resources have been created, the system is completely automatic. The linguistic framework is integrated in GATE, which is widely used by the computational linguistics community. GATE allows the easily integration of linguistics resources depending on the needs of the domain and the nature of the texts under consideration. Consequently, the proposed approach is flexible and fairly portable to other domains. On a related note, the ontology population process, which is based on the semantic distance of the knowledge entities detected, has been designed in order to be domain and language independent. Moreover, the consistency of the ontology is checked at the end of the process, which is essential in a fully automatic method. The methodology presented here has been validated in the tourism domain in the Spanish language with promising results. A more in-depth validation of the system is planned, comprising the application of the system to texts from different touristic domains and the use of statistical methods for analyzing the results obtained. The validation of the proposed methodology in the scope of other application domains such as the financial domain is also left for future work. We also plan to include other NLP techniques such as syntactic analysis and semantic role labelling, to enrich the ontology population phase and improve its performance and domain independence. Furthermore, the conducted analysis of discourse shows that NEs usually appear in texts in a particular order and grouped by categories. For this reason, a mechanism to calculate the weight of NEs in different parts of texts, as proposed in [40], could be used as an additional parameter to facilitate the NEs disambiguation process.

6132

˜ ´ J. M. RUIZ-MART´INEZ, J. A. MINARRO-GIM ENEZ, D. CASTELLANOS-NIEVES ET AL.

Finally, due to the percentage of misprints and orthographic errors that have been detected in the texts extracted from the Web, a multilingual spell checker will be implemented and integrated within the application. This is particularly important since the loss of information because of this kind of errors is highly significant on the Web. Acknowledgment. This work has been supported by the Spanish Government through project SeCloud (TIN2010-18650). Juana Mar´ıa Ruiz-Mart´ınez is supported by the Fundacin Sneca through grant 06857/FPI/07. REFERENCES [1] T. Berners-Lee, J. Hendler and O. Lassila, The semantic web, Scientific American Magazine, vol.284, pp.34-43, 2001. [2] R. Studer, V. R. Benjamins and D. Fensel, Knowledge engineering: Principles and methods, Data Knowl. Eng., vol.25, pp.161-197, 1998. [3] T. R. Gruber, A translation approach to portable ontology specifications, Knowledge Acquisition, vol.5, pp.199-220, 1993. [4] X.-Q. Yang, N. Sun, T.-L. Sun, X.-Y. Cao and X.-J. Zheng, The application of latent semantic indexing and ontology in text classification, International Journal of Innovative Computing, Information and Control, vol.5, no.12(A), pp.4491-4499, 2009. [5] H. M. Park, Y. L. Lee, B. N. Noh and H. H. Lee, Ontology-based generic event model for ubiquitous environment, International Journal of Innovative Computing, Information and Control, vol.5, no.11(B), pp.4317-4326, 2009. [6] R.-C. Chen, C.-T. Bau, M.-Y. Tsai and C.-Y. Huang, Web pages cluster based on the relations of mapping keywords to ontology concept hierarchy, International Journal of Innovative Computing, Information and Control, vol.6, no.6(B), pp.2749-2760, 2010. [7] Y. Zhang, H. Huang, D. Yang and H. Zhang, A hierarchical and chord-based semantic service discovery system in the universal network, International Journal of Innovative Computing, Information and Control, vol.5, no.11(A), pp.3745-3753, 2009. [8] Y. Yang, P. Jiang, S. Tsuchiya and F. Ren, Effect of using pragmatics information on question answering system of analects of confucius, International Journal of Innovative Computing, Information and Control, vol.5, no.5, pp.1201-1212, 2009. [9] X. Jiang and A. Tan, Learning and inferencing in user ontology for personalized semantic web search, Inf. Sci., vol.179, pp.2794-2808, 2009. [10] H. Yang, Automatic generation of semantically enriched web pages by a text mining approach, Expert Syst. Appl., vol.36, no.8, pp.9709-9718, 2009. [11] G. Petasis, V. Karkaletsis and G. Paliouras, Ontology population and enrichment: State of the art, Deriverable d4.3, BOEMIE: Bootstrapping Ontology Evolution with Multimedia Information Extraction, 2007. [12] V. de Boer, M. van Someren and B. J. Wielinga, Relation instantiation for ontology population using the web, Proc. of the 29th Annual German Conference on AI, KI 2006, vol.4314, pp.202-213, 2007. [13] M. Murata, T. Shirado, K. Torisawa, M. Iwatate, K. Ichii, Q. Ma and T. Kanamaru, Extraction and visualization of numerical and named entity information from a very large number of documents using natural language processing, International Journal of Innovative Computing, Information and Control, vol.6, no.3(B), pp.1549-1568, 2010. [14] H. Tanev and B. Magnini, Weakly supervised approaches for ontology population, Proc. of EACL2006, Trento, pp.3-7, 2006. [15] D. Maynard, A. Funk and W. Peters, SPRAT: A tool for automatic semantic pattern-based ontology population, International Conference for Digital Libraries and the Semantic Web, Trento, Italy, 2009. [16] B. Magnini, E. Pianta, O. Popescu and M. Speranza, Ontology population from textual mentions: Task definition and benchmark, Proc. of the 2nd Workshop on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, pp.26-32, 2006. [17] C. Giuliano and A. Gliozzo, Instance-based ontology population exploiting named-entity substitution, Proc. of the 22nd International Conference on Computational Linguistics, Manchester, Association for Computational Linguistics, vol.1, pp.265-272, 2008. [18] D. Celjuska and M. Vargas-Vera, Ontosophie: A semi-automatic system for ontology population from text, Proc. of International Conference on Natural Language Processing ICON, vol.4, 2004.

ONTOLOGY POPULATION: AN APPLICATION FOR THE E-TOURISM DOMAIN

6133

[19] R. Navigli and P. Velardi, Enriching a formal ontology with a thesaurus: An application in the cultural heritage domain, Proc. of the 2nd Workshop on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, Sydney, Australia, pp.1-9, 2006. [20] F. Amardeilh, P. Laublet and J. L. Minel, Document annotation and ontology population from linguistic extractions, Proc. of the 3rd International Conference on Knowledge Capture, pp.161-168, 2005. [21] L. K. McDowell and M. Cafarella, Ontology-driven, unsupervised instance population, Web Semantics: Science, Services and Agents on the World Wide Web, vol.6, no.3, pp.218-236, 2008. [22] J. M. Ruiz-Mart´ınez, J. A. Mi˜ narro-Gim´enez, L. Guill´en-C´arceles, D. Castellanos-Nieves, R. Valencia-Garc´ıa, F. Garc´ıa-S´ anchez, J. T. Fern´andez-Breis and R. Mart´ınez-B´ejar, Populating ontologies in the e-Tourism domain, WI-IAT ’08, IEEE/WIC/ACM, pp.316-319, 2008. [23] H. Cunningham, GATE, a general architecture for text engineering, Computers and the Humanities, vol.36, pp.223-254, 2002. [24] J. Atserias, B. Casas, E. Comelles, M. Gonz´alez, L. Padr´o and M. Padr´o, FreeLing 1.3: Syntactic and semantic services in an open-source NLP library, Proc. of the 5th International Conference on Language Resources and Evaluation, 2006. [25] M. Sabou, C. Wroe, C. Goble and H. Stuckenschmidt, Learning domain ontologies for semantic web service descriptions, Web Semantics: Science, Services and Agents on the World Wide Web, vol.3, no.12, pp.340-365, 2005. [26] B. Grau, I. Horrocks, B. Motik, B. Parsia, P. Patel-Schneider and U. Sattler, OWL 2: The next step for OWL, Journal of Web Semantics, vol.6, no.4, pp.309-322, 2008. [27] E. Sirin, B. Parsia, B. C. Grau, A. Kalyanpur and Y. Katz, Pellet: A practical OWL-DL reasoned, Journal of Web Semantics, vol.5, no.2, pp.51-53, 2007. [28] J. Delahousse, Semantic Web Use Case: An Application for Sustainable Tourism Development Mondeca, http://www.mondeca.com/sw-tourism-ontoweb-sig4-V2.pdf, 2003. [29] WTO, Thesaurus on Tourism and Leisure Activities of the World Tourism Organization, 2001. [30] K. Prantner, OnTour: The Ontology. Deri Insbruck, http://ontour.deri.org/ontology/ontour-02.owl, 2004. [31] M. Missikof, H. Werthner, W. H¨opken, M. Dell’Erba, O. Fodor, A. Formica and F. Taglino, Harmonise towards interoperability in the tourism domain, The 10th International Conference on Information Technologies in Tourism, pp.29-31, 2003. [32] J. Cardoso, E-tourism: Creating dynamic packages using semantic web processes, W3C Workshop on Frameworks for Semantics in Web Services, 2005. [33] R. Jakkilinki, M. Georgievski and N. Sharda, Connecting destinations with ontology-based e-tourism planner, The 14th Annual Conference of the International Federation for IT&Travel and Tourism, 2007. [34] D. N. Kanellopoulos and A. A. Panagopoulos, Exploiting tourism destination´s knowledge in an RDF-based P2P network, Journal of Network and Computer Applications, vol.31, pp.179-200, 2008. [35] M. Flugge and D. Tourtchaninova, Ontology-derived activity components for composing travel web services, International Workshop on Semantic Web Technologies in Electronic Business, Berlin, Germany, 2004. [36] H. Knublauch, Travel.owl, http://protege.stanford.edu, 2004. [37] Jena, Semantic Web Framework, Version 2.6, 2010. [38] J. Sinclair, EAGLES: Preliminary recommendations on text typology, Document EAG-TCWGCTYP/P, 1996. [39] D. Huynh, S. Mazzocchi and D. Karger, Piggy bank: Experience the semantic web inside your web browser, Journal of Web Semantics, vol.5, pp.16-27, 2007. [40] Z. Zhu, P. Liu, L. Zhao and T. Lv, Research of feature weights adjustment based on semantic paragraphs matching, ICIC Express Letters, vol.4, no.2, pp.559-564, 2010.