Abstract. The analysis and creation of annotated corpus is fundamen- tal for implementing natural language processing solutions based on machine learning.
A Parallel Corpus Labeled Using Open and Restricted Domain Ontologies E. Boldrini, S. Ferr´ andez, R. Izquierdo, D. Tom´ as, and J.L. Vicedo Natural Language Processing and Information Systems Group Department of Software and Computing Systems University of Alicante, Spain {ebolrini,sferrandez,ruben,dtomas,vicedo}@dlsi.ua.es
Abstract. The analysis and creation of annotated corpus is fundamental for implementing natural language processing solutions based on machine learning. In this paper we present a parallel corpus of 4500 questions in Spanish and English on the touristic domain, obtained from real users. With the aim of training a question answering system, the questions were labeled with the expected answer type, according to two different ontologies. The first one is an open domain ontology based on Sekine’s Extended Named Entity Hierarchy, while the second one is a restricted domain ontology, specific for the touristic field. Due to the use of two ontologies with different characteristics, we had to solve many problematic cases and adjusted our annotation thinking on the characteristics of each one. We present the analysis of the domain coverage of these ontologies and the results of the inter-annotator agreement. Finally we use a question classification system to evaluate the labeling of the corpus.
1
Introduction
A corpus is a collection of written or transcribed texts created or selected using clearly defined criteria. It is a selection of natural language texts that are representative of the state of the language or of a special variety of it. Corpus annotation is a difficult task due to the ambiguities of natural language. As a consequence, annotation is time-consuming, but is extremely useful for natural language processing tasks based on machine learning, such as word sense disambiguation, named entity recognition or parsing. Question answering (QA) is the task that, given a collection of documents (that can be a local collection or the World Wide Web), retrieves the answers to queries in natural language. The purpose of this work is the development of a corpus for training a question classification system. Question classification is one of the tasks carried out in a QA system. It assigns a class or category to the searched answer. The answer extraction process depends on this classification,
This research has been partially funded by the Spanish Government under project CICyT number TIC2003-07158-C04-01 and by the European Commission under FP6 project QALL-ME number 033860.
A. Gelbukh (Ed.): CICLing 2009, LNCS 5449, pp. 346–356, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Parallel Corpus Labeled Using Open and Restricted Domain Ontologies
347
as different strategies may be used depending on the question type detected. Consequently, the overall performance of the system depends directly on question classification. We have developed a parallel corpus of 4500 questions in English and Spanish, that has been employed in the QALL-ME European project.1 Every question in this corpus has been labeled with its expected answer type (EAT), defined in the literature as “the class of object sought by the question”. We labeled these questions using two different ontologies. The first one is Sekine’s [15], an ontology suitable for open domain question answering systems, like those traditionally presented in conferences such as TREC2 and CLEF3 . The second one is a restricted ontology on the touristic domain, that has been created ad hoc for the QALL-ME project [12]. The aim of this double annotation is to allow training question classification systems in both open, and restricted domains. In this paper, we present the different features of the ontologies, and we also propose the solution we adopted for the most problematic cases of annotation. Moreover, in order to test the coherence of the corpus, we examine the interannotator agreement and the performance of a question classification system trained on the corpus. The rest of the paper is organized as follows: in Section 2 related work is presented. Sections 3 and 4 present a detailed description of the corpus and the ontologies we have employed. Afterwards, in Section 5 we explain the annotation process and we examine the most problematic cases. Section 6 presents the evaluation of the annotation of the EAT in the corpus. Finally, Section 7 depicts conclusions and future work proposals.
2
Related Work
There is a wide range of QA systems that apply machine learning techniques based on corpus, covering different stages of the question answering task. In [13], they developed a corpus of question-answer pairs called KM database. Each pair of the KM represents a trivia question and its answer, such as the trivia card game. The question-answer pairs were filtered to get only questions and answers that are similar to the ones presented in the TREC task. Using this corpus they automatically collected a set of text patterns employed for answer extraction. In [16], they present a QA system around a noisy-channel architecture which exploited both a language model for answers, and a transformation model for 1
2 3
QALL-ME is an EU-funded project which aims to establish a shared infrastructure for multilingual and multimodal question answering in the tourism domain. The QALL-ME system (http://qallme.fbk.eu/) allows users to pose natural language questions in several languages (both in textual and speech modality) using a variety of input devices (e.g. mobile phones), and returns a list of specific answers formatted in the most appropriate modality, ranging from small texts, maps, videos, and pictures. http://trec.nist.gov http://www.clef-campaign.org
348
E. Boldrini et al.
answer/question terms. In order to apply the learning mechanisms, they first created a large training corpus of question-answer pairs with a broad lexical coverage. They collected FAQ pages and extracted a total of one million questionanswer pairs. After that, they employed this training corpus in the query analysis, and in the answer extraction modules. In the work presented in [1], they developed a system that used a collection of approximately 30,000 question-answer pairs for training. The corpus was obtained from more than 270 FAQ files on different subjects in the FAQFinder project [4]. They used this corpus to automatically learn phrase features for classifying questions into different types, and to generate candidate query transformations. Finally, [3] proposed another approach based on machine learning. Starting from a large collection of answered questions, the algorithms described learned lexical correlations between questions and answers. Nowadays, there is also a wide range of research projects focused on the touristic domain, and more specifically on the creation of restricted domain ontologies; the main objective is to investigate complex language technologies and web technologies to improve information searching and accessing in this data-rich area. In the following paragraphs we present some of the most relevant. The first is The Harmonise ontology, developed during the Harmonise project4 , and then extended to new subdomains in the project of Harmonise Trans-European Network for tourism (Harmo-TEN). The aim of the two related projects was to provide an open mediation service for travel, and tourism information exchange within the tourism industry members. It is also important to mention theHi-Touch Ontology that is an IST/CRAFT European program, which aimed to develop Semantic Web methodologies and tools for intra-European sustainable tourism. The Hi-Touch ontology was developed mainly by Mondeca, using the “Thesaurus on Tourism and Leisure Activities” (World Tourism Organization, 2001) as an official source for its terminology. Moreover the ontology focuses on tourism products and customers’ expectations. Its usage can ensure the consistency of categorization of tourism resources managed on different databases, and enhances searches among numerous tourism products by providing semantic query functionalities. The eTourism Semantic Web portal was developed by Digital Enterprise Research Institute5 ; it consisted of a search interface, where information retrieval was based on the semantic data to allow better queries. In order to provide vocabulary for annotations, and obtain agreement on a common specification language for sharing semantics an ontology was used. It mainly covers accommodation and activities, including also the necessary infrastructure for the activities. TAGA6 is an agent framework for simulating the global travel market on the Web. In TAGA, all travel service providers can sell their services on the Web forming a travel market; travel agents can help customers to buy the travel 4 5 6
http://www.cepis-harmonise.org/harmonise/php/ http://e-tourism.deri.at/ http://taga.sourceforge.net/
A Parallel Corpus Labeled Using Open and Restricted Domain Ontologies
349
package from the Web travel market according to the customers’ preferences. TAGA defines the domain ontologies to be used in simulations. The BMBF funded project-German Text Exploitation and Search System (GETESS) 7 , aimed at developing an intelligent Web tool for information retrieval in the tourism domain. GETESS enables natural language description of search queries through navigation in a domain-specific ontology, and presents the results in an understandable form. The GETESS ontology [8] contains 1043 concepts and 201 relations and provides bilingual terms (English and German) for each concept. It is the central service for text mining, storage, and query of semantic content by determining which facts may be extracted form texts, which database schema must be used to store these facts and what information is made available at the semantic level [17]. None of the corpus presented here, offer the double annotation that we describe in this work, and that allows us to train a question answering system in both open, and restricted domains.
3
Description of the Corpus
The object of our study is the corpus created for the QALL-ME project, composed by 4500 questions in Spanish, with an English parallel version about the touristic domain. It is the result of the collection of many sentences recorded by a large number of speakers. Every speaker has performed 30 questions based on 15 real scenarios and for each of them, two questions are generated. This collection of questions has been created to have a sample of real natural language, and for this reason speakers had to think about real needs and real situations. Every speaker is given a list of scenarios to be able to formulate the spoken queries to the system, using the telephone and then they will read a written question for the same scenario, that is composed by the following items: 1. Sub Domain identifies the context in which the query has to be posed; it could be “cinema”, “restaurants”, “events”, etc. 2. Desired Output identifies the kind of information that you would like to obtain from the system (eg. How to get the cinema, the telephone number of a restaurant, the cost of a ticket, . . . ). 3. Mandatory Items are list of items. The speaker has to include all of them. 4. Optional Items are list of items: the speaker can put none, some, or all of them in the question. We could define a realistic scenario as the set of instructions that allow a speaker to formulate useful questions, without providing too many suggestions related to each query; in our case, questions have been recorded, and then transcribed using the free tool Transcriber8 . After the transcription process, the corpus has been translated into English, taking into account the main aim of the translation that is the simulation of 7 8
http://www.getess.de/goal.html http://trans.sourceforge.net/en/presentation.php
350
E. Boldrini et al.
situations in which a person is visiting a city and asks for touristic information in a natural way. The only elements that we did not translate are named entities. After the translation, the corpus has been annotated to detect the speech acts that refer to the different communication purposes. According to [2], when we speak we are doing something with our words. In fact, the term “speech act” is the synonym of illocutionary act; when a minister joins two people in marriage he says: “I now pronounce you husband and wife”. As we can understand, the words pronounced by the minister have effect in the reality. Thus, the queries of our corpus has been created thinking about a real need, and each of them should have the purpose to generate an effect; in our case the effect is the answer. For this reason, we group the queries in request and non request. The first group can be direct or indirect and the second can be greetings, thanks, asserts or other. The corpus contains queries such as: “At what time can I watch the movie el ´ Ilusionista at the Abaco 3D cinema?”, “What is the price of a double room at Vista Blanca hotel?”, etc. After having realized the aforementioned steps, the EAT has been annotated using Sekine’s ontology that is open domain; as a consequence it provides a description of the world, but we will focus on the part of the world we are working with. Moreover, we would like to explain that we annotated our corpus according to the definitions of the labels provided by Sekine, adapting them to the needs of the project. We could use this corpus in open domain QA systems such as the ones that participate in competitions like TREC [19], CLEF [6], etc. Finally, the last step consisted in annotating the corpus using the restricted domain QALL-ME ontology that is specific, and complete for the touristic area and created ad hoc for the needs of the project.
4
Description of the Two Ontologies
The Sekine’s Extended Named Entity Hierarchy originates from the first Named Entity set defined by MUC [7], the Named Entity set developed by IREX [14], and the Extended Named Entity hierarchy which contains approximately 150 NE types [15]. This ontology is divided into top level classes that are name, time, and numerical expressions. Starting from these three classes at the top of the Extended Named Entity Hierarchy we can find the others. By contrast, the QALL-ME ontology [12] was created after a deep research on the previous ontologies (described in section 2) and, as a consequence, it borrows some concepts and structures from them. Regarding its coverage, it is similar to the Harmonise and eTourism ontologies because they all focus on static tourism information rather than dynamic. In figure 1, a part of the QALL-ME ontology, concerning cinema and movies, is shown. The ontology provides a conceptualized description of the touristic domain. Moreover, it covers the most important aspects of the tourism industry, including tourism destinations, cities or events. It consists of 122 classes, 55 datatype
A Parallel Corpus Labeled Using Open and Restricted Domain Ontologies
351
Fig. 1. Part of the QALL-ME ontology (cinema/movies)
properties and 52 object properties with the function of indicating the relationships among the 122 classes, divided into 15 top-level classes. The structure of the QALL-ME ontology is similar to the eTourism ontology; in fact, both of them are written in the Web Ontology Language (OWL9 ), they can involve more complex classes and relationships, and support complex inferences. 4.1
Annotation
Tagging questions with their EAT needs an exhaustive definition of a hierarchy of possible answer types, where question EATs will be matched. We can find different general answer type taxonomies, employed for open-domain question answering, but they cannot be employed in specialized domains due to its high abstraction level. The EAT of a question could be defined as the class of object sought by the question. In the QALL-ME project, we have to perform EAT tagging over a restricted domain modeled by an ontology. In fact, the QALL-ME system is considered a restricted domain question answering system, due to the fact that it fulfils the main characteristics of these kinds of systems [11]. The size of the corpus is limited, it contains a low redundancy level, and the domain of application is described and modeled with precision. Using the EAT classification, that is an essential process in QA, the annotator assigns a predefined class or category to the answer, and the subsequent extraction process 9
OWL is a family of knowledge representation languages for authoring ontologies, and is endorsed by the World Wide Web Consortium. http://www.w3.org/TR/owl-features/
352
E. Boldrini et al.
depends on this previous classification. In order to create a coherent annotation, it is also fundamental to fix guidelines in order to properly annotate the corpus. One of the most important rule for our annotation is that we use ontology concepts (classes) as EAT as a default option. Concepts in ontologies are organized hierarchically and, as a consequence, the most specific are included into the general ones. As an example, the concepts fax, telephone or email are part of the more general class Contact. This structure causes the problem of deciding which is the best level to be used for EAT tagging, also because there is a wide range of ambiguous questions. In general, we expect the annotation to be as much informative as possible, and therefore we will always assign the most specific concept of the ontology, when it is possible. However, we must pay attention because using very specific concepts may cause errors in the corpus when annotating more general questions. Moreover, there are cases in which a speaker asks for more that one thing, formulating a complex question. In this case we allow multiple EAT in the same question, and regarding the tagging purposes a tag for any of EAT in a question is added. Finally, when a query requires information not explicitly defined as a class but as a datatype property, question EAT will be expressed as a datatype concept where this datatype property takes values from.
5
Problems of Annotation
The aim of this section is to present a sample of each annotation difficulties we find out during the corpus annotations, and to present our solution. As we mentioned in section 3, the corpus object of our study represents a sample of real language and, as a consequence, ambiguous questions are frequent. Moreover, annotated corpora should fulfil a fundamental requirement; they have to be coherent. As a consequence, when annotating general criteria need to be fixed. These decisions are essential, since they represent the pillar of the global annotation. The main problem of our annotation is the coexistence of the two ontologies. In fact, Sekine’s ontology is open domain, and this feature represents the first obstacle for annotators; the labeling of a restricted domain corpus with a general ontology can be a very complex process because Sekine describes the world in general and, as a consequence, we cannot find specific classes about the touristic domain, as for example the cuisine of a restaurant. We can only find out classes of the touristic domain that are very general and not specific for the domain we are analyzing. If we have a look at the Sekine’s ontology, we can find the class MONEY, but nothing related with the price of a guest room in a hotel. After having performed the annotation with Sekine’s taxonomy, we had to start annotating using the ontology that it is extremely specific, and this characteristic represents another problem. In fact, the tendency of the annotator is to be as much specific as possible and this could generate the risk to be extremely specific, a negative attitude for the needs of the QALL-ME project. In other
A Parallel Corpus Labeled Using Open and Restricted Domain Ontologies
353
words, the final system should provide the user as much information as possible, avoiding him to call many times in order to obtain the information he needs. In the following paragraphs, we will provide an exhaustive description, and explanation of the criteria we adopted in order to propose a viable solution thinking on the needs of the QALL-ME project. – Tell me the address of the Meli´ a hotel in Alicante. In this case we use the class ADDRESS for Sekine, and the class PostalAddress for the ontology. This is not a problematic query, but when it is compared with the following one, it could be ambiguous. – What is the street of the Heperia hotel in Alicante? This is a different question, because, more specific than the previous example; if the user asks for the address, we use the class of the ontology Postal Address, while in this case we can not use the same class; the attribute .street has to be added in order to supply the information required by the user. – How can I get in touch with the Amerigo hotel? This is a very general query, because the user could ask for the telephone number, the fax number or the email address of the hotel he is looking for. As a consequence, we do not have problems with Sekine because general is better for this ontology and we put the label ADDRESS. The problem we have to solve is to find the correct class that includes address, telephone, fax, mail website into the ontology; this class is Contact. – When does the pharmacy at calle Alfonso el Sabio opens? In this question the speaker may want to know the day, the opening hours or both of them; for this reason we select DateTimePeriod in the QALL-ME ontology and TIMEX that are the two best classes for providing the user with all the information he needs. – Tell me the timetable of La Tagliatella restaurant This question is also ambiguous. The timetable could be the time, the day or both of them; the solution is the same adopted for the aforementioned example. We select TIMEX for the Sekine’s ontology, and DateTimePeriod for the QALL-ME ontology. – Tell me the name of a cinema, restaurant, etc. We can not find a class specific for these queries in Sekine, and our option is to put the general class GOE-OTHER10 ; in the case of the annotation using the ontology we do not have problems. – Tell me the ticket price of the Panoramis cinema This question can be interpreted in different ways; the fist one is that the speaker requires for the value, but it could ask for the type of the price to check if a discount is available or both options. In this case we are forced to select the class MONEY in Sekine and in the ontology to choose for TicketPrice that includes the attributes .priceType and .priveValue. 10
GOE-OTHER is a class that indicates public facilities. School, Institution, Market, Museum, etc. are also included into this group. When the annotator cannot use one of these classes, it should select GOE-OTHER.
354
E. Boldrini et al.
– Does the price of the Bah´ıa hotel include breakfast? We could see the breakfast as a facility or as a kind of price; we decided to see it as a kind of price that can be also the amount of money. For this reason we select MONEY in Sekine, and GuestRoomPrice for the annotation using the ontology.
6
Evaluation
In order to test the consistency of the labeled corpus, we performed two different evaluations. First, we employed the corpus to train and test a question classification system. Secondly, we calculated the kappa agreement between the assessors that labeled the corpus. 6.1
Black-Box Evaluation
In this first experiment, we performed a black-box type evaluation. We employed our corpus to train and test an SVM-based question classification system. Support Vector Machines (SVM) [18] have demonstrated to perform the state-of-the-art in the task of question classification [10]. In these experiments, we employed a lineal kernel and a bag-of-words representation of the feature space. In order to evaluate the two sets of labels in English and Spanish, we carried out four different tests. We performed a 10-fold cross validation to test the system. Table 1 shows the results obtained. Table 1. Question classification performance for English and Spanish Language Sekine QALL-ME English 95,18% 94,36% Spanish 95,51% 95,04%
These results are considerably high for all the experiments. We can compare our results with those obtained with one of the most widely used corpus in the task of question classification, previously described in [9]. This English corpus consist in almost 5,500 questions for training and 500 for testing. These questions are labeled with a two level hierarchy of 6 coarse– and 50 fine–grained classes. In [20], they employed this corpus to train the same classifier that we used in our experiments, obtaining 85,8% precision for coarse-grained classes and 80,2% for fine–grained. When compared with these results, our corpus demonstrates to be a coherent and robust resource for the task of question classification. 6.2
Inter-annotator Agreement
The corpus developed in this work was labeled by two annotators. The kappa agreement obtained by these annotators in the ontology of Sekine was 0.87, while the agreement in the QALL-ME ontology was 0.89. These values were computed
A Parallel Corpus Labeled Using Open and Restricted Domain Ontologies
355
according to [5], taken as equal for the coders the distribution of proportions over the categories. In both cases, we obtained a substantial agreement. This agreement is higher for the QALL-ME ontology. This reflects the fact that the corpus was gathered thinking in the QALL-ME restricted domain ontology, and thus this labels can be naturally assigned to the questions in the corpus.
7
Conclusion and Future Work
In this paper, we have presented a corpus created under the QALL-ME project framework with the aim of training QA systems. The corpus consists of 4500 Spanish and English touristic domain questions which were annotated according to two different ontologies: an open domain, and a close domain ontology. Another contribution of our research is the presented solutions that focus on harmonizing the diferencies between the two ontologies in order to obtain a valid annotation. Thus, this corpus allows training a question answering system for both open and restricted domain purposes. In order to evaluate the coherence of this resource, we have performed a double test and, on one hand, we have evaluated the inter-annotator agreement calculating the kappa measure. On the other hand, we performed the evaluation of the annotation using a question classification system. We have obtained considerably positive results for both test, demonstrating the coherence of the annotation process. Finally, as a future work proposal, our intention is to extend our work to other languages in order to train Cross-Lingual QA systems.
References 1. Agichtein, E., Lawrence, S., Gravano, L.: Learning search engine specific query transformations for question answering. In: Proceedings of the 10th World Wide Web Conference (WWW 10) (2001) 2. Austin, J.: How to do things with words. In: CPaperback, 2nd edn. Harvard University Press (2005) 3. Berger, A., Caruana, R., Cohn, D., Freitag, D., Mittal, V.: Bridging the lexical chasm: statistical approaches to answer-finding. Research and Development in Information Retrieval, 192–199 (2000) 4. Burke, R., Hammond, K., Kulyukin, V., Lytinen, S., Tomuro, N., Schoenberg, S.: Question answering from frequently-asked question files: Experiences with the faq finder system. AI Magazine 18(2), 57–66 (1997) 5. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971) 6. Giampiccolo, D., Forner, P., Herrera, J., Pe˜ nas, A., Ayache, C., Forascu, C., Jijkoun, V., Osenova, P., Rocha, P., Sacaleanu, B., Sutcliffe, R.F.E.: Overview of the clef 2007 multilingual question answering track. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 200–236. Springer, Heidelberg (2008) 7. Grishman, R., Sundheim, B.: Message understanding conference- 6: A brief history. In: COLING, pp. 466–471 (1996)
356
E. Boldrini et al.
8. Klettke, M., Bietz, M., Bruder, I., Heuer, A., Priebe, D., Neumann, G., Becker, M., Bedersdorfer, J., Uszkoreit, H., Maedche, A., Staab, S., Studer, R.: Getess - ontologien, objektrelationale datenbanken und textanalyse als bausteine einer semantischen suchmaschine. Datenbank-Spektrum 1, 14–24 (2001) 9. Li, X., Roth, D.: Learning question classifiers. In: Proceedings of the 19th international conference on Computational linguistics, Morristown, NJ, USA, pp. 1–7. Association for Computational Linguistics (2002) 10. Metzler, D., Croft, W.B.: Analysis of statistical question classification for factbased questions. Information Retrieval 8(3), 481–504 (2005) 11. Moll´ a, D., Vicedo, J.L.: Question answering in restricted domains: An overview. Computational Linguistic 33(1), 41–61 (2008) 12. Ou, S., Pekar, V., Orasan, C., Spurk, C., Negri, M.: Development and alignment of a domain-specific ontology for question answering. In European Language Resources Association (ELRA) (ed.) Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco (May 2008) 13. Ravichandran, D., Ittycheriah, A., Roukos, S.: Automatic derivation of surface text patterns for a maximum entropy based question answering system. In: Proceedings of the HLT-NAACL Conference (2003) 14. Sekine, S., Isahara, H.: Irex: Ir and ie evaluation project in japanese. In: European Language Resources Association (ELRA) (ed.) Proceedings of the Sixth International Language Resources and Evaluation (LREC 2000), Athens, Greece (MayJune 2000) 15. Sekine, S., Sudo, K., Nobata, C.: Extended named entity hierarchy. In: European Language Resources Association (ELRA) (ed.) Proceedings of the Sixth International Language Resources and Evaluation (LREC 2002), Las Palmas, Spain (March 2002) 16. Soricut, R., Brill, E.: Automatic question answering: Beyond the factoid. In: Proceedings of the HLT-NAACL Conference (2004) 17. Staab, S., Braun, C., Bruder, I., D¨ usterh¨ oft, A., Heuer, A., Klettke, M., Neumann, G., Prager, B., Pretzel, J., Schnurr, H.-P., Studer, R., Uszkoreit, H., Wrenger, B.: Getess - searching the web exploiting german texts. In: Klusch, M., Shehory, O., Weiss, G. (eds.) CIA 1999. LNCS, vol. 1652, pp. 113–124. Springer, Heidelberg (1999) 18. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) 19. Voorhees, E.M.: Overview of trec 2007. In: Peters, C., Jijkoun, V., Mandl, T., M¨ uller, H., Oard, D.W., Pe˜ nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152. Springer, Heidelberg (2008) 20. Zhang, D., Lee, W.S.: Question classification using support vector machines. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 26–32. ACM, New York (2003)