Oct 31, 2008 - 100. 9.1.2 Identification of key elements in Knowledge Acquisition 101 ...... 8http://www.stern.de/media/pdf/wiki_test_750.jpg ...... comunes, adjetivos, verbos y adverbios). Esto esta ... el estado de las ENs en WordNet (el RL para el inglés mas usado en la .... Para ello hemos aplicado expresiones regulares.
Departament de Llenguatges i Sistemes Inform`atics Universitat d’Alacant
Enrichment of Language Resources by Exploiting New Text and the Resources Themselves A case study on the acquisition of a NE lexicon Antonio Toral Ruiz PhD Dissertation This work was carried out under the supervision of Rafael Mu˜ noz Guillena, Universitat d’Alacant and Monica Monachini, Istituto di Linguistica Computazionale Alacant, month 12, 2009
Departament de Llenguatges i Sistemes Inform`atics Universitat d’Alacant
Enrichment of Language Resources by Exploiting New Text and the Resources Themselves A case study on the acquisition of a NE lexicon Antonio Toral Ruiz
PhD Dissertation This work was carried out under the supervision of Rafael Mu˜ noz Guillena, Universitat d’Alacant and Monica Monachini Istituto di Linguistica Computazionale
Alacant, month 12, 2009
Preface
i
The writing of this manuscript culminates five years of work (Fall 2004– Fall 2009) at the two research institutions that have hosted me during this period; the Universitat d’Alacant and the Istituto di Linguistica Computazionale. This PhD. thesis departs from the conclusions of the RECENT research project, carried out at the Universitat d’Alacant. Within this project I developed a knowledge-based Named Entity Recognition tool called DRAMNERI. In its design, particular attention was paid to modularity, with the aim of making it possible to easily adapt it to different domains and/or languages. The tool proved successful for specific tasks; we exploited it to build a larger Named Entity Recognition (NER) system for Portuguese and to build a commercial Information Extraction system in the field of conveyancing. However, a problem was encountered: the knowledge-acquisition bottleneck. While the effort needed to write the required rules and dictionaries for very specific domains (e.g. conveyancing) was affordable, tackling larger domains required a considerable amount of human labour. This led to the research carried out in this PhD. thesis. We began by exploring ways in which we could automatically acquire lists of Named Entity (NE)s (plain dictionaries). Initially we looked at the high potential of Web 2.0 resources because of some of their intrinsic characteristics; the first experiment consisted of applying Natural Language Processing (NLP) techniques to Wikipedia definitions, considering Wikipedia as a Machine Readable Dictionary, in order to derive lists of person, organisation and location names. With the experience gained from this initial phase, which was presented as the “Diploma of Advanced Studies”, we moved to considering a larger picture. We studied the knowledge bottleneck problem from a more abstract point of view and tried to identify efficient and sensible strategies to overcome it. The expertise of the Istituto di Linguistica Computazionale in the areas of Language Resource (LR)s, Interoperability and Standards and my participation in several international projects related to these matters (KYOTO1 and FLaReNET2 ) helped me consider other elements in the problem. Apart from the potential of Web 2.0 resources, we identified two additional key elements: (i) the use of existing LRs to support extraction procedures and (ii) the important role of standards as a means to represent the acquired 1 2
http://www.kyoto-project.eu/ http://www.flarenet.eu/
ii
information in order to guarantee interoperability. As a case study related to NEs, we worked on an automatic and language independent method to create a NE Lexicon from Wikipedia and noun taxonomies of LRs. Apart from the extraction of NEs, the thesis considers the design of the lexicon according to the emerging standards of the sector, the representation of the acquired NEs in the lexicon and the linking of the NEs to external resources such as ontologies. This method is evaluated by applying it to different languages (English, Italian and Spanish) and two families of LRs (WordNet and PAROLE-SIMPLE). Finally, in order to check the usefulness of the repository, we include it in a Question Answering process and evaluate its impact on the results by using its knowledge to validate the answers produced by the system.
iii
iv
Acknowledgements
v
I should begin by thanking Fernando, the supervisor of the final project I chose for my MsC degree. He was my first contact with this research field and his expertise and vocation had a great impact on me. I also thank my classmates with whom I carried out that project, their work was essential to its success and to obtain a publication at an International Conference: the two H`ectors, Mariano and Ismael. To Rafael, my supervisor and the person that gave me the opportunity to take part in the PhD. program at the Departament de Llenguatges i Sistemes Inform`atics (Universitat d’Alacant). To all the teachers of this PhD. program, they made me learn a lot regarding Computational Linguistics, especially to German, visiting teacher with whom afterwards I have had the opportunity to work. To all the students that took part with me in this program, for the interesting collaborative homework and the dinners shared. Special mention for the two big brothers, who have became not only top class researchers with whom I have had the pleasure to work, but also real friends. To all the staff of the department. To Monica, Claudia and Nicoletta, to give me the opportunity to work with them at a prestigious and historic institution within this research area: the Istituto di Linguistica Computazionale. To all the people with whom I have shared laboratory room at the Institute during these years: Valeria, Riccardo, Eva, Francesca, Luca and Roberto. To all the staff of the Institute. To the projects I have worked at during these years as it has been thanks to their funding that I have been able to carry out my PhD thesis: RECENT, CESS-ECE, KYOTO and FLaReNet. To Heather, thanks to her the English of the final version of this document is much less maccheronico than it was before she proof-read it. Thanks! Finally, I would like to say a very big thank you to my family. They have always been there, helping me whatever the problem, supporting my decisions, and giving me hope and love in the difficult times.
vi
Contents 1 Introduction 1.1 Research Motivation . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Work 2.1 Named Entity Recognition . . . . . . . . . . . . . . 2.1.1 Knowledge-based . . . . . . . . . . . . . . . 2.1.2 Machine learning approach . . . . . . . . . . 2.1.3 Hybrid systems . . . . . . . . . . . . . . . . 2.2 Acquisition and enrichment of Language Resources 2.2.1 General . . . . . . . . . . . . . . . . . . . . 2.2.2 Onomastica . . . . . . . . . . . . . . . . . . 2.3 Text Similarity . . . . . . . . . . . . . . . . . . . . 2.3.1 Similarity between words . . . . . . . . . . . 2.3.2 Similarity between texts . . . . . . . . . . . 2.4 Standards and Language Resources . . . . . . . . . 3 Language Resources and New Text 3.1 WordNet . . . . . . . . . . . . . . . . . . . . . . 3.2 EuroWordNet . . . . . . . . . . . . . . . . . . . 3.2.1 Italian WordNet . . . . . . . . . . . . . . 3.2.2 Spanish WordNet . . . . . . . . . . . . . 3.3 Parole-Simple-Clips . . . . . . . . . . . . . . . . 3.3.1 Ontology . . . . . . . . . . . . . . . . . . 3.3.2 Lexicon . . . . . . . . . . . . . . . . . . 3.4 Mapping between ItalWordNet Parole-Simple-Clips . . . . . . . . . . . . . . . . 3.5 The Lexical Markup Framework . . . . . . . . . vii
. . . . . . .
. . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . . . . . and . . . . . . . . . . . . . .
1 6 7
. . . . . . . . . . .
9 10 12 15 17 18 18 19 20 21 22 23
. . . . . . .
25 26 26 27 27 28 29 29
. 30 . 30
CONTENTS 3.5.1 Core package . . . . . . . . . . . . . . . . 3.5.2 Semantics package . . . . . . . . . . . . . 3.5.3 Multilingual package . . . . . . . . . . . . 3.6 SUMO . . . . . . . . . . . . . . . . . . . . . . . . 3.7 New Text and Wikipedia . . . . . . . . . . . . . . 3.7.1 Structure . . . . . . . . . . . . . . . . . . 3.7.2 Comparison to other encyclopaedias . . . . 3.7.3 Wikipedia and Computational Linguistics
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
31 31 33 33 35 36 37 38
4 Acquisition of NE plain dictionaries 41 4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Experiments and results . . . . . . . . . . . . . . . . . . . . . 44 5 Key elements to overcome the knowledge acquisition bottleneck 5.1 Language Resources content to support Their Enrichment . . 5.2 Exploiting New Text for Broad Coverage and High Quality Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Standards for Building Generic Information Repositories . . . 6 Acquisition of a multilingual NE lexicon 6.1 Mapping . . . . . . . . . . . . . . . . . . . 6.2 Disambiguation . . . . . . . . . . . . . . . 6.2.1 Instance Intersection . . . . . . . . 6.2.2 Text similarity . . . . . . . . . . . 6.3 Extraction . . . . . . . . . . . . . . . . . . 6.4 NE identification . . . . . . . . . . . . . . 6.4.1 Web Search . . . . . . . . . . . . . 6.4.2 Wikipedia search . . . . . . . . . . 6.4.3 Combining Wikipedia and the Web 6.5 Postprocessing . . . . . . . . . . . . . . . . 6.6 The Named Entity Lexicon . . . . . . . . . 7 Evaluation 7.1 Data . . . . . . 7.2 Mapping . . . . 7.2.1 Mapping 7.3 Disambiguation
. . . . . . . . . . Analysis . . . . .
. . . .
. . . .
. . . .
viii
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
49 50 52 54
. . . . . . . . . . .
55 57 57 57 59 64 65 65 65 67 67 68
. . . .
71 72 73 74 75
CONTENTS . . . . . . . . .
76 79 79 82 83 83 84 86 86
. . . . . . . . . .
89 90 90 91 91 92 94 95 96 96 97
9 Conclusions and Future Work 9.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 Acquisition of NE plain dictionaries . . . . . . . . . . . 9.1.2 Identification of key elements in Knowledge Acquisition 9.1.3 Acquisition of multilingual NE lexicons . . . . . . . . . 9.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99 100 100 101 101 103
References
105
A Software developed
123
B Publications
127
C LMF output
133
D Questions of the English-Spanish QA@CLEF task (2006)
139
7.4 7.5
7.6
7.3.1 Data . . . . . . . . . 7.3.2 Instance Intersection 7.3.3 Semantic similarity . Extraction . . . . . . . . . . NE identification . . . . . . 7.5.1 Web . . . . . . . . . 7.5.2 Wikipedia . . . . . . 7.5.3 Wikipedia + Web . . Postprocessing . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
8 Application to Question Answering 8.1 Motivation . . . . . . . . . . . . . . . . . . 8.2 The BRILIW Question Answering System 8.2.1 Language Identification . . . . . . . 8.2.2 Named Entity Translation . . . . . 8.2.3 Question Analysis . . . . . . . . . . 8.2.4 Inter Lingual Reference . . . . . . . 8.2.5 Selection of Relevant Passages . . . 8.2.6 Answer Extraction . . . . . . . . . 8.3 Methodology . . . . . . . . . . . . . . . . 8.4 Results . . . . . . . . . . . . . . . . . . . .
ix
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
CONTENTS E Resumen E.1 Introducci´on . . . . . . . . . . . . . . . . . . . . . E.1.1 Motivaci´on . . . . . . . . . . . . . . . . . E.2 Trabajo de investigaci´on y obtenidos . . . . . . . . . . . . . . . . . . . . . . E.2.1 Elementos clave para tratar la adquisici´on miento . . . . . . . . . . . . . . . . . . . . E.2.2 M´etodo . . . . . . . . . . . . . . . . . . . E.2.3 Evaluaci´on . . . . . . . . . . . . . . . . . . E.2.4 Aplicaci´on a B´ usqueda de Respuestas . . . E.3 Conclusiones . . . . . . . . . . . . . . . . . . . . . F Abbreviations
151 . . . . . . . 152 . . . . . . . 155 resultados . . . . . . . 156 de conoci. . . . . . . 156 . . . . . . . 158 . . . . . . . 161 . . . . . . . 163 . . . . . . . 164 167
x
List of Tables 4.1 4.2 4.3 4.4 4.5
Distribution by entity classes . . . . . . . . . . . . . . . . . Experiment 1. Results applying is instance heuristic . . . . . Experiment 2. Results applying is instance and is in wordnet heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error analysis (experiment 1, k=2) . . . . . . . . . . . . . . Error analysis (experiment 2, k=0) . . . . . . . . . . . . . .
5.1
Comparison of corpora, MRDs and Wikipedia . . . . . . . . . 54
7.1 7.2 7.3
. 72 . 74
Number of categories and articles in Wikipedia per language Mapping for English, Spanish and Italian . . . . . . . . . . . WordNet class nouns to Wikipedia categories mapping percentages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Text Similarity System Results . . . . . . . . . . . . . . . . 7.5 Text Similarity Combination Results . . . . . . . . . . . . . 7.6 Extracted NEs . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Number of English NEs per lexicographic file . . . . . . . . . 7.8 Evaluation results of the web-search method . . . . . . . . . 7.9 Evaluation results using Wikipedia . . . . . . . . . . . . . . 7.10 Evaluation results of the combination method . . . . . . . . 7.11 Number NE sets linked to the SIMPLE ontology . . . . . . . 8.1 8.2 8.3
. 45 . 45 . 46 . 46 . 47
. . . . . . . . .
75 81 82 83 84 85 85 86 88
Syntactic patterns to determined the “expected answer type” . 93 Question Analysis Module . . . . . . . . . . . . . . . . . . . . 94 Example of Validation module process . . . . . . . . . . . . . 97
C.1 NE Repository. LexicalEntry table . . . . . . . . . . . . . . . 137 C.2 NE Repository. FormRepresentation table . . . . . . . . . . . 137 C.3 NE Repository. Sense table . . . . . . . . . . . . . . . . . . . 137 xi
LIST OF TABLES C.4 C.5 C.6 C.7
NE NE NE NE
Repository. Repository. Repository. Repository.
SenseRelation table . . . . . SenseAxis table . . . . . . . SenseAxisElements table . . SenseAxisExternalRef table
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
137 138 138 138
E.1 ENs extraidas . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
xii
List of Figures 3.1 3.2 3.3 3.4 3.5
Top nodes of the SIMPLE ontology . . . . . . . . . . . . . . LMF core package . . . . . . . . . . . . . . . . . . . . . . . LMF semantics package . . . . . . . . . . . . . . . . . . . . LMF multilingual package . . . . . . . . . . . . . . . . . . . Database Entity Relation diagram of the Wikipedia database
. . . . .
29 32 33 34 37
6.1 6.2
Method diagram . . . . . . . . . . . . . . . . . . . . . . . . . 56 Database Entity Relationship diagram of the NE lexicon . . . 69
8.1 8.2 8.3
BRILIW diagram . . . . . . . . . . . . . . . . . . . . . . . . . 91 Named Entity Translation Process . . . . . . . . . . . . . . . . 92 Inter Lingual Reference module . . . . . . . . . . . . . . . . . 95
A.1 Snapshot of the Named Entity Repository web interface . . . . 125
xiii
Chapter 1 Introduction
1
CHAPTER 1. INTRODUCTION Language is an essential part of our everyday life. Through it, different individuals can establish natural communication in an oral or written manner. Moreover, language is the mechanism used for the representation of our thoughts and therefore determines somehow our behaviour, the way we are and the way we act. Computational Linguistics is a subfield of Artificial Intelligence that deals with human-level understanding and the generation of natural languages. On one hand, it deals with the computational exploitation of huge amounts of data that are represented as natural language. On the other, it pursues making the communication between the human being and the computers less rigid and more comfortable. The knowledge of the structure of natural language is a requirement to carry out NLP; the applications developed in this field need to understand language at its different levels1 : phonological, lexicographic, morphological, syntactic, semantic and pragmatic. Therefore, the research and development of automatic analysers for the aforementioned linguistic levels is fundamental for the evolution of this research field. The results obtained by these analysers determine to a big extent the success or failure of more advanced NLP applications that are built taking these analysers as the basic building blocks. The importance of NLP nowadays is due mainly to the great growth experimented by the computer networks in general and specifically Internet, which has led to a spectacular increment of the digital information available. This huge amount of data needs automatic mechanisms to be processed. Hence, nowadays, systems to search for information in which the user might be interested, to translate text and thus be able to understand it, etc are being developed. According to Grisham (Grisham, 1986), three kinds of applications have had a key role in the evolution of NLP. These are Information Retrieval, Machine Translation and Human-Machine Interfaces. Other authors include, apart from these, a fourth application: Information Extraction. A brief description of each of them follows: • Information Retrieval (IR) (Baeza-Yates & Ribeiro-Neto, 1999). These systems return, from a query and a collection of documents, an ordered list of documents extracted from that collection according to the relevance of these documents to the input query. A further improvement of 1
depending on the type of application, knowledge of different levels is required
2
this model is Passage Retrieval (Kaskziel & Zobel, 1997), which regards documents as sets of passages and returns a list of the latter, thus solving several problems of the classic IR paradigm. Another extension of IR, even if today can be considered as a different application, is Question Answering (QA) (Burger et al., 2001). In this case, the system returns the answer instead of the relevant passages or documents. • Information Extraction (IE). Cowie and Wilks define this application as “the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in one or more texts” (Cowie & Lehnert, 1996). The aim of IE systems is to find and relate relevant pieces of information while discarding those parts of input texts that are not relevant. Finally, the elements extracted are used to fill a predefined template. • Machine Translation (Hutchins & Somers, 1992). The European Association for Machine Translation 2 defines this task as “the application of computers to the task of translating texts from one natural language to another”. Nowadays there exist systems that are able to produce output that, even if not perfect, have enough quality to be very useful in several scenarios. • Natural Language Interfaces (Perrault & Grosz, 1986). They allow the user to interact with a computer using natural language, thus facilitating human-machine communication and abstracting to the user the real. An example of specific application are natural language interfaces to query databases (Androutsopoulos et al., 1995). World knowledge is a requirement for dealing with the semantic level of natural languages. Conceptualisations of reality have occupied human beings since the Ancient Greeks, where the term Ontology (from the Greek în, genitive înto : of being (part. of eÙnai: to be) and -loga: science, study, theory) was introduced by Aristotle (Aristotle, 1908). A long time later, at the end of the XX century, the first attempts to give common sense to computers by building Knowledge Base (KB)s were initiated in the field of Artificial Intelligence. Examples of this are the Cyc project (Lenat, 1998), MindNet (Richardson et al., 1998) and, more related 2
http://www.eamt.org/
3
CHAPTER 1. INTRODUCTION to natural language, WordNet (Miller, 1995). In fact, world knowledge is necessary for attaining truly intelligent computer systems. In the case of language, this knowledge is contained in LRs, and in fact, these play a central role in the field of Computational Linguistics as they are practically indispensable for carrying out any automatic understanding of language. LRs are extensively applied into NLP tasks and applications. We present some examples of exploitation of LRs in the applications of IR, IE and QA: • In IR LRs can be used to expand the query by including with the user’s query words other related terms so that the query will match snippets that are formulated in alternative phrasings (Voorhees, 1998). The inclusion of a lexicon in IR significantly increases recall but as a drawback decreases precision due to the ambiguity of the query words. • IE is often guided by a predefined template that has to be filled. The semantics of the template can thus be linked to a LR and then the properties to be extracted can be based on the knowledge of the LR. Large Scale Information Extraction system (LaSIE) (Gaizauskas & Humphreys, 1997) is an IE system that carries deep text analysis. The system uses general world knowledge represented as a semantic network. Its coverage is extended using WordNet. • Concerning QA, LRs can be used in different phases of the process. (Pasca & Harabagiu, 2001) uses WordNet in order to extract concepts from the question, to identify passages relevant to the query and finally to extract the answer from the selected passages. A qualitative analysis confirms the usefulness of lexical knowledge: the precision of the system is increased by a 147%. Multilingual LRs have proved to be very useful for Cross–Lingual QA, (Ferr´andez et al., 2007d) shows how the knowledge of Wikipedia and EuroWordNet improve the results for such a system. Given the importance of LRs, the research community has dedicated a lot of effort to their manual construction during the last two decades. In spite of the amount of work devoted to LRs, which has led to the availability of robust and high coverage LRs, some types of linguistic information are not exhaustively covered in these resources. Two paradigmatic examples 4
are those of NEs3 and domain-specific terms. It is clear that the manual population and maintenance of these two kinds of terms into LRs would be unfeasible, as the amount of terms involved is huge and their nature, especially that of NEs, is much more volatile than that of the terms that make up the core of traditional LRs (common nouns, adjectives, verbs and adverbs). This is related with the following assertion: “building a proper noun ontology is more difficult than building a common noun ontology as the set of proper nouns grows more rapidly” (Mann, 2002). The problem is then that a proper noun resource should be constantly updated. Keeping with this, (Philpot et al., 2005) states that “the need for machine-assisted ontology construction is stronger than ever” because “humans cannot manually structure the available knowledge at the same pace as it becomes available”. Hence, in order to fill this gap, automatic procedures are needed. The so called knowledge acquisition bottleneck is a recognised issue within the NLP community, and a lot of research effort is devoted nowadays to solve it. In order to clarify the NE maintenance issue, let us take a look at the state of NEs in WordNet -the most widely used English LR nowadays-. From version 2.1., this LR explicitly distinguishes between common nouns (called classes) and proper nouns (called instances) (Miller & Hristea, 2006). While WordNet’s coverage of open domain common nouns is quite high, it contains very few proper nouns (only 7,669 synsets are tagged as instances in WordNet 2.1).It is worth pointing out that this is the situation for English, which is by far the most privileged language in terms of amount and coverage of available LRs; the lack of NEs in LRs for other languages is even more acute. NEs are important in NLP tasks for they provide salient clues within texts, and they can be used to sum up what the text is about. Therefore they are useful for different applications including summarisation, building inverted indexes and so on. NEs are especially popular in IE4 , as the terms to be extracted to fill templates are usually NEs (an example of such a template involving three NEs is a company acquired by another company at a given time). Furthermore they have a special role in translation (Ferr´andez et al., 2007c). Following with NEs, most of the research done up to now relates directly to their recognition and classification; this is a subtask of IE known as NER. 3
By NEs we refer in this paper to entities belonging to several semantic types (e.g. person, location, organisation) which take the form of proper nouns. 4 NER was born as a intermediate phase of IE.
5
CHAPTER 1. INTRODUCTION NER is performed according to small predefined sets of categories, such as the four category set (person, organisation, location, miscellaneous) of the Conference on Natural Language Learning (CoNLL) (Tjong Kim Sang, 2002). With regards to NE resources, even if mature repositories of geographical NEs (also called gazetteers) do exist (e.g. geonames5 ), there is a lack of more general NE dictionaries, also called onomastica6 . However, the availability of general LRs with fine-grained types of NEs could be very useful for NLP tasks; (Mann, 2002) shows how the use of a proper noun ontology, even if the ontology used has a low coverage, improves the precision of a Question Answering system. Moreover, this kind of resources could play a crucial role in NER systems that consider an extended hierarchy of entity types like that proposed in (Sekine et al., 2002). Let us clarify the role that a NE rich LR could play in NLP by presenting a QA example. Consider the question 161 from the QA track at the 2006 edition of Cross Language Evaluation Forum (CLEF)7 : “Who is Fernando Henrique Cardoso?”. This question would be easily answered if this person NE was present in a LR with semantic links to other entries, such as being an instance of “Brazilian”, “politician”, “president” or “minister”.
1.1
Research Motivation
This thesis tackles the knowledge acquisition bottleneck problem and specifically, given the importance of NEs for the automatic processing of language and also the problems its acquisition poses, considers the acquisition of this kind of linguistic phenomenon. This PhD. thesis departs from the conclusions of the work at the RECENT research project8 . Within this project I developed a knowledge-based Named Entity Recognition tool called DRAMNERI (Toral, 2005). In its design, particular attention was paid to modularity, with the aim of making it possible to easily adapt it to different domains and/or languages. The tool proved successful for specific tasks; we exploited it to build a larger NER system for Portuguese (Ferr´andez et al., 2007b), to improve the results of a QA system (Toral et al., 2005a)(Toral et al., 2005b) and to build “REC5
www.geonames.org http://www.worldwidewords.org/weirdwords/ww-ono2.htm 7 www.clef-campaign.org 8 http://dlsi.ua.es/eines/projecte.cgi?id=val&projecte=35
6
6
1.2. THESIS ROADMAP NOT@RIAL”, a commercial Information Extraction system in the field of conveyancing9 . However, a problem was encountered: the knowledge-acquisition bottleneck. While the effort needed to write the required rules and dictionaries for very specific domains (e.g. conveyancing) was affordable, tackling larger domains required a considerable amount of human labour. This led to exploring ways in which we could automatically acquire lists of NEs (plain dictionaries) (Toral & Mu˜ noz, 2006). The thesis generalises the acquisition bottleneck of NEs to the more general acquisition bottleneck problem in the field of Computational Linguistics in order to explore the problem and find feasible ways to overcome it. Once this problem is analysed from a more abstract perspective and guidelines are identified, it applies these guidelines to the concrete problem of NE acquisition, which ends up in the design and evaluation of a methodology devoted to the automatic construction of a Named Entity Lexicon exploiting Web 2.0 resources and existing LRs (Toral et al., 2008)(Toral & Mu˜ noz, 2007)(Toral et al., 2009). Our motivation for combining these two kinds of resources is that while LRs have rich semantic information but their coverage is limited, Web 2.0 resources hold massive amounts of information but in semi-structured manner. Apart from the acquisition, important emphasis is dedicated to the representation of the extracted information in a structured manner. With this aim, we study different theories of lexico-semantic representation of linguistic information in order to find the most adequate way in which to represent the extracted information. The design of this structured resource takes into consideration emerging standards of the sector in order to guarantee interoperability and to facilitate its application to different NLP applications.
1.2
Thesis Roadmap
A very brief description regarding the content of the rest of this document follows. Chapter 2 presents an overview of the state–of–the–art regarding the four main research topics covered in this thesis, i.e. (i) knowledge–based 9
This software is registered in the Registro General de la Propiedad Intelectual de la Comunidad Valenciana
7
CHAPTER 1. INTRODUCTION NER, (ii) acquisition and enrichment of LRs, with particular interest paid to onomastica, (iii) text similarity and (iv) standards and LRs. Chapter 3 introduces the resources used within this thesis. This includes the LRs WordNet, the Spanish and Italian lexicons of EuroWordNet (EWN) and PAROLE-SIMPLE-CLIPS (PSC). It also introduces the concept of New Text resource and subsequently the most well-known resource: the Wikipedia encyclopaedia, which is used in our methodology. A comparison to other resources and a survey of its role in the field of Computational Linguistics are included. Chapter 4 presents an initial experiment to automatically extract NE gazetteers from Wikipedia. In this approach, Wikipedia definitions are processed using NLP techniques. Chapter 5 studies from an abstract point of view the knowledge bottleneck problem and proposes three elements in order to overcome it: (i) the exploitation of New Text, (ii) the use of LRs to support extraction procedures and (iii) the role of standards to create interoperable resources. Chapter 6 introduces an automatic and language independent method for the creation of a NE Lexicon from Wikipedia and noun taxonomies. The different phases it consists of are thoroughly described. Chapter 7 evaluates the different parts of the method introduced in the previous chapter. Chapter 8 discusses the application of the methodology presented into a QA system. Chapter 9 closes the thesis by deriving conclusions and proposing future works.
8
Chapter 2 Related Work
9
CHAPTER 2. RELATED WORK The research carried out in this PhD. thesis touches four main topics: NER, the acquisition and enrichment of LRs, text similarity and standards and interoperability of LRs. This chapter presents an overview of the stateof-the-art in these areas.
2.1
Named Entity Recognition
Given a text considered as a set T whose elements are the words that make up this text and a set of categories C whose elements are the different categories of NEs considered plus the category “no-NE”, NER can be defined as the task that consists of establishing for each element of the set T a unique element from the set C. The application of this definition to a sample text follows. T = {Condolezza Rice , secretary of the State , will visit LIMO in Liverpool .} C = { NO-ENT LOCATION ORGANISATION PERSON } NE = { Condolezza_PERSON Rice_PERSON ,_NO-ENT secretary_NO-ENT of_NO-ENT the_NO-ENT State_NO-ENT ,_NO-ENT will_NO-ENT visit_NO-ENT LIMO_ORGANISATION in_NO-ENT Liverpool_LOCATION ._NO-ENT } NEs are theoretically identified and classified from evidence. Two kinds of evidence have been defined for NER (McDonald, 1996). These are internal and external evidence. Internal evidence is that provided by the sequence of tokens that constitute a NE. External evidence refers to the context in which the entity is located. The NER concept was born within the framework of the Message Understanding Conference (MUC) conferences during the 90s (Grishman & Sundheim, 1996) (Chinchor, 1998). These conferences focused on evaluating IE systems. The participants noticed during the development of their IE systems that recognising information units such as proper nouns and numerical expressions was important for this task. From this, it was concluded that NER is one of the main subtasks of the IE systems (Sekine, 2004). After the MUC conferences, other events where NER is tackled have taken place until nowadays. Next, the most important events that have dealt with NER in the last years are mentioned. 10
2.1. NAMED ENTITY RECOGNITION The exercise “Information Retrieval and Extraction Exercise” 1 took place during the years 1998 and 1999. This exercise dealt with IR and NE in Japanese. It included a NER task similar to that of the MUC and Multilingual Entity Task Conference (MET)2 where texts were divided into belonging to general domain and belonging to specific domains. The objective of the Automatic Content Extraction (ACE) program 3 is to develop technology for the automatic extraction of information that helps NLP. It introduces several levels of entities: name (the entity is mentioned by its name), nominal (the entity is mentioned indirectly by a common noun) and pronoun. This program began in 1999 and it is still active nowadays. Several languages have been treated: among others Arabic, Chinese, English and Spanish. The 2002 and 2003 editions of CoNLL included shared tasks whose aim was to tackle NER only with Machine Learning techniques for Spanish and Dutch (Tjong Kim Sang, 2002) 4 and for English and German (Tjong Kim Sang & De Meulder, 2003) 5 respectively. The NE categories considered were the ENAMEX (person, location, organisation) from the MUC conferences to which a nex category, miscellaneous, was added for proper nouns that do not belong to any of the ENAMEX categories. From 2004, two editions of HAREM6 have taken place. This is a competition for the evaluation of NER systems for Portuguese. The MUC conferences not only constituted the birth of NER but they also established a set of entity categories that has served as a basis for other events related to NEs that have taken place afterwards. The taxonomy of NE categories proposed at the MUC conferences was like this: • Named Entities (ENAMEX). Contains three subcategories: – Organisation. Names of companies, public administrations and other entities of organisative kind. – Person. Names of person and family. – Location. Names of location of political and geographical natures (e.g. cities, provinces, regions, water bodies, mountains). 1
http://nlp.cs.nyu.edu/irex/index-e.html http://www-nlpir.nist.gov/related\_projects/tipster/met.htm 3 http://www.itl.nist.gov/iaui/894.01/tests/ace/ 4 http://www.cnts.ua.ac.be/conll2002/ner/ 5 http://www.cnts.ua.ac.be/conll2003/ner/ 6 http://www.linguateca.pt/aval_conjunta/HAREM/harem_ing.html 2
11
CHAPTER 2. RELATED WORK • Temporal Expressions (TIMEX) – Date – Hour • Numerical expressions (NUMEX) – Monetary expression – Percentage Subsequently, other more complex hierarchies have been proposed. A notable example is the proposal by Sekine (Sekine et al., 2002). This is made up of 150 NE types grouped in a multilevel hierarchy. In order to define the key NE types, the types defined in events related to NER such as MUC, Information Retrieval and Extraction Exercise (IREX) and ACE have been taken into account. The HAREM competition proposes a two level hierarchy of NE types. The first contains 10 elements and the second 40. Moreover, there are proposals of arbitrary categories, this is, not defined a priori. These are mainly used in unsupervised approaches to the acquisition of NEs. An example of these types of system is (Pasca, 2004). In analogous manner to other tasks of NLP, for NER two main approaches have been developed, to which we will refer as knowledge-based or learning-based. Knowledge-based systems use as resources dictionaries and rules, which normally are hand-tagged by linguists. On the other hand, the learning-based paradigm relies on texts annotated with the NEs to be considered, which is exploited according to a predefined set of features. Hereafter, the state-of-the-art for the two aforementioned NER approaches is described.
2.1.1
Knowledge-based
Most of the systems developed for the MUC conferences belonged to this approach (e.g. Facile (Black et al., 1998), Fastus (Hobbs et al., 1996), NYU, LaSIE (Gaizauskas et al., 1995)(Humphreys et al., 1998)). These systems were simple and efficient. They used different types of algorithms and exploited diverse levels of linguistic analysis. However, all of them used the same of kind of resources in order to perform NER: word lists and rules both created by hand. The main problem of these systems was therefore the effort 12
2.1. NAMED ENTITY RECOGNITION needed to create and maintain these resources. In fact, the manual creation of dictionaries and rules for a NER system in a general domain requires expert linguistic knowledge and is a time consuming task in order to obtain satisfactory results. Regarding the construction and maintenance of dictionaries, several problems have been identified, being the main ones the effort required to build and maintain them and the possible overlap between dictionaries of different NE categories. Due to these problems, most of the research in NER shift to the approach based on machine learning. Nevertheless, in the last years research to overcome these problems from the knowledge-based paradigm is taking place. The creation and maintenance of resources was an important problem because it was done manually. Current research tackles the creation and maintenance of these resources automatically. The research proposals in this direction try to obtain these resources automatically from sources of linguistic knowledge or even from plain text. From these proposals we highlight a NER system (Magnini et al., 2002) (Negri & Magnini, 2004) that uses trigger dictionaries which are automatically extracted from WordNet. We consider also a research work on the automatic creation of geographic dictionaries from Internet by using text mining techniques (Ourioupina, 2002) (Uryupina, 2003). The author emphasises that the resulting dictionaries could be satisfactorily used in a NER system. This topic will be dealt with in more detail in section 2.2.2. Next paragraphs describe in more detail several knowledge-based NER systems. LaSIE LaSIE is a modular system made up of 9 modules that process the input sequentally in the following order: tokeniser, search in dictionaries, sentence splitter, morpho-syntactic analysis, noun matching, discourse interpretation and template writer. Regarding the resources used for NER, LaSIE uses dictionaries and rules generated manually. Dictionaries are of two types: NEs and triggers. Concerning rules, it uses 10 entity grammars which contain 400 rules. These use morpho-syntactic and semantic tags and lexical items. The rules are built from the analysis of the Penn Tree Bank corpus. This system’s performance reached 90.4% F-measure, 94% precision and 13
CHAPTER 2. RELATED WORK 87% recall in MUC-7. The low recall compared to other systems’ scores is due to the inability to detect some of the NEs present in the evaluation corpus because they were not covered by the used dictionaries. FASTUS FASTUS is a permuted acronym standing for “Finite State Automata-based Text Understanding System”. This system architecture is based on a finitestate cascade of transducers. The transducer of each phase provides an additional level of analysis for the input. The transducers used by the system, as it participated in MUC-6, are the following: tokeniser, multiword analyser, preprocessor, NER, parser, phrase combiner and domain pattern recogniser. FASTUS uses also manually created rules, which are divided into coarsegrained rules independent of the domain (macros) and rules dependent on the domain that use that macros to build final rules. The rules use morphologic, syntactic and semantic elements. FASTUS obtained a high score in the MUC-6 NER task (94% F-measure, 96% precision and 92% recall), even if the authors claim that the adaptation of the system to the specific domain of MUC-6 took only one person-day because of the representation system used for the rules. A WordNet-based approach to NER The authors propose to use a semantic network in order to tackle the problem of construction and maintenance of large dictionaries (Magnini et al., 2002) (Negri & Magnini, 2004). Due to the fact that the LR used is available for several languages, the use of language independent techniques would allow to apply the approach followed to these languages. This research does not use plain dictionaries of NEs, instead it uses the WordNet semantic network. Therefore triggers acquire a notable importance because WordNet contains a huge amount of common nouns but very few proper nouns. On the other hand, the NER system presented is based on rules. The architecture of the system is made up of three phases. In the first the text is preprocessed by tokenising it, applying a morpho-syntactic analysis and recognising multiwords. In a second step 200 basic rules are used to tag the possible entities. These rules might contain synsets from WordNet, lemmas and morpho-syntactic tags.. Most of these rules refer both to external 14
2.1. NAMED ENTITY RECOGNITION and internal evidence captured from WordNet predicates. Finally, in the third phase a higher level set of rules are used in order to resolve possible ambiguities and correference cases. The system has been evaluated with the dataset of the DARPA/NIST HUB4 data, obtaining 85% F-measure. For the most common NE categories the result were 73.4% for person entities, 88.7% for locations and 81.8% for organisations.
2.1.2
Machine learning approach
After the first years dominated by knowledge-based systems, the majority of the research devoted to NER shifted to machine learning approaches. Many different learning techniques have been used in this field, for example Hidden Markov Models (Bikel et al., 1997), Maximum Entropy (Borthwick et al., 1998), Decision Trees (Sekine et al., 1998), Support Vector Machines (Takeuchi & Collier, 2002) or Boosting (Carreras et al., 2002). In fact, NER has constituted a field where diverse learning algorithms have been experimentally tested. Whatever the machine learning algorithm used, these systems require a tagged corpus. From these and a selection of features (e.g. capitalisation of words, etc.) the algorithm carries out the training process and then is ready to be applied to non tagged texts. Because of this, the set of features selected and the quality of the tagged corpus used for training are crucial. Moreover, systems that combine different machine learning algorithms have been developed. The idea is to try to combine the advantages of each of the algorithms while trying to minimise the effect of the disadvantages of each of the algorithms on the final result. Combination techniques such as voting (Kozareva et al., 2005) or stacking (Florian et al., 2003) have been tested. The main inconvenient of these system is the requirement of a tagged corpus, which, in order to obtain satisfactory results, needs to be considerably large. Because of this, research to avoid the need to tag manually large corpora has been carried out. Proposals in this direction can be divided into two kinds: semi supervised methods (Collins & Singer, 1999) and non supervised methods (Pasca, 2004). A more detailed description of two systems that belong to this paradigm follows. 15
CHAPTER 2. RELATED WORK Named Entity Extraction using AdaBoost This system uses the Adaboost algorithm and was presented at the CoNLL2002 and CoNLL-2003 shared task competitions (Carreras et al., 2002; Carreras et al., 2003). It approaches the task by treating the two main sub–tasks of the problem, recognition and classification, sequentially and independently with separate modules. Both modules are machine learning based systems, which make use of binary and multiclass AdaBoost classifiers. The features used by the classifiers may be grouped in the following main types: • Lexical: word forms and word lemmas. • Syntactic: PoS and chunk tags. • Orthographic: word properties such as capitalisation, presence of punctuation marks, etc. • Affixes: prefixes and suffixes of the word. • Word Type Patterns: type pattern of consecutive words in the context. • Trigger Words: external list of words that trigger NEs. • Gazetteer features: external list of NEs. NERUA NERUA (Kozareva et al., 2005) is an open domain NE recogniser developed by the NLP group at the University of Alicante. It was developed combining three machine learning classifiers by means of a voting strategy. This system carries out the recognition of entities in two phases: entities’ detection and classification of the detected entities. The three classifiers integrated in NERUA use the following algorithms: Hidden Markov Models, Maximum Entropy and Memory Based Learning. The outputs of the classifiers are then combined using a weighted voting strategy which consists of assigning different weights to the models corresponding to the correct class they determine. The features used by the machine learning classifiers can be divided into several groups: 16
2.1. NAMED ENTITY RECOGNITION • Orthographic: related to the orthography of the word to be classified. For instance features about capitalisation, digits, punctuation marks, hyphenated words, suffixes and prefixes, etc. • Contextual: a three-words window was used to create and analyse the context of the target words. • Morphologic: represents morphological characteristics such as lemma, stem, part-of-speech tag, etc. • Handcrafted-list: features testing whether or not the word is contained in some handcrafted list of general entities obtained from several web pages.
NERUA classifies the entities into the four classes of the aforementioned CoNLL tagset. The classifiers were trained using the corpora provided by the CoNLL 2002 conference. Initially, NERUA was designed to recognise NEs within Spanish texts, however due to the fact that the 2003 edition of this conference also supply annotated corpora for English, the system was adjusted to recognise English NEs.
2.1.3
Hybrid systems
By hybrid NER systems we refer to those that combine the two aforementioned paradigms (see 2.1.1 and 2.1.2). LTG (Mikheev et al., 1998) is a NER system that belongs to this category. This system recognises NEs in five sequential phases. In the first and the third the system applies rules to its input while the second, fourth and fifth applies a Maximum Entropy model to partially match NEs. Another recent hybrid proposal (Ferr´andez et al., 2006a) uses Machine Learning methods to perform NER. Subsequently, it applies simple rules that improve the results obtained. These rules are derived from an error analysis in which the most salient error patterns produced by the learning-based NER system are identified. 17
CHAPTER 2. RELATED WORK
2.2 2.2.1
Acquisition and enrichment of Language Resources General
Research on automatic lexical automatic acquisition began in the 1980s and initially focused on acquiring lexical information from Machine Readable Dictionaries (MRDs). During the next decade, due both to the availability of large corpora and NLP tools needed for their accurate processing (Partof-Speech (PoS) taggers, chunkers, etc.) and to the drawbacks of MRDs, the emphasis shifted to corpus-based approaches. Recent years have seen what could be called “a quantitative evolution”; the increasing processing power of computers together with the availability of robust statistical NLP tools have led to research proposals where the reference corpus is the World Wide Web (WWW). The Acquisition of Lexical Knowledge for Natural Language Processing Systems (ACQUILEX) project (1989-1992) pioneered on the derivation of lexica from very incipient samples of MRDs. Relevant publications from this period include (Calzolari, 1992), (ichi Nakamura & Nagao, 1988) and (Alshawi, 1987). Later on, a PhD. thesis (Rigau, 1998) presents a detailed proposal regarding the massive acquisition of lexical knowledge from monolingual and bilingual MRDs. Apart from designing a productive methodology to build and validate a multilingual KB, a software system (called SEISD) was implemented. The utilisation of MRDs in knowledge acquisition has been criticised because of their fixed size (Hearst, 1992). This paper proposes the extraction of semantic knowledge from corpora by using lexical patterns. Six patterns are proposed together with a methodology to find new ones. The follow-up of ACQUILEX, Acquisition of Lexical Knowledge (ACQUILEX-II) (19931995), made considerable use of corpora as a further source of data for the semi-automatic construction of lexical resources. Shallow Parsing and Knowledge Extraction for Language Engineering (SPARKLE) (1995-1996) demonstrated the important role of shallow parsing for acquiring several types of linguistic information such as subcategorisation, argument structure or selectional preferences. Developing Multilingual Web-scale Language Technologies (MEANING) (2002-2005) (Atserias et al., 2004) acquired EWNbased information from corpora to support automatic procedures of Word 18
2.2. ACQUISITION AND ENRICHMENT OF LANGUAGE RESOURCES Sense Disambiguation (WSD). The scalability problem has been addressed by proposing an efficient method when dealing with large corpora (Agichtein & Gravano, 2000). Open Information Extraction (Etzioni et al., 2008) is an extraction paradigm designed for large corpora in which the system makes a single pass and extracts tuples without any human input. The authors also present TextRunner, an implementation of this paradigm.
2.2.2
Onomastica
This section presents an overview of research work regarding the creation and acquisition of onomastica. The structure of a multilingual onomasticon made up of a set of monolingual onomastica cross–referenced by translation links is presented in (Sheremetyeva et al., 1998). The entries are organised in a hierarchy made up of 45 semantic categories. A semi-automatic population procedure is proposed, which is supported by an acquisition and administration interface. Prolexbase, a multilingual database of proper nouns, was created within the Prolex project (Tran et al., 2004; Krstev et al., 2005; Maurel, 2008). It is based on an ontology which has four layers (instances, linguistic, conceptual and meta-conceptual) and several relations (synonymy, meronymy, antonomasia, etc). Entries are linked to EuroWordNet’s Inter-Lingual Index. The population of Prolex is done manually. It contains mainly French proper nouns, 75,368 lemmas. There are also translations for Serbian and German (13,000 entries). Proper noun ontology might be automatically derived from newswire text (Mann, 2002). The proposal consists of extracting phrases from a 1 gigabyte corpus by applying a Part-of-Speech pattern (a common noun followed by a proper noun). This allows the author to gather 113,000 different proper nouns and to reach a precision of 60% (84% for proper nouns referring to people and 47% for the rest). The author also points out that the employed methodology is problematic with polysemous words and that it is not straight-forward to integrate the proper noun ontology created with the WordNet taxonomy of nouns. A consecutive proposal that exploits newswire (Fleischman et al., 2003) extracts concept-instance relations from 15 gigabytes of newspaper text by using two Part-of-Speech patterns (common nouns followed by a proper noun and appositions). Machine Learning techniques are applied to increase the 19
CHAPTER 2. RELATED WORK precision of the extracted info. 500,000 unique instances (Bill Clinton and William Clinton are considered as two different instances) are extracted. An evaluation over 100 concept-instance items is carried out, achieving a precision of 93%. Other proposals try to extend general LRs by linking them to NE resources. The linkage of a gazetteer to WordNet is studied in(Sundheim et al., 2006). The paper proposes to incorporate the instances of a geographic nature from WordNet into the Integrated Gazetteer Database (IGDB). This is justified by the fact that both resources contain complementary information. A similar proposal is found in (de Loupy et al., 2004), it uses WordNet as a proper noun thesaurus for a QA system by enriching it with 130,675 proper nouns. These nouns are extracted from several knowledge bases (the authors do not specify which) and from the Internet. 55 types of entries are enriched with proper nouns. However, not all of them seem to contain proper nouns (e.g. “professions” contains “Academic teacher”, “political titles” contains “1st secretary”). The methodology followed to build this thesaurus is not mentioned, which leads us to think that both the acquisition of proper nouns and their insertion in the correspondent synsets are carried out manually. REPENTINO (Sarmento et al., 2006) is a repository of monolingual (Portuguese) NEs. This resource contains 450,129 entities, which are organised according to a taxonomy made up of several top categories (abstract, art and media, nature, event, legal, localisation, organisation, product, being and substance) which in turn are subdivided into subcategories. The NEs are extracted from several corpora and web sources by using semi-automated methods. Details about the amount of NEs extracted from each source per category can be found at http://poloclup.linguateca.pt/cgi-bin/ repentino/fontes.pl. There is a web interface 7 that allow users to both browse the repository and to suggest new NEs to be added.
2.3
Text Similarity
Detecting and categorising semantic relations are pursued challenges that encompass issues of lexical modifications, syntactic alternation, reference or discourse structure and world knowledge comprehension. There is strong interest in modelling and designing techniques aimed at measuring semantic 7
http://www.linguateca.pt/REPENTINO/
20
2.3. TEXT SIMILARITY equivalence and capable of addressing the uncertainty that underlies language ambiguity and variability. As background in this area, this section describes some of the most relevant works on obtaining similarities between short texts. Existing research on text similarity has focused mainly on whole documents or individual words, while short paragraphs - sentences - contexts have mostly been dismissed. We distinguish between two types of approaches: (i) those intended for computing similarities among words; and (ii) those which deal with entire sentences or short text’s fragments (also called contexts). The reason for presenting approaches that compute similarities between words is that some approaches that deal with similarity between short texts extend this kind of methods. The following paragraphs delve into techniques and/or approaches that were relevant for the development of this research work.
2.3.1
Similarity between words
There is a large number of word-to-word similarity measures (Terra & Clarke, 2003), which could be classified as belonging to two main approaches: corpusbased and knowledge-based. Word similarity has been widely used in order to support and improve the global performance of many NLP tasks such as WSD, Textual Entailment (TE) and IE to name but a few (Mihalcea et al., 2006). Corpus-based measures try to determine the similarity between two words using information derived from large corpora. Examples of widely used measures of this type include Pointwise Mutual Information (PMI) (Turney, 2001) and Latent Semantic Analysis (Landauer et al., 1998). Knowledge-based measures calculate the similarity between two words by using information derived from semantic LRs. Metrics belonging to this paradigm include Leacok & Chodorow (Leacock & Chodorow, 1998), Lesk (Lesk, 1986), Wu & Palmer (Wu & Palmer, 1994), Resnik (Resnik, 1995), Lin (Lin & Hovy, 2003) and Jiang & Conrath (Jiang & Conrath, 1997). The WordNet::Similarity::Tool (Pedersen et al., 2004) 8 is a freely available software package that implements several knowledge-based similarity and relatedness measures between a pair of concepts (or word senses). It provides six measures of similarity, and three measures of relatedness, all of which are based on the lexical database WordNet. 8
http://www.d.umn.edu/~tpederse/similarity.html
21
CHAPTER 2. RELATED WORK Regarding gold standard datasets to which measures are evaluated, let us point out RG (Rubenstein & Goodenough, 1965) and WordSim353 (Finkelstein et al., 2002). RG consists of 65 pairs of words judged by 51 human subjects in a scale from 0.0 (minimum similarity) to 4.0 (maximum similarity). WordSim353 contains 353 word pairs associated with an average of 13 to 16 human judgements, considering similarity and relatedness indistinctly.
2.3.2
Similarity between texts
SimFinder (Hatzivassiloglou et al., 2001) is a supervised system made up of 43 features extracted from text. It uses a log-linear regression model to determine the semantic similarity of two short text units from the evidence obtained from the different features. Another approach combines corpus-based (PMI-IR and Latent Semantic Analysis) and knowledge-based measures of word similarity (Mihalcea et al., 2006). The results are then combined and used to derive a similarity metric between short texts. Semantic Text Similarity (Islam & Inkpen, 2008) combines string and semantic similarity (a modified version of the longest common subsequence and Second order Co-occurrence PMI respectively) between words and commonword order similarity. SenseClusters 9 is a language independent and unsupervised tool that clusters short contexts. It represents contexts using first or second order feature vectors. In order to reduce dimensionality it applies Singular Value Decomposition. Apart from the aforementioned systems, it is worth noting two datasets that have been used to evaluate approaches to short text similarity. The first is the Microsoft paraphrase corpus (Dolan & Brockett, 2005), extracted from news sources. It is made up of 5,801 pairs of sentences, each with a human judgement indicating whether the two sentences can be considered paraphrases or not. The second is the Pilot Short Text Semantic Similarity Benchmark Data Set (Li et al., 2006), which contains 30 sentence pairs from the Collins Cobuild dictionary. In this case the judgements are not binary but on a scale (from 0.0 for minimum similarity to 4.0 for maximum similarity). 9
http://senseclusters.sourceforge.net/
22
2.4. STANDARDS AND LANGUAGE RESOURCES
2.4
Standards and Language Resources
LRs can be difficult to access for both NLP tasks and human users because of differences in their format, structures and the semantics of the categories used. The major barriers against the sharing and interchanging of such kind of resources are, hence, of a technical and methodological nature: the presence of competing practices in lexicon building, the absence of interfaces among format and models, and the lack of broadly accepted standards. Concrete initiatives have been or are being implemented to address the issue of standardisation in NLP lexica. Some of these efforts are often limited to the laboratories concerned and to the lifetime of the projects in which the standardisation initiatives have been pursued, or are in a premature phase of standardisation, with no official and stable standard framework available. Start-of the-art initiatives are: • Standards for monolingual lexicons and corpora in Expert Advisory Group for Language Engineering Standards (EAGLES). • Standards for multilingual lexicons in International Standards for Language Engineering (ISLE)-CLWG (Calzolari et al., 2002). It includes the previous EAGLES results for many layers of linguistic description and reflects both Preparatory Action for Linguistic Resources Organisation for Language Engineering (PAROLE)-Semantic Information for Multifunctional Plurilingual Lexica (SIMPLE) lexica and the EWN lexical database family. All in all, it can be considered the result of almost a ten-year standardisation activity. • Standards for the description of information from print dictionaries in Text Encoding Initiative (TEI). • A standard for exchanging terminological and lexical data for machine translation applications in Open Lexicon Interchange Format (OLIF). • Standard method to describe translation memory data that is being interexchanged among tools and/or translation vendors in LISA. • Provision of ISO ratified standards for Language Technology to enable the exchange and reuse of multilingual LRs in LIRICS10 . 10
http://lirics.loria.fr/
23
CHAPTER 2. RELATED WORK • The establishment in 2002 of a technical subcommittee in International Organization for Standardization (ISO), called ISO/TC37/SC411 , devoted to the creation of standards for LRs in order to maximise their applicability. The Lexical Markup Framework (LMF) (Francopoulo et al., 2008) (ISO 24613, 2008), a standard for the representation of LRs has been produced by this working group. • The establishment of registries and catalogues for linguistic categories (e.g., ISO TC37 SC4 Data Category Registry) and annotation schemas (e.g., UIMA Component Registry); • Research efforts to create linked resources, examples are the Global WordNet Association12 , constituted in 2000, and the MEANING project13 (2002–2005). • The creation of an international conference devoted to LR interoperability (ICGL14 ). The diverse efforts described above all aim towards attaining LR interoperability, but they have tended to be developed in isolation. There is as yet minimal ability to integrate data and components, leading to not only a duplication of effort but also a restricted understanding of better ways to address specific technical issues. At the same time, there is enough convergence of opinion and practice among these community activities that the crucial steps required to bring all of the pieces together are beginning to emerge. Perhaps more importantly, advances in technology over the past decade, such as the technologies surrounding the Semantic Web and the emergence of distributed computing and data capabilities, have opened up possibilities for interlinkage of data, annotations, lexicons, ontologies, etc., that provide new motivation to pursue resource interoperability. The time and circumstances are therefore ripe to take a broad and forward-looking view in order to establish and implement the standards and technologies necessary to ensure LR interoperability in the future. This can only be achieved through a coordinated, community-wide effort that will ensure both comprehensive coverage and widespread acceptance. 11
http://www.tc37sc4.org http://www.globalwordnet.org/ 13 http://www.lsi.upc.es/~nlp/meaning/ 14 http://icgl.ctl.cityu.edu.hk (1st edition 2008; 2nd edition 2010) 12
24
Chapter 3 Language Resources and New Text
25
CHAPTER 3. LANGUAGE RESOURCES AND NEW TEXT This chapter introduces the resources used in the present research; this includes structured LRs for the different languages covered (English, Italian and Spanish), and a semi-structured resource from the Web 2.0 paradigm. The LRs are WordNet (for English), EuroWordNet (for Italian and Spanish) and PSC (for Italian). The following sections briefly describe each of these LRs. Afterwards we introduce SUMO, a formal ontology linked to WordNet, and LMF, an ISO standard for LRs used in our research proposal. Finally we introduce the concept of New Text, related to Web 2.0 resources, and describe Wikipedia, a multilingual encyclopaedia used in our methodology.
3.1
WordNet
WordNet is an on-line lexical database for English developed at the University of Princeton that contains nouns, verbs, adjectives and adverbs organised into sets of synonyms - called synsets- and contains several types of semantic relations among its nodes (Miller, 1995). It is manually developed by a team of linguists and its design is inspired by psycholinguistic theories of human lexical memory. This resource is widely used within the NLP community. In fact, it has become the de facto standard for several NLP tasks such as WSD. The version of WordNet used in this research, 2.1., is made up of 117,597 synsets (81,426 nouns, 13,650 verbs, 18,877 adjectives and 3,644 adverbs) and 155,327 variants (117,097 nouns, 11,488 verbs, 22,141 adjectives and 4,601 adverbs).
3.2
EuroWordNet
EWN (Vossen, 1998) is a project funded by the European Union with the aim of developing of a multilingual database of inter-connected wordnets for several European languages (initially Dutch, Italian, Spanish and English; later extended to German, French, Estonian and Czech). This project is inspired by Princeton WordNet but introduces important improvements to the original design of it. EWN contains new types of relationships, including some across differents parts of speech. Moreover, EWN is a multilingual resource; a module called Inter-Lingual-Index (ILI) links “equivalent” synsets in the various wordnets by using as a pivot the synsets of Princeton WordNet 26
3.2. EUROWORDNET 1.5. EWN introduces a set of 1,024 top concepts common to all the languages and a language independent Top Ontology built from 63 very abstract of these top concepts. In this research we use two wordnets that belong to the EWN model: the Italian and Spanish wordnets. These are described in the following paragraphs.
3.2.1
Italian WordNet
Italian WordNet (IWN) (Alonge et al., 1999) was built from different Italian lexical and corpora sources, such as the Italian Machine Dictionary, the Italian Reference Corpus and the PAROLE lexicon. IWN was originally created in the framework of the EWN project and then further extended in the Italian national project Integrated System for the Automatic Language Processing (SI-TAL). In its current status, IWN provides the semantic description for around 67,000 Italian word senses (9,096 verbs, 32,099 common nouns, 3,450 proper nouns, 4,356 adjectives, and 513 adverbs, either single or multi-word units), which are clustered in approximately 50,000 synsets. IWN employs the same set of semantic relations used in EWN and there are currently 117,068 instances of language-internal relations. One of the salient features of the resource is the connection of all IWN synsets to the Princeton WordNet database, version 1.5, under the form of an unstructured list of synsets or ILI. Synsets are classified in terms of the IWN Top Ontology (TO), which is slightly different from EWN TO, and is a hierarchy of 65 language independent Top Concepts (TCs) clustered around three main categories distinguishing, 1st, 2nd and 3rd order entities. Their subclasses, hierarchically ordered by means of a subsumption relation, are also structured in terms of (disjunctive and non-disjunctive) opposition relations.
3.2.2
Spanish WordNet
The Spanish WordNet (Verdejo, 1999) was built within the EWN project by a research team belonging to three universities: Universidad Nacional de Educaci´on a Distancia (UNED), University of Barcelona and Technical University of Catalonia. It was afterwards extended, enriched and mapped 27
CHAPTER 3. LANGUAGE RESOURCES AND NEW TEXT to WordNet 1.6. The version used in this research 1 contains 30,485 synsets, 52,515 variants, 73,665 language internal relations and 28,283 equivalence relations to the ILI.
3.3
Parole-Simple-Clips
PSC is an Italian computational lexicon which has been developed in the framework of three different projects. The first two, PAROLE (Ruimy et al., 1998) and SIMPLE (Lenci et al., 2000), were funded by the European Union and were devoted to the research and development of wide-coverage, multipurpose and harmonised computational lexicons for twelve European languages2 . While PAROLE dealt with the morphological and syntactic layers, SIMPLE, a follow-up of the first, added a semantic layer to a subset of the PAROLE data. Lastly, Corpora e Lessici di Italiano Parlato e Scritto (CLIPS) (Ruimy et al., 2002) was an Italian national project where the Italian lexicon developed within PAROLE and SIMPLE was enlarged and refined. The semantic layer of PSC, the relevant one for the current research, contains about 55,000 semantic units (i.e. senses) organised in an ontology made up of 153 semantic types (i.e. ontology nodes). From a theoretical point of view, the linguistic background of PSC is based on the Generative Lexicon (GL) theory (Pustejovsky, 1991). In the GL, the sense is viewed as a complex bundle of orthogonal dimensions that express the multidimensionality of word meaning. The most important component for representing the lexical semantics of a word sense is the qualia structure which consists of four qualia roles: • Formal role. Makes it possible to identify an entity. • Constitutive role. Expresses the constitution of an entity. • Agentive role. Provides information about the origin of an entity. • Telic role. Specifies the function of an entity. 1
available for research at urlhttp://www.lsi.upc.edu/ nlp Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish and Swedish. 2
28
3.3. PAROLE-SIMPLE-CLIPS Each qualia role can be considered as an independent element or dimension of the vocabulary for semantic description. The qualia structure enables us to express different or orthogonal aspects of word sense whereas a onedimensional inheritance can only capture standard hyperonymic relations. Within SIMPLE, the qualia structure was extended by assigning subtypes to each of the qualia roles (e.g. “Usedfor” and “Usedby” are subtypes of the telic role).
3.3.1
Ontology
The SIMPLE ontology is made up of nodes called semantic types. There are two types of nodes, simple types, which are identified by only a onedimensional aspect of meaning (formal) expressed by hyperonymic relations, and unified types, for which additional dimensions of meaning (e.g. constitutive) are needed. The top nodes of the ontology are shown in figure 3.1. These are “Entity” and the three qualia roles that might be involved in the definition of unified types (“Constitutive”, “Agentive” and “Telic”).
Figure 3.1: Top nodes of the SIMPLE ontology
3.3.2
Lexicon
The PSC lexicon consists of 57,000 semantic entries structured in terms of a semantic system type: the SIMPLE ontology. Out of the total number of entries, 28,500 are fully encoded with the whole wealth of mandatory and optional semantic features and relations foreseen by the SIMPLE model; another 28,500 entries bear the main semantic information, namely ontological classification, type defining features and predicative representation. To express the relationships holding between word senses, the SIMPLE model offers a large set of semantic relations. Among these are 60 Extended Qualia relations, the relevance and expressive power of which is largely acknowledged. These relations allow for the expression of very fine-grained 29
CHAPTER 3. LANGUAGE RESOURCES AND NEW TEXT distinctions both for structuring the information regarding the componential aspect of word meanings and to capture the nature of the relationships holding among word senses. The PSC lexicon instantiates more than 80,000 semantic relations. A recent work (Ruimy & Toral, 2008) enriches automatically PSC with 5,000 new relations regarding conceptual links holding between events and their participants and among co-participants in events.
3.4
Mapping between Parole-Simple-Clips
ItalWordNet
and
Although PSC and IWN follow different lexical models, they also present compatible aspects (as a matter of fact, the ontologies of SIMPLE and EWN are compatible). Linking both resources offers the end-user more exhaustive lexical information combining features offered by the two lexical models. It provides not only reciprocal enhancements but also a validation of the two resources. Moreover the linking presents a multilingual vocation; on one hand IWN is linked to wordnets for other languages by using the ILI, on the other PSC shares the theoretical model, the representation language, the building methodology and a set of core entries with 11 other European lexicons. Regarding the current status of this linking, 72.37% of word senses about concrete entities and 69.59% of word senses about abstract entities and events have been mapped (Roventini et al., 2007) (Roventini & Ruimy, 2008).
3.5
The Lexical Markup Framework
LMF (Francopoulo et al., 2008) (ISO 24613, 2008) is an ISO standard for the representation of LRs. The goals of LMF are to provide a common model for the creation and use of LRs, to manage the exchange of data between and among them, and to enable the merging of a large number of individual resources to form extensive global electronic resources. LMF was designed specifically to accommodate as many models of lexical representations as possible. Purposefully, it is designed as a meta-model, i.e a high-level specification for LRs defining the structural constraints of a lexicon. It is based on two main components: 30
3.5. THE LEXICAL MARKUP FRAMEWORK • The core package, i.e. a structural skeleton to represent the basic hierarchy of information in a lexicon, under the form of core classes of objects and relations. • A set of modular extensions to the core package, i.e. additional classes and relations required for the description of specific types of lexical resources. Available extensions include morphology, syntax, semantics, multilingual notations, paradigm classes, multi-word expression patterns and constraint expressions.
3.5.1
Core package
The core lexical objects (see figure 3.2) provide the basis for building LMFcompliant lexicons. LexicalResource is intended for representing an entire resource and, in our case, it will be the container of the whole Named Entity Lexicon. Each individual monolingual lexicon is an instance of the standard Lexicon class and, in its turn, is the container for words in a given language. The LexicalEntry class represents an abstract unit of vocabulary: as a first approximation, it is a word. LexicalEntry functions as a bridge among the Form (or, in case, Lemma), an abstract class representing the way a word is written (or spoken), and its related Sense(s), representing one (or more) meaning(s) of a lexical entry.
3.5.2
Semantics package
The representation of the semantic aspects of words (see figure 3.3) is entrusted to objects related and aggregated to Sense class which represent lexical items as lexical semantic units. Each Sense instance describes one meaning of a LexicalEntry. Synset clusters synonymous Sense instances. Both may contain information on a specific domain and a link to a semantic type in an ontology which the sense (or synset) instantiate, via the MonolingualExternalRef class. Semantic relatedness among words is an important property in natural language lexicons and, here, it is expressed through the SenseRelation and SynsetRelation classes, which encode (lexical) semantic relationships among instances of the Sense or Synset class. 31
CHAPTER 3. LANGUAGE RESOURCES AND NEW TEXT
Figure 3.2: LMF core package
32
3.6. SUMO
Figure 3.3: LMF semantics package
3.5.3
Multilingual package
A separate package is devoted, in LMF, to multilingual notation, which can be used to represent bilingual and multilingual resources. The framework, based on the notion of Axis, accommodates transfer, TransferAxis, and interlingual pivot approaches, SenseAxis (see figure 3.4). The multilingual package comes equipped with the possibility to define connections between a node in a lexicon (e.g. a SenseAxis instance) and knowledge representation systems, such as ontologies or fact databases, as well. This is accomplished by the use of the InterlingualExternalRef class.
3.6
SUMO
The Suggested Upper level Merged Ontology (SUMO) (Niles & Pease, 2001) in an upper level ontology proposed as a starter document for The Standard Upper Ontology Working Group, an Institute of Electrical and Electronics Engineers (IEEE) working group made up of collaborators from the fields of engineering, philosophy and information science. It was created by merging 33
CHAPTER 3. LANGUAGE RESOURCES AND NEW TEXT
Figure 3.4: LMF multilingual package
34
3.7. NEW TEXT AND WIKIPEDIA publicly available ontological content (ontologies available at the Ontolingua server, Sowa’s upper level ontology, the ontologies developed at ITBM-CNR, etc.) into a single structure. The ontology contains 654 terms and 2,351 assertions. It has been mapped to differents versions of WordNet3 (Niles & Pease, 2003). SUMO is free and can be both browsed online4 and downloaded5 .
3.7
New Text and Wikipedia
By New Text we refer to “new types of text - dynamic, reactive, multilingual, with numerous cooperating or even adversarial authors and little or no editorial control” which have arisen due to “recent advances in publication and dissemination systems” (Karlgren, 2006). These kinds of sources have emerged as a consequence of the appearance of new forms of communication; they are in fact a core element of the so-called Web 2.0. This new communication paradigm puts emphasis on collaboration promoting the creation of content in a bottom-up fashion. Unlike the traditional Web model, where the author and reader are completely separated, this distinction becomes much more fuzzy as everybody could play both roles. The most well known New Text resource types include blogs and wikis. Blogs are somewhat similar to traditional resources as usually there is one person (or a team) that writes the content (called posts) and a separate audience that reads them. However, the reader can play an active role as he is invited to comment on the posts. Wikis, in its turn, can be defined as online texts that allow users to easily edit and change the contents. Wikipedia is a multilingual encyclopaedia that follows the wiki philosophy. The rest of this section focuses on Wikipedia. First this resource is introduced and afterwards we delve into some aspects regarding Wikipedia that are relevant to the current research; we describe its structure, we report on studies that compare Wikipedia to other encyclopaedias and finally, we survey its role in Computational Linguistics. Wikipedia is a multi-lingual, web-based, free-content encyclopaedia which is updated continuously in a collaborative way. It has versions for more than 3
http://sigmakee.cvs.sourceforge.net/sigmakee/KBs/WordNetMappings/ http://ontology.teknowledge.com:8080/rsigma/SKB.jsp 5 http://ontology.teknowledge.com/cgi-bin/cvsweb.cgi/SUO/ 4
35
CHAPTER 3. LANGUAGE RESOURCES AND NEW TEXT 250 languages6 . It was launched in 2001 by Jimmy Wales and Larry Sanger and it is supported by the non-profit Wikimedia Foundation.
3.7.1
Structure
We briefly outline the main elements of Wikipedia’s structure: • Articles. The article represents the concept of encyclopedic entry. • Redirects, can be associated to pages. They represent orthographic variants of entry titles. • Categories, to which pages can be associated. The categories form a taxonomy; a category can have one or several subcategories and belongs to a supercategory. • Intralingual links. They connect two pages that belong to the same language. • Interlingual links. They connect equivalent pages that belong to different languages. Figure 3.5 shows the Entity Relation diagram of the Wikipedia database . Note that two tables are represented with discontinuous lines (CategoryLemmas and Abstracts), these tables are not part of the original Wikipedia database, they are automatically created with the wiki db access software package (see appendix A). Let us briefly describe each of these tables: 7
• Page, is the main table. Depending on its namespace it represents the concept of article (0), redirect (1), category (14), etc. • Revision, holds metadata for every edit done to a page. Connects the Page and Text tables. • Text, stores the text associated to a page. • Redirect, provides links from redirects to the corresponding articles. 6
http://meta.wikimedia.org/wiki/List_of_Wikipedias For the sake of clarity, this diagram contains only the tables of the Wikipedia database that are used in this research. 7
36
3.7. NEW TEXT AND WIKIPEDIA
Figure 3.5: Database Entity Relation diagram of the Wikipedia database • PageLinks, contains page-to-page links. • CategoryLinks, links articles and categories to the categories they belong to. • LangLinks, links articles to corresponding articles in other languages. • CategoryLemmas, for each page that is a category it contains its lemma. • Abstracts, for each page that is an article it contains its abstract.
3.7.2
Comparison to other encyclopaedias
We briefly report on two studies that compare Wikipedia to other resources. The first, a special report in Nature (Giles, 2005), confronts the English Wikipedia with the Encyclopaedia Britannica. A set of scientific articles covering a wide range of topics from the two encyclopaedias was selected and sent to field experts for peer review. These compared the corresponding articles without knowing which article came from which encyclopaedia. This led to 42 articles being reviewed. In all, 162 errors (factual errors, omissions and misleading statements) were found in Wikipedia while 123 were found in Britannica. On average, therefore, 3.86 errors per article were found in 37
CHAPTER 3. LANGUAGE RESOURCES AND NEW TEXT Wikipedia and 2.92 in Britannica. It can be concluded then that Wikipedia is more or less as accurate on science as the Encyclopedia Britannica. The second, a study carried out by the WIND research institute for the Stern magazine89 , compares the German Wikipedia with Brockhaus Online encyclopaedia. 50 articles belonging to different topics (economy, politics, sports, science, culture, entertainment, geography, medicine and history) were rated on a scale from 1 (best) to 6 (worst). Wikipedia’s average rating was 1.7, while the Brockhaus average rating was 2.7.
3.7.3
Wikipedia and Computational Linguistics
In the last few years there has been a growing academic interest for Web 2.0 collaborative resources and among them especially for Wikipedia10 . This is specially true for the area of Computational Linguistics, which perceives Wikipedia as a new Language Resource of huge dimensions. Several factors make Wikipedia a valuable resource for this research area: • It is a big source of information. By April 2009, it has over 12,800,000 entries. The English version alone has more than 2,800,000 entries. • Its accuracy is comparable to traditional encyclopaedias (see 3.7.2). • It is a general knowledge resource. Thus, it can be used to extract information for open domain systems. • Its data has some degree of formality and structure (e.g. categories) which helps to process it. • It is a multilingual resource. It contains multilingual links that reference entries in an input language with their equivalents in other languages. • It is constantly updated. This is a very important fact for the maintenance of NLP systems. 8
http://www.stern.de/media/pdf/wiki_test_750.jpg http://www.stern.de/computer-technik/internet/: stern-Test-Wikipedia-Brockhaus/604423.html?q=Brockhaus\%20wikipedia 10 http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_in_academic_studies# Over_time 9
38
3.7. NEW TEXT AND WIKIPEDIA • Its content has a free license, meaning that it will always be available for research without restrictions. In this field several events where Wikipedia has a central role have been organised recently including the evaluation tasks Question Answering using Wikipedia (WiQA)11 and Cross-language Geographic Information Retrieval from Wikipedia (GikiCLEF)12 and the workshops NEW TEXT13 , WikiAI0814 , and The People’s Web Meets NLP 15 . The community has also developed tools that allow researchers to access the information encoded in Wikipedia and other similar resources. Examples of these are Java-based Wikipedia Library (JWPL)16 and Java-based WiKTionary Library (JWKTL)17 , Application programming interface (API)s that allow to access the information contained in Wikipedia and Wiktionary respectively (Zesch et al., 2008). Wikipedia has been exploited for a wide range of tasks such as Monolingual (Ahn et al., 2005; Jijkoun et al., 2005; Buscaldi & Rosso, 2006) and multilingual (Ferr´andez et al., 2007d) Question Answering, Semantic relatedness (Ponzetto & Strube, 2007; Gabrilovich & Markovitch, 2007; Milne & Witten, 2008), IE (Wu et al., 2008) or NE Disambiguation (Bunescu & Pasca, 2006). Furthermore, several researchers have used it to build LRs. (Gregorowicz & Kramer, 2006) mine a term-concept network from Wikipedia. (Suchanek et al., 2007) introduces an ontology automatically derived from Wikipedia and WordNet. (Auer et al., 2008) extracts structured information from Wikipedia and makes it available on the Web. (Pedro et al., 2008) extracts a medical ontology. (Milne et al., 2006) mines a thesaurus for the agriculture domain. (Medelyan & Legg, 2008) integrates Cyc and Wikipedia. (Jones et al., 2008) builds a domain-specific multilingual dictionary by extracting the entries from a Wikipedia category, which is then used to customise a Machine Translation system. (Ruiz-Casado & Castells, 2006) extracts relations between entries of Wikipedia which are added to their corresponding WordNet entries. 11
http://ilps.science.uva.nl/WiQA/ http://www.linguateca.pt/GikiCLEF/ 13 http://www.sics.se/jussi/newtext 14 http://lit.csci.unt.edu/~wikiai08/index.php/Main_Page 15 http://www.ukp.tu-darmstadt.de/acl-ijcnlp-2009-workshop 16 http://www.ukp.tu-darmstadt.de/software/jwpl/ 17 http://www.ukp.tu-darmstadt.de/software/jwktl/ 12
39
CHAPTER 3. LANGUAGE RESOURCES AND NEW TEXT
40
Chapter 4 Acquisition of NE plain dictionaries
41
CHAPTER 4. ACQUISITION OF NE PLAIN DICTIONARIES This chapter presents an initial experiment to automatically extract NE plain dictionaries (also known as gazetteers) from Wikipedia. In this approach, Wikipedia definitions are processed using NLP techniques. This method extracts the necessary information from linguistic resources. Our approach is based on the analysis of the entries of an on-line encyclopedia by using a noun hierarchy and optionally a PoS tagger. An important motivation is to reach a high level of language independence. This restricts the techniques that can be used but makes the method useful for languages with few resources. The evaluation carried out proves that this approach can be successfully used to build NER gazetteers for location (F 78%) and person (F 68%) categories.
4.1
Approach
In this section we present our initial approach to automatically build and maintain dictionaries of proper nouns. In a nutshell, we analyse the entries of an encyclopedia with the aid of a noun hierarchy. Our motivation is that proper nouns which form entities can be obtained from the entries in an encyclopedia and that some features of their definitions in the encyclopedia can help to classify them into their correct entity category. The encyclopedia used has been Wikipedia. We use its English simple version1 , “a version of the Wikipedia encyclopedia, written in Simple English. Articles in the Simple English Wikipedia use fewer words and easier grammar than the Ordinary English Wikipedia”2 . The reason to use the Simple Wikipedia is that we attempt to process the definitions, and therefore it is expected that a simpler grammar provides better results. The noun hierarchy used has been the noun hierarchy from WordNet. From this noun hierarchy we consider the nodes (called synsets in WordNet) which in our opinion represent more accurately the different kind of entities we are working with (location, organisation and person). For example, we consider the synset 6026 as corresponding to the entity class Person. This is the information contained in synset number 6026: person, individual, someone, somebody, mortal, 1 2
http://simple.wikipedia.org http://simple.wikipedia.org/wiki/Simple_English_Wikipedia
42
4.1. APPROACH human, soul -- (a human being; "there was too much for one person to do") Given an entry from Wikipedia, a PoS-tagger (Carreras et al., 2004) is applied to the first sentence of its definition. As an example, the first sentence of the entry Portugal in the Simple English Wikipedia3 is presented here: Portugal Portugal NP is be VBZ a a DT country country NN in in IN the the DT south-west south-west NN of of IN Europe Europe NP . . Fp For every noun in a definition we obtain the synset of WordNet that contains its first sense4 . We follow the hyperonymy branch of this synset until we arrive at a synset we consider to belong to an entity class or we arrive to the root of the hierarchy. If we arrive to a considered synset, then we consider that noun as belonging to the entity class of the considered synset. The following example may clarify this explanation: Portugal --> LOCATION country --> LOCATION south-west --> NONE Europe --> LOCATION As it has been said in the abstract, the application of a PoS tagger is optional. The algorithm will perform considerably faster with one as with the PoS data we only need to process the nouns. If a PoS tagger is not 3
http://simple.wikipedia.org/wiki/Portugal We have also carried out experiments taking into account all the senses provided by WordNet. However, the performance obtained is not substantially better while the processing time increases notably. 4
43
CHAPTER 4. ACQUISITION OF NE PLAIN DICTIONARIES available for a language, the algorithm can still be applied. The only drawback is that it will perform more slowly as it needs to process all the words. However, through our experimentation we can conclude that the results do not significantly change. Finally, we apply a weighting algorithm which takes into account the amount of nouns in the definition identified as belonging to the different entity types considered and decides to which entity type the entry belongs. This algorithm has a constant Kappa which allows us to increase or decrease the distance required within categories in order to assign an entry to a given class. The value of Kappa is the minimum difference of number of occurrences between the first and second most frequent categories in an entry in order to assign the entry to the first category. In our example, for any value of Kappa lower than 4, the algorithm would say that the entry Portugal belongs to the location entity type. Once we have this basic approach we apply different heuristics which we think might improve the results obtained and whose effect will be analysed in the results section (see 4.2). The first heuristic, called is instance, tries to determine whether the entries from Wikipedia are instances (e.g. Portugal) or word classes (e.g. country). This is done due to the fact that NEs only consider instances. Therefore, we are not interested in word classes. We consider that an entry from Wikipedia is an instance when it has an associated entry in WordNet and it is an instance. The procedure to determine if an entry from WordNet is an instance or a word class is similar to the one used in (Magnini et al., 2002). The second heuristic is called is in wordnet. It simply determines if the entries from Wikipedia have an associated entry in WordNet. If so, we may use the information from WordNet to determine its category.
4.2
Experiments and results
We have tested our approach by applying it to 3,517 randomly selected entries of the Simple English Wikipedia. Thus, these entries have been manually tagged with the expected entity category. The distribution by entity classes can be seen in table 4.1: As shown in table 4.1, the amount of entities of the categories person and location are balanced but this is not the case for the type organisation. There are very few instances of this type. This is understandable as in an 44
4.2. EXPERIMENTS AND RESULTS Entity type NONE LOC ORG PER
Number of instances 2822 404 55 236
Percentage 58 8 34
Table 4.1: Distribution by entity classes k 0 2
LOC ORG PER P R Fβ=1 P R Fβ=1 P R Fβ=1 66.90 94.55 78.35 28.57 18.18 22.22 61.07 77.11 68.16 86.74 56.68 68.56 66.66 3.63 6.89 86.74 30.50 45.14 Table 4.2: Experiment 1. Results applying is instance heuristic
encyclopedia, locations and people are defined but this is not the usual case for organisations (such as companies). According to our comments in section 4.1, we considered the heuristics explained there by carrying out two experiments. In the first one we applied the is instance heuristic. The second experiment considers the two heuristics explained in section 4.1 (is instance and is in wordnet). We do not present results without the first heuristic as through our experimentation it proved to increase both recall and precision for every entity category. For each experiment we considered two values of a constant Kappa as used in our algorithm. The values are 0 and 2 as through experimentation we found these are the values which provide the highest recall and the highest precision, respectively. Results for the first experiment can be seen in table 4.2 and results for the second experiment in table 4.3. As shown in these tables, the best recall for all classes is obtained in experiment 2 with Kappa 0 (table 4.3) while the best precision is obtained in experiment 1 with Kappa 2 (table 4.2). The results for both location and person categories are in our opinion good enough to the purpose of building and maintaining good quality gazetteers after a manual supervision. However, the results obtained for the organisation class are very low. This is mainly due to the fact of the high interaction between this category and location combined with the practically absence of 45
CHAPTER 4. ACQUISITION OF NE PLAIN DICTIONARIES k 0 2
LOC ORG PER P R Fβ=1 P R Fβ=1 P R Fβ=1 62.88 96.03 76.00 16.17 20.00 17.88 43.19 84.74 57.22 77.68 89.60 83.21 13.95 10.90 12.24 46.10 62.71 53.14
Table 4.3: Experiment 2. Results applying is instance and is in wordnet heuristics Tagged NONE LOC ORG PER
Guessed NONE LOC ORG PER 2777 33 1 11 175 229 0 0 52 1 2 0 163 1 0 72
Table 4.4: Error analysis (experiment 1, k=2)
traditional entities of the organisation type such as companies in the Simple Wikipedia. This interaction can be seen in the in-depth results whose presentation follows. In order to clarify these results, we present more in-depth data in tables 4.4 and 4.5. These tables present an error analysis, showing the false positives, false negatives, true positives and true negatives among all the categories for the configuration that provides the highest recall (experiment 2 with Kappa 0) and for the one that provides the highest precision (experiment 1 with Kappa 2). In tables 4.4 and 4.5 we can see that the interactions within classes (occurrences tagged as belonging to one class but NONE and guessed as belonging to another class but NONE) is low. The only case where it is significant is between location and organisation. In table 4.5 we can see that 12 entities tagged as organisation are classified as location while 20 tagged as organisation are guessed as being the correct type. Following with these, 5 entities tagged as location where classified as organisation. This is due to the fact that countries and related entities such as ”European Union” can be considered both as organisations or locations depending on their role in a text.
46
4.2. EXPERIMENTS AND RESULTS
Tagged NONE LOC ORG PER
Guessed NONE LOC ORG 2220 196 163 8 387 5 20 12 20 30 9 2
PER 243 4 3 195
Table 4.5: Error analysis (experiment 2, k=0)
47
CHAPTER 4. ACQUISITION OF NE PLAIN DICTIONARIES
48
Chapter 5 Key elements to overcome the knowledge acquisition bottleneck
49
CHAPTER 5. KEY ELEMENTS TO OVERCOME THE KNOWLEDGE ACQUISITION BOTTLENECK That of the knowledge acquisition bottleneck is one of the major challenges that the Computational Linguistics field faces nowadays. The stateof-the-art regarding this topic, putting special emphasis on the enrichment of existing LRs, has been already introduced (see 2.2). In chapter 4, we have presented an initial approach to automatically acquire plain dictionaries of NEs from Wikipedia. In this chapter we consider this problem from a more abstract perspective and try to identify solutions that could lead to a step forward. Instead of a narrow analysis of state-of-the-art algorithms to find ways of improving their results, we depart from a higher level; we look at the current situation of the Computational Linguistics area and also its neighbour areas, and try to identify key elements that are sensible to exploit within knowledge acquisition. In the next chapter we will apply the solutions found at the abstract level to the specific case of the acquisition of NE resources. Nowadays there is a huge amount of digital information available with different levels of formalisation; there is structured information (for example in LRs), semi-structured (such as the emerging Web 2.0 sources) and non structured (plain text). The challenge therefore consists on finding efficient and accurate mechanisms that would allow to acquire, integrate and relate these different sources as to automatically generate new resources from the different sources of information. From the analysis carried out we have come up with three key elements: LRs, Web 2.0 resources and standards. The next sections motivates the choice of these three elements and discusses how knowledge acquisition could benefit from them.
5.1
Language Resources content to support Their Enrichment
During the last two decades a lot of effort has been devoted to the construction of LRs. In Europe some large projects have been completely devoted to the development of LRs, e.g. PAROLE, PAROLE-2, SIMPLE, EWN. Because of the specific nature and high difficulty of this task, it has been carried out by highly qualified personnel. Therefore, the outcome is a set of high quality LRs with valuable knowledge covering the different levels of language (see chapter 3 for a description 50
5.1. LANGUAGE RESOURCES CONTENT TO SUPPORT THEIR ENRICHMENT of the LRs used in this thesis). However, except for WordNet, the big effort devoted to LRs have not paid off as the knowledge contained in these LRs has not been widely exploited by the research community. Reasons are of diverse nature (Choukri & Odijk, 2009), including among others political, related to distribution, and technical. Regarding technical reasons, those that have had a negative impact in the present research are format errors found in LRs and the lack of automated access methods. The first is related to the concept of formal validation, a phase in the development of LRs in which the data is checked for formal errors. Although being very important for automatic applications of LRs, it has not been systematically applied in the development of LRs and it is only lately that has emerged as a key phase in LR building (Odijk & Mariani, 2009). The second is the lack of libraries that provide automatic access to LRs. In fact, we needed to develop APIs to provide an automatic access for some of the LRs covered in the thesis (see appendix A). A further step concerning automatic access that is gaining strength in the last years is the logic formalisation of the LRs themselves. In this way LRs can be used to support logic reasoning. LRs such as WordNet (van Assem et al., 2006) and Mesh (Soualmia et al., 2004) have been converted into formal languages (mainly Web Ontology Language (OWL)). The SIMPLE ontology has been formalised and enriched with lexicon information generalised in a bottom-up fashion (Toral et al., 2007) (Toral & Monachini, 2007). As we have seen, the knowledge present in LRs is of high quality due to the way in which they are built and thus can be exploited for NLP tasks in general and for knowledge acquisition in particular. The problems that LRs present for such an application have been identified and can be solved without much effort, although it would be much better if the necessary phases to prevent these problems to appear were included in the LR development process itself. Whilst LRs are of high quality their size however is usually moderate. This contrasts with another kind of resources, the so called New Text, whose size is very large but whose content is less reliable (see section 3.7). Another feature that distinguishes these two kinds of resources regards the very manner in which they are populated; while LRs, being developed by a small number of experts, follow an authoritative process, New Text sources are built by big communities following a decentralised and collaborative approach. Because of these complementary characteristics, we hypothesise that 51
CHAPTER 5. KEY ELEMENTS TO OVERCOME THE KNOWLEDGE ACQUISITION BOTTLENECK combining both kinds of resources might bring a relevant advance for knowledge acquisition. The next section motivates the use of New Text sources within knowledge acquisition.
5.2
Exploiting New Text for Broad Coverage and High Quality Extraction
Up to now, research devoted to the automatic population of LRs has mostly focused on extracting the required information from two kinds of sources: MRDs and raw corpora. However, both present disadvantages. While MRDs are small in size and thus limit the quantity of information that can be extracted, corpora consist of unstructured text and therefore make it harder to extract valuable information. Relations found in unrestricted text tend to be subjective judgements compared to the more established statements present in dictionaries and encyclopaedias (Hearst, 1998). This is in line with an study that analyses the Wall Street Journal Treebank Corpus and divided it into opinion (editorials, letters to the editor, Arts & leisure reviews and viewpoints) and non opinion pieces (the rest) (Wiebe et al., 2004). The authors discovered that 70% of the sentences in opinion pieces are subjective and 30% are objective whereas in non opinion pieces, 44% of the sentences are subjective and only 56% are objective. Therefore, unless some post-process is carried out, methods that extract semantic relations from these kind of textual sources are not appropriate for an automatic acquisition process. A possible solution to tackle this problem is to create subjective and objective sentence classifiers (Wiebe & Riloff, 2005). Nevertheless, the results obtained are far from being perfect; the best classifier, which is supervised, obtains 76% accuracy while the best unsupervised one achieves 73.8%. Following with corpora based methods, they might, if no special treatment is applied, acquire the same instance with different lexical forms (Fleischman et al., 2003) (e.g. Bill Clinton and William Clinton) and therefore include them as different instances in the created resource. However, in Wikipedia, if an instance being an entry has different lexical ways of referral, all of them are linked so it is straight-forward to extract all of them as being different strings that refer to the same entity (e.g. William Clinton links to Bill Clinton). However, new types of text -the so called New Text- have emerged as a 52
5.2. EXPLOITING NEW TEXT FOR BROAD COVERAGE AND HIGH QUALITY EXTRACTION consequence of the appearance of new forms of communication (see section 3.7). We are interested in using these kinds of sources because (i) they tend to have some degree of structure which facilitates the extraction of valuable information and (ii) they are dynamic and thus a sensible source to guarantee up-to-date information. Making use of these new kinds of information could present important advantages for Information Extraction compared to the aforementioned kinds of sources. New types of sources such as folksonomies (aka social tagging) and wikis contain semi-structured semantic information (categorisation tags, interlingual and multilingual links, attribute-value tables, etc) that is not only useful to recognise the elements to be extracted but also to disambiguate and normalise them. Besides, these sources are dynamic, thus change with time, and because they are collaboratively built, reflect language variety. The challenge consists of adapting state-of-the-art extraction techniques in order to derive the maximum benefit from these new kinds of sources. Compared to research that relies on corpora, the use of New Text avoids problems due to subjective judgements1 and inconsistencies due to calling instances in different manners whereas compared to research that uses MRDs, the limitation by the small size of the input resource is avoided. One of these new kinds of text is known as wiki. Wikis are collaborative websites that allow users to be both readers and change the contents. These characteristics make them an effective tool for collaborative authoring. The most widely known example of a wiki resource is Wikipedia (see section 3.7). Wikipedia could be an interesting textual source for the automatic creation of LRs because, being an encyclopedia, it contains facts dealing with the entire range of human knowledge and, as it is developed by a large amount of people2 , therefore reflects the variations of language and human thought. Table 5.1 compares the relevant characteristics of corpora, MRDs and Wikipedia for their application to knowledge acquisition. Taking into account all the four features considered (structure, subjectivity, size and nature), Wikipedia emerges as the resource offering the best tradeoff.
1
Specifically, Wikipedia, being an encyclopaedia and having strong policies regarding neutrality, does not suffer from such problems 2 On 2008/03/11 the English version has 9,141,485 registered users.
53
CHAPTER 5. KEY ELEMENTS TO OVERCOME THE KNOWLEDGE ACQUISITION BOTTLENECK
Table 5.1: Comparison of corpora, MRDs and Wikipedia corpora MRDs Wikipedia structure none high medium subjectivity high low low size big small big nature static static dynamic
5.3
Standards for Building Generic Information Repositories
Apart from the knowledge bottleneck, another important problem of the field has to do with interoperability. And solving the latter, in its turn, could have a positive effect for solving the first. The lack of long-term planification in the field has led to a variety of LRs in different formats (often incompatible), aimed at specific subfields. Interoperability has recently come to be recognised as perhaps the most pressing current need for language processing research (Ide & Pustejovsky, 2009). Interoperability is especially critical at this time because of the widely recognised need to create and merge annotations and information at different linguistic levels in order to study interactions and interleave processing at these different levels. It has also become critical because new data and tools for emerging and strategic languages such as Chinese and Arabic as well as minor languages are in the early stages of development. Recognition of the urgency for interoperability of LRs and tools is apparent in the recent flurry of activity within the community aimed at achieving this goal (see 2.4 for details). As (Nyberg, 2009), we see resource interoperability as “a first step in a general trend toward greater sharing and reuse of components in language technologies applications”. In fact, an added value of our proposal is the use of standards in order to make both the procedures more generic and independent from the specific resource(s) used and to improve the interoperability and future sharing of different LRs. Concerning this matter, we study the use the LMF as the representation format of the resulting NE resource. This choice, in its turn, facilitates the use of the produced resource in NLP tasks.
54
Chapter 6 Acquisition of a multilingual NE lexicon
55
CHAPTER 6. ACQUISITION OF A MULTILINGUAL NE LEXICON In this section we explain thoroughly the procedure followed to derive a lexicon of NEs from existing LRs and Wikipedia. It consists of several sequential phases which we will refer to as: mapping, disambiguation, extraction, NE identification and post-processing. A graphic depicting the overall process is presented in figure 6.1. As previously stated, The method brings into practise the three elements identified in the analysis carried out in chapter 5; our approach (i) takes advantage of information already present in LRs, (ii) exploits the semi structured nature of New Text and (iii) integrates the information acquired making use of standards in order to guarantee interoperability.
Figure 6.1: Method diagram Our method maps the noun is-a hierarchy of LRs to Wikipedia categories, disambiguates eventual ambiguous mappings, extracts the articles present in the latter and identifies from them which are NEs. Several information pieces from the NEs such as written variants, definitions, etc. are introduced into a NE lexicon. In a post-processing phase, (i) additional NEs are extracted exploiting the interlingual links of Wikipedia and (ii) the extracted NEs are linked to ontologies. The following subsections deal with each phase. Afterwards we present the structure of the resulting NE lexicon. 56
6.1. MAPPING
6.1
Mapping
In this first step the instantiable nouns present in the LRs are mapped to Wikipedia categories. These mappings are obtained by comparing the lemmas of the nouns from the LR to those of the Wikipedia categories. In order to do this, the categories of Wikipedia are lemmatised with Freeling (Atserias et al., 2004), as this tool provides PoS-tagging machinery for the different languages considered (English, Spanish and Italian). Once we have the subset of nouns that are instantiable and the categories have been lemmatised, we map the LR nouns to Wikipedia categories by matching their lemmas. For example, the noun “country” would be mapped to the category “Countries” as the PoS-tagger would obtain the same lemma for both words.
6.2
Disambiguation
Once the nouns of the LR have been mapped to categories, a further mandatory step must be carried out for those nouns that are polysemous: the sense that corresponds to the mapped category should be identified. We have devised two different approaches to do this automatically. The first looks for common instances in the hyponym trees of both the noun senses and the category, while the second performs text similarity between the definitions of the noun senses and the category. The following subsections present both approaches.
6.2.1
Instance Intersection
We hypothesise that instances could be useful to disambiguate nouns from LRs with respect to Wikipedia categories. For a polysemous noun that is mapped to a category, we consider the instances for each of its senses and check if any of these instances are also present as an article that belongs to the category. If an instance is found both in the LR and in the mapped category, then the sense chosen is the one that contains this instance. E.g. the English word “obelisk” is mapped to the category “Obelisks”. It has two senses in WordNet (1. stone pillar, 2. character used in printing). The first sense has one instance (“Washington Monument”) while the second has none. In the category “Obelisks” we find the instance “Washington Monument”. 57
CHAPTER 6. ACQUISITION OF A MULTILINGUAL NE LEXICON Thus, the sense chosen for the mapping would be the first one. As the taxonomy of Wikipedia is usually deeper than that of WordNet, we not only consider looking for instances in the mapped categories but also in their hyponyms (subcategories). However, the subcategory relation in the categories taxonomy of Wikipedia does not always follow the hyponymy relation1 . Therefore, in order to exploit subcategories, we need to identify whether they are hyponyms or not. We propose to apply regular expression patterns which can hold both lexical and Part-of-Speech elements. If a subcategory matches a pattern then it is considered as a hyponym. From studying the category structure of Wikipedia, we come up with the following patterns for English (for each pattern we provide an example of a matching subcategory for the category “Philosophers”): • ^ category " by " e.g. “philosophers by nationality” • ^category " in|from| " • ^ category " of " e.g. “philosophers of mind” • ^ category " stubs" $ e.g. “philosophers stubs” • ^ (JJ|JJR|NN|NP)+ \ (CC(JJ|JJR|NN|NP)+)* \ " " category $ e.g. “Spanish philosophers” As an example, we show how the word philosopher (1. specialist in philosophy, 2. wise person who is calm and rational) is disambiguated with respect to the category “Philosophers”. The first sense contains several instances such as “Averroes” while the second contains none. “Averroes” is not present in the mapped category but it is found in a subcategory that follows the hyponymy relation (“Philosophers” -> “philosophers by nationality” -> “Spanish philosophers”). 1
E.g. In the category “Philosophers” there are subcategories that follow the hyponymy relation (e.g. “Philosophers by country”) but there are also others that do not (e.g. “Philosophy academics”).
58
6.2. DISAMBIGUATION Equivalent patterns have been also built for the other languages considered, i.e. Spanish: • ^ category " por ", e.g. “Fil´osofos por idioma” • ^ category " de|del|en ", e.g. “Fil´osofos de la Edad Antigua” • ^ "Wikipedia:esbozo " category $, e.g. “Wikipedia:Esbozo fil´osofos” • ^ category " " (AQ[0-9A-Z]+ \ |N[0-9A-Z]+)+(CC|SP[0-9A-Z]+\ (AQ[0-9A-Z]+|N[0-9A-Z]+)+)* $, e.g. “Fil´osofos ´arabes” and for Italian: • ^ category " per ", e.g. “Filosofi per secolo” • ^ category " di |del |dell’|della |delle |degli ", e.g. “Filosofi del XX secolo” • ^ "stub " category $, e.g. “stub Filosofi” • ^ category " " (AQ[0-9A-Z]+ \ |N[0-9A-Z]+)+(CC|SP[0-9A-Z]+\ (AQ[0-9A-Z]+|N[0-9A-Z]+)+)* $, e.g. “Filosofi atei”
6.2.2
Text similarity
The second disambiguation approach relies on the definitions of the mapped elements, it applies text similarity to disambiguate the correct sense of the polysemous nouns mapped to Wikipedia categories. For each such noun, it computes the similarity between the gloss of each of its senses and the abstract of the mapped category. As there are different approaches to compute text similarity, we have decided to consider a set of representative methods in order to find out which works best for the current task: • A TE system. 59
CHAPTER 6. ACQUISITION OF A MULTILINGUAL NE LEXICON • A LR-based algorithm which applies Personalised PageRank to WordNet. • a Latent Semantic Analysis like algorithm based on Random Projection. The next paragraphs present each of these systems. We also present two baseline systems to which the aforementioned ones will be confronted in order to evaluate them. Finally, we introduce three combinations of the different approaches based on voting, unsupervised and supervised schemes. Textual Entailment TE has been defined as a generic framework for modelling semantic variability, which appears when a concrete meaning is described in different manners as proposed by (Dagan & Glickman, 2004). Therefore, semantic similarity is addressed by defining the concept of TE as a one-way meaning relation between two snippets. Moreover, a series of Workshops, called Recognising Textual Entailment (RTE)2 challenges and the Answer Validation Exercise (AVE)3 competitions, have been recently proposed with the objective of providing suitable frameworks to evaluate TE systems (Rodrigo et al., 2008; Giampiccolo et al., 2008). To address the specific semantic similarity phenomenon we are dealing with in this research work (i.e. semantic similarity between WordNet glosses and Wikipedia categories), we used the TE system presented in (Ferr´andez et al., 2007a; Balahur et al., 2008). This system has been previously used to support other NLP applications rather than puristic TE tasks. For instance, in QA (Ferr´andez et al., 2008) and automatic text summarisation (Lloret et al., 2008). As a brief system overview, it is worth mentioning the most relevant inferences implemented aimed at solving entailment relations:4 • Lexical inferences based on lexical distance measures. For instance, the Needleman-Wunsch algorithm, Smith-Waterman algorithm, a matching of consecutive subsequences, Jaro distance, Euclidean distance, Inverse Document Frequency (IDF) specificity based on word frequencies extracted from corpora, etc. 2
http://pascallin.ecs.soton.ac.uk/Challenges/RTE/ http://nlp.uned.es/clef-qa/ave/ 4 Further details of these inferences can be found in the referenced papers. 3
60
6.2. DISAMBIGUATION • Semantic inferences focused on semantic distances between concepts. These inferences implement several well-known WordNet-based similarity measures, verbs’ similarities based on the relations encoded in VerbNet5 (Kipper et al., 2006) and VerbOcean6 (Chklovski & Pantel, 2004) , and reasoning about named entities correspondences between texts. For the final application of the system to the target task of this work, we adapted it in order to manage bidirectional meaning relations. Linking WordNet glosses to Wikipedia categories is not a clear entailment phenomenon. It can occur that the gloss is implied by the Wikipedia category, the Wiki category is deducted by the gloss or the entailment appears in both directions. Therefore, to control these situations we opted for computing the average of the two system outputs regarding each unidirectional relation (as shown in equation 6.1). Sim(Glossi , Catj ) =
sim(Glossi → Catj ) + sim(Catj → Glossi ) (6.1) 2
Personalised PageRank over WordNet WordNet is a lexical database of English which groups nouns, verbs, adjectives and adverbs into sets of synonyms (synsets), each expressing a distinct concept. Synsets are interlinked with conceptual-semantic and lexical relations, including hypernymy, meronymy, causality, etc. Given a pair of texts and a graph-based representation of WordNet, our method has basically two steps: We first compute the Personalised PageRank over WordNet separately for each of the texts, producing a probability distribution over WordNet synsets. We then compare how similar these two discrete probability distributions are by encoding them as vectors and computing the cosine between the vectors. We represent WordNet as a graph G = (V, E) as follows: • Graph nodes represent WordNet concepts (synsets) and dictionary words. • Relations among synsets are represented by undirected edges. 5 6
http://verbs.colorado.edu/~mpalmer/projects/verbnet.html http://demo.patrickpantel.com/Content/verbocean/
61
CHAPTER 6. ACQUISITION OF A MULTILINGUAL NE LEXICON • Dictionary words are linked to the synsets associated to them by directed edges. For each text in the pair we first compute a personalised PageRank vector of graph G (Haveliwala, 2002). Basically, personalised PageRank is computed by modifying the random jump distribution vector in the traditional PageRank equation. In our case, we concentrate all probability mass on the words present in the target text. Regarding PageRank implementation details, we chose a damping value of 0.85 and finish the calculation after 30 iterations. We have not optimised these values for this task. We used all the relations in WordNet 3.07 , including the disambiguated glosses8 . This similarity method was used for word similarity (Agirre et al., 2009) which reports very good results on word similarity datasets. Semantic Vectors Semantic Vectors (Widdows & Ferraro, 2008)9 is an open source (Berkeley Software Distribution (BSD) license) software package that creates WORDSPACE models from plain text. Its aim is to provide an easy-to-use and efficient tool which can fit both research and production users. It uses a random projection algorithm to perform dimension reduction as this is a simpler and more efficient technique than other alternatives such as Singular Value Decomposition. It relies on Apache Lucene10 for tokenisation and indexing in order to create a term document matrix. Once the reference corpus has been tokenised and indexed, Semantic Vectors creates a WORDSPACE model from the resulting matrix by applying random projection. For the current task we have gathered a corpus made up of WordNet glosses and Wikipedia abstracts. On one hand, it contains the glosses of all the synsets present in WordNet 2.1., i.e. 117,598 glosses. On the other, it contains the abstracts of all the entries present in a Wikipedia dump obtained in January 2008, i.e. 2,179,275 abstracts. The final corpus has 1,292,447 terms. 7
Available from http://wordnet.princeton.edu/ http://wordnet.princeton.edu/glosstag 9 http://code.google.com/p/semanticvectors 10 http://lucene.apache.org 8
62
6.2. DISAMBIGUATION Semantic Vectors provides a class (CompareTerms) that calculates the similarity between two terms (which can be words or texts). Thus we have used this directly in our experiments. Baselines We provide two baselines based on sense predominance and word overlap. First sense follows the assumption that senses in WordNet are ordered according to their usage predominance (i.e. the first sense is the most general). First Sense always chooses the first sense of WordNet as being the correspondent to the mapped Wikipedia category. Wikipedia being a general resource, it is expected that the words that identify categories refer to their most common sense. E.g. it is really unexpected that the category “Banks” would refer to the sense “a flight manoevre” (tenth and last sense of the noun bank in WordNet). Word overlap calculates similarity between two texts by counting the number of overlapping words. In order to do this we have used the software package Text::Similarity11 . This method has been applied considering both all the words that appear in the texts and discarding stop words. For the last, we have used the list of stop words of the English stemmer Snowball12 . Combinations Due to the fact that the methods presented belong to different paradigms, we hypothesise that their results could be complementary and therefore we consider it sensible to study possible combinations of them. The first step in this direction has been the construction of an optimal combination, which we refer to as oracle. Given the outputs of the different systems and the gold standard, the oracle output sense/s for each instance is/are the sense/s present in the gold standard if any system return(s) it/them. The oracle then represents an optimal upper bound, the optimum result that could be obtained by combining the different systems. Once we get an insight of the improvement that could be achieved by combining the diverse systems, we come up with three combination strategies: • Voting. For each mapping it ranks senses according to the number of times they are returned by the different systems which are combined. 11 12
http://text-similarity.sourceforge.net http://snowball.tartarus.org/algorithms/english/stop.txt
63
CHAPTER 6. ACQUISITION OF A MULTILINGUAL NE LEXICON Finally, it outputs the first ranked sense. Voting returns more than one sense if two or more senses are ranked first with the same score. • Unsupervised combination. Within this combination, the methods taken into account have the same relevance computing a simple average function among the outputs of the considered methods (i.e. TE, WordNet-based method, Semantic Vectors and/or Word Overlap). As a result, the value returned by the average function is associated with its corresponding Wikipedia category-WordNet sense pair. • Supervised combination. The whole set of inferences carried out by the TE system together with the scores returned by the WordNet-based, Semantic Vectors and/or Word Overlap methods are computed as features for a machine learning algorithm. We have specifically used the BayesNet implementation provided by Weka (Witten & Frank, 2005), and we obtained the 10-fold cross validation results over our gold standard corpus.
6.3
Extraction
Once the mapping has been carried out, NEs can be extracted from the mapped categories. For each category mapped we extract all its subcategories which are hyponyms (see section 6.2.1). From the resulting set of categories (i.e. the mapped category plus all its hyponyms), we obtain the articles they contain and identify which are NEs, as explained in 6.4. Thus we obtain the set of articles which are NEs. From them we gather further relevant information such as their abstracts and their redirects (this information is explicitly available in structured form from the Wikipedia database dumps and thus obtaining it is straight-forward). Finally, all this information is uploaded to the NE lexicon (see section 6.6). Let us take the example of the mapped category “Countries”. First, the procedure would obtain all the hyponym subcategories: “Fictional countries”, “Countries by language”, “Arabic-speaking countries”, etc. Subsequently, all the articles from the resulting category set would be extracted: “Neverland”, “Algeria”, “Fictional country” etc. and only those being NEs are considered (in this example “Fictional country” would be discarded). From the articles that are NEs we gather other information; from “Neverland” we would get the redirects “Never Land” and “Never Never Land” 64
6.4. NE IDENTIFICATION and the abstract “Neverland (also spelled Never Land or expanded as Never Never Land) is a fictional world featured in the works of J. M. Barrie and those based on them”.
6.4
NE identification
We have explored three different possibilities in order to identify which of the extracted articles are NEs. The first relies on a web search engine, the second on the content of Wikipedia articles while the third combines the two. All three share a common aspect though; they exploit the capitalisation norms followed in some languages, i.e. that proper nouns begin with uppercase while common nouns begin with lowercase. A detailed explanation for each of them follows.
6.4.1
Web Search
The article’s title is searched in the World Wide Web by using a web search engine. The first 50 results where the title is found are returned and an algorithm calculates the number of times the article’s title appears in the website’s description (i) with all the words beginning with capital letters, (ii) with some words beginning with capital letters and (iii) with no word beginning with capital letters. Besides, a threshold is established in order to distinguish between articles being instances and non-instances according to the different models of capitalisation.
6.4.2
Wikipedia search
This approach also takes advantage of orthographic features (capitalisation norms) as in the previous one, but instead of looking for occurrences of the articles’ titles in the World Wide Web, it looks for them in the body article of the entry, following (Bunescu & Pasca, 2006). The difference is that our method is language independent due to the use of Wikipedia’s interlingual links. For a given Wikipedia article title, whatever its language, we obtain its equivalents in a set of ten languages that follow the aforementioned capitalisation rules (Catalan, Dutch, English, French, Italian, Norwegian, Portuguese, Romanian, Spanish and Swedish). Apart from the language independence, 65
CHAPTER 6. ACQUISITION OF A MULTILINGUAL NE LEXICON considering the article in ten languages presents another important advantage: the text size where we look for occurrences of the article’s title is bigger, hence the results are more representative. In order to obtain the article’s title for each of these languages we use the interlingual links of Wikipedia that connect the same article in different languages. In the body of the article translations, we look for occurrences of the article title and compute the percentage of times it begins with uppercase. Finally, as in the previous approach, if the percentage is higher than an empirically established threshold then the article title is classified as a NE. From a technical point of view it is worth mentioning that the body articles of Wikipedia are not in plain text but in the mediawiki markup format and thus are not directly processable by text tools. In order to carry out the current procedure, we first transformed Wikipedia body articles into plain text by using two perl modules, Text::MediawikiFormat13 and html2text14 . All in all, this approach presents two advantages over the previous one:
• Language independence. Whatever the language we apply these procedures to, we can obtain the Wikipedia entry titles for languages which follow the aforementioned capitalisation norms. • Avoidance of sense variation. A problem of the previous method is related to the fact that some nouns have senses in which they are instances and others in which they are classes. If an extracted article is a NE but it also has a class sense the method could fail to classify it as a NE as in the Web we would find both senses. E.g. the Wikipedia article “Children’s Machine” is a NE referring to a laptop developed by the OLPC (acronym of One Laptop Per Child). However, this term can also be found in the string “The children’s machine”, the title of book from Seymour Papert in which “children” and “machine” are classes. With the new method we look for “Children’s Machine” in the body of its article and so it is really unexpected to find this string referring to the book. 13
http://search.cpan.org/~dprice/Text-MediawikiFormat-0.05/lib/Text/ MediawikiFormat.pm 14 http://search.cpan.org/~awrigley/html2text-0.003/html2text.pl
66
6.5. POSTPROCESSING
6.4.3
Combining Wikipedia and the Web
While searching for occurrences of the title in Wikipedia avoids noise due to eventual sense variation, the web method presents an important advantage: the amount of text available is considerably bigger and therefore more occurrences can be found. We conclude then that the advantages of these two approaches could be combined by extracting salient terms from the body text of the article in Wikipedia (we apply the tf-idf measure) and then searching in the Web pages where the entry title and these terms appear. Therefore, our combination method consists of the web search method refined with significant terms from Wikipedia. Following the example presented for the previous method (the entry “Children’s Machine”), of the first ten results from Google six correspond to the computer and the remaining four to Papert’s book. However, if we extract significant terms from the body text of the Wikipedia entry such as “OLPC” and “$100 laptop”, and search the three terms in Google, then all of the first 10 results correspond to the computer.
6.5
Postprocessing
The aim of the postprocessing phase is to improve and increment the information extracted and introduced into the NE lexicon. Two different actions are carried out: (i) introducing additional NEs and (ii) linking NEs to ontologies. Additional NEs are introduced into the lexicon by exploiting Wikipedia multilingual links. If a NE has been extracted for language a but its equivalent in language b has not, we gather the NE for language b and add it to the NE lexicon. Furthermore, we select from the NE lexicon the class/es associated to the NE for language a in order to extract the equivalent class/es from the LR for language b and add them to the lexicon. Links to ontologies present in some of the LRs could be exploited to connect the extracted NEs to them. On one hand, the English WordNet has been linked to SUMO (Niles & Pease, 2003). On the other, the Italian PSC contains an ontology itself. Therefore, the extracted NEs are connected to these two ontologies. 67
CHAPTER 6. ACQUISITION OF A MULTILINGUAL NE LEXICON
6.6
The Named Entity Lexicon
In order to make the procedures independent of specific LRs and to facilitate interoperability with other LRs, we provide a standard-compliant output format. The elements that are part of this output are mainly NEs, orthographic variants of these NEs and classes to which these NEs belong (by means of “instance of” relations). Due to the fact that this data could be naturally represented by means of a LR and because we build with it a LR we have decided to follow the LMF, an ISO standard for the representation of computational lexicons (see 3.5), in order to encode the output. A description of the LMF elements we have considered and their role in our NE lexicon follows. The “Lexicon” element holds each NE monolingual dictionary. “LexicalEntry” acts as a container for all the information that regards each NE and contains two child elements. The first, “Lemma”, contains the lemma of the NE and its orthographic variant/s (by making use of “FormRepresentation”). The second, “Sense”, holds semantic information which can be one of two types: (i) relations to other lexical entries (“SenseRelation”) and (ii) links to other resources (“MonolingualExternalRef”). Furthermore, we make use of the NLP multilingual notations extension of LMF (see 3.5.3) to create a multilingual lexicon where NEs for different languages might be related by means of interlingual links. The element of LMF employed for this purpose is the “SenseAxis”; it represents the relationships between different closely related senses in different languages (each of these senses is contained in a “SenseAxisElements”). This element groups together monolingual senses that correspond to one another. Within a “SenseAxis” element we use the “InterlingualExternalRef” in order to link its elements to ontologies. As for the output structure, we have designed a NE lexicon as a database whose structure is compliant (isomorphic) with LMF. Figure 6.2 presents the Entity Relationship diagram of this database. As an example of the information extracted, we provide the LMF compliant eXtensible Markup Language (XML) notation and the corresponding database entries for a NE for English, Spanish and Italian (see appendix C).
68
6.6. THE NAMED ENTITY LEXICON
Figure 6.2: Database Entity Relationship diagram of the NE lexicon
69
CHAPTER 6. ACQUISITION OF A MULTILINGUAL NE LEXICON
70
Chapter 7 Evaluation
71
CHAPTER 7. EVALUATION This chapter evaluates the different parts of the method introduced in the previous chapter.
7.1
Data
In the current research we use database dumps of English, Italian and Spanish Wikipedia from January 20081. From these dumps, we have used the page, pagelinks, categorylinks, text and abstract data. Table 7.1 shows the number of categories and articles for each of the Wikipedia dumps. With regards to LRs, we have used WordNet 2.1, Spanish WordNet 1.6 and PSC. Table 7.1: Number of categories and articles in Wikipedia per language Language Number of categories Number of articles English 312,948 2,183,497 Italian 39,019 388,717 Spanish 45,888 305,366
Apart from the aforementioned Wikipedia dumps and LRs, some of the experiments rely on specific test data. In that case, datasets are described with the experiments they were created for. For several of the experiments we have used the same set of measures, these are their formula: P recision =
Recall =
Fβ=1 =
number of correct answers f ound by the system number of answers in the corpus
2 × P recision × Recall P recision + Recall
Fβ=0.5 = 1
number of correct answers f ound by the system number of answers f ound by the system
(1 + 0.52 ) × (P recision × Recall) (0.52 ) × P recision + Recall
Downloaded from http://download.wikimedia.org
72
(7.1)
(7.2)
(7.3)
(7.4)
7.2. MAPPING
7.2
Mapping
In order to map the English WordNet, we departed from its noun classes that contain instances (as we are interested in extending a LR with NEs we decided to consider a set of instantiable nouns; clearly if a noun is instantiated it is instantiable). Apart from this set of noun classes, following the inheritance principle of the hyponymy relation (i.e. if a noun class is instantiable, its hyponyms also are), we consider also the noun classes that are hyponyms of this set. For Spanish, as the EuroWordNet model does not include a specific type of relation for instantiation, rather than begin with the Spanish WordNet we are forced to start with the English one. From it we extract the nouns that contain instances and their hyponyms. We obtain the equivalent synset offsets in WordNet 1.6 by using the mapping sets between WordNet versions provided by (Daud´e et al., 2003). From the set of 15,906 synsets with instances (and its hyponyms) present in WordNet 2.1, 2,140 cannot be mapped to WordNet 1.6 because no mapping is found, 13,640 have 1-to-1 mappings and 126 1-to-n (n > 1) mappings. In the last case the mapping with the highest score is preserved. We end up with a set of 13,278 instantiable synsets of WordNet 1.6. From these, exploiting the ILI, we obtain the corresponding 15,094 variants of the Spanish WordNet. ´ 2,966 of them are proper nouns (e.g. “Africa”, “Nuevo Testamento”) and therefore discarded. This leads us to a set of 12,128 variants. From these, 7,739 are monosemous and the remaining 4,389 polysemous. Subsequently, these variants can be mapped to categories of the Spanish Wikipedia. For Italian, we proceed in an analogous manner. We departed from the set of instantiable nouns of the English WordNet. From this we obtained the equivalent synsets in WordNet 1.5, which are connected to the Italian WordNet through the ILI. Finally, from the Italian WordNet, we obtained the equivalent entries of PSC. From the set of 15,906 synsets with instances (and its hyponyms) present in WordNet 2.1, 2,806 cannot be mapped to WordNet 1.5 because no mapping is found, 12,946 have 1-to-1 mappings and the remaining 154 1-to-n (n > 1) mappings. In the last case the mapping with the highest score is preserved. This leads to a set of 13,183 synsets of WordNet 1.5. Following the ILI we gather 12,488 corresponding synsets from ItalWordNet. Finally, exploiting the ItalWordNet-PSC mapping, we obtain 10,498 variants of the latter. After discarding instances we have 6,977 monosemous and 3,067 polysemous nouns. 73
CHAPTER 7. EVALUATION Table 7.2 shows the mapping results obtained for English, Spanish and Italian. The table presents two types of results (columns nh without considering hyponyms of the initial synsets and columns h considering them). Table 7.2: Mapping for English, Spanish and Italian English Spanish Italian nh h nh h nh h Total 1,012 14,855 627 7,739 777 6,977 Monosemous Mapped 557 2,860 195 446 159 529 Percentage 55.03 19.25 31.10 5.76 20.46 7.58 Total 628 6,903 473 4,389 386 3,067 Polysemous Mapped 282 1,429 103 490 103 358 Percentage 44.90 20.70 21.77 11.16 26.68 11.67
The amount of mapped nouns is notably lower for Spanish and Italian than for English both for monosemous and polysemous nouns, this is as expected because both the number of total monosemous and polysemous nouns and the number of Wikipedia categories (45,796 and 39,019 vs. 312,941) are substantially lower for these two languages. The percentages are considerably lower when considering hyponyms. This is expected as in doing so we map very specific nouns from deep nodes of the LRs taxonomy for which is less probable that a correspondent Wikipedia category exists (e.g. it is expected to find a category for the noun “sword” but more unlikely for a more specific hyponym such as “rapier”). However, considering hyponyms boosts the total amount of nouns mapped in all cases.
7.2.1
Mapping Analysis
We present an analysis of the mapping results for English (column nh in table 7.2). Table 7.3 shows the percentages of monosemous words, polysemous words and synsets that get mapped to Wikipedia categories for three different dumps from April 2007, November 2007 and January 2008. As it can be seen, the continuous growth of Wikipedia allows us to increase the mapping percentage. 57.44% of the synsets were mapped to the April 2007 dump. This percentage increases to 60.02% for the November 2007 dump and to 65.39% for the January 2008 dump (the one we are currently working with). 74
7.3. DISAMBIGUATION
Table 7.3: WordNet class nouns to Wikipedia categories mapping percentages Wikipedia dump date 200704 200711 200801 Total 1012 Monosemous Mapped 491 509 557 Nouns Percent. 48.51% 50.29% 55.03% Total 628 Polysemous Mapped 249 265 282 Nouns Percent. 39.64% 42.19% 44.90% Total 893 Synsets Mapped 513 536 584 Percent. 57.44% 60.02% 65.39%
In order to get a better understanding from the mapping procedure, we have manually analysed a randomly selected set of WordNet classes which do not get mapped to any Wikipedia category. In most of the cases (75%), although there is not a matching category, there is a matching article in Wikipedia to which the class could be mapped. E.g. “oracle” could be mapped to the article “Oracle”. In 13% of the cases there is neither a matching category nor a matching article (e.g. “formal garden”). 10% of times there is a matching category but the class is not mapped to it due to a PoS tagger error. E.g. the class “aquarium” is not mapped to the category “Aquaria” because the tagger fails to obtain “aquarium” as the lemma of “Aquaria”. The remaining 2% is due to having the class and matching category in different English variants. E.g. the class “railroad tunnel” (British) should be mapped to the category “railway tunnels” (American) but is not mapped as their lemmas do not match.
7.3
Disambiguation
For English we have evaluated the two automatic methods, instances intersection and semantic similarity, which have been described in 6.2.1 and 6.2.2 respectively. 75
CHAPTER 7. EVALUATION
7.3.1
Data
The evaluation data consists of a set of polysemous nouns from WordNet which are mapped to Wikipedia categories. Additional information is provided for both nouns and categories; for the first their glosses, while for the second their abstracts. The disambiguation task should then identify, for each noun, which of its senses, if any, corresponds to the mapped/s category/ies. The resulting gold standard files (corpus and key) are available for research purposes. The corpus file follows the following format {sense gloss} [...] {sense gloss} {category abstract} [...] {category abstract}
while the key file is made up of lines with the format of Senseval-3 scorer2 : word category sense_number+ In order to build the corpus file we departed from a set of 254 nouns mapped to categories. 54 of them were discarded because the abstracts of the corresponding categories were empty. The final data-set contains 200 polysemous nouns mapped to 207 categories. Thus we have an evaluation set with 207 mappings. Regarding Wikipedia abstracts, they are easily available for articles but not for categories. Therefore we developed a procedure to gather them: if (category has referent_article) if(referent_article has abstract) return abstract of referent_article if (category has article_with_same_lemma) if(article_with_same_lemma has abstract) return abstract of article_with_same_lemma if (category_body longer than N characters) return category_body return empty_string 2
http://www.senseval.org/senseval3/scoring
76
7.3. DISAMBIGUATION Furthermore, we have manually created a key file. It contains the correct sense/s for each mapping. In most of the cases (154, 74,4%) there is a one to one correspondence. For 37 (17,9%) mappings, more than one sense corresponds to the mapped category, this usually occurs because the WordNet senses tend to be finer-grained than the Wikipedia categories. Concerning the remaining 16 (7,7%) mappings, no sense corresponds to the mapped category. Let’s take a look at an example for each of these three cases: • One sense corresponds to one category the supreme commander of a fleet; ranks above a vice admiral and below a fleet admiral any of several brightly colored butterflies Admiral is the rank, or part of the name of the ranks, of the highest naval officers. It is usually considered a full admiral (equivalent to full general) and four-star rank above Vice Admiral and below Admiral of the Fleet/Fleet Admiral. admiral Admirals 1
• More than one sense correspond to a category a member of the communist party a socialist who advocates communism This category lists people who have, at one time or another, been active in communist politics through either identifying themselves as communists or being members of parties identifying themselves as communist. It should not be taken for granted
77
CHAPTER 7. EVALUATION that inclusion in this category implies that figures remained their whole life or continue to be communists. Note : communist activists should only be featured in this category if no existing subcategory (-ies) suits them better - the comprehensive subcategory is :Category:Communists by nationality . For more information on categories, see: Wikipedia:Categorization . communist Communists 1 2
• No sense corresponds to the category
the person who holds the office of head of state of the United States government; "the President likes to jog every morning" the office of the United States head of state; "a President is elected every four years" Chief executives determine and formulate policies and provide the overall direction of companies or private and public sector organizations within the guidelines set up by a board of directors or similar governing body. They plan, direct, or coordinate operational activities at the highest level of management with the help of subordinate executives and staff managers. chief_executive Chief_executives 0
78
7.3. DISAMBIGUATION
7.3.2
Instance Intersection
This algorithm disambiguates 39% of the words This low recall, which is due to the low number of instances present in WordNet, is compensated by a very high precision. In fact, all the disambiguated entries were correct. We analysed the reasons why 61% of the words were not disambiguated. There are two main causes: • One of the senses from WordNet corresponds to the category but no common instance is found. This happens for 78% of the cases For 74% of the words there is simply no common instance in both resources. For the remaining 4% a common instance does exist but it is in a subcategory that although being a hyponym of the mapped category, the hyponymy patterns are not able to identify as such. E.g. “Colosseum, Amphitheatrum Flavium” is an instance of the second sense of “amphitheater”, which is mapped to the category “Amphitheaters”. “Colosseum” is present in the category “Roman amphitheatre buildings” which is a subcategory of “Amphitheaters”. However, the aforementioned patterns do not identify “Roman amphitheatre buildings” as a hyponym of “amphitheater”. • No sense from WordNet corresponds to the category or the category has been changed. This occurs for the remaining 22%. An example of no sense corresponding to the mapped category happens for the word “assemblage” which has four senses: “a group of persons together in one place”, “a system of components assembled together for a particular purpose”, “the social act of assembling” and “several things grouped together or considered as a whole”. The mapped category, “Assemblage”, is “for assemblage artists”. As an example of a category change, the word “college” is mapped to the category “Colleges” but it has been moved to “Universities and colleges”. Obviously we cannot map “college” to “College and Universities” as by doing so we would end up with instances of universities under the class college.
7.3.3
Semantic similarity
Table 7.4 presents the scores obtained by the different systems and the baselines introduced in section 6.2.2. Regarding the application of the textual 79
CHAPTER 7. EVALUATION entailment system, three different experiments were carried out, each one with a specific setting: • TE (trained AVE’07’08 + RTE-3): for this experiment the system was trained with the corpora provided in the AVE competitions (edition 2007 and 2008) and RTE-3 Challenge. This configuration uses a BayesNet algorithm, and it will show the capability of the system to solve the task when specific textual entailment corpora are used as training. • No training phase: in order to assess whether the training corpora are appropriate to the final decision with regards to the task tackled in this work, we also decided to make an experiment without a training phase. Therefore, the highest entailment coefficient returned by the system among all sense-category pairs for each word will be tagged as the correct link. These coefficients are obtained computing the set of lexical and semantic measures integrated into the system. • Supervised (10-fold cross-validation): a BayesNet algorithm was trained with the corpus described in section 7.3.1, which is intended to evaluate the task. We evaluated this experiment by 10-fold crossvalidation using each textual entailment inference as a feature for the machine learning algorithm. This experiment shows the system behaviour when it is trained with a specific corpus for our task. The first element that appears is the high score obtained by the 1st sense baseline (64.7%). In fact, leaving aside supervision, only one system is able to reach its score, TE without training. Also important is the role of stop words. By filtering them, substantially better results can be obtained, as can be seen for both the Word Overlap (62.7% vs. 56.3%) and the Personalised PageRank (64.3% vs. 61.8%) systems. Results also point out that both the AVE and RTE corpora are not appropriate to this task (52.8%). This is due to the fact that the idiosyncrasies of each corpus are somewhat different resulting in a poor training stage. Nevertheless, computing the entailment coefficient returned by the system without training (64.7%), a considerable improvement in accuracy is achieved. It proves the textual entailment inferences are suitable to support our research. Finally, as expected, the best TE results are when the dataset created for 80
7.3. DISAMBIGUATION
Table 7.4: Text Similarity System Results Run Accuracy Baseline 1st sense 64.7% Baseline Word overlap 56.3% Baseline Word overlap (without 62.7% stop words) Semantic Vectors 54.1% Personalised PageRank 61.8% Personalised PageRank (without 64.3% stop words) TE (trained AVE 07-08 + RTE-3) 52.8% TE (no training) 64.7% TE (supervised) 77.74%
the evaluation is also processed as training and evaluated by 10-fold crossvalidation (77.74%). Table 7.5 presents the scores obtained by the different combinations introduced in section 6.2.2. The score achieved by the upper bound oracle (84.5%), nearly 7 points higher than the best supervised system (77.74%) seems to indicate that there is room for improving the performance by using a supervised combination of the systems. However, none of the supervised combinations is even able to reach the score obtained by the supervised TE. The reason behind this is the TE system implements some inferences which are somewhat similar to the knowledge reported by the other methods (e.g. the Smith-Waterman algorithm vs. word overlap, and WordNet Similarity measures vs. WordNetbased method). Therefore, the information gain supplied by the other methods is not enough in order to improve the TE performance. Regarding unsupervised combinations, the best result (65.7%) is obtained by discarding Semantic Vectors, as expected because out of the three systems this was the one that obtained the lowest score (54.1%). For some of these combinations the results improve the performance of the unsupervised TE configuration, but as we mentioned before, this is a slight improvement owing to the inner characteristics of the TE inferences. 81
CHAPTER 7. EVALUATION
Table 7.5: Text Similarity Combination Results Run Accuracy Oracle (PPR + SV + TE + WO) 84.5% Voting (PPR + SV + TE + WO) 66.5% Voting (PPR + SV + TE) 64.7% Voting (PPR + SV + WO) 66.1% Voting (PPR + TE + WO) 68% Unsupervised (PPR + SV + TE 65.2% + WO) Unsupervised (PPR + SV + TE) 64.7% Unsupervised (PPR + SV + WO) 64.7% Unsupervised (PPR + TE + 65.7% WO) Supervised (PPR + SV + TE + 77.24% WO) Supervised (PPR + SV + TE) 76.99% Supervised (PPR + SV + WO) 75.12% Supervised (PPR + TE + WO) 77.11%
Furthermore, it is worth noting that the simple voting approach outperforms the unsupervised combination. The best result (68%) again is obtained when discarding Semantic Vectors.
7.4
Extraction
We have extracted NEs for the mapped nouns (see table 7.3) for each language. Table 7.6 provides quantitative data about the NEs extracted. We not only show the number of NEs which are added to the repository but also the amount of orthographic variants (written forms) of these NEs and the number of instance relations extracted that link to the LRs used. The number of NEs extracted for Spanish and Italian is notably lower than the number of NEs for English. This result was expected because both the number of pages in Wikipedia (305,000 and 388,000 vs. 2,100,000) and the number of mapped categories (see 7.2) are significantly lower. Table 7.7 provides results about the nature of the NEs for English added 82
7.5. NE IDENTIFICATION
Table 7.6: Extracted NEs English Spanish NEs 948,410 99,330 Written forms 1,541,993 128,796 Instance relations 1,366,899 128,796
Italian 78,638 104,745 139,190
to the repository. It shows the number of instances added according to the different noun lexicographic files of WordNet. For each lexicographic file where a substantial amount of instances is added we include an example of such instance together with the synset it is attached to.
7.5
NE identification
A set of articles from the English Wikipedia was randomly selected, and these articles manually tagged as being instances or classes. The set contains 278 articles and was used to evaluate the different methods we applied to NE identification, the web-based, wikipedia-based and combined methods (see section 6.4). Concerning the capitalisation model, we chose the one that considers the number of times that the first word of the string begins with capital letters. A threshold, minimum percentage of occurrences in which the article title begins with capital letters to be considered a NE, is used. The next paragraphs report on the results obtained by these methods.
7.5.1
Web
Table 7.8 shows the results obtained by the web method. For several values of the threshold, precision, recall and F-measure (β=1 and β=0.5 ) are included. Fβ=1 weights evenly precision and recall whereas Fβ=0.5 weights precision twice as much as recall. It can be seen that the highest Fβ=0.5 is obtained when the threshold is set to 0.87, reaching 79.04% and precision 76.74%. Although other values of the threshold provide higher values of Fβ=1 , as the aim of the approach is to extend a knowledge resource, we consider precision more important than recall as we think that it is better to link a lower number of NEs to LRs while making sure that the quality of the final resource is good enough. 83
CHAPTER 7. EVALUATION
Table 7.7: Number of English NEs per lexicographic file Lex. File NEs Example act 43,005 Project Pluto instanceOf project0 4 artifact 55,454 Akinada Bridge instanceOf suspension bridge0 6 communication 18,361 Flower of Scotland instanceOf national anthem0 10 event 2,146 Sino-Soviet split instanceOf schism0 11 group 81,373 Medici instanceOf family0 14 location 111,564 Incense Route instanceOf trade route0 15 object 39,321 Pyxis instanceOf constellation0 17 person 520,422 Vladimir Kotelnikov instanceOf electrical engineer0 18 time 1,169 Black Saturday (France) instanceOf en s days0 28 ambiguous 485,542 Barachiel instanceOf archangel? ?
7.5.2
Wikipedia
We have evaluated this new approach by both looking for entry occurrences only in the English Wikipedia and in the English Wikipedia plus the other nine aforementioned Wikipedias. The aim is to increase the precision (76.74%) of the web-based method without causing negative effects in recall (89.80%). Table 7.9 presents the results obtained for each of the scenarios. The best Fβ=0.5 is obtained for the thresholds 0.91 to 0.95 when only using the English Wikipedia (75.08%) and for the threshold 0.95 when using ten Wikipedias (78.60%). For this threshold, using more text allows us to obtain 6% better precision (76.47% vs. 71.81%) while losing 3.7% recall (88.44 vs. 91.84%), which supports our hypothesis of using different Wikipedias to 84
7.5. NE IDENTIFICATION
Table 7.8: Evaluation results of the web-search method Threshold Precision Recall Fβ=1 Fβ=0.5 0.81 74.73 92.52 82.67 77.71 0.83 75.84 91.84 83.08 78.58 0.85 76.74 89.80 82.76 79.04 0.87 76.74 89.80 82.76 79.04 0.89 76.65 87.07 81.53 78.53 0.91 77.12 80.27 78.67 77.73 0.93 76.81 72.11 74.39 75.82 0.95 76.92 61.22 68.18 73.17
Thr 0.81 0.83 0.85 0.87 0.89 0.91 0.93 0.95
Table 7.9: Evaluation results using Wikipedia only English ten languages P R Fβ=1 Fβ=0.5 P R Fβ=1 70.83 92.52 80.24 74.32 74.30 90.48 81.60 70.68 91.84 79.88 74.09 74.72 90.48 81.85 70.68 91.84 79.88 74.09 75.57 90.48 82.35 71.43 91.84 80.36 74.75 75.57 90.48 82.35 71.43 91.84 80.36 74.75 76.16 89.12 82.13 71.81 91.84 80.60 75.08 76.16 89.12 82.13 71.81 91.84 80.60 75.08 76.02 88.44 81.76 71.81 91.84 80.60 75.08 76.47 88.44 82.02
Fβ=0.5 77.06 77.42 78.14 78.14 78.44 78.44 78.22 78.60
increase the text size. Compared to the web search approach, the current one obtains 1.5% lower recall (88.44% vs. 89.80%) and practically the same precision (76.47% vs. 76.74%). By analysing the results, we have found a drawback of the current approach compared to web search. The number of occurrences found per article is quite low: 7.97 when only using the English Wikipedia and 13.59 when using also the others. These values contrast with those obtained for the web search. In fact, for that experiment we set the number of occurrences per article to 100 and found such a high number for all the articles of the evaluation set. 85
CHAPTER 7. EVALUATION
7.5.3
Wikipedia + Web
Finally we present the results obtained when combining both methods. We have refined the web method by adding to the query salient words from the Wikipedia article, table 7.10 presents the results of adding one, two and three words from the article to the query.
Thr 0.81 0.83 0.85 0.87 0.89 0.91 0.93 0.95
Table 7.10: Evaluation results of the combination method 1 word 2 words 3 words P R F0.5 P R F0.5 P R 76.67 93.88 79.59 76.80 94.56 79.79 76.16 89.12 77.10 93.88 79.95 77.53 93.88 80.32 76.33 87.76 77.71 92.52 80.28 77.14 91.84 79.70 76.79 87.76 78.44 89.12 80.37 77.46 91.16 79.86 77.58 87.07 77.91 86.39 79.47 79.17 90.48 81.20 77.5 84.35 77.99 84.35 79.18 79.63 87.76 81.13 78.21 82.99 78.57 82.31 79.29 79.87 86.39 81.10 79.47 81.63 78.08 77.55 77.98 80.82 80.27 80.71 78.47 76.87
F0.5 78.44 78.37 78.75 79.31 78.78 79.12 79.89 78.15
From the three configurations, the best Fβ=0.5 is obtained when considering two additional words from the body article (81.20% with threshold 0.89). The best results obtained with one and three words are slightly lower, 80.37% (threshold 0.87) and 79.89% (threshold 0.93) respectively. Compared to the other methods, this obtains both better precision (79.17% vs. 76.47% and 76.74%) and recall (90.48 vs. 88.44% and 89.80%).
7.6
Postprocessing
To close the evaluation chapter, we present the results on the added NEs by exploiting multilingual links and the links to the SUMO and SIMPLE ontologies. By exploiting Wikipedia’s multilingual links we are able to extract 26,157 additional NEs for English, 38,253 for Spanish and 47,168 for Italian. Therefore, the lexicon after this step contains 974,567 English NEs, 137,583 for Spanish and 125,806 for Italian. 86
7.6. POSTPROCESSING In this step we also connect sets of equivalent NEs in different languages (encoded in the “SenseAxis” element) to two ontologies through the “InterlingualExternalRef” element. 814,251 such sets are linked to SUMO while 42,824 get linked to SIMPLE. The substantial difference of the number of entities connected to these ontologies (roughly by a 20 to 1 factor) is due to the fact that in order to connect a set of NEs to SIMPLE, it has to contain an Italian NE linked to PSC while to connect a set to SUMO it needs to contain an English NE linked to WordNet. The results are expected then as the NE lexicon contains much more NEs linked to WordNet (1,366,899) than to PSC (139,190). Table 7.11 shows the number of NE sets linked to the different nodes of the SIMPLE ontology. It shows for each ontology node for which a substantial number of NE sets are linked the actual number of NEs, an example of an Italian NE and the PSC wordsense to which this NE is connected.
87
CHAPTER 7. EVALUATION
Table 7.11: Number NE sets linked to the SIMPLE ontology Ontology node NEs Example Artwork 1,221 Las Meninas (Velazquez) instanceOf USem837dipinto Agent of persistent instanceOf 2,890 Carl Lewis USem2018atleta activity Building 748 Arena di Verona instanceOf USem70845anfiteatro D 3 location 1,023 Eufrate instanceOf USem5089fiume Domain 1,804 Martini Racing instanceOf USem77024automobilismo instanceOf Ideo 596 Henri Bergson UsemTH08517esistenzialista inInstitution 2,126 Paramount Pictures stanceOf USem61226azienda Instrument 751 Intel 80286 instanceOf USem75625microprocessore Metalanguage 586 ENIAC instanceOf USem67411acronimo Profession 18,383 Lukas Moodysson instanceOf USem3641regista Purpose act 689 Coppa UEFA instanceOf USemD6042competizione Social status 6,749 Franco Turigliatto instanceOf USem3581senatore instanceOf Vehicle 1,667 Toyota Prius USem843automobile
88
Chapter 8 Application to Question Answering
89
CHAPTER 8. APPLICATION TO QUESTION ANSWERING
8.1
Motivation
With the aim of applying our repository to a real-world NLP task and to validate its usefulness, we have added the knowledge encoded in our NE lexicon to a QA process. The main idea is to plug the NE lexicon into a QA system and to use its knowledge to validate the answers given by the system. The next section introduces Question Answering using Inter Lingual Index module of EuroWordNet and Wikipedia (BRILIW), the Cross–Lingual QA system that we have used for our experiments.
8.2
The BRILIW Question Answering System
The BRILIW system (Ferr´andez et al., 2007d; Ferr´andez et al., 2009) was designed to localise answers from documents, where answers and input questions are written in different languages. BRILIW was presented at CLEF 2006 being ranked first in the bilingual English-Spanish QA task (Magnini et al., 2006; Ferr´andez et al., 2006b). BRILIW architecture is built on three main pillars which stand out among other state–of–the–art Cross Lingual QA systems: (i) the use of several multilingual knowledge resources to reference words between languages (the ILI module of EWN and the multilingual knowledge encoded in Wikipedia); (ii) the consideration of more than only one translation per word in order to search candidate answers; and (iii) the analysis of the question in the original language without any translation process. The architecture of BRILIW is organised as a sequential set of modules. A graphic depicting them is shown in figure 8.1. First, the language of the input question is detected. Next, the NEs of the input question are identified and classified with a NER tool and then translated by using Wikipedia. This is followed by an analysis of the input question, where its answer type and its main syntactic blocks are detected. Later on, the equivalents in the target language for the words of the input question are extracted by exploiting the ILI module of EWN. This is done for common nouns and verbs, but not for NEs as these have been previously translated. Subsequently, using as input the translations from ILI (common nouns and verbs) and Wikipedia (NEs), the relevant passages to the input question are fetched by using an IR tool. Finally, an ordered list of answers is extracted from the set of relevant 90
8.2. THE BRILIW QUESTION ANSWERING SYSTEM passages by applying syntactic patterns. A more detailed description of these modules can be found in the following subsections.
Syntactic patterns
user
Language Identific.
NE Translat.
Question Analysis
Interling. Ref.
Selection Passages
Answer Extraction
Answer
Wikipedia
ILI
Documents
Figure 8.1: BRILIW diagram
8.2.1
Language Identification
The language identification module has been developed to automatically distinguish the correct language of the question and documents. It is based on the use of a dictionary (which contains specific per–language stop–words) and and the part–of–word terminology (for example, “ing” in the case of English).
8.2.2
Named Entity Translation
NEs, as common nouns and verbs, could be translated by using the ILI of EWN, but this presents an important drawback: ILI contains very few proper nouns. Instead, BRILIW exploits Wikipedia in order to translate NEs. As an example, the procedure followed to translate the NEs in the question 151 at CLEF 2005 is presented. “What award did Pulp Fiction win at the Cannes Film Festival? ”. The question contains two NEs: “Pulp Fiction” and “Cannes Film Festival ”. Neither is referenced in ILI. However, both have an entry in the English version of Wikipedia, and both entries contain a reference to their Spanish equivalents: 91
CHAPTER 8. APPLICATION TO QUESTION ANSWERING What award did Pulp Fiction win at the Cannes Film Festival?
NE Recogniser
NE: Pulp Fiction NE: Cannes Film Festival
NE: Pulp Fiction NE: Festival Internacional de Cine de Cannes
Translator
Wikipedia
What award did "Pulp Fiction" win at the "Festival Internacional de Cine de Cannes"?
Figure 8.2: Named Entity Translation Process “Pulp Fiction” and “Festival de Cine de Cannes” respectively. Furthermore, if this question would have been translated by a MT service1 , the string “Pulp Fiction” is converted to“la Ficci´on de la Pulpa” and the string “Cannes Film Festival ” is translated into “Cannes la fiesta cinematogr´afica”.
8.2.3
Question Analysis
For carrying out the Question Analysis phase a shallow parser tool is used. From the output of this tool, BRILIW extracts the Syntactic Blocks2 (SBs) of the question. A set of syntactic patterns that use these SBs are designed for the analysis of the question. The Question Analysis Module performs two tasks: • The detection of the expected answer type. The system detects the type of information that the answer needs to contain to be a candidate answer (proper noun, quantity, date, ...). 1 2
http://freetranslation.com (consulted in January 2008) Such as Verb Phrase (VP), Nominal Phrase (NP), and Prepositional Phrase (PP)
92
8.2. THE BRILIW QUESTION ANSWERING SYSTEM • The identification of the main SBs of the question. The system extracts the SBs that are necessary to find the answers. With regard to the first task and with the aim of classifying the questions, a taxonomy based on WordNet Based-Types and EuroWordNet TopConcepts has been designed. This taxonomy consists of the following categories: person, profession, group, object, place city, place country, place capital, place, abbreviation, event, numerical economic, numerical age, numerical measure, numerical period, numerical percentage, numerical quantity, temporal year, temporal month, temporal date and definition. The “expected answer type” is determined using a set of syntactic patterns. BRILIW has circa 300 English patterns for the determination of the different semantic categories of the ontology, three examples are shown in Table 8.1. The system compares the SBs of the patterns with the SBs of the question, the resulting comparison determines the category of the question. Table 8.1: Syntactic patterns to determined the “expected answer type” Pattern Expected Answer Type [WHO] [TO KNOW] [...] person [WHEN] [TO BE] [...] date [WHICH or WHAT] [YEAR] [...] date–year
In the second task, the objective is to select the main SBs of the question that make possible to locate the solution in the documents that may contain the answer. Furthermore, these SBs contain the keywords, i.e. the words that must be translated into the language in which the documents are written (this process will be shown in the next subsection). Table 8.2 shows several examples (questions of type place country, event and amount) which describe the behaviour of this module. The 3rd column shows the syntactic pattern used, the 4th names the expected answer type and the 5th column describes the necessary SBs extracted to find the answer. Furthermore, the Question Analysis module discards some SBs that are not useful in order to carry out the searching of the answer. E.g. in the question 033 at CLEF 2004 (see 2nd row), the words “can” and “be” are discarded to the answer extraction phase. The question analysis is developed in its original language without any translation process and developing the aforementioned two main tasks in 93
CHAPTER 8. APPLICATION TO QUESTION ANSWERING
Table 8.2: Question Analysis Module Official Question Pattern Expect Answer Type 1 006 at CLEF 2006 [WHICH] place country Which country [synonym of did Iraq invade in COUNTRY] 1990? [...] 2 033 at CLEF 2004 [HOW] [TO event How can an allergy CAN] [...] be treated? 3 107 at CLEF 2006 [HOW amount How many soldiers MANY] does “Espa˜ na” [TO HAVE] have? [...]
SBs [NP Iraq] [VP to invade] [PP in 1990] [VP to treat] [NP allergy] [NP soldier] [VP to have] [NP Espa˜ na]
the Question Analysis module. Analysing the question in its own language lets the system avoids lexical and syntactical noise that can be introduced to the process by wrong question translations. Obviously, the scalability of this module to other languages would entail the study, design and generation of a syntactic pattern set which determines the expected answer type and relevant SBs for each specific language.
8.2.4
Inter Lingual Reference
This module determines a methodology to create references between words in different languages. It is based on the use of the ILI module of EWN. The strategy followed by the ILR module introduces a main improvement with regard to other current CrossLingual–QA systems: the consideration of more than only one translation per word by means of using the different synsets of each word in the ILI module of EWN. Using the question 031 at CLEF 2006 as example “Who discovered the Shoemarker – Levy comet? ” (see Figure 8.3), we can see the references provided by the ILI module for the noun “comet” (synset: cometa) and the verb “to discover” (synset1: descubrir, ver; synset2: descubrir, encontrar ; synset3: descubrir, observar, encontrar ; synset . . . ) between the languages English–Spanish. In some cases (like in the case of “to discover”) the ILR 94
8.2. THE BRILIW QUESTION ANSWERING SYSTEM module obtains more than one equivalent for each original word. The proposal to use several translations to find the correct answer consists of assigning a weight depending to each word in ILI. Who discovered the Shoemaker-Levy comet?
NE Translator
Who discovered the "Shoemaker-Levy" comet? Question analysis
ILR module
comet (noun) discover (verb) ILI comet
cometa
discover
descubrir ver
descubrir encontrar
descubrir observar encontrar
Figure 8.3: Inter Lingual Reference module
8.2.5
Selection of Relevant Passages
An open-domain IR tool which is able to retrieve passages from documents is used with the aim of speeding up the search for answers. For instance, using the previous example (question 031 at CLEF 2006, Who discovered the Shoemaker-Levy comet? ), the IR tool receives as input the translations of the verb “to discover ” (“descubrir”, “encontrar”, ... ) and the translation of the noun “comet” (“cometa”). The IR tool returns a 95
CHAPTER 8. APPLICATION TO QUESTION ANSWERING list of passages in which the Answer Extraction module will try to find the candidate answers using a complex syntactic pattern matching process.
8.2.6
Answer Extraction
In this phase, the SBs of the question and different sets of syntactic patterns with lexical, syntactic and semantic information are used to search for the correct answer. The syntactic pattern matching process was presented in (Roger et al., 2005). In this work, we show the strategy used to sort and select the answers. In BRILIW, we use a similar technique with the proviso that the Answer Extraction Module considers more than only one translation per word.
8.3
Methodology
At this point, we have added a validation module which uses the knowledge encoded in the NE lexicon to validate the correctness of the answers, with the possibility of reordering the list of answers provided by BRILIW with the aim of improving the effectiveness of the whole system. Using the NE lexicon, the Validation module is able to validate two types of questions: i) those that expect a NE as the answer type (e.g. Who is the General Secretary of Interpol?); and ii) those which ask for definitions of NEs (e.g. Who is Vigdis Finnbogadottir?). With this objective the model assesses the answer as: • UNKNOWN: if the expected NE as the answer (type i) or the NE of the question (type ii) are not present in the NE lexicon. • CORRECT: if the expected NE as the answer or the NE of the question are present in the NE lexicon and their types (person, location, etc . . . ) match with the type tagged by BRILIW. • INCORRECT: if the expected NE as the answer or the NE of the question are present in the NE lexicon and their types do not match with the type tagged by BRILIW. Once the answers are tagged, the Validation module reorders the list of answers provided by the system according to the next preferential ranking: CORRECT, UNKNOWN and INCORRECT. 96
8.4. RESULTS
8.4
Results
We have evaluated the effectiveness of the Validation module and how its knowledge improves the whole precision of the system using the official set of questions of CLEF 2006 and our results in this competition. The results obtained are very promising. BRILIW obtains an improvement of 28.1% compared to the former official results (Ferr´andez et al., 2006b). Using an official question of CLEF 2006, we show an example of the process in table 8.3. This example shows how, by using the knowledge encoded in the NE lexicon, the correct answer is returned in the first place, therefore improving the whole accuracy of the system. In this case, BRILIW returns four possible answers. The validation module tags as UNKNOWN all the answers but the second, whose type corresponds to the expected type of the question. Therefore, the validation module reorders the two first answers. The right answer to the question is the one that the validation module promoted to the first place. Table 8.3: Example of Validation module Question 072 at CLEF Who is the General Secretary of Interpol? Answer Validation tag Organizaci´on Internacional de UNKNOWN Polic´ıa Criminal Enrique G´omez CORRECT Jefe de la Polic´ıa Interna UNKNOWN Polic´ıa Internacional UNKNOWN
97
process 2006: Validation Ranking 2 1 3 4
CHAPTER 8. APPLICATION TO QUESTION ANSWERING
98
Chapter 9 Conclusions and Future Work
99
CHAPTER 9. CONCLUSIONS AND FUTURE WORK This thesis has tackled the knowledge acquisition bottleneck issue in the field of Computational Linguistics and presented a case study on the acquisition of a NE Lexicon. This chapter presents the conclusions of the work that has been carried out and proposes lines of future work.
9.1
Main Contributions
The research conducted in this PhD. leads to three main contributions: the definition of a methodology to acquire NE plain dictionaries, the identification of three key elements in order to overcome the knowledge acquisition bottleneck and its practical application by the development of a technique to acquire multilingual NE lexicons whose main pillars are, in fact, these three elements. Next sections discuss in detail these contributions.
9.1.1
Acquisition of NE plain dictionaries
We have presented an initial method to automatically create and maintain plain dictionaries of NEs using as resources an encyclopedia, a noun hierarchy and, optionally, a PoS tagger. The method proves to be helpful for these tasks as it facilitates the creation and maintenance of this kind of resources. The principal drawback of this initial acquisition proposal is that it has a low precision for the configuration for which it obtains an acceptable value of recall. Therefore, the automatically created gazetteers need to pass a step of manual supervision in order to have a good quality. On the positive side, we can conclude that our method is helpful as it takes less time to automatically create gazetteers with our method and then to supervise them than to create those dictionaries from scratch. Another important factor is that the method has a high degree of language independence; in order to apply this approach to a new language, we need a version of Wikipedia and WordNet for that language, but the algorithm and the process do not change. Therefore, we think that our method can be useful for the creation of gazetteers for languages in which NER gazetteers are not available but do contain Wikipedia and WordNet resources. 100
9.1. MAIN CONTRIBUTIONS
9.1.2
Identification of key elements in Knowledge Acquisition
After this initial proposal we have carried out a high-level study in order to identify guidelines to overcome the knowledge acquisition issue. From this study we came up with three key elements: • LRs content to support Their Enrichment. A lot of effort has been devoted to the construction of LRs during the last two decades. The outcome is a set of high quality LRs with valuable knowledge covering the different levels of language and thus can be exploited for NLP tasks in general and for knowledge acquisition in particular. • Exploiting New Text for Broad Coverage and High Quality Extraction. New kinds of text that have emerged from the Web 2.0 collaborative paradigm present several advantages over other sources such as MRDs and corpora for carrying out acquisition procedures. • Standards for Building Generic Information Repositories. The use of standards makes procedures more generic and independent from specific resource(s) and improves the interoperability and future sharing of different LRs. We have studied the use the LMF as the representation format of the acquired information. This choice, in its turn, facilitates the use of the output in NLP tasks.
9.1.3
Acquisition of multilingual NE lexicons
Once this problem is analysed from a more abstract perspective and guidelines are identified, we applied these guidelines to the concrete problem of NE acquisition. The main outcome of the thesis is, in fact, the design of a generic methodology to automatically create a NE Lexicon by exploiting New Text sources and existing LRs. We have applied this methodology to three LRs for three languages: WordNet for English, the Spanish WordNet for Spanish, and PSC for Italian. The different phases regarding the construction of this lexicon have been discussed in detail and evaluated. These include an initial mapping procedure, the disambiguation of polysemous nouns, the extraction and identification of NEs and a post-processing step. Finally, we have built a lexicon 101
CHAPTER 9. CONCLUSIONS AND FUTURE WORK of NEs that holds the extracted information and whose representation is compliant with the ISO LMF standard thus enhancing interoperability. We have presented an automatic approach to treat disambiguation when linking WordNet to Wikipedia. The proposal calculates semantic similarity between the definitions of WordNet senses and abstracts of Wikipedia categories in order to individuate which of the senses corresponds to the category. We have applied different methods based on different approaches and explored also with their combination. In order to evaluate their performance we have manually annotated a gold standard. We have also considered two baselines. The results obtained are encouraging as compared to the accuracy of the best baseline (64.7%), an unsupervised combination obtains 68% while regarding supervised schemes, a TE system applied bidirectionally achieves 77.74%. In order to identify the extracted articles that constitute NEs, we have devised three different methods which share the basic idea: they exploit the capitalisation norms followed in some languages, i.e. that proper nouns begin with uppercase while common nouns begin with lowercase. The first relies on a web search engine, the second on the content of Wikipedia articles while the third combines the two. The best results are obtained with the combination scheme: 81.20% Fβ=0.5, 79.17% precision and 90.48% recall. In the postprocessing step, the NEs extracted are connected to two ontologies: SUMO and SIMPLE. Moreover, additional NEs are introduced into the lexicon by exploiting Wikipedia multilingual links. If a NE has been extracted for language a but its equivalent in language b has not, we gather the NE for language b and add it to the NE lexicon. The resulting lexicon contains 974,567, 137,583 and 125,806 NEs for English, Spanish and Italian respectively and 1,366,860, 141.055 and 139,190 “instance of” relations to the LRs WordNet, Spanish Wordnet and PSC respectively. Moreover, equivalent NEs in different languages have been connected by exploiting Wikipedia’s multilingual links. These equivalent sets have been connected to two ontologies: SUMO (814,251 sets) and SIMPLE (42,824 sets). This resource, together with two APIs (C++ and PHP), is publicly available (see appendix A). An important feature of the proposed approach is its high degree of language independence. This method can be directly applied to any language if there is a version of Wikipedia, a LR with a noun taxonomy and a lemmatiser. In fact, we have applied it to LRs based on different theories and covering three languages (English, Spanish and Italian). Moreover, the updating of 102
9.2. FUTURE WORK the NE lexicon is straightforward; just by executing the procedure, the new articles in Wikipedia (the entries that did not exist at the time the procedure was performed the last time) would be analysed and from these set, the ones detected as NEs would be added to the NE lexicon and connected to the corresponding sense of the LRs and ontologies considered. Finally, we have tested the usefulness of the created resource for real world applications, by applying it to validate the answers produced by a state-ofthe art QA system. With the knowledge of the NE lexicon, the performance of the system increases by 28.1%. The repository could be exploited by NER systems that attempt to classify named entities across a high number of categories. Also, as we provide a classification of entities in nodes of a taxonomy instead of isolated lists of entities for each category, the resource can be used with different levels of granularity for entity recognition.
9.2
Future work
During the development of this research, several future works possibilities have emerged. Regarding the acquisition of the NE Lexicon, we consider to carry out new experiments incorporating other kinds of structured data that Wikipedia provides, such as links between pairs of articles and infoboxes. The inclusion of additional data structures would allow to extract further pieces of information and could be used also to validate the information extracted from the other structures. It would be also interesting to extract additional types of information in order to enrich the resulting lexicon. For example, relations between the extracted NEs. In this case, we plan to identify relations between pair of Wikipedia articles and to detect their types. The usefulness of the NE lexicon acquired has been demonstrated by applying it to QA. It would be interesting to apply the resource to other NLP tasks. For example, we would like to add the knowledge of the lexicon to a machine learning NER system. This could be done by using a gazetteer feature. The format of the resulting lexicon is conformant to the LMF ISO standard and we provide it as a database repository with two APIs to access it from two different programming languages. Hence the interoperability of the resource and the easeness of integration with other tools and resources 103
CHAPTER 9. CONCLUSIONS AND FUTURE WORK are guaranteed. A step forward in this direction would be to provide web services to access the NE lexicon remotely. As it has been said, the methodology introduced has a high degree of language independence. This has been demonstrated by applying it to a set of Indo-European languages, including two Romance languages (Spanish and Italian) and a Germanic language (English). A step forward in order to prove this fact could be assessed by applying our approach to a language that belongs to a different family. In this direction, on-going work is being carried out to exploit the methodology introduced in order to extract Arabic NEs. As for the text similarity approach used for disambiguation, because of the fact that some inferences from the TE system overlap with the other methods (e.g. WordNet Similarity measures of the TE system vs. WordNet-based system), we plan to explore further combinations in which such inferences of the TE will be discarded. Also, we would like to study the cases in which none of the systems is able to obtain the correct disambiguation, i.e. the 15,5% of the dataset for which the oracle fails. This will give an insight of the kind of data that none of the methods is able to disambiguate. Finally, it would be interesting to measure the statistical significance between the scores obtained by the different systems.
104
References Agichtein, Eugene, & Gravano, Luis. 2000. Snowball: extracting relations from large plain-text collections. Pages 85–94 of: DL ’00: Proceedings of the fifth ACM conference on Digital libraries. New York, NY, USA: ACM. Agirre, E., Soroa, A., Alfonseca, E., Hall, K., Kravalova, J., & Pasca, M. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In: Proceedings of NAACL. Ahn, D., Jijkoun, V., Mishne, G., de Rijke, K. M¨oller M., & Schlobachz, S. 2005. Using Wikipedia at the TREC QA Track. In: Proceedings of the Thirteenth Text Retrieval Conference (TREC 2004). Alonge, A., Bertagna, F., Calzolari, N., & Roventini, A. 1999. The Italian Wordnet, EuroWordNet Deliverable D032D033 part B5. Tech. rept. Alshawi, Hiyan. 1987. Processing Dictionary Definitions with Phrasal Pattern Hierarchies. Computational Linguistics, 13(3-4), 195–202. Androutsopoulos, I., Ritchie, G.D., & Thanisch, P. 1995. Natural Language Interfaces to Databases–an introduction. Journal of Language Engineering, 1(1), 29–81. Aristotle. 1908. Metaphysics. In: Ross, W. D. (ed), The Works of Aristotle translated into English, Volume VIII. Oxford: Oxford University Press. Atserias, J., Villarejo, L., Rigau, G., Agirre, E., Carroll, J., Magnini, B., & Vossen, P. 2004. The meaning multilingual central repository. In: Proceedings of GWC. 105
References Auer, S¨oren, Bizer, Christian, Kobilarov, Georgi, Lehmann, Jens, Cyganiak, Richard, & Ives, Zachary. 2008. DBpedia: A Nucleus for a Web of Open Data. Baeza-Yates, R., & Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison Wesley. ´ Balahur, Alexandra, Lloret, Elena, Ferr´andez, Oscar, Montoyo, Andr´es, Palomar, Manuel, & Mu˜ noz, Rafael. 2008. The DLSIUAES Team’s Participation in the TAC 2008 Tracks. In: Notebook Papers of the Text Analysis Conference, TAC 2008 Workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Bikel, D., Miller, S., Schwartz, R., & Weischedel, R. 1997. Nymble: a highperformance learning name-finder. Pages 194–201 of: Proceedings of the 5th Conference on Applied Natural Language Processing. Black, W., Rinaldi, F., & Mowatt, D. 1998. FACILE: Description of the NE System Used for MUC-7. In: Proceedings of the 7th conference on Message understanding. Borthwick, Andrew, Sterling, John, Agichtein, Eugene, & Grishman, Ralph. 1998. Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition. In: Proceedings of the 6th Workshop on Very Large Corpora, WVLC-98. Bunescu, Razvan C., & Pasca, Marius. 2006. Using Encyclopedic Knowledge for Named entity Disambiguation. In: EACL. The Association for Computer Linguistics. Burger, John, Cardie, Claire, Chaudhri, Vinay, Gaizauskas, Robert, Harabagiu, Sanda, Israel, David, Jacquemin, Christian, Lin, ChinYew, Maiorano, Steve, Miller, George, Moldovan, Dan, Ogden, Bill, Prager, John, Riloff, Ellen, Singhal, Amit, Shrihari, Rohini, Strzalkowski, Tomek, Voorhees, Ellen, & Weishedel, Ralph. 2001. Issues, Tasks and Program Structures to Roadmap Research in Question Answering. Tech. rept. NIST. Buscaldi, Davide, & Rosso, Paolo. 2006. Mining Knowledge from Wikipedia for the Question Answering Task. In: Proceedings of The fifth international conference on Language Resources and Evaluation. 106
References Calzolari, Nicoletta. 1992. Acquiring and Representing Semantic Information in a Lexical Knowledge Base. Pages 235–243 of: Proceedings of the First SIGLEX Workshop on Lexical Semantics and Knowledge Representation. London, UK: Springer-Verlag. Calzolari, Nicoletta, Zampolli, Antonio, & Lenci, Alessandro. 2002. Towards a Standard for a Multilingual Lexical Entry: The EAGLES/ISLE Initiative. Pages 264–279 of: Gelbukh, Alexander F. (ed), CICLing. Lecture Notes in Computer Science, vol. 2276. Springer. Carreras, X., Chao, I., Padr´o, L., & Padr´o, M. 2004. FreeLing: An OpenSource Suite of Language Analyzers. In: Proceedings of the 4th International Language Resources and Evaluation Conference. Carreras, Xavier, M`arquez, Llu´ıs, & Padr´o, Llu´ıs. 2002. Named Entity Extraction using AdaBoost. Pages 167–170 of: Proceedings of CoNLL2002. Taipei, Taiwan. Carreras, Xavier, M`arquez, Llu´ıs, & Padr´o, Llu´ıs. 2003. A Simple Named Entity Extractor using AdaBoost. Pages 152–155 of: In Proceedings of CoNLL-2003. Chinchor, N. 1998. Overview of MUC-7. In: Proceedings of the Seventh Message Understanding Conference (MUC-7). Chklovski, Timothy, & Pantel, Patrick. 2004. VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations. In: Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP04). Choukri, Khalid, & Odijk, Jan. 2009. S6 - Enhancing Market/Models for LRs: New Challenges, New Services. Pages 99–100 of: The European Language Resources and Technologies Forum: Shaping the Future of the Multilingual Digital Europe. Collins, M., & Singer, Y. 1999. Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Cowie, J., & Lehnert, W. 1996. Information Extraction. Communications of the ACM, 39(1), 80–91. 107
References Dagan, Ido, & Glickman, Oren. 2004. Probabilistic textual entailment: Generic appied modelling of language variability. In: Proceedings of the PASCAL Workshop on Learning Methods for Text Understanding and Mining. Daud´e, Jordi, Padr´o, Llu´ıs, & Rigau, German. 2003. Making Wordnet Mappings Robust. In: Proceedings of the 19th Congreso de la Sociedad Espa˜ nola para el Procesamiento del Lenguage Natural, SEPLN. de Loupy, Claude, Crestan, Eric, & Lemaire, Elise. 2004. Proper Nouns Thesaurus for Document Retrieval and Question Answering. In: Atelier Question-R´eponse, Traitement Automatique des Langues Naturelles (TALN). Dolan, William B., & Brockett, Chris. 2005. Automatically constructing a corpus of sentential paraphrases. In: Proceedings of The 3rd International Workshop on Paraphrasing (IWP2005). Etzioni, Oren, Banko, Michele, Soderland, Stephen, & Weld, Daniel S. 2008. Open information extraction from the web. Communications of the ACM, 51(12), 68–74. ´ Ferr´andez, Oscar, Toral, Antonio, & Mu˜ noz, Rafael. 2006a. Fine Tuning Features and Post-processing Rules to Improve Named Entity Recognition. In: Proceedings of 11th International Conference on Applications of Natural Language to Information Systems, NLDB 2006. ´ Ferr´andez, Oscar, Micol, Daniel, Mu˜ noz, Rafael, & Palomar, Manuel. 2007a. A Perspective-Based Approach for Solving Textual Entailment Recognition. Pages 66–71 of: Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Prague: Association for Computational Linguistics. ´ Ferr´andez, Oscar, Kozareva, Zornitsa, Toral, Antonio, Mu˜ noz, Rafael, & Montoyo, Andr´es. 2007b. Tackling HAREM’s Portuguese Named Entity Recognition task with Spanish resources. Linguateca. Chap. 11, pages 137–144. ´ Ferr´andez, Oscar, Izquierdo, Rub´en, Ferr´andez, Sergio, & Vicedo, Jos´e Luis. 2008. Addressing ontology-based question answering with collections 108
References of user queries. Information Processing and Management, In Press, Corrected Proof available online 31 October 2008. Ferr´andez, S., Ferr´andez, O., Ferr´andez, A., & Mu˜ noz, R. 2007c (September). The Importance of Named Entities in Cross-Lingual Question Answering. In: Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing. Ferr´andez, Sergio, L´opez-Moreno, P., Roger, Sandra, Ferr´andez, Antonio, Peral, Jes´ us, Alvarado, X., Noguera, Elisa, & Llopis, Fernando. 2006b. Monolingual and Cross-Lingual QA Using AliQAn and BRILI Systems for CLEF 2006. In: (Peters et al., 2007). ´ Ferr´andez, Sergio, Toral, Antonio, Ferr´andez, Oscar, Ferr´andez, Antonio, & Mu˜ noz, Rafael. 2007d. Applying Wikipedia’s Multilingual Knowledge to Cross-Lingual Question Answering. Pages 352–363 of: Kedad, Zoubida, Lammari, Nadira, M´etais, Elisabeth, Meziane, Farid, & Rezgui, Yacine (eds), NLDB. Lecture Notes in Computer Science, vol. 4592. Springer. ´ Ferr´andez, Sergio, Toral, Antonio, Ferr´andez, Oscar, Ferr´andez, Antonio, & Mu˜ noz, Rafael. 2009. Exploiting Wikipedia and EuroWordNet to Solve Cross-Lingual Question Answering. Information Sciences, 68–74. Finkelstein, Lev, Gabrilovich, Evgeniy, Matias, Yossi, Rivlin, Ehud, Solan, Zach, Wolfman, Gadi, & Ruppin, Eytan. 2002. Placing search in context: the concept revisited. ACM Trans. Inf. Syst., 20(1), 116–131. Fleischman, M., Echihabi, A., & Hovy, Eduard. 2003. Offline Strategies for Online Question Answering: Answering Questions Before They Are Asked. In: Proceedings of the ACL Conference. Sapporo, Japan. Florian, R., Ittycheryah, A., Jing, H., & Zhang, T. 2003. Named Entity Recognition through Classifier Combination. Pages 168–171 of: Proceedings of CoNLL-2003, Edmonton, Canada. Francopoulo, G., Bel, N., George, M., Calzolari, N., Monachini, M., Pet, M., & Soria, C. 2008. (forthcoming) Multilingual resources for NLP in the Lexical Markup Framework (LMF). Language Resources and Evaluation Journal. 109
References Gabrilovich, Evgeniy, & Markovitch, Shaul. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. Pages 1606–1611 of: Proceedings of The Twentieth International Joint Conference for Artificial Intelligence. Gaizauskas, R., Wakao, T., Humphreys, K., Cunningham, H., & Wilks, Y. 1995. University of sheffield: Description of the lasie system as used for muc. In: Proceedings of the Sixth Message Understanding Conference (MUC-6). Gaizauskas, Robert, & Humphreys, Kevin. 1997. Using A Semantic Network for Information Extraction. Journal of Natural Language Engineering, 3, 147–169. Giampiccolo, D., Dang, H.T., Magnini, B., Dagan, I., & Dolan, B. 2008. The Fourth PASCAL Recognizing Textual Entailment Challenge. In: Notebook Papers of the Text Analysis Conference, TAC 2008 Workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Giles, Jim. 2005. Internet encyclopaedias go head to head. Nature, 438(7070), 900–901. Gregorowicz, Andrew, & Kramer, Mark A. 2006. Mining a Large-Scale TermConcept Network from Wikipedia. Tech. rept. MITRE. Grisham, Ralph. 1986. Computational Linguistics. An introduction. Cambridge Univ. Press. Grishman, Ralph, & Sundheim, Beth. 1996. Message Understanding Conference-6: a brief history. Pages 466–471 of: Proceedings of the 16th conference on Computational linguistics. Hatzivassiloglou, Vasileios, Klavans, Judith L., Holcombe, Melissa L., Barzilay, Regina, yen Kan, Min, & Mckeown, Kathleen R. 2001. Simfinder: A flexible clustering tool for summarization. Pages 41–49 of: Proceedings of NAACL Workshop on Automatic Summarization. Haveliwala, T. H. 2002. Topic-sensitive PageRank. Pages 517–526 of: WWW ’02: Proceedings of the 11th international conference on World Wide Web. New York, NY, USA: ACM. 110
References Hearst, Marti A. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. Pages 539–545 of: COLING. Hearst, Marti A. 1998. Automated Discovery of WordNet Relations. Cambridge, MA: MIT Press. Hobbs, Jerry R., Appelt, Douglas E., Bear, John, Israel, David, Kameyama, Megumi, Stickel, Mark E., & Tyson, Mabry. 1996. FASTUS: A cascaded finite-state transducer for extracting information from natural-language text. Pages 383–406 of: Roche, Emmanuel, & Schabes, Yves (eds), Finite-State Language Processing. MIT Press. Humphreys, K., Gaizauskas, R., Azzam, S., Huyck, C., Mitchell, B., Cunningham, H., & Wilks, Y. 1998. University of sheffield: Description of the lasie-ii system as used for muc. Hutchins, W.John, & Somers, Harold L. 1992. An introduction to machine translation. Academic Press. ichi Nakamura, Jun, & Nagao, Makoto. 1988. Extraction of semantic information from an ordinary English dictionary and its evaluation. COLING88, 459–464. Ide, Nancy, & Pustejovsky, James. 2009. S4 - Interoperability and Standards. Pages 71–73 of: The European Language Resources and Technologies Forum: Shaping the Future of the Multilingual Digital Europe. Islam, Aminul, & Inkpen, Diana. 2008. Semantic text similarity using corpusbased word similarity and string similarity. ACM Trans. Knowl. Discov. Data, 2(2), 1–25. ISO 24613. 2008. Languages Resources Management – Lexical Markup Framework (LMF), Rev.15 ISOTC37SC4 FDIS. [Online; accessed 25-March2008]. Jiang, J. J., & Conrath, D. W. 1997 (September). Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Pages 9008+ of: International Conference Research on Computational Linguistics (ROCLING X). 111
References Jijkoun, V., Sang, E. Tjong Kim, Ahn, D., M¨oller, K., & de Rijke, M. 2005. The University of Amsterdam at QA@CLEF 2005. In: Working Notes of the CLEF 2005 Workshop. Jones, Gareth, Fantino, Fabio, Newman, Eamonn, & Zhang., Ying. 2008. Domain-Specific Query Translation for Multilingual Information Access Using Machine Translation Augmented With Dictionaries Mined From Wikipedia. In: 2nd International Workshop on Cross Lingual Information Access Addressing the Information Need of Multilingual Societies. Karlgren, J. (ed). 2006. NEW TEXT, Wikis and blogs and other dynamic text sources. Kaskziel, M., & Zobel, J. 1997. Passage Retrieval Revisited. Pages 178– 185 of: Proceedings of the 20th annual International ACM Philadelphia SIGIR. Kipper, Karin, Korhonen, Anna, Ryant, Neville, & Palmer, Martha. 2006 (June). Extending VerbNet with Novel Verb Classes. In: Fifth International Conference on Language Resources and Evaluation (LREC 2006). ´ Kozareva, Zornitsa, Oscar Ferr´andez, Montoyo, Andr´es, Mu˜ noz, Rafael, & Su´arez, Armando. 2005. Combining Data-Driven System for Improving Named Entity Recognition. Pages 80–90 of: Proceedings of 10th International Conference on Applications of Natural Language to Information Systems, NLDB 2005. Krstev, Cvetana, Vitas, Dˇ usko, Maurel, Denis, & Tran, Micka¨el. 2005. Multilingual Ontology of Proper Names. Pages 116–119 of: Proceedings of the Language and Technology Conference. Landauer, Thomas K., Foltz, Peter W., & Laham, Darrell. 1998. An Introduction to Latent Semantic Analysis. Discourse Processes, 259–284. Leacock, Claudia, & Chodorow, Martin. 1998. Combining Local Context and WordNet Similarity for Word Sense Identification. An Electronic Lexical Database, 265–283. Lenat, D.B. 1998. From 2001 to 2001: Common sense and the mind of HAL. Cambridge, MA: MIT Press. Pages 193–208. 112
References Lenci, A., Bel, N., Busa, F., Calzolari, N., Gola, E., Monachini, M., Ogonowski, A., Peters, I., Peters, W., Ruimy, N., Villegas, M., & Zampolli, A. 2000. SIMPLE: A general framework for the development of multilingual lexicons. International Journal of Lexicography, 13(4), 249– 263. Lesk, Michael. 1986. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. Pages 24–26 of: SIGDOC ’86: Proceedings of the 5th annual international conference on Systems documentation. New York, NY, USA: ACM Press. Li, Yuhua, McLean, David, Bandar, Zuhair A., O’Shea, James D., & Crockett, Keeley. 2006. Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150. Lin, Chin-Yew, & Hovy, Eduard. 2003. Automatic evaluation of summaries using N-gram co-occurrence statistics. Pages 71–78 of: NAACL ’03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Morristown, NJ, USA: Association for Computational Linguistics. ´ Lloret, Elena, Ferr´andez, Oscar, Mu˜ noz, Rafael, & Palomar, Manuel. 2008. A Text Summarization Approach under the Influence of Textual Entailment. Pages 22–31 of: Natural Language Processing and Cognitive Science, Proceedings of the 5th International Workshop on Natural Language Processing and Cognitive Science, NLPCS 2008, In conjunction with ICEIS 2008. Magnini, B., Negri, M., Preete, R., & Tanev, H. 2002. A WordNet-based approach to Named Entities Recognition. Pages 38–44 of: Proceedings of SemaNet ’02: Building and Using Semantic Networks. Magnini, Bernardo, Giampiccolo, Danilo, Forner, Pamela, Ayache, Christelle, Jijkoun, Valentin, Osenova, Petya, Pe˜ nas, Anselmo, Rocha, Paulo, Sacaleanu, Bogdan, & Sutcliffe, Richard F. E. 2006. Overview of the CLEF 2006 Multilingual Question Answering Track. In: (Peters et al., 2007). 113
References Mann, G. 2002. Fine-Grained Proper Noun Ontologies for Question Answering. In: Proceedings of SemaNet’02: Building and Using Semantic Networks. Maurel, Denis. 2008 (may). Prolexbase: a Multilingual Relational Lexical Database of Proper Names. In: (ELRA), European Language Resources Association (ed), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). McDonald, D. 1996. Internal and external evidence in the identification and semantic categorization of proper names. Corpus Processing for Lexical Aquisition, 21–39, chapter 2. Medelyan, Olena, & Legg, Catherine. 2008. Integrating Cyc and Wikipedia: Folksonomy Meets Rigorously Defined Common-sense. In: AAAI 2008 workshop Wikipedia and Artificial Intelligence: An Evolving Synergy. Mihalcea, Rada, Corley, Courtney, & Strapparava, Carlo. 2006. Corpusbased and knowledge-based measures of text semantic similarity. Pages 775–780 of: Proceedings of AAAI’06. Mikheev, A., Grover, C., & Moens, M. 1998. Description of the LTG system used for MUC-7. In: Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference held in Fairfax, Virginia, 29 April-1 May. Miller, G. A. 1995. WORDNET: A Lexical Database for English. Communications of ACM, 39–41. Miller, G. A., & Hristea, F. 2006. WordNet Nouns: Classes and Instances. Computational Linguistics, 32(1), 1–3. Milne, David, & Witten, Ian H. 2008. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In: AAAI 2008 workshop Wikipedia and Artificial Intelligence: An Evolving Synergy. Milne, David, Medelyan, Olena, & Witten, Ian H. 2006. Mining DomainSpecific Thesauri from Wikipedia: A Case Study. Pages 442–448 of: WI ’06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence. Washington, DC, USA: IEEE Computer Society. 114
References Moreno, Lidia, Palomar, Manuel, Molina, Antonio, & Ferr´andez, Antonio. 1999. Introducci´on al procesamiento del Lenguaje Natural. Universidad de Alicante. Negri, M., & Magnini, B. 2004. Using WordNet Predicates for Multilingual Named Entity Recognition. Pages 169–174 of: Proceedings of The Second Global Wordnet Conference. Niles, Ian, & Pease, Adam. 2001. Towards a standard upper ontology. Pages 2–9 of: FOIS ’01: Proceedings of the international conference on Formal Ontology in Information Systems. New York, NY, USA: ACM. Niles, Ian, & Pease, Adam. 2003. Linking lexicons and ontologies: Mapping wordnet to the suggested upper merged ontology. Pages 23–26 of: Proceedings of the 2003 International Conference on Information and Knowledge Engineering (IKE 03), Las Vegas. Nyberg, Eric. 2009. Interoperability, Standards and Open Advancement. Pages 76–77 of: The European Language Resources and Technologies Forum: Shaping the Future of the Multilingual Digital Europe. Odijk, Jan, & Mariani, Joseph. 2009. S3 - Evaluation and Validation. Pages 51–52 of: The European Language Resources and Technologies Forum: Shaping the Future of the Multilingual Digital Europe. Ourioupina, O. 2002. Extracting geographical knowledge from the internet. In: Proceedings of the ICDM-AM International Workshop on Active Mining. Pasca, M. 2004. Acquisition of categorized named entities for web search. Pages 137–145 of: Proceedings of Third International Conference on Language Resources and Evaluation. Pasca, Marius, & Harabagiu, Sanda M. 2001. The Informative Role of WordNet in Open-Domain Question Answering. Pages 138–143 of: Proceedings of the NAACL 2001 Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations. Pedersen, Ted, Patwardhan, Siddhart, , & Michelizzi, Jason. 2004. WordNet::Similarity - Measuring the Relatedness of Concepts. Pages 38–41 115
References of: Proceedings of the North American Chapter of the Association for Computational Linguistics. Pedro, Vasco, Niculescu, Stefan, & Lita, Lucian. 2008. Okinet: Automatic Extraction of a Medical Ontology From Wikipedia. In: AAAI 2008 workshop Wikipedia and Artificial Intelligence: An Evolving Synergy. Perrault, Raymond C., & Grosz, Barbara J. 1986. Natural Language Interfaces. Annual review of computer science, 1, 47–82. Peters, Carol, Clough, Paul, Gey, Fredric C., Karlgren, Jussi, Magnini, Bernardo, Oard, Douglas W., de Rijke, Maarten, & Stempfhuber, Maximilian (eds). 2007. Evaluation of Multilingual and Multi-modal Information Retrieval, 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, September 20-22, 2006, Revised Selected Papers. Lecture Notes in Computer Science, vol. 4730. Springer. Philpot, Andrew, Hovy, Eduard, & Pantel, Patrick. 2005. The Omega Ontology. Pages 59–66 of: IJCNLP Workshop on Ontologies and Lexical Resources (OntoLex-05). Ponzetto, Simone P., & Strube, Michael. 2007. Knowledge Derived from Wikipedia for Computing Semantic Relatedness. Journal of Artificial Intelligence Research, 30, 181–212. Pustejovsky, James. 1991. The generative lexicon. Computational Linguistics, 17(4), 409–441. Resnik, Philip. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Pages 448–453 of: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. Richardson, Stephen D., Dolan, William B., & Vanderwende, Lucy. 1998. MindNet: Acquiring and Structuring Semantic Information from Text. Pages 1098–1102 of: COLING-ACL. Rigau, G. 1998. Automatic Acquisition of Lexical Knowledge from MRDs. Ph.D. thesis, Universitat Polit`ecnica de Catalunya. ´ Rodrigo, Alvaro, Pe˜ nas, Anselmo, & Verdejo, Felisa. 2008 (September). Overview of the Answer Validation Exercise 2008. In: et al., C. Peters (ed), CLEF 2008, Lecture Notes in Computer Science, to appear. 116
References Roger, Sandra, Ferr´andez, Sergio, Ferr´andez, Antonio, Peral, Jes´ us, Llopis, Fernando, Aguilar, A., & Tom´as, David. 2005. AliQAn, Spanish QA System at CLEF-2005. Pages 457–466 of: Peters, Carol, Gey, Fredric C., Gonzalo, Julio, M¨ uller, Henning, Jones, Gareth J. F., Kluck, Michael, Magnini, Bernardo, & de Rijke, Maarten (eds), CLEF. Lecture Notes in Computer Science, vol. 4022. Springer. Roventini, Adriana, & Ruimy, Nilda. 2008 (may). Mapping Events and Abstract Entities from PAROLE-SIMPLE-CLIPS to ItalWordNet. In: (ELRA), European Language Resources Association (ed), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Roventini, Adriana, Ruimy, Nilda, Marinelli, Rita, Ulivieri, Marisa, & Mammini, Michele. 2007. Mapping Concrete Entities from PAROLESIMPLE-CLIPS to ItalWordNet: Methodology and Results. Pages 161– 164 of: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Prague, Czech Republic: Association for Computational Linguistics. Rubenstein, Herbert, & Goodenough, John B. 1965. Contextual correlates of synonymy. Commun. ACM, 8(10), 627–633. Ruimy, N., Corazzari, O., Gola, E., Spanu, A., Calzolari, N., & Zampolli, A. 1998. The European LE-PAROLE Project: The Italian Syntactic Lexicon. In: Proceedings of the First International Conference on Language Resources and Evaluation (LREC’98). Ruimy, N., Monachini, M., Distante, R., Guazzini, E., Molino, S., Ulivieri, M., Calzolari, N., & Zampolli, A. 2002. CLIPS, a Multi-level Italian Computational lexicon: A Glimpse to Data. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). Ruimy, Nilda, & Toral, Antonio. 2008. More Semantic Links in the SIMPLECLIPS Database. In: Calzolari, Nicoletta, Choukri, Khalid, Maegaard, Bente, Mariani, Joseph, Odjik, Jan, Piperidis, Stelios, & Tapias, Daniel (eds), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association (ELRA). 117
References Ruiz-Casado, Enrique Alfonseca Maria, & Castells, Pablo. 2006. From Wikipedia to Semantic Relationships: a Semi-automated Annotation Approach. In: Proceedings of ESWC2006. Sarmento, Lu´ıs, Pinto, Ana Sofia, & Cabral, Lu´ıs. 2006. REPENTINO - A wide-scope gazetteer for entity recognition in Portuguese. Pages 31–40 of: Vieira, Renata, Quaresma, Paulo, da Gra¸cas Volpes Nunes, Maria, Mamede, Nuno, Oliveira, Claudia, & Dias, Maria Carmelita (eds), Proc. of the 7th Workshop on Computational Processing of Written and Spoken Portuguese, PROPOR 2006. Itatiaia, Rio de Janeiro, Brazil: Springer. Sekine, Satoshi. 2004. Named Entity: History and Future. Tech. rept. Proteus project. Sekine, Satoshi, Grishman, Ralph, & Shinnou, Hiroyuki. 1998. A Decision Tree Method for Finding and Classifying Names in Japanese Texts. In: Proceedings of The Sixth Workshop on Very Large Corpora. Sekine, Satoshi, Sudo, Kiyoshi, & Nobata, Chikashi. 2002. Extended Named Entity Hierarchy. In: Proceedings of Third International Conference on Language Resources and Evaluation. Sheremetyeva, Svetlana, Cowie, Jim, Nirenburg, Sergei, & Zajac, R´emi. 1998. Multilingual Onomasticon as a Multipurpose NLP Resource. In: Proceedings of the First International Conference on Language Resources and Evaluation. Soualmia, Lina Fatima, Golbreich, Christine, & Darmoni, St´efan Jacques. 2004. Representing the MeSH in OWL: Towards a semi-automatic migration. Pages 81–87 of: KR-MED. Suchanek, Fabian M., Kasneci, Gjergji, & Weikum, Gerhard. 2007. Yago: a core of semantic knowledge. Pages 697–706 of: WWW ’07: Proceedings of the 16th international conference on World Wide Web. New York, NY, USA: ACM Press. Sundheim, Beth M., Mardis, Scott, & Burger, John. 2006. Gazetteer Linkage to WordNet. Pages 103–104 of: Proceedings of the Third International WordNet Conference. 118
References Takeuchi, K., & Collier, N. 2002. Use of support vector machines in extended named entity recognition. In: Proceedings of the sixth Conference on Natural Language Learning (CoNLL-2002). Terra, Egidio, & Clarke, C. L. A. 2003. Frequency estimates for statistical word similarity measures. Pages 244–251 of: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Tjong Kim Sang, Erik F. 2002. Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. Pages 155–158 of: Proceedings of CoNLL-2002. Taipei, Taiwan. Tjong Kim Sang, Erik F., & De Meulder, Fien. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Pages 142–147 of: Daelemans, Walter, & Osborne, Miles (eds), Proceedings of CoNLL-2003. Edmonton, Canada. Toral, Antonio. 2005. DRAMNERI: a free knowledge based tool to Named Entity Recognition. Pages 27–32 of: Proceedings of the 1st Free Software Technologies Conference. Toral, Antonio, & Monachini, Monica. 2007. SIMPLE-OWL: a Generative Lexicon Ontology for NLP and the Semantic Web. In: Workshop of Cooperative Construction of Linguistic Knowledge Bases, 9th Congress of Italian Association for Artificial Intelligence. Toral, Antonio, & Mu˜ noz, Rafael. 2007 (September). Towards a Named Entity Wordnet (NEWN). Pages 604–608 of: Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing. Toral, Antonio, & Mu˜ noz, Rafael. 2006. A proposal to automatically build and maintain gazetteers for Named Entity Recognition using Wikipedia. In: Proceedings of the Workshop on New Text, 11th Conference of the European Chapter of the Association for Computational Linguistics. Toral, Antonio, Noguera, Elisa, Llopis, Fernando, & Mu˜ noz, Rafael. 2005a. Improving Question Answering using Named Entity Recognition. In: Proceedings of the 10th NLDB congress. Lecture notes in Computer Science. Alicante, Spain: Springer-Verlag. 119
References Toral, Antonio, Noguera, Elisa, Llopis, Fernando, & Mu˜ noz, Rafael. 2005b. Reducing Question Answering input data using Named Entity Recognition. In: Proceedings of the 8th International Conference on Text, Speech and Dialogue. Lecture notes in Computer Science. Karlovy Vary, Czech Republic: Springer-Verlag. Toral, Antonio, Monachini, Monica, & Mu˜ noz, Rafael. 2007. Automatically converting and enriching a computational lexicon Ontology for NLP semantic tasks. Pages 216–220 of: Proceedings of the 3rd Language and Technology Conference. Toral, Antonio, Mu˜ noz, Rafael, & Monachini, Monica. 2008. Named Entity WordNet. In: Calzolari, Nicoletta, Choukri, Khalid, Maegaard, Bente, Mariani, Joseph, Odjik, Jan, Piperidis, Stelios, & Tapias, Daniel (eds), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2008/. ´ Toral, Antonio, Ferr´andez, Oscar, Agirre, Eneko, & Mu˜ noz, Rafael. 2009 (September). A study on Linking and disambiguating Wikipedia categories to Wordnet using text similarity. In: Proceedings of the 7th International Conference on Recent Advances in Natural Language Processing. Tran, Micka¨el, Grass, Thierry, & Maurel, Denis. 2004. An Ontology for Multilingual Treatment of Proper Names. In: Proceedings of OntoLex 2004. Turney, Peter D. 2001. Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. Pages 491–502 of: EMCL ’01: Proceedings of the 12th European Conference on Machine Learning. London, UK: Springer-Verlag. Uryupina, O. 2003. Semi-supervised learning of geographical gazetteers from the internet. Pages 18–25 of: Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References. van Assem, Mark, Gangemi, Aldo, & Schreiber, Guus. 2006 (May). Conversion of WordNet to a standard RDF/OWL representation. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). 120
References Verdejo, M. Felisa. 1999. The Spanish Wordnet, EuroWordNet Deliverable D032D033 part B3. Tech. rept. Voorhees, E. 1998. Using WordNet for Text Retrieval. In: Fellbaum, C. (ed), WordNet: an electronic lexical database. MIT Press. Vossen, P. 1998. EuroWordNet A Multilingual Database with Lexical Semantic Networks. Kluwer Academic publishers. Widdows, Dominic, & Ferraro, Kathleen. 2008 (may). Semantic Vectors: a Scalable Open Source Package and Online Technology Management Application. In: (ELRA), European Language Resources Association (ed), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Wiebe, Janyce, & Riloff, Ellen. 2005. Creating Subjective and Objective Sentence Classifiers from Unannotated Texts. Pages 475–486 of: Proceeding of CICLing-05, International Conference on Intelligent Text Processing and Computational Linguistics. Lecture Notes in Computer Science, vol. 3406. Mexico City, MX: Springer-Verlag. Wiebe, Janyce M., Wilson, Theresa, Bruce, Rebecca F., Bell, Matthew, & Martin, Melanie. 2004. Learning subjective language. Computational Linguistics, 30(3), 277–308. Witten, Ian H., & Frank, Eibe. 2005. Data Mining: Practical machine learning tools and techniques. 2nd Edition, Morgan Kaufmann, San Francisco, United States of America. Wu, Fei, Hoffmann, Raphael, & Weld, Daniel S. 2008. Augmenting Wikipedia-Extraction with Results from the Web. In: AAAI 2008 workshop Wikipedia and Artificial Intelligence: An Evolving Synergy. Wu, Zhibiao, & Palmer, Martha Stone. 1994. Verb Semantics and Lexical Selection. Pages 133–138 of: ACL. Zesch, Torsten, M¨ uller, Christof, & Gurevych, Iryna. 2008 (may). Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary. In: (ELRA), European Language Resources Association (ed), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08).
121
References
122
Appendix A Software developed
123
APPENDIX A. SOFTWARE DEVELOPED This appendix reports on the software that has been developed within this PhD. thesis. All the software reported here is contributed to the research community and available under a free license. • wiki db access. Scripts to download Wikipedia dumps and import them into a MySQL database and a C++ API to query this database. • sumo wn mapping. An script to convert the SUMO–WordNet mappings to a MYSQL database and a C++ API to query this database. • A C++ API to query WordNet (wraps WordNet’s C API). • upc wn db. A C++ API to query the Spanish and Catalan wordnets. • A Java API to query ItalWordNet. • A Java API to query PSC. • A Java API to query the ItalWordNet–PSC mappings. • A C++ wrapper to a python tf-idf python library and a tf-idf corpus based on the body articles of the English Wikipedia dump. • A C++ API to query and modify the Named Entity Repository. • A PHP web interface to the Named Entity Repository including an API to query this resource (see figure A.1). • An evaluation framework for the identification of NEs. It contains a gold standard of approximately 300 articles of the English Wikipedia manually tagged as being NEs or not and a C++ program to evaluate results. • An evaluation framework for the classification of NEs. It contains a gold standard of 3,517 articles of the Simple English Wikipedia manually tagged with the expected entity category (person, location, organisation, none) and a C++ program to evaluate results.
124
Figure A.1: Snapshot of the Named Entity Repository web interface
125
APPENDIX A. SOFTWARE DEVELOPED
126
Appendix B Publications
127
APPENDIX B. PUBLICATIONS 1. Journal articles • Antonio Toral, Sergio Ferr´andez, Monica Monachini, Rafael Mu˜ noz. Web 2.0, existing LRs and standards to automatically build LRs: A case study on the construction of a multilingual Named Entity Repository. In Language Resources and Evaluation. 2009. (pending) ´ • Sergio Ferr´andez, Antonio Toral, Oscar Ferr´andez, Antonio Ferr´andez, Rafael Mu˜ noz. Exploiting Wikipedia and EuroWordNet to Solve Cross-Lingual Question Answering. In Information Sciences. Available online 30 June 2009, ISSN 0020-0255, DOI: 10.1016/j.ins.2009.06.031. SCI Journal. 2. Book chapters ´ • Oscar Ferr´andez, Zornitsa Kozareva, Antonio Toral, Rafael Mu˜ noz, Andr´es Montoyo. Tackling HAREM’s Portuguese Named Entity Recognition task with Spanish resources. In Santos, Diana & Nuno Cardoso (eds.). Reconhecimento de entidades mencionadas em portuguˆes: Documenta¸c˜ao e actas do HAREM, a primeira avalia¸c˜ao conjunta na ´area. pp. 137-144. September 2008. ISBN 978-989-20-1297-1. 3. Conference articles ´ • Antonio Toral, Oscar Ferr´andez, Eneko Agirre, Rafael Mu˜ noz. A study on Linking and disambiguating Wikipedia categories to Wordnet using text similarity. In Proceedings of the 7th International Conference on Recent Advances in Natural Language Processing (RANLP). Borovets (Bulgaria). September 2009. • Antonio Toral, Rafael Mu˜ noz, Monica Monachini. Named Entity WordNet. In Proceedings of the 6th Conference on Language Resources and Evaluation. Marrakech (Morocco). May 2008. • Antonio Toral, Rafael Mu˜ noz. Towards a Named Entity Wordnet (NEWN). In Proceedings of the 6th International 128
Conference on Recent Advances in Natural Language Processing (RANLP). Borovets (Bulgaria). pp. 604-608 . September 2007. ´ • Sergio Ferr´andez, Antonio Toral, Oscar Ferr´andez, Antonio Ferr´andez, Rafael Mu˜ noz. Applying Wikipedia’s Multilingual Knowledge to Cross-Lingual Question Answering. In Proceedings of the 12th International Conference on applications of Natural Language to Information Systems. Paris (France). pp. 352-363. June 2007. • Oscar Ferr´andez, Antonio Toral y Rafael Mu˜ noz. Fine Tuning Features and Post-processing Rules to Improve Named Entity Recognition. In Proceedings of the 11th International Conference on Applications of Natural Language to Information Systems, NLDB 2006. Klagenfurt, Austria. pp. 176-185. June 2006. • Antonio Toral, Rafael Mu˜ noz. A proposal to automatically build and maintain gazetteers for Named Entity Recognition using Wikipedia. Workshop on New Text, 11th Conference of the European Chapter of the Association for Computational Linguistics. Trento (Italy). April 2006. • Elisa Noguera, Antonio Toral, Fernando Llopis, Rafael Mu˜ noz. Reducing Question Answering input data using Named Entity Recognition. In Proceedings of the 8th International Conference on Text, Speech & Dialogue. Karlovy Vary (Czech Republic). pp. 428-434. September 2005. • Antonio Toral. DRAMNERI: a free knowledge based tool to Named Entity Recognition. In Proceedings of the 1st Free Software Technologies Conference. A Coru˜ na (Spain). pp. 27-32. July 2005. • Antonio Toral, Elisa Noguera, Fernando Llopis, Rafael Mu˜ noz. Improving Question Answering Using Named Entity Recognition. In Proceedings of the 10th International Conference on applications of Natural Language to Information Systems. Alacant (Spain). pp. 181-191. June 2005. Other related publications that have complemented the research carried out for this PhD. thesis: 129
APPENDIX B. PUBLICATIONS 1. Journal articles • Francesca Bertagna, Antonio Toral, Nicoletta Calzolari. The AllWords WSD Task. In Journal ’intelligenza artificiale’ IV(2), pp. 50-52. June 2007. 2. Conference articles • Antonio Toral, Stefania Bracale, Monica Monachini, Claudia Soria. Rejuvenating the Italian WordNet: upgrading, standarising, extending. In Proceedings of the 5th Global WordNet Conference. Bombay (India). January 2010. (forthcoming) • Antonio Toral, Monachini Monachini, Aitor Soroa, German Rigau. Studying the role of Qualia Relations for Word Sense Disambiguation. In Proceedings of the 5th International Conference on Generative Approaches to the Lexicon. Pisa (Italy). September 2009. • Eneko Agirre, Oier Lopez de Lacalle, Christiane Fellbaum, Andrea Marchetti, Antonio Toral and Piek Vossen. SemEval-2010 Task 17: All-words Word Sense Disambiguation on a Specific Domain. In SEW-2009 - Semantic Evaluations: Recent Achievements and Future Directions, Workshop of the NAACL HLT 2009 conference. Boulder, Colorado (USA). June 2009. • Antonio Toral, Valeria Quochi, Riccardo Del Gratta, Monica Monachini, Claudia Soria and Nicoletta Calzolari. Lexicallybased Ontologies and Ontologically Based Lexicons. In Proceedings of the 10th Congress of Italian Association for Artificial Intelligence. Cagliari (Italy). pp. 49-59. September 2008. • Nilda Ruimy, Antonio Toral. More semantic links in the SIMPLE-CLIPS database. In Proceedings of the 6th Conference on Language Resources and Evaluation. Marrakech (Morocco). May 2008. • Riccardo del Gratta, Nilda Ruimy, Antonio Toral. Simple-Clips ongoing research: more information with less data by implementing inheritance. In Proceedings of the 6th Conference on Language Resources and Evaluation. Marrakech (Morocco). May 2008. 130
• Bernardo Magnini, Amedeo Cappelli, Fabio Tamburini, Cristina Bosco, Alessandro Mazzei, Vincenzo Lombardo, Francesca Bertagna, Nicoletta Calzolari, Antonio Toral, Valentina Bartalesi Lenzi, Rachele Sprugnoli and Manuela Speranza. Evaluation of Natural Language Tools for Italian: EVALITA 2007. In Proceedings of the 6th Conference on Language Resources and Evaluation. Marrakech (Morocco). May 2008. • Antonio Toral, Monica Monachini, Rafael Mu˜ noz. Automatically converting and enriching a computational lexicon Ontology for NLP semantic tasks. In Proceedings of the 3rd Language and Technology Conference. Poznan (Poland). pp. 216220. October 2007. • Antonio Toral, Monica Monachini. Formalising and Bottomup Enriching the Ontology of a Generative Lexicon. In Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing (RANLP). Borovets (Bulgaria). pp. 599-603. September 2007. • Antonio Toral, Monica Monachini. SIMPLE-OWL: a Generative Lexicon Ontology for NLP and the Semantic Web. Workshop of Cooperative Construction of Linguistic Knowledge Bases, 10th Congress of Italian Association for Artificial Intelligence. Rome (Italy). September 2007. ´ • Antonio Toral, Oscar Ferr´andez, Elisa Noguera, Zornitsa Kozareva, Andr´es Montoyo, Rafael Mu˜ noz. Geographic IR helped by Structured Geospatial Knowledge Resources. In Working notes of CLEF 2006. Alacant (Spain). September 2006. • Antonio Toral, Georgiana Puscasu, Lorenza Moreno, Rub´en Izquierdo, Estela Saquete. University of Alicante at WiQA 2006. In Working notes of CLEF 2006. Alacant (Spain). September 2006. • Manuel Garc´ıa-Vega, Miguel A. Garc´ıa-Cumbreras, L. Alfonso ´ Ure˜ na-L´opez, Jos´e M. Perea-Ortega, F. Javier Ariza-L´opez, Oscar Ferr´andez, Antonio Toral, Zornitsa Kozareva, Elisa Noguera, Andr´es Montoyo, Rafael Mu˜ noz, Davide Buscaldi, Paolo Rosso. 131
APPENDIX B. PUBLICATIONS R2D2 at GeoCLEF 2006: a mixed approach. In Working notes of CLEF 2006. Alacant (Spain). September 2006. ´ • Oscar Ferr´andez, Zornitsa Kozareva, Antonio Toral, Elisa Noguera, Andr´es Montoyo, Rafael Mu˜ noz y Fernando Llopis. The University of Alicante at GeoCLEF 2005. Lecture Notes in Computer Science, CLEF 2005. Pendiente de publicaci´on. Vienna, Austria. September 2005. • Antonio Toral, Rafael Mu˜ noz, Andr´es Montoyo. Weak Named Entities Recognition using morphology and syntax. In Proceedings of the 5th International Conference on Recent Advances in Natural Language Processing (RANLP). Borovets (Bulgaria). pp.547-550. September 2005. • Antonio Toral, Sergio Ferr´andez, Andr´es Montoyo. EAGLES compliant tagset for the morphosyntactic tagging of Esperanto. In Proceedings of the 5th International Conference on Recent Advances in Natural Language Processing (RANLP). Borovets (Bulgaria). pp. 551-555. September 2005. • Fernando Llopis, H´ector Garc´ıa Puigcerver, Mariano Cano, Antonio Toral, H´ector Esp´ı. IR-n System, a Passage Retrieval Architecture. In Proceedings of the 7th International Conference on Text, Speech & Dialogue. Brno (Czech Republic). pp. 57-64. September 2004.
132
Appendix C LMF output
133
APPENDIX C. LMF OUTPUT This appendix contains an output sample in LMF format and in the database. It is made up of three monolingual lexicons whose entries are linked by using the “SenseAxis” object of the LMF multilingual extension. 134
135
APPENDIX C. LMF OUTPUT
136
Table C.1: NE Repository. LexicalEntry table LE id LE pos it le Firenze PN it le citt`a N en le Florence PN en le city N es le Florencia PN
Table C.2: NE Repository. FormRepresentation table LE id written form variant type it le Firenze Firenze full it le citt`a citt`a full en le Florence Florence full en le city city full es le Florencia Florencia full
S id it s Firenze it s citt`a1 en s Florence en s city1 es s Florencia
Table C.3: NE LE id it le Firenze it le citt`a en le Florence en le city es le Florencia
Repository. Sense table ext. resource resource id it Wikipedia 1118816 it PSC USem2234citta1 en Wikipedia 11525 en WordNet noun.loc:city0 es Wikipedia 32511
Table C.4: NE Repository. SenseRelation table source id target id relation it s Firenze it s citt`a1 instanceOf en s Florence en s city1 instanceOf
137
def. ... ... ... ... ...
APPENDIX C. LMF OUTPUT
Table C.5: NE Repository. SenseAxis table SA id type 1 eq synonym
Table C.6: NE Repository. SenseAxisElements table SA id element 1 it s Firenze 1 en s Florence 1 es s Florencia
Table C.7: NE Repository. SenseAxisExternalRef table SA id resource resource id relation 1 SUMO city at 1 SIMPLE Geopolitical location at
138
Appendix D Questions of the English-Spanish QA@CLEF task (2006)
139
APPENDIX D. QUESTIONS OF THE ENGLISH-SPANISH QA@CLEF TASK (2006) 1. What is Atlantis? 2. What is Hubble? 3. What is ”Nike Zeus”? 4. What is Linux? 5. Who is Iosif Kobzon? 6. Which country did Iraq invade in 1990? 7. How many countries are members of NATO at the moment? 8. Which organization is lead by Yaser Arafat? 9. Of which organization does Taiwan want to become a member? 10. Name a movie with Kirk Douglas from the period 1946-1960. 11. Give the name of a winner of the Nobel prize for Literature for the period 1945-1990. 12. When did the official coronation of Elizabeth The Second take place? 13. When was the Maastricht treaty signed? 14. When did Stalin die? 15. Who played the main role in the film called ”Seven Years in Tibet”? 16. Who wrote the fantasy epic called ”Lord of the Rings”? 17. What is the name of the woman who first climbed the Mt. Everest without an oxygen mask? 18. Who was called the ”Iron Chancellor” ? 19. In what year did the catastrophe in Chernobyl happen? 20. What year was Helmut Kohl born? 21. What year was Martin Luther King murdered? 22. Which country was pope John Paul II born in? 140
23. What party does the British prime minister Tony Blair belong to? 24. How high is Kanchenjunga? 25. What is Down syndrome also called? 26. Where did the Olympic Winter Games take place in 1994? 27. Which environmental organisation was founded in 1971? 28. For whom did Bremen football club manager Willi Lenke pretend to work between 1970 and 1975? 29. How high is the Eiffel Tower? 30. In what American state is Everglades National Park? 31. Who discovered the Shoemaker-Levy comet? 32. What planet did the Shoemaker-Levy comet hit? 33. What is Quinoa? 34. Which company took over Barings after its collapse in February 1995? 35. Who is Nick Leeson? 36. Who was president of France at the time of the nuclear weapons tests in the South Pacific? 37. What is leprosy? 38. To which organisation is Peter Anderson the alcohol adviser? 39. Which political party was Willy Brandt a member of? 40. What is the longest German word? 41. What is the name of the Latvian currency? 42. What is GI Joe? 43. What is the main religion of East Timor? 141
APPENDIX D. QUESTIONS OF THE ENGLISH-SPANISH QA@CLEF TASK (2006) 44. How many houses are expected to be built under the Stirling Initiative between 1993 and 1998? 45. In which city did the runner Ben Johnson test positive for Stanozol during the Olympic Games? 46. Who was Alexander Graham Bell? 47. Who is Danuta Walesa? 48. Who is Vigdis Finnbogadottir? 49. Who is Marc Forn´e? 50. Who is Neil Armstrong? 51. Who is Fernando Masone? 52. Who is Javier Clemente? 53. What is Lufthansa? 54. What is Medicus Mundi? 55. What is Airbus? 56. What does IBRD stand for? 57. What is Christie’s? 58. What is the CERN? 59. What is Deep Blue? 60. What is toner? 61. What is Eurovision? 62. What is the Big Bang? 63. Who were the Cossacks? 64. What is the green ECU? 65. What is the drachma? 142
66. What person awarded by the Goethe Institute did not accept the award? 67. Who chairs RAI? 68. Who won the Battle of Alamein? 69. Who is NASA mission operations director? 70. Who is the president of Latvia? 71. Who is known as the ”Iron Lady” of Hong Kong? 72. Who is the General Secretary of Interpol? 73. What doctor accompanied Miguel Indur´ain during his displacement to Colorado? 74. Who discovered acetylsalicylic acid? 75. In which year was the Football World Cup celebrated in the United States? 76. In what year did Titanic sink? 77. In which period did the Second World War take place? 78. On which date did the United States invade Haiti? 79. On which day did Jordan and Israel sign a peace treaty? 80. In which year did Bernard Montgomery die? 81. When was the Hubble Space Telescope launched? 82. In which year was the Russian Revolution? 83. In which year was the Dunkirk withdrawal? 84. In which year did Einstein win the Physics Nobel Prize? 85. In which city is the Johnson Space Center? 86. In which city is the Sea World aquatic park? 143
APPENDIX D. QUESTIONS OF THE ENGLISH-SPANISH QA@CLEF TASK (2006) 87. In which city is the opera house La Fenice? 88. In which street does the British Prime Minister live? 89. Which Andalusian city wanted to host the 2004 Olympic Games? 90. In which country is Nagoya airport? 91. In which city was the 63rd Oscars ceremony held? 92. Where is Interpol’s headquarters? 93. Where did Braque and Picasso work together? 94. What organization made an appeal for the Olympic Truce? 95. What organization was Jacques Diouf main director of? 96. What organization is Michel Camdessus managing director of? 97. What organization is C´esar Gaviria General Secretary of? 98. What organization was Sergio Balanzino deputy General Secretary of? 99. What is the name of the German company that markets Hero products? 100. What organization holds a trade embargo on Iraq? 101. What film studio was Cedric Gibbons chief art director of? 102. What group does Aviaco belong to? 103. Of what company has Mart´ın Bustamante been an executive board member? 104. How many inhabitants are there in Longyearbyen? 105. How many pieces does the Carambolo Treasure contain? 106. How many Oscar Academy Awards did Star Wars win? 107. How many soldiers does Spain have? 108. How many Formula One World Championships did Fangio win? 144
109. How many categories does the Grammy Awards have? 110. How many Grammy Awards did The Lion King win? 111. How much money is made by drug trafficking? 112. What is the Interpol’s budget? 113. What was the name of the first nuclear submarine? 114. In which prison was Mario Conde held? 115. What is the main component of aspirin? 116. What is the name of the greatest engraving of Picasso? 117. What team is run by Luca Cordero Di Montezemolo? 118. What did Yogi Bear steal? 119. What is the most emblematic species of Do˜ nana? 120. What is the name of the bicycle that was used by Miguel Indur´ain to break the world one hour record? 121. Who was the first president of the United States to visit China? 122. Who was the president of Peru between 1985 and 1990? 123. Who won the Tour de France in 1988? 124. Which Russian Tsar died in 1584? 125. In which city did the inaugural match of the 1994 USA Football World Cup take place? 126. What port did the aircraft carrier Eisenhower leave when it went to Haiti? 127. Which country did Roosevelt lead during the Second World War? 128. Name a country that became independent in 1918. 129. Which organization did Shimon Peres chair after Yitzhak Rabin’s death? 145
APPENDIX D. QUESTIONS OF THE ENGLISH-SPANISH QA@CLEF TASK (2006) 130. Which organization did Primo Nebiolo chair during the Athletics World Championships in Gothenburg? 131. What force was Luis Roldan director from 1986 to 1993? 132. What organization did Greece join in 1952? 133. How old was Umberto Bossi when he resigned as general secretary of Northern League? 134. How long in kilometres was the 1926 Tour de France? 135. How many countries did Nixon visit from 1953 to 1959? 136. What was the population of Hong Kong in 1993? 137. What position did Francois Mitterrand hold when he was admitted to the Cochin hospital? 138. What is the name of the collection of pictures painted by Goya from 1819 to 1823? 139. Which war took place from 1939 to 1945? 140. In which sport did Europe beat America during 1987? 141. Who took part in the Yalta Conference? 142. Which countries signed the North American Free Trade Agreement? 143. Name the three Beatles that are still alive. 144. Name the three Baltic states. 145. Which are the three Slavic republics? 146. When was the Norwegian EU referendum? 147. When was the Berlin Wall torn down? 148. What is the LZ 129 Hindenburg? 149. When did Tom Twyker win the Nobel Peace Prize? 146
150. Who was the Prime Minister of England before John Major? 151. When was the Halifax G7 summit held? 152. What is the RKA? 153. How many times has Zinedine Zidane won the US Open? 154. Which astronaut has spent the longest time in space? 155. Which are the seven most industrialized countries in the world? 156. What is Gianni Versace’s profession? 157. What famous French event is celebrated on the 11th of November? 158. How many people attended the world soccer championship final in 1994? 159. How many journalists did Diego Maradona injure with an air rifle? 160. What film won the Golden Bear in 1988? 161. Who is Fernando Henrique Cardoso? 162. Who was the president of Deutsche Bank during the bankruptcy of the property speculator Juergen Schneider? 163. What French multinational changed its name to Groupe Danone? 164. Who is Rolf Ekeus? 165. What countries are members of the Gulf Cooperation Council? 166. Who was George Starckmann? 167. What is the penalty imposed on Italy for over production with respect to milk quota? 168. What is Partnership for Peace? 169. How many Oscar nominations did ”In the Name of the Father” get? 170. How many separations were there in Norway in 1992? 147
APPENDIX D. QUESTIONS OF THE ENGLISH-SPANISH QA@CLEF TASK (2006) 171. When was the referendum on divorce in Irland? 172. Who was the favourite personage at the Wax Museum in London in 1995? 173. Who are the interpreters of the film ”Color of the Night”? 174. When was the 51st Venice Film Festival held in 1994? 175. When did Suriname become independent? 176. What is the nickname of Eddy Merckx? 177. Where is James Ensor buried? 178. Where is the Hermitage? 179. Where is Ystad located? 180. Who created the operating system OS/2? 181. What is CBGB? 182. What is CD-i? 183. In which year did Glenn Gould die? 184. Who is Jan Tinbergen? 185. Name sumo wrestlers. 186. In which town in Zeeland did Jan Toorop spend several weeks every year between 1903 and 1924? 187. What is a samovar? 188. What is the Bundesgrenzschutz? 189. What is Roque Santeiro? 190. What is cachupa? 191. What was annexed after the Six Days’ War? 192. In which state lies Porto Alegre? 148
193. What country invaded the Greater Hanish? 194. How much was John Fashanu fined? 195. What is the world record in the high jump? 196. What is the French railway company called? 197. What was the result of the game Italy-Nigeria at the 1994 World Cup? 198. What is Geoffrey Oryema’s nationality? 199. What does Vital do Rego teach? 200. Since when is Portugal a republic?
149
APPENDIX D. QUESTIONS OF THE ENGLISH-SPANISH QA@CLEF TASK (2006)
150
Ap´ endice E Resumen
151
´ APENDICE E. RESUMEN Este cap´ıtulo presenta un resumen de un m´ınimo de 5.000 palabras del contenido de la tesis doctoral en un idioma oficial de la universidad como requisito de la Comisi´on de Doctorado (02/03/2005) para la admisi´on de Tesis Doctorales en lenguas no oficiales.
E.1
Introducci´ on
El lenguaje es una parte fundamental en la vida del ser humano. Mediante ´este se establece la comunicaci´on natural entre distintos individuos ya sea de forma hablada o escrita. Adem´as, es el mecanismo utilizado para la representaci´on de nuestros pensamientos y, por tanto, determina en cierto modo nuestra forma de ser y de actuar. El Procesamiento del Lenguaje Natural (PLN) (Moreno et al., 1999) es una rama de la Inteligencia Artificial que estudia el tratamiento computacional de los lenguajes humanos. Se trata, por una parte, de explotar computacionalmente grandes vol´ umenes de datos que est´an representados como lenguaje natural. Por otra, de hacer la comunicaci´on entre el ser humano y los computadores menos r´ıgida y m´as c´omoda para el primero. El PLN requiere conocer la estructura del lenguaje natural. Las aplicaciones que se desarrollan en este campo necesitan comprender el lenguaje en sus distintos niveles1 : l´exico-morfol´ogico, sint´actico, sem´antico y contextual. Por lo tanto, el desarrollo y la investigaci´on en analizadores autom´aticos para los distintos niveles ling¨ u´ısticos es b´asico para el desarrollo de este campo de investigaci´on. De los resultados que puedan alcanzar estos bloques b´asicos de PLN depender´a en gran medida el ´exito o fracaso de aplicaciones m´as avanzadas que se puedan desarrollar tomando estos analizadores como base. La importancia del PLN en la actualidad se debe en parte al gran aumento de informaci´on digital disponible a consecuencia del gran crecimiento experimentado por el uso de las redes inform´aticas en general y de Internet en particular. Esta gran cantidad de informaci´on precisa mecanismos para ser manejada. As´ı, se desarrollan sistemas para buscar la informaci´on que pueda interesar al usuario, para traducir la informaci´on y as´ı poder entenderla, etc´etera. Seg´ un Grisham (Grisham, 1986), tres clases de aplicaciones han sido de ´ gran importancia en el desarrollo del PLN. Estas son la recuperaci´on de informaci´on, la traducci´on autom´atica y las interfaces hombre-m´aquina. Otros 1
no en todos los casos se requiere el conocimiento de todos los niveles del lenguaje
152
´ E.1. INTRODUCCION autores incluyen, adem´as de estas tres, una cuarta aplicaci´on: la extracci´on de informaci´on. Veamos brevemente en que consiste cada una de ellas: • Recuperaci´on de Informaci´on (RI) (Baeza-Yates & Ribeiro-Neto, 1999). Estos sistemas devuelven, a partir de una pregunta y una colecci´on de documentos, una lista ordenada de documentos extra´ıdos de dicha colecci´on seg´ un la relevancia de estos documentos con la pregunta de entrada. Una mejora de los sistemas de RI la constituyen los sistemas de recuperaci´on de pasajes (Kaskziel & Zobel, 1997), los cuales, en vez de devolver una lista de documentos, devuelven una lista de pasajes, i.e. fragmentos de documentos, y solventan as´ı varios problemas de los sistemas cl´asicos de RI. Otra extensi´on de los sistemas de RI, aunque hoy puede considerarse una aplicaci´on distinta, es la b´ usqueda de respuestas (BR) (Burger et al., 2001). En este caso, ante una pregunta, el sistema devuelve la respuesta a ´esta en vez de los documentos o pasajes relevantes. • Extracci´on de informaci´on (EI). Cowie y Wilks definen la EI como “el nombre dado a cualquier proceso que estructura y combina selectivamente datos que se encuentran, bien de manera expl´ıcita, bien impl´ıcitamente, en un texto o en una colecci´on de ellos” (Cowie & Lehnert, 1996). Los sistemas de EI tienen como objetivo encontrar y relacionar informaci´on relevante mientras ignoran otras informaciones no relevantes a partir del an´alisis ling¨ u´ıstico de textos. • Traducci´on autom´atica (Hutchins & Somers, 1992). La Asociaci´on europea para la traducci´on autom´atica2 define ´esta como “la aplicaci´on de computadoras a la tarea de traducir textos de una lengua natural a otra”. A d´ıa de hoy existen sistemas que son capaces de producir una salida que, si bien no es perfecta, es de calidad suficiente para ser u ´ til para determinados usos y en determinadas situaciones. • Interfaces en lenguaje natural (Perrault & Grosz, 1986). Permiten al usuario interrelacionarse con una computadora en una lengua natural, facilitando la comunicaci´on persona-m´aquina y abstrayendo al usuario el funcionamiento real de ´esta. Una aplicaci´on concreta son las interfaces en lenguaje natural para la realizaci´on de consultas a bases de datos (Androutsopoulos et al., 1995). 2
http://www.eamt.org/
153
´ APENDICE E. RESUMEN El conocimiento del mundo es un requisito para tratar con el nivel sem´antico de los lenguajes naturales. Las conceptualizaciones de la realidad han ocupado al ser humano desde la antigua grecia, en la que el termino Ontolog´ıa (del griego în, genitivo înto : del ser (parte de eÙnai: ser ) y -loga: ciencia, estudio, teoria) fue introducido por Arist´oteles (Aristotle, 1908). Mucho tiempo despu´es, a finales del siglo veinte, tuvieron lugar los primeros intentos de proveer de sentido com´ un a los ordenadores por medio de bases de conocimiento en el marco del campo de la inteligencia artificial. En el caso del lenguaje, este conocimiento se encuentra en los Recursos ´ Ling¨ u´ısticos (RLs). Estos juegan un papel central en el campo de la ling¨ u´ıstica computacional pues son pr´acticamente imprescindibles para llevar a cabo cualquier proceso de comprensi´on autom´atica del lenguaje. Estos recursos se aplican extensivamente en las diferentes tareas del PLN. Por ejemplo en RI se pueden usar para expandir la query incluyendo t´erminos relacionados (e.g. sin´onimos). En IE la sem´antica de la plantilla a rellenar se puede enlazar a un RL y as´ı las propiedades a extraer se pueden basar en el conocimiento de ´este u ´ ltimo. Dada la importancia de los RLs, la comunidad investigadora ha dedicado mucho esfuerzo para construir este tipo de recursos manualmente durante las u ´ ltimas dos d´ecadas. A pesar de ello, lo cual sin duda ha redundado en la disponibilidad de RLs robustos y precisos, algunos tipos de elementos ling¨ u´ısticos no est´an cubiertos exhaustivamente. Dos ejemplos son las Entidades Nombradas (ENs) y los t´erminos de dominio espec´ıfico. La inserci´on y el mantenimiento manual de estos dos tipos de t´erminos seria inviable, ya que la cantidad de t´erminos es enorme y su naturaleza, en especial de las ENs, es mucho mas vol´atil que la de los t´erminos normalmente tratados (nombres comunes, adjetivos, verbos y adverbios). Esto esta relacionado con la siguiente afirmaci´on: “la construcci´on de una ontolog´ıa de nombres propios es mas dif´ıcil que la construcci´on de una ontolog´ıa de nombres comunes pues el conjunto de nombres propios crece m´as r´apidamente”. Para aclarar el asunto del mantenimiento de ENs, veamos el estado de las ENs en WordNet (el RL para el ingl´es mas usado en la actualidad). Mientras la cobertura de WordNet para nombres comunes es considerablemente alta, ´este contiene muy pocos nombres propios: solamente 7,669 en la versi´on 2.1. Las ENs son importantes en diversas tareas del PLN como res´ umenes autom´aticos, construcci´on de ´ındices invertidos, etc. Las ENs son especialmente importantes en IE (de hecho el Reconocimiento de Entidades Nombradas na154
´ E.1. INTRODUCCION ci´o como una fase intermedia del IE) y juegan un rol especial en la traducci´on. La mayor´ıa de la investigaci´on dedicada a las ENs se refiere a su reconocimiento en textos. Por cuanto respecta a recursos de ENs, existen repositorios maduros de ENs geogr´aficas pero hay una falta de diccionarios de ENs de car´acter m´as general. No obstante, la disponibilidad de RLs de car´acter general con ENs de granularidad fina podr´ıan ser muy u ´ tiles para el PLN; por ejemplo, consideremos la pregunta 161 de la tarea de BR del CLEF 2006: “¿Qui´en es Fernando Henrique Cardoso?”. Esta pregunta podr´ıa ser f´acilmente contestada si la EN de tipo persona a la que se refiere la pregunta estuviese presente en un RL con relaciones sem´anticas que la conectasen a otras entradas como “brasile˜ no”, “pol´ıtico”, “presidente” y “ministro”.
E.1.1
Motivaci´ on
Esta tesis de doctorado trata el problema del cuello de botella de la adquisici´on de conocimiento y espec´ıficamente, dada la importancia de las ENs en el PLN y tambi´en los problemas de su mantenimiento manual, considera la adquisici´on de este tipo de elemento ling¨ u´ıstico. La tesis parte de las conclusiones del proyecto RECENT3 . En el marco de este proyecto desarroll´e una herramienta de reconocimiento de entidades llamada DRAMNERI. En su dise˜ no se prest´o una particular atenci´on a la modularidad, con el objetivo de facilitar su adaptaci´on a diferentes idiomas y dominios. La herramienta fue aplicada con ´exito a diversas tareas; fue usada para reconocer entidades temporales y de cantidad en portugu´es, para mejorar los resultados de un sistema de b´ usqueda de respuestas y para construir un sistema de extracci´on de informaci´on en el dominio notarial. Sin embargo, encontramos un problema; mientras el esfuerzo necesario para construir los diccionarios y reglas requeridos en dominios espec´ıficos, como por ejemplo el notarial, era abordable, el tratamiento de dominios de car´acter m´as general requerir´ıa de mucho tiempo de etiquetado manual. ´ Este es el punto de partida de esta tesis, nos planteamos tratar de investigar y desarrollar una metodolog´ıa para adquirir diccionarios de entidades de manera autom´atica. La tesis generaliza el problema de la adquisici´on de entidades a la adquisici´on en general y analiza la situaci´on actual con el fin de identificar v´ıas plausibles para atacar el problema. Una vez concluida esta fase de ´ındole m´as te´orica, se aplican los conocimientos obtenidos para 3
http://dlsi.ua.es/eines/projecte.cgi?id=val&projecte=35
155
´ APENDICE E. RESUMEN desarrollar una metodolog´ıa que nos permita adquirir un l´exico de entidades. Finalmente, para evaluar la utilidad del recurso adquirido, lo aplicamos a la tarea de b´ usqueda de respuestas.
E.2
Trabajo de investigaci´ on y resultados obtenidos
E.2.1
Elementos clave para tratar la adquisici´ on de conocimiento
El cuello de botella de la adquisici´on de conocimiento es uno de los mayores retos que afronta el campo de la Ling¨ u´ıstica Computacional en la actualidad. En esta parte de la tesis consideramos el problema desde una perspectiva m´as abstracta y tratamos de identificar soluciones que puedan llevar a un paso adelante. Inspeccionamos la situaci´on actual de este campo y de sus ´areas vecinas e intentamos identificar elementos clave que puedan ser explotados en tareas de adquisici´on. A partir del an´alisis llevado a cabo surgen tres elementos clave: RLs, recursos de la Web 2.0 y est´andares. A continuaci´on motivamos la elecci´on de estos elementos y valoramos en que manera la adquisici´on de conocimiento se puede beneficiar de ellos.
Uso de recursos ling¨ u´ısticos para respaldar su enriquecimiento Durante las dos u ´ ltimas d´ecadas se ha dedicado un gran esfuerzo a la construcci´on manual de RLs computacionales. El resultado es un conjunto de recursos de alta calidad, los cuales contienen un valioso conocimiento ling¨ u´ıstico y pueden ser explotados para tareas de PLN en general y de adquisici´on en particular. Sin embargo, a excepci´on de WordNet, el gran esfuerzo dedicado a la construcci´on de LRs no ha tenido como contrapartida un uso generalizado del conocimiento etiquetado por parte de la comunidad investigadora. Esto se debe a varios factores de ´ındole diversa como la falta de validaci´on formal, la falta de mecanismos de acceso autom´atico adecuados, problemas de distribuci´on, etc. 156
´ Y RESULTADOS E.2. TRABAJO DE INVESTIGACION OBTENIDOS Explotar la Web2.0 para una adquisici´ on de alta cobertura y calidad Hasta el momento, la investigaci´on dedicada a la adquisici´on de conocimiento se ha focalizado en extraer la informaci´on requerida de dos tipos de fuentes: diccionarios de m´aquina y corpus. Ambos tienen importantes desventajas: mientras los diccionarios de m´aquina son peque˜ nos y por tanto limitan la informaci´on que se puede extraer de ellos, los corpus consisten en texto no estructurado y por ello es m´as complicado extraer de ellos informaci´on valiosa. Normalmente los corpus utilizados pertenecen al dominio period´ıstico, y presentan por tanto un alto grado de subjetividad. Con el advenimiento del paradigma colaborativo de la Web 2.0 han surgido nuevos tipos de texto tales como blogs, folksonomies y wikis. Estas fuentes de informaci´on presentan importantes ventajas sobre los tipos de fuentes anteriormente mencionados para tareas de adquisici´on; su tama˜ no es grande, la informaci´on se presenta en forma semi-estructurada, son din´amicos (est´an en continua evoluci´on) y se desarrollan de manera colaborativa, por lo que reflejan la variedad del lenguaje.
Est´ andares para construir repositorios de conocimiento gen´ ericos A parte del cuello de botella de la adquisici´on, otro problema actual se refiere a la interoperabilidad. La falta de planificaci´on a largo t´ermino ha llevado a la existencia de una gran variedad de RLs en formatos diferentes (a menudo incompatibles). De hecho, este problema ha sido reconocido como uno de los mas cr´ıticos para el futuro de la investigaci´on en PLN (Ide & Pustejovsky, 2009). Al igual que (Nyberg, 2009), vemos la interoperabilidad de RLs como “un primer paso en la tendencia general hacia un mayor intercambio y reuso de componentes en las aplicaciones de PLN”. De hecho, un valor a˜ nadido de nuestra propuesta lo constituye el uso de est´andares con el fin de hacer los procedimientos m´as generales e independientes de recursos espec´ıficos y mejorar la interoperabilidad y el futuro intercambio de diferentes RLs. Hemos estudiado el uso del est´andar ISO LMF como formato de representaci´on para la informaci´on adquirida. Esta elecci´on facilita la explotaci´on del recurso generado en tareas de PLN. 157
´ APENDICE E. RESUMEN
E.2.2
M´ etodo
Nuestro m´etodo mapea la jerarqu´ıa formal de un recurso ling¨ u´ıstico respecto a la taxonom´ıa de categor´ıas de Wikipedia, desambigua posibles casos ambiguos, extrae los art´ıculos presentes en dichas categor´ıas e identifica cuales de ellos constituyen ENs. A continuaci´on, se introducen en el l´exico de entidades diversos tipos de informaci´on de estos art´ıculos tales como variantes ortogr´aficas, definiciones, etc. En una fase de post-proceso, (i) se extraen entidades adicionales haciendo uso de los enlaces interlingua de Wikipedia y (ii) se conectan las entidades extra´ıdas a dos ontolog´ıas. Se tratan a continuaci´on las fases m´as importantes de este proceso. Desambiguaci´ on Una vez que los nombres del recurso ling¨ u´ıstico de partida han sido mapeados con respecto a las categor´ıas, los nombres polis´emicos deben ser desambiguados: hay que identificar el sentido que corresponde a la categor´ıa mapeada. Hemos concebido dos aproximaciones para llevar a cabo la desambiguaci´on de manera autom´atica. La primera busca instancias comunes en los ´arboles de hiponimia de los nombres de partida y la categor´ıa mapeada, mientras que el segundo calcula la similitud textual entre las definiciones de cada sentido del nombre y la definici´on de la categor´ıa. La primera aproximaci´on, dado un nombre polis´emico mapeado a una categor´ıa, extrae las instancias de cada uno de los sentidos del nombre y comprueba si alguna de ellas coincide con alg´ un art´ıculo de la categor´ıa. Si se halla una instancia com´ un, el sentido desambiguado es aquel que contiene dicha instancia. Debido a que la taxonom´ıa de Wikipedia es normalmente m´as profunda que la de WordNet, este m´etodo no busca instancias comunes solamente en las categor´ıas mapeadas sino tambi´en en sus categor´ıas que son hip´onimos. Ahora bien, la relaci´on de subcategor´ıa no implica necesariamente hiponimia, y por tanto se hace necesario detectar si una subcategor´ıa es un hip´onimo de una categor´ıa. Para ello hemos aplicado expresiones regulares que pueden contener tanto elementos l´exicos como morfosint´acticos (Part-ofSpeech). Debido a que hay diferentes m´etodos para calcular la similitud textual, hemos decidido tener en cuenta un conjunto representativo de ´estos para determinar cual funciona mejor para nuestra tarea: un sistema de implicaci´on textual, un algoritmo que aplica la medida de personalised PageRank 158
´ Y RESULTADOS E.2. TRABAJO DE INVESTIGACION OBTENIDOS sobre WordNet, un algoritmo del tipo an´alisis sem´antico latente basado en proyecci´on aleatoria. Adem´as de estos sistemas, se contemplan dos sistemas baseline: uno elige siempre el primer sentido de WordNet pues es el m´as frecuente, el otro calcula la similitud entre los textos contando simplemente el n´ umero de palabras que ambos tienen en com´ un. Finalmente, y dado que los sistemas pertenecen a diferentes paradigmas, se ha experimentado con diversas aproximaciones para combinar sus salidas, se ha aplicado un esquema de votaci´on, otro de combinaci´on no supervisada y otro de combinaci´on supervisada. Identificaci´ on de entidades Hemos explorado tres m´etodos para identificar cuales de los art´ıculos extra´ıdos son ENs. El primero se basa en un motor de b´ usqueda web, el segundo en el contenido de los art´ıculos de Wikipedia mientras que el tercero combina los dos precedentes. Los tres comparten un aspecto com´ un; explotan ciertas caracter´ısticas ortogr´aficas seguidas en varios idiomas, i.e. que los nombres propios empiezan con may´ uscula mientras que los nombres comunes empiezan con min´ uscula. En el primer m´etodo, el t´ıtulo del art´ıculo se busca en la Web haciendo uso de un motor de b´ usqueda. Se consideran los 50 primeros resultados devueltos por el motor, y en ´estos un algoritmo calcula el n´ umero de veces que el t´ıtulo empieza con may´ uscula y con min´ uscula. Se define un umbral de forma emp´ırica para distinguir entre art´ıculos que son entidades y art´ıculos que no lo son. La desventaja de este m´etodo se refiere a la variaci´on de sentidos; un nombre puede tener un sentido en el que se refiere a un nombre com´ un y otro que se refiere a un nombre propio. Por ejemplo el art´ıculo “Children’s Machine” es una entidad referida a un ordenador. Sin embargo, si buscamos esta cadena en internet podemos encontrar p´aginas en las que “The children’s machine” es el t´ıtulo de un libro en el cual “children” y “machine” son nombres comunes, y por tanto es muy probable que la identificaci´on no funcione correctamente. El segundo m´etodo procede de manera an´aloga pero busca las ocurrencias del t´ıtulo en el cuerpo de su entrada en Wikipedia. Una caracter´ıstica a destacar es que hacemos uso de los enlaces que conectan la entrada de Wikipedia con su equivalente en otros idiomas, por tanto no solo buscamos las ocurrencias del t´ıtulo en su entrada en el idioma que tratamos sino en diez idiomas que siguen la caracter´ıstica ortogr´afica anteriormente introducida 159
´ APENDICE E. RESUMEN (castellano, catal´an, franc´es, holand´es, ingl´es, italiano, noruego, portugu´es, rumano y sueco). La desventaja de este m´etodo es que el n´ umero de ocurrencias obtenidas tiende a ser mucho menor que en el primer m´etodo y por tanto los resultados menos significativos. El tercer m´etodo combina los dos anteriores: en primer lugar extrae del cuerpo del art´ıculo los t´erminos m´as significativos haciendo uso del algoritmo tf-idf. Posteriormente busca en la web el t´ıtulo del art´ıculo junto a estos t´erminos. Esto garantiza en gran medida que los textos obtenidos se refieran a ese art´ıculo a la par que el n´ umero de ocurrencias es elevado. El l´ exico de entidades Como resultado del proceso de adquisici´on se provee una salida en un formato conforme con el est´andar LMF. Los elementos de los que se compone nuestra salida son principalmente ENs, variantes ortogr´aficas de estas ENs y clases a las cuales estas ENs pertenecen (por medio de relaciones de instanciaci´on). Debido a que estos datos pueden ser representados por medio de un l´exico y debido a que de hecho construimos un l´exico, hemos decidido seguir el formato LMF, el est´andar ISO para la representaci´on de l´exicos computacionales. A continuaci´on se describen los elementos de este formato que hemos utilizado y el papel que juegan en nuestro l´exico. El elemento “Lexicon” contiene cada uno de los diccionarios monoling¨ ues de ENs. “LexicalEntry” contiene toda la informaci´on relativa a cada EN; tiene dos nodos hijos. El primero, “Lemma” contiene el lema de la EN y sus variantes ortogr´aficas (por medio del elemento “FormRepresentation”). El segundo, “Sense” contiene la informaci´on sem´antica relativa a la EN, la cual puede ser de dos tipos: (i) relaciones sem´anticas con otras entradas lexicales (“SenseRelation”) y (ii) conexiones a recursos externos (“MonolingualExternalRef”). Adem´as, utilizamos la extensi´on “NLP multilingual notations” de LMF para crear un l´exico pluriling¨ ue en el que las ENs de diferentes idiomas pueden estar conectadas por medio de enlaces interlengua. El elemento de LMF usado a este prop´osito es “SenseAxis”: representa las relaciones entre ENs equivalentes que pertenecen a diferentes idiomas. (cada una de estas ENs est´a contenida en un “SenseAxisElements”). El elemento “InterlingualExternalRef”, que puede ser instanciado en un “SenseAxis” permite conectar un conjunto de ENs equivalentes a recursos externos, en nuestro caso a ontolog´ıas. Hemos dise˜ nado el l´exico de ENs como una base de datos cuya estructura 160
´ Y RESULTADOS E.2. TRABAJO DE INVESTIGACION OBTENIDOS es conforme (isom´orficamente) a LMF.
E.2.3
Evaluaci´ on
Entorno de evaluaci´ on Hemos usado los volcados en formato de base de datos de Wikipedia para castellano, ingl´es e italiano generados en Enero de 20084 . De estos volcados, hemos usado las tablas “page”, “pagelinks”, “categorylinks”, “text” y “abstracts”. En cuanto concierne a los RLs, hemos usado el WordNet ingl´es (versi´on 2.1), el WordNet espa˜ nol (versi´on 1.6) y PSC. Desambiguaci´ on El m´etodo que busca instancias comunes desambigua solamente el 39 % de los nombres. Esta baja cobertura, que se debe al bajo numero de instancias presentes en WordNet, se ve compensada por una muy buena precisi´on. De hecho, todos los nombres desambiguados son correctos. Hemos analizado los motivos por los que el m´etodo no desambigua el 61 % de los nombres; hay dos causas principales: (i) uno de los sentidos de WordNet corresponde a la categor´ıa pero no se encuentra una instancia com´ un (esto pasa en el 78 % de los casos); y (ii) ning´ un sentido de WordNet corresponde a la categor´ıa o la categor´ıa ha cambiado de nombre (ocurre en el 22 % restante). En cuanto a la aproximaci´on que calcula la similitud textual, el primer factor que destaca es el alto resultado alcanzado por el primer sistema de baseline (64.7 %), el cual simplemente elige siempre el primer sentido de WordNet ya que es el m´as frecuente. El otro sistema de baseline obtiene un resultado sensiblemente inferior (62,7 %). De los tres sistemas aplicados, el basado en an´alisis sem´antico latente es el que obtiene el peor resultado (54,1 %), mientras que los dos restantes obtienen resultados parecidos: el sistema basado en PageRank obtiene 64,2 % mientras que el que aplica implicaci´on textual consigue alcanzar al sistema de baseline obteniendo tambi´en 64,7 % sin utilizar un corpus de entrenamiento. Utilizando un corpus de este tipo, este sistema obtiene un resultado significativamente mejor: 77,74 %. En cuanto a las combinaciones, la votaci´on obtiene un 68 %, el esquema no supervisado un 65,7 % y el supervisado un 77,24 %. 4
Descargados de http://download.wikimedia.org
161
´ APENDICE E. RESUMEN Extracci´ on de ENs Hemos extra´ıdo ENs de las categor´ıas mapeadas a los nombres de partida para cada idioma. El cuadro E.1 provee informaci´on cuantitativa respecto a las ENs extra´ıdas. Este cuadro, adem´as de mostrar el numero de ENs extra´ıdas e incluidas en el l´exico, muestra tambi´en el n´ umero de variantes ortogr´aficas y de relaciones de instanciaci´on adquiridas. Cuadro E.1: ENs extraidas Ingles Castellano Italiano ENs 948.410 99.330 78.638 Variantes 1.541.993 128.796 104.745 Relaciones 1.366.899 128.796 139.190
El n´ umero de ENs extra´ıdas tanto para castellano como para italiano es notablemente inferior al numero de ENs extra´ıdas para ingl´es. Este resultado era esperado pues tanto el n´ umero de art´ıculos en Wikipedia (305.000 en castellano, 388.000 en italiano, 2.100.000 en ingles) como el n´ umero de categor´ıas mapeadas son significativamente menores para estos dos idiomas. Identificaci´ on de ENs Hemos evaluado los tres m´etodos propuestos para identificar ENs. El mejor resultado se obtiene con el m´etodo que combina Wikipedia y la Web: Fβ=0,5 81,20 %, 79,17 % de precisi´on y 90,48 % de cobertura. Por su parte, el primer m´etodo (Web) obtiene una precisi´on de 74,37 % y una cobertura de 88,44 % mientras que el segundo (Wikipedia) obtiene una precisi´on de 76,47 % y una cobertura de 89.80 %. Postproceso Explotando los enlaces interlengua de Wikipedia se extraen 26.157 ENs adicionales para el ingl´es, 38.253 para el castellano y 47.168 para el italiano. Por tanto, tras este proceso, el l´exico contiene un total de 974.567 ENs en ingl´es, 137.583 en castellano y 125.806 en italiano. En esta fase de postproceso tambi´en conectamos conjuntos de ENs que pertenecen a lenguas diferentes a dos ontolog´ıas. 814.251 conjuntos se enlazan 162
´ Y RESULTADOS E.2. TRABAJO DE INVESTIGACION OBTENIDOS a la ontolog´ıa SUMO y 42.824 a la ontolog´ıa SIMPLE. La gran diferencia en el numero de entidades conectadas a estas dos ontolog´ıas (aproximadamente un factor 20 a 1) se debe al hecho de que para conectar un conjunto a SIMPLE, ´este debe contener una EN en italiano conectada a PSC mientras que para conectar un conjunto a SUMO este debe contener una EN en ingl´es conectada a WordNet. Los resultados son por tanto coherentes pues el diccionario de EN contiene muchas mas ENs enlazadas a WordNet (1.366.899) que a PSC (139.190).
E.2.4
Aplicaci´ on a B´ usqueda de Respuestas
Con el objetivo de aplicar el l´exico de ENs creado a una tarea de PLN y as´ı validar su utilidad, hemos a˜ nadido el conocimiento contenido en el l´exico a un proceso de BR. La idea es utilizar este conocimiento para validar y reordenar las respuestas devueltas por el sistema de BR. El sistema utilizado, BRILIW, fue dise˜ nado para localizar respuestas en documentos, incluso si las respuestas y las preguntas est´an escritas en idiomas diferentes. Este sistema se presento en la tarea de BR biling¨ ue ingl´escastellano de CLEF 2006 y fue el sistema que obtuvo los mejores resultados. La arquitectura de BRILIW se basa en tres pilares principales que lo distinguen del resto de los sistemas de este tipo: (i) el uso de diversas fuentes de conocimiento multiling¨ ue (el m´odulo interlengua de EuroWordNet y el conocimiento multiling¨ ue de Wikipedia); (ii) el uso de m´as de una posible traducci´on de cada palabra de la pregunta; y (iii) el an´alisis de la pregunta en el idioma original sin ning´ un proceso de traducci´on. Hemos a˜ nadido a este sistema un m´odulo de validaci´on el cual usa el conocimiento del l´exico de ENs para validar las respuestas, con la posibilidad de reordenarlas. Este m´odulo es capaz de validar dos tipos de preguntas: (i) aquellas que esperan una EN como tipo de respuesta (e.g. ¿Qui´en es el secretario general de Interpol?); y (ii) aquellas que se refieren a definiciones de una EN (e.g. ¿Qui´en es Vigdis Finnbogadottir?). El m´odulo clasifica cada respuesta en uno de los siguientes tipos: (i) desconocida, si el l´exico no contiene la EN de la pregunta (i) o la EN de la respuesta (tipo ii); (ii) correcta, si el l´exico contiene la EN de la pregunta o la EN de la respuesta y el tipo de la EN codificado en el l´exico coincide con el etiquetado por BRILIW; y (iii) incorrecta, si el l´exico contiene la EN de la pregunta o la EN de la respuesta y el tipo de la EN codificado en el l´exico no coincide con el etiquetado por BRILIW. Una vez se han etiquetado las 163
´ APENDICE E. RESUMEN respuestas, el m´odulo de validaci´on reordena la lista de respuestas seg´ un el siguiente orden: correcta, desconocida e incorrecta. Hemos evaluado la efectividad del modulo de validaci´on usando el conjunto de preguntas oficiales del CLEF 2006 y tomando como base los resultados que el sistema obtuvo en esa competici´on. Los resultados obtenidos con la validaci´on son prometedores ya que BRILIW obtiene una mejora del 28,1 % comparado con los resultados anteriores.
E.3
Conclusiones
Esta tesis de doctorado ha tratado el problema del cuello de botella de la adquisici´on de conocimiento y ha presentado un caso de estudio sobre la adquisici´on de un l´exico de ENs. El primer paso ha consistido en un estudio de alto nivel para tratar de identificar l´ıneas gu´ıa que ayuden a superar el problema de la adquisici´on. Una vez el problema ha sido analizado desde una perspectiva m´as abstracta, hemos aplicado las l´ıneas gu´ıa al problema espec´ıfico de la adquisici´on de ENs. La contribuci´on principal de esta tesis consiste, de hecho, en el dise˜ no de una metodolog´ıa general para la creaci´on autom´atica de l´exicos de ENs que usa recursos de la Web 2.0 y RLs preexistentes. Hemos aplicado esta metodolog´ıa a tres de estos recursos para tres idiomas: WordNet para el ingl´es, WordNet para el castellano y PSC para el italiano. El l´exico resultante contiene 974.567, 137.583 y 125.806 ENs para el ingl´es, castellano e italiano respectivamente y 1.366.860, 141.055 y 139.190 relaciones de instanciaci´on a WordNet, al WordNet espa˜ nol y a PSC respectivamente Adem´as, hemos conectado conjuntos de ENs equivalentes que pertenecen a lenguas diferentes a dos ontolog´ıas: SUMO (814.251 conjuntos) y SIMPLE (42.824 conjuntos). El recurso resultante, junto a dos APIs de acceso (en C++ y PHP) est´an disponibles p´ ublicamente. Una caracter´ıstica importante de la aproximaci´on propuesta es su alto grado de independencia ling¨ u´ıstica. Este m´etodo se puede aplicar directamente a cualquier idioma si existe una versi´on de Wikipedia, un recurso ling¨ u´ıstico con una taxonom´ıa de nombres y un lematizador. De hecho, lo hemos aplicado a RLs basados en diferentes teor´ıas y que cubren tres idiomas. Por otra parte, la actualizaci´on del l´exico de ENs es banal; simplemente ejecutando el m´etodo, los art´ıculos nuevos de Wikipedia (aquellos que no 164
E.3. CONCLUSIONES exist´ıan cuando el m´etodo fue ejecutado por la ultima vez) ser´ıan analizados y de ellos, aquellos detectados como ENs ser´ıan a˜ nadidos al l´exico y conectados a los RLs y a las ontolog´ıas. Finalmente, hemos evaluado la utilidad del l´exico creado en una aplicaci´on real; ´este ha sido usado para validar las respuestas generadas por un sistema de BR. Con el conocimiento del l´exico, el sistema de BR mejora sus resultados en un 28.1 %. En cuanto a trabajos futuros, hay varias posibles lineas de investigaci´on. Por una parte consideramos llevar a cabo nuevos experimentos incorporando otros tipos de informaci´on estructurada que Wikipedia provee, como enlaces entre pares de art´ıculos y cajas de informaci´on. Por otra, ser´ıa interesante extraer tipos de informaci´on adicional para enriquecer el l´exico adquirido, por ejemplo relaciones sem´anticas entre las ENs extra´ıdas. Para terminar, nos gustar´ıa aplicar el l´exico a otras tareas de PLN. Una posibilidad seria aplicarlo a un sistema de REN basado en aprendizaje como una caracter´ıstica de tipo diccionario.
165
´ APENDICE E. RESUMEN
166
Appendix F Abbreviations
167
APPENDIX F. ABBREVIATIONS ACE Automatic Content Extraction ACQUILEX Acquisition of Lexical Knowledge for Natural Language Processing Systems ACQUILEX-II Acquisition of Lexical Knowledge API Application programming interface AVE Answer Validation Exercise BRILIW Question Answering using Inter Lingual Index module of EuroWordNet and Wikipedia BSD Berkeley Software Distribution CLEF Cross Language Evaluation Forum CLIPS Corpora e Lessici di Italiano Parlato e Scritto CoNLL Conference on Natural Language Learning CNR Consiglio Nazionale delle Ricerche DRAMNERI Dictionary, Rule-based And Multilingual Named Entity Recognition Implementation EAGLES Expert Advisory Group for Language Engineering Standards EWN EuroWordNet GikiCLEF Cross-language Geographic Information Retrieval from Wikipedia GL Generative Lexicon ICGL International Conference on Global Interoperability for Language Resources IDF Inverse Document Frequency IE Information Extraction IEEE Institute of Electrical and Electronics Engineers 168
IGDB Integrated Gazetteer Database ILI Inter-Lingual-Index IR Information Retrieval IREX Information Retrieval and Extraction Exercise ISLE International Standards for Language Engineering ISO International Organization for Standardization ITBM Istituto di Tecnologie Biomediche IWN Italian WordNet JWPL Java-based Wikipedia Library JWKTL Java-based WiKTionary Library KB Knowledge Base LaSIE Large Scale Information Extraction system LMF Lexical Markup Framework LR Language Resource MEANING Developing Multilingual Web-scale Language Technologies MET Multilingual Entity Task Conference MRDs Machine Readable Dictionaries MUC Message Understanding Conference NLP Natural Language Processing NE Named Entity NER Named Entity Recognition OLIF Open Lexicon Interchange Format OWL Web Ontology Language 169
APPENDIX F. ABBREVIATIONS PAROLE Preparatory Action for Linguistic Resources Organisation for Language Engineering PMI Pointwise Mutual Information PoS Part-of-Speech PSC PAROLE-SIMPLE-CLIPS QA Question Answering REPENTINO REPosit´orio para reconhecimento de ENTidades com NOme RTE Recognising Textual Entailment SIMPLE Semantic Information for Multifunctional Plurilingual Lexica SI-TAL Integrated System for the Automatic Language Processing SPARKLE Shallow Parsing and Knowledge Extraction for Language Engineering SUMO Suggested Upper level Merged Ontology TCs Top Concepts TE Textual Entailment TEI Text Encoding Initiative TO Top Ontology UIMA Unstructured Information Management Architecture UNED Universidad Nacional de Educaci´on a Distancia WiQA Question Answering using Wikipedia WSD Word Sense Disambiguation WWW World Wide Web XML eXtensible Markup Language
170
171