DRAMNERI: a free knowledge based tool to Named Entity Recognition Antonio Toral Grupo de investigaci´ on en Procesamiento del Lenguaje Natural y Sistemas de Informaci´ on Departamento de Lenguajes y Sistemas Inform´ aticos University of Alicante, Spain
[email protected]
Abstract. In this paper we present DRAMNERI, a free software application which uses rules and gazetteers in order to perform Named Entity Recognition. This system is fully customizable to any specific domain and it is multilingual. It has succesfully been applied in a domain specific Information Extraction system and in a Question Answering task.
1
Introduction
Named Entity Recognition (NER) is nowadays an important task for the resolution of other problems of higher complexity, like Information Retrieval (IR), Information Extraction (IE) or Question Answering (QA), among others. In spite of this, NER was initially only used as a subtask of IE. This is the Natural Language Processing (NLP) task that consists in retrieving relevant information from non structured texts and producing as a result a structured set of data, usually refered as templates. Several subtasks are applied in order to achieve this goal. As we have already pointed out, one of these is NER. As defined in the Message Understading Conference [3], NER consists in identifying and categorizing entity names wich can include also temporal and/or numerical expressions. As in other NLP techniques, there are two approaches to NER [1]. One is based in knowledge while the other uses a supervised learning algorithm. Regarding resources, the first usually uses gazetteers and rules whereas the later needs an annotated corpus. The knowledge based model obtains good results in specific domains, as the gazetteers can be adapted very precisely, and it is able to detect complex entities, as the rules can be tailored to meet nearly any requirement. However, if we deal with a non restricted domain, it is better to use the learning approach, as it would be very tedious and time consuming to build rules and gazetteers in this case. Because our aim is to classify complex entities in restricted texts, we have adopted the knowledge model. We also wanted our system to be highly flexible and adaptable. That is why we have made almost all possible parameters customizable (i.e. dictionaries to use, entity categories, length of contexts, etc). This
way the system can be easily configured to work with different languages and domains. Moreover, this way it can deal with an open set of entity categories1 . Regarding software licenses, we would like to point out that we strongly agree with Freeling [2] and Weka [7] developers that the free availability of basic NLP tools would speed up progress in our area of reasearch. Thus, we modestly contribute in this aspect by developing this software with a free license2 .
2
Architecture
DRAMNERI states for Dictionary, Rule-based And Multilingual Named Entity Recognition Implementation. It is a multiplatform3 NER system written in C++. It is organized as a sequential set of modules with a high degree of flexibility, meaning that some modules may be used or not depending on the input4 . Moreover, most of the actions it performs, and the dictionaries and rules it uses are configurable by using parameter files. The main modules of our system are briefly outlined in the following subsections.
Fig. 1. DRAMNERI architecture
1
2
3
4
Usually NER systems are limited to a closed set that usually includes Person, Location, Organization and time, date and numerical expressions This software is distributed under the terms of the GNU General Public License (GPL) It has been succesfully tested in GNU/Linux and Win32 but it should work in any system with a C++ compiler with STL support For example, we could process text that it is already tokenized and/or already splitted in sentences
2.1
Tokenizer
The built-in tokenizer has been designed taking efficiency and simplicity in mind and it can be used for correctly punctuated common texts in languages with latin codings. If the program is used for unusual domains or for languages with different codings or with other demands on tokenization, then an external module should be used instead of this one. There are free tokenizers which can deal with this task, such as the tokenizer included in Freeling [2]. 2.2
Sentence Splitter
We split the sentences using an algorithm based on the method used for this task in the EXIT system [5]. For every token in the text that can delimit a sentence (i.e. dot, question mark) the two preceding and the following token are considered. With this context information, some rules and dictionaries are applied to decide wether or not the token is an end of sentence. 2.3
Named Entity Identification
This task is applied for each sentence in the given text and its goal is to identify the named entities that appear in the text. We use regular expressions to do this. Groups of tokens that match any NEI regular expression jointed by prepositions5 are detected and identified as generic entities. The maximum number of prepositions between two tokens that match any NEI regular expression and the list of prepositions to consider are configurable. For example, if we have ’de’ and ’la’ in the preposition list and the maximum number of prepositions between identified tokens is 1, then the string “en la Universidad de Alicante” would be identified as “en la Universidad de Alicante ” but “Pedro de la Viuda” would be identified as “ Pedro de la Viuda ” instead of “ Pedro de la Viuda ” because 1 is the maximum number of prepositions between identified tokens. 2.4
Named Entity Recognition
The goal of this phase is to assign a category to each of the entities detected in the previous step. It should be noted that the boundaries of the identified entitities can be altered in this phase. In order to achieve the classification, we take into account external and internal evidence [4], that is, we try to find any information that help us to classify each entity by studying its left and right contexts and the entity itself respectively. We perform this two actions in a secuential way. These two processes are detailed like follows. 5
extracted from a preposition list specific to NEI
Classification using external evidence We use triggers gazetteers. By a trigger we mean a word or collection of words that appear before or after an entity and that determines its classification type. For trigger driven classification length-configurable left and right contexts of the identified entity are considered. Within these contexts front triggers and back triggers gazetteers are applied respectively. If any happens to be found then the entity is classified with the category of the gazetteer that the matching trigger belongs to. For example, if we have the string “calle Mayor ” and calle is a location trigger, then ”calle Mayor” is classified as a location entity. The output string would be “ calle Mayor ”. If we classifiy an entity with a front trigger, then, we try to extend the entity classified by using rules, which follow the standard egrep syntax. An example follows. rule: ^no [0-9]+ identified entity: "calle Mayor no 27" ”calle Mayor” is classified as an address using triggers. After that, it is extended because no 27 matches a rule. Thus, the final entity classified with type address is ”calle Mayor no 27” Classification using internal evidence For this classification we use two resources: gazetteers and rules. As the rules used for trigger driven classification these ones follow the standard egrep syntax. These one also may contain elements that refer to gazetteers. Each rule is linked to an entity category. This way, if a rule matches an entity then the category assigned is that linked to the rule. It follows an example to match first name and surname in Spanish: rule: ^PER (PER)? ((PREP)? (PREP)? PER) category: PER This rule matches and entity that consists of a token which is in the Person gazetteer, followed by a token present in the prepositions gazetteer, etc. If an entity matches then it is assigned the category PER. Example strings that would match are ”Alberto P´erez” and ”Pedro Mario de la Viuda”.
3
Application to specific tasks
We have not directly evaluated DRAMNERI as it is intended to be a generic tool that can be adapted to specific tasks. Thus, we present two tasks in which we have applied DRAMNERI. Firstly, we outline the use of this NER system in a Information Extraction system whose domain consists of notarial documents. Secondly, we discuss the use of DRAMNERI in a Question Answering (QA) system.
In the first task we had a collection of notarial documents from which we wanted to build templates containing the relevant data in a sorted way. For doing this, DRAMNERI was applied after a preprocess and before a postprocess. In the preprocess we divided the documents in several sections because the relevant data changed from section to section. For each section we built adapted gazetteers and rules to apply DRAMNERI. Finally, we applied a postprocess to the classified entities of each section. Roughly, it consisted in filling templates taking into account the different entity categories, the order in which the entities where classified and so on. A output template is shown as an example:
PEDRO PEREZ LOPEZ JOSEFA PEREZ calle Zepelin n´ umero 5 3o 25526996-S 69962777-U Jos´ e Mar´ ıa Gonz´ alez Banco Santander Central S.A URBANA.- CATORCE.- Piso calle Sevilla 9932023UT2959S0050PS In our second task we applied DRAMNERI to Question Anwering [6]. The aim of a QA system is to find a specific answer to a given query in a collection of documents. These systems are usually made up of an Information Retrieval (IR) module and a QA algorithm. The IR module retrieves the most relevant documents from the collection to a given query, and QA is applied only to these documents, as its computational cost is quite high. Our approach consisted in applying NER between IR and QA. This way we filtered all the relevant documents that IR returned that did not have any entity that belonged to the same category as the answer. I.e. if the query were ”Who is the president of France?” then the answer category is Person and thus, we would filter all the relevant documents in which no Person entity was found. By applying NER we achieved a 26% data reduction and moreover, we increased a 9% the data relevance.
4
Conclusions and Future Work
We have fulfiled our main objective. That is, to have an easily customizable NER system that could perform well detecting complex entities in specific domains.
Besides, we think that providing a NER tool such as this one as free software could help the research community to focus on investigating new algorithms and techniques rather than to spend time implementing again and again basic algorithms that have little to contribute to the field. One line of future research could consist of investigating how to combine this NER system with a learning based NER system in order to improve the results of the later. Another line of research would be to determine how the addition of language information (lexic, morfologic and sintactic) could improve the performance of our NER system.
Acknowledgements This research has been partially funded by the Valencia Government under project number GV04B-268.
References 1. A. Borthwick. A Maximum Entropy Approach to Named Entity Recognition. PhD thesis, New York University, September 1999. 2. X. Carreras, I. Chao, L. Padr´ o, and M. Padro. Freeling: An Open-Source Suite of Language Analyzers. In Proceedings of the 4th LREC Conference, 2004. 3. N. Chinchor. Overview of muc-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7), 1998. 4. D. Mc Donald. Internal and external evidence in the identification and semantic categorization of proper names. Corpus Processing and Lexical Acquisition, 1996. 5. R. Mu˜ noz and M. Palomar. Sentence Boundary and Named Entity Recognition in EXIT system: Information extraction system of notarial texts. In Proceedings of IV Int. Conference on Artificial Intelligence and Emerging Technologies in Accounting, 1998. 6. Antonio Toral, Elisa Noguera, Fernando Llopis, and Rafael Mu˜ noz. Improving question answering using named entity recognition. In Proceedings of the 10th International Conference of Applications of Natural Language to Information Systems, pages 181–191, 2005. 7. Ian Witten, Eibe Frank, and Morgan Kaufmann. Data Mining: Practical machine learning tools with Java implementations. 2000.