Our event extraction system was built using GATE General Architecture for Text. Engineering) . Our system identifies predefined named entities (names of ...
Automating Event Recognition for Statistical Machine Translation systems Emna HKIRI, Souheyl MALLAT, Mohsen Maraoui and Mounir ZRIGUI
Abstract. Event Named entity Recognition (NER) is different from most past research on NER in Arabic texts. Most of the effort in named entity recognition focused on a specific domains and general classes especially the categories; Organization, Location and Person. In this work, we build a system for Event named entities annotation and recognition. To reach our goal we combined between linguistic resources and tools. Our method is fully automatic and aims to ameliorate the performance of our machine translation system.
1
Introduction
Information extraction (IE) as defined in the message understanding conferences (MUC),aims to analyze natural language texts and extract useful information from a particular domain or application. IE is used in processing text for Arabic systems such as search engines, clustering, classification, text mining systems Question and answering, information retrieval, text summary, indexation, and classification(Cimiano et al, 2004) ,(Cohen et al, 2009). The main task that we tackle in this paper is to develop a system to extract events and test the applicability of our method to Arabic texts. Arabic language is considered difficult to control in the NLP. It is a semitic language and presents specific morphological, syntactic, phonetic and phonological features; To overcome these challenges, we propose an event system detection in order to improve the performance of SMT system.. Our system used a set of unclassified Arabic news websites articles, and generates a set of events with their attributes. In this work we adopted the Automatic Content Extraction model definition to extract and annotate Event named entities. The ACE defines the event as an action, a process in which participants are connected(George, 2004) The rest of the paper is organized as follows: Section (2) deals with the definition of the event and reviews related work on event detection. In section (3), we present our method for the automatic extraction of events. In section (4) , we discuss the main elements of our proposed system. In section (5), we presents our experiments and discusses the results. Finally, we conclude our work highlight the future work.
.
2
State of art
The problem of extracting information from text acquired in recent years considerable attention. But while the problem of detecting and extracting named entities is relatively well understood, the event extraction remains a problem far from being solved tasks because an event mention can be expressed by several linguistic expressions and several sentences (Béatrice, 2012)(Aymen, 2007). The same definition of event varies depending on the domain of application: software engineering, history, philosophy and linguistics ...etc. Extracting event attracted a number of works, mostly restricted to English (Gabriel, 2006) or French (Gabriel,2008),( Ludovic, 2011). This domain of research is widespread, but it is still very little studied for the Arabic language because it has several unique features that do not exist in other languages (Soraya, 2013). For other languages different approaches have proposed different models in the definition, which can be summarized in two main flows: The approach or TimeML standard and the ACE Model. The model ACE (Automatic Content Extraction) defines the event as an action, a trial at which the participants are connected. Events are represented according to their attributes and their participants. ACE event can have a number of participants, and each participant is characterized by a role (agent, object, source, target).The event attributes are its types (destruction, creation, transfer, movement, interaction) and the modality of an event is (real and not real). Several systems and projects were founded on this model such as the Aone and Ramos-Santacruz approach, which is interested in events and relations between them; it is based on patterns for the labeling. The idea is to search find all informations about the event and to fill the grid corresponding to it for example the event “bying” their REES tool do extract from the text informations about the time, place, the seller and the buyer.. (Aone and Ramos-Santacruz , 2000) The approach of Ahn is founded on machine learning. It is composed of 4 modules, each one is generated by TiMBL tool these modules are trigger identification; events anchors; attributes identification and the last one corresponds to the coreference of the events (Ahn, 2006). The same principle of modules is used in Beatrice approach for French events extraction, The module of anchors identification is based on the prepositions or determinants and the events could be nouns, verbs, adjectives or adverbs. The identification of these anchors is based on classification of words in the texts. The basic indices of the classification are morpho-syntactic categories of words, lexical features (lemma, form, depth in the dependency tree), features issued from WordNet , words surrounding the event ,etc... The approach or TimeML standard (Roser, 2006) is based on three fundamental concepts: time, event and relations. The TimeML project was created with the aim of
improving Q-A systems to deal with questions of temporal order on the entities and events. TimeML adopts a large conception of events. The tag classifies an
annotated event by the TimeML ontology, the latter includes different classes of events (aspectual, intentional action, intentional state, perception, and state case). Parent used TimeML guide to manually annotate the English texts. Their model is based on patterns and syntactic analysis. The author is interested in adverbs, verbs, adjectives and nouns events. In their approach, they are based on words of action verbs and syntactic rules to annotate nouns events which depend on temporal prepositions. These prepositions are initially issued from TimeML guide. (Parent et al., 2008) The approach of Sauri in (Saurí et al., 2005) is focused on the disambiguation of nouns that may have event interpretation using SemCor corpus and TimeBank1.2. He extracted 25 sub-tree of the semantic network, which contains essentially nouns referring to events. This approach is also based on a statistical module and Wordnet for event annotation. If the process of events search in Wordnet is worthless they pass to use their baysien classifier. This last is founded on a set of rules learned from the SemCor corpus. The evaluation of their results is done by comparison with the tags of the TimeBank1.2 In our work we are interested in the EVITA system (Sauri, 2005). This system is part of the project TARSQI (Verhagen, 2005) respecting the TimeML specification, EVITA is a tool for event detection for the English language. In this system, we consider verbs, nouns and adjectives as events. These textual elements are considered the most meaningful. In this system for each type of "trigger", an extraction method is associated particularly considering the grammatical and textual features. EVITA system tests first if the category of a word is a verb, noun or adjective and then classifies it among the types of predefined events. These processes are complemented by the identification of the various arguments whose objective is the identification of event participants. The identification is made by detecting pairs between the trigger and the other entities of the sentence. Detection of pairs is realized based on indices such as the morpho-syntactic category of the two elements, the type of event, the type of entity or the dependency relations in the pair
3
Proposed Method
In our work, as already mentioned, we were inspired by the EVITA system. EVITA considers verbs, nouns and adjectives as the most meaningful events. For us, we judged that only verbs and nouns as events while adjectives as less significant Our event extraction system was built using GATE General Architecture for Text Engineering) . Our system identifies predefined named entities (names of "people", "places" "organization", and "dates"), and the relations between the entities and the defined events. The extraction of an event consists of the discovery of links between the "trigger" of the event and its arguments/attributes. The extraction of the link is established based on a syntactic analysis "dependency analysis» and of extraction rules exploiting this analysis. To implement this task we used JAPE transducer (JAVA Annotation Pattern Engine) provided by Gate Toolkit Our event detection system is composed of four main phases summarized below:
The first phase is lemmatization; we first start by cutting the Arabic text into words, and then we assign to each word its lemma. The second phase is the identification of triggers; it is done by the use of gazetteers. These lists are composed of verbal lemmas for verbal triggers and nominal lemmas for nominal triggers. This phase is realized as the following steps: We compare each lemma to the list of triggers (list of gazetteers). If matching , we annotate the corresponding lemmatized word as an event trigger In the third phase, as in the EVITA system we associated the trigger to the class that it represents. At this stage of our work we do not deal with polysemous triggers, we just developed gazetteers for monosemtic triggers. The fourth phase includes the identification of participants and their semantic roles. For identification, the parser is used to extract the main dependencies "subject", "object", "preposition", “agent ». The purpose of the semantic analysis is the assignment of semantic roles to participants extracted in the previous phase.
4
System Implementation
4.1
Work Environment
4.1.1
GATE
Most of existing tools for text engineering, if not all, were not originally developed for the treatment of the Arabic language. Most tools were built for English, French or other languages as ACABIT, LEXTER, FASTER, ANA, EXIT(Roch, 2004). Some tools can support or be modified to treat Arabic as Nooj and GATE. We chose the latter for the implementation of our system. GATE1 (General Architecture for Text Engineering) is a Java open source platform dedicated to textual engineering; it appeared to us well suited for the development of our system. It is a toolkit for natural language processing and very useful for information extraction. By a system of plugins GATE provides to users a variety of modules dedicated to textual analysis. The most commonly used are tokenizers (segmenters), Part Of Speech Taggers (morpho-syntactic taggers), Gazetteers (lexicons) and transducers (JAPE). Named entities extracted by GATE correspond to person names, organization, location, dates, etc.. To realize new annotation Gate permits to load new resources and plugins. Also, it permits to combine and parameter them within the same treatment chain.
4.1.2
External resources
For the Arabic language, Gate does not have a satisfactory number of instances in his predefined Gazetteers of named entities. For example, the Gazetteers_Personne is composed only of two types ( female_names.lst and male_names.lst ) whereas for 1
http://gate.ac.uk/
English almost 20000 entries, classified in nine different types ( person_ambig.lst , person_female_cap.lst , person_female.lst,person_full.lst, person_male_cap.lst,person_male.lst,person_relig.lst,person_sci.lst,person_spur.lst). To overcome, this problem of deficiency in the Gazetteers, we have used external resources to enrich Gate predefined Gazetteers, which consist of five different types; all built manually using our corpus and web resources. Person gazetteer: a list of 3257 complete names of people found in Wikipedia and other websites. These names are normalized and formalized to end with a list of 3000 names (first and final names). Organization gazetteer: we collected 2400 names of organizations. Location gazetteer: we enrich it by Arabic Wikipedia, taking the page labeled “countries of the world in Arabic” to retrieve location names. It consists of 2500 of continents, cities, countries… etc. Date Gazetteers: we did enrich this predefined gazetteer by 910 new entities. Event Gazetteers: Gate as mentioned above do not do the annotation of event and have no predefined Gazetteer for it, that’s why we did create a new one , composed of verbal and nominal list of triggers. These lists are collected from our domain corpus.To enrich it we added to their extracted lemmas their synonyms using Arabic Wordnet and as well as the argument prepositions structures ( for verbs ) by the Arabic dictionary. This final Gazetteer contains 700 entries. Table 1: Enrichment of Gate Gazetteers
4.2
Gate Person
predefined entries 1700
enriched entries 3000
Overall 4700
Organization Location Date Event
96 485 84 0
2400 2500 910 700
2496 2985 994 700
Implementation of the Method
The implementation of our method in GATE platform has necessitated the use of additional Gazetteers and tools; therefore the installation of new plugins. For the first phase of lemmatization, we used morphological analyzer (GATE Morphological Analyser). This analyzer has allowed us to obtain, for each word in the Arabic text, the associated lemma. These lemmas are then used in the next phase. The second phase of identifying triggers is performed using resources Flexible Gazetteer, which allowed us to compare the tokens of the Arabic text to lemmas in both created Gazetteers (one for verbal lemmas and the other for nominal lemmas).
In the third phase, we classified the verb and noun lemmas into subclasses of the "Event" class (Attack, Military_Operation, Crash, Shooting, Damage, Bombing, Death, Kidnapping, War, Injure). In the fourth phase, for the identification of participants, sentences were cut with the Noun Phrase Chunker and VP Chunker resources. For syntactic analysis, Stanford Parser was used and configured to process the Arabic language. In the last phase, we pass to develop our linguistic rules with JAPE transducer (Java Annotation Patterns Engine). The role of the transducer is to identify named entities (Person, Organization, Location, and Date). So in our case, JAPE 2executes the developed rules to extract the various arguments of the event. Writing these rules is followed by a test phase which aims to detect annotation errors and therefore the correction of these rules. Here we present an example of some predefined event classes, that are annotated by our tool : death : attack : م اInjureEvent : إMilitary_ Status: ا ا/ ھ ا War
5
System Evaluation
We used MILTcorp to test our system. MILTcorp is annotated especially for the Event extraction task. This corpus is related to our domain research; the military domain. Its articles are collected from news websites and news wire like (aljaziraa , al_arabia , al_manar , France24 (Arabic ) also from electronic journals(al-Quds). Table 2: data collection
Test corpus Number of sentences 1650 Number of words 45350 Number of words/sentences 24 The evaluation of our system is as follows: The texts of test corpus are all manually annotated. After we have annotated automatically these texts with our system. - We have established a comparison between these two annotations with Corpus Quality Assurance. This Gate tool makes the comparison between two annotations on a corpus. - The result of the comparison leads to an overall F - measure equal to 0.48, which uses the recall and precision (accuracy). Precision measurement is defined by the percentage of entities found by the system and which are correct, and the recall is defined as the ratio between the numbers of found correct entities by the number of entities extracted from the reference articles. number of found correct entities Precision = (1) number of entities found 2
:http://gate.ac.uk/sale/tao/splitch8.html#chap:jape
The overall evaluation of entities extracted by our system compared to reference articles is based on the use of F-measure. This measure combines the precision and the recall, it is defined as following (we have used the Fß=1; the precision and the recall are weighted equally ): (ß + 1) ∗ Precision ∗ Recalll F − mesure = ß ∗ Precision + Recall Table1 shows results for different entities annotated by Gate (using only the basic predefined gazetteers without the enriched gazetteers) Table 3: Gate baseline results
NE Date Person Organization Location Event Overall
Precision Recall F-mesure 51% 29% 36 ,97% 33% 15% 20 ,62% 52% 33% 40,73% 59% 44% 50,40% 0% 0% 0% 39% 24,2% 29,67%
Table3 show the results obtained by using the enriched gazetteers (person, organization, location, date) and our method of event extraction. Table 4: results with event, date, person, Organization, location gazetteers
Named entities Date Person Organization Location Event Overall
Precision 84.03% 83.04% 78.6% 86.14% 70% 80,36%
Recall 77.19% 79.34% 62.11% 77.42% 35% 66,21%
F-measure 80,64% 81,14% 69,38% 81,54% 46,66% 71,87%
From the table above, we see the effect of the annotation of events on the annotation of named entities extracted by Gate. We obtain an improvement of accuracy of 39% to 80,36 % same with the recall from 24,2 % to 66,21 %. These results are satisfying according to limitations due to the complexity of Arabic sentences and the lack of adequate tools for the treatment of the Arabic language. . In the next future, we plan to complete the event Gazetteers (verbal and nominal) and the disambiguation of polysemous triggers also to increase the size of our corpus to obtain a higher performance of the system.
6
Conclusion and Perspectives In this paper we have presented an integrated system to detect events in Arabic texts. The event identification was performed through several stages; data collection, preprocessing, classification and identification of participants and their semantic roles. Extensive experiments were conducted to evaluate the effectiveness of the proposed system using our Arabic Corpus. This system can be generalized for the purposes of decision making enrichment which can be implemented in many areas such as information intelligence or crises management. In future work we look to improve results and compare with other works in the domain of event detection. We aim to annotate different types of relations between the named entities in order to improve the performance of our proposed system
References Ahn D. (2006). The stages of event extraction. In ARTE '06 : Proceedings of the Workshop on Annotating and Reasoning about Time and Events, pages 1_8, Morristown, NJ, USA, jul 2006. Association for Computational Linguistics. 49 Atwell E, L. Al-Sulaiti, S. Al Osaimi, Abu Shawar B.(2004) "Un Examen d'Outils pour l'Analyse de Corpus Arabes", JEP TALN, Session on Arabic Language Processing, Fès, 2004 Aone, C. and Ramos-Santacruz, M. (2000). REES : a Large-Scale Relation and Event Extraction System. In Proceedings of the sixth conference on Applied natural language processing, ANLC '00, pages 76-83, Stroudsburg, PA, USA. Association for Computational Linguistics. 4, 42, 47, 50, 76 Aymen Elkhlifi, Rim Faiz. (2007). : Machine Learning Approach for the Automatic Annotation of the Events. FLAIRS Conference 2007: 362-367. Bethard S. &Martin J. H. (2006). Identification of event mentions and their semantic class. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), p. 146–154, Sydney. Borsje, J., Hogenboom, F., Frasincar, F(2010).: Semi-Automatic Financial Events Discovery Based on Lexico-Semantic Patterns. International Journal of Web Engineering and Technology 6(2), 115-140 (2010) Capet, P., Delavallade, T., Nakamura, T., Sandor, A., Tarsitano, C., Voyatzi, S(2008).: Intelligent Information Processing IV, IFIP International Federation for Information Processing, vol. 288, chap. A Risk Assessment System with Automatic Extraction of Event Types, pp. 220-229. Springer Boston (2008) Cimiano, P., Staab, S.(2004): Learning by Googling. SIGKDD Explorations Newsletter6(2), 2433 (2004) Cohen, K.B., Verspoor, K., Johnson, H.L., Roeder, C., Ogren, P.V., Baumgartner, Jr., W.A., White, E., Tipney, H., Hunter, L(2009).: High-Precision Biological Event Extraction with a Concept Recognizer. In: Workshop on BioNLP: Shared Task collocated with the NAACL-HLT 2009 Meeting. pp. 50-58. Association for Computational Linguistics (2009) Debili, F(2001) traitement automatique de l’arabe voyéllé ou non IRMC tunis 2001 Debili F., Achour H., Souici E(2002). "La langue arabe et l'ordinateur : de l'étiquetage grammatical à la voyellation automatique", Correspondances de l'IRMC, N° 71, pp. 10-28, 2002.
Frasincar, F., Borsje, J., Levering, L(2009).: A Semantic Web-Based Approach for Building Personalized News Services. International Journal of E-Business Research 5(3),35-53 (2009) Gabriel Parent, Michel Gagnon, Philippe Muller(2008). Annotation d’expressions temporelles et d’événements en français TALN 2008, Avignon, 9–13 juin 2008. George Doddington, Alexis Mitchell, Mark Przybocki. (2004). The Automatic Content Extraction (ACE) Program Tasks, Data, and Evaluation, Proceedings of LREC 2004 pp. 837-840. Kamijo, S., Matsushita, Y., Ikeuchi, K., Sakauchi, M.(2000): Tra_c monitoring and accident detection at intersections. IEEE Transactions on Intelligent Transportation Systems 1(2), 108118 (2000) Khoja S.(2001) "APT: Arabic part-of speech tagger". Proceeding of student workshop at the 2nd meeting of the NAACL, (NAACL’01), Carnegie Mellon University, Pennsylvania, 2001 Larkey L. S., Ballesteros L. , Connell M. (2002) "Improving Stemming for Arabic Information Retrieval: Light Stemming and Cooccurrence Analysis", In Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2002), Tampere, Finland, pp. 275-282, 2002. Tuerlinckx L(2004). "La lemmatisation de l’arabe non classique ", JADT 2004 : 7es Journées internationales d’Analyse statistique des Données Textuelles, 2004. Ludovic Jean-Louis(2011). Approches supervisées et faiblement supervisées pour l’extraction d’événements et le peuplement de bases de connaissances. Thèse, Université Paris Sud - Paris XI, Decembre 2011. Ouersighni R.(1998) "An approach for the conception of an arabic parser based on affix grammars over finite lattice", ICEMCO 98, Cambridge, 1998. Ouersighni R.(2000) "A major offshoot of the DIINAR-MBC project:AraParse, a morphosyntactic analyzer for unvowelled Arabic texts", ENSSIB, 2000. Ouersighni R.(2002) "Analyse syntaxique robuste de la langue arabe ",Université Lyon 2, thèse sous la direction de M. Hassoun, 2002. Parent Gabriel, Michel Gagnon, and Philippe Muller(2008). Annotation d'expressions temporelles et d'événements en français. In Frédéric Béchet, editor, Traitement Automatique des Langues Naturelles (TALN), Avignon, 09/06/08-13/06/08, page (support électronique), http ://www.atala.org/, 2008. Association pour le Traitement Automatique des Langues (ATALA). 49, 51, 100, 150, 152, 153 Roche, M., Heitz, T., Matte-Tailliez, O.,Kodratoff, Y., EXIT(2004). Un système itératif pour l’extraction de la terminologie du domaine à partir de corpus spécialisés, 2004. Roser Sauri, Jessica Littman, Bob Knippen, Robert Gaizauskas, Andrea Setzer, and James Pustejovsky(2006).TimeML Annotation Guidelines Version 1.2. 2006 Saurí R., Knippen R., Verhagen M. & Pustejovsky J. (2005). Evita : A Robust Event Recognizer for QA Systems. In Proceedings of HLT/EMNLP 2005, p. 700–707. Soraya Zaidi–Ayad (2013). Une plateforme pour la construction d’ontologie en arabe : Extraction des termes et des relations à partir de textes (Application sur le Saint Coran) These, Université Badij Mokhtar ,Annaba 2013. Verhagen M., Mani I., Saurí R., Knippen R., Littman J. & Pustejovsky J. (2005). Automating Temporal Annotation with TARSQI. In Proceedings of the ACL 2005 Wei, C.P., Lee, Y.H.(2004): Event detection from Online News Documents for Supporting Environmental Scanning. Decision Support Systems 36(4), 385-401 (2004) Yakushiji, A., Tateisi, Y., Miyao, Y.(2001): Event Extraction from Biomedical Papers using a Full Parser. In: 6th Paci_c Symposium on Biocomputing . pp. 408-419 (2001)