Ludovic Jean-Louis, Romaric Besançon and Olivier Ferret. CEA, LIST, Vision and Content Engineering Laboratory,. Fontenay-aux-Roses, F-92265, France.
Using Conditional Random Fields to segment texts into events Ludovic Jean-Louis, Romaric Besançon and Olivier Ferret CEA, LIST, Vision and Content Engineering Laboratory, Fontenay-aux-Roses, F-92265, France.
1
Introduction
Information extraction aims at automatically extracting structured information from texts. The work we present here focuses more specifically on the extraction of events from news articles. The content of this kind of texts is organized to provide answers to factual questions (When? Where? How much? Who?) about the events they refer to [3]. Our goal is to gather the pieces of information, usually identified as named entities, related to the main event that is being discussed in a news article. One frequent characteristic of news articles is to present the features of their main event by comparing them to the features of similar events. In this context, an important issue for information extraction systems is to be able to associate a piece of information with the right event. We propose a two step approach to tackle this problem: 1) segmenting texts into parts related to different events 2) within a segment, associate the named entities that are part of its features with the event of the segment. Our approach is illustrated by an example on a French news article in Figure 1. The work of this article mainly focuses on the segmentation of texts into events. Un séisme d'une magnitude de 5,5 degrés sur l'échelle ouverte de Richter a été enregistré vendredi soir dans la région d'Oran (430 km à l'ouest d'Alger), a annoncé la radio publique. Oran, la métropole de l'ouest algérien, avait déjà été secouée en janvier par un tremblement de terre d'une magnitude de 5,3 sur l'échelle de Richter. Cette secousse n'avait fait ni victime, ni dégâts. L'Algérie, dont le nord est situé dans une zone sismique, est régulièrement affectée par des tremblements de terre. Alger et sa région avaient été frappées, le 21 mai 2003, par un violent séisme qui avait fait 2.300 morts et plus de 10.000 blessés. : earthquake
: magnitude
: damages
: location
Main Event
Sub Event
Background
Figure 1: Segmentation of a text into events Following this perspective, we consider a text as a sequence of sentences and each sentence is associated with an event we want to determine. This problem can be also considered as a sequence classification task, for which the Conditional Random Fields (CRF) [1] often provide good results. In the next two sections, we report an application of CRF to the segmentation of texts into events with an evaluation in the field of earthquake events.
2
Segmentation into events
In accordance with the context of our task, we adopted a representation of texts turned towards events: a text is viewed as a sequence of sentences in which each sentence is characterized by the presence or the absence of an event1 . In addition, we also only focus in this work on the identification of the named entities associated with the main event of a news article. We have therefore decided not to differentiate one sub event from another. Thus, we propose to classify sentences according to the following three categories: • Main Event: all sentences referring to the main event of the text ; • Sub Event: all sentences containing data that are related to an event different from the Main Event ; • Background: all sentences that belong neither to the Main Event or a Sub Event. To perform this classification, we assume that the most interesting criteria rely not only on the nature of the sentences but also on their linking at a discursive level. A graphical model for sequence annotation (HMM, CRF) is particularly suitable for catching such linking. Compared to the Hidden Markov Models (HMM), the CRF models have the advantage of modeling arbitrary knowledge by using features and are consequently well suited for integrating several criteria for classification. In order to test our hypothesis on the importance of the discursive level, we compare the results of these models with a Maximum Entropy approach using the same features. 1 The
hypothesis ”one sentence = one event” is a simplification but it is globally not too simplistic in our application domains.
1
Concerning the segmentation of texts into events, we suggest to rely mainly on temporal information. Sub events are generally past events whose mention requires an explicit temporal reference marked by linguistic cues. More specifically, we included the following features in our model: • verb tense: for each sentence, a binary feature is associated with each possible grammatical tense. This feature is set to 1 when the sentence contains at least one verb with the corresponding tense, 0 otherwise; • date entities: this feature is used to indicate whether the sentence contains or not a named entity of type DATE (in the current model, the value of a date is not used); • temporal multiword expressions: this feature accounts for the presence of a temporal multiword expression in a sentence. We built manually a dictionary of such expressions from the corpus presented in [2]. This dictionary contains expressions such as: au début de l’année, ces dernières années (the beginning of the year, in recent years).
3
Evaluation
We performed an evaluation of our model of segmentation of texts into events on a corpus of French news articles concerning earthquake events. The articles were collected between the end of February 2008 and early September 2008 and manually annotated by domain analysts (the annotation only concerned entities related to the main event). For the implementation of our models, we used a set of Python scripts together with several reference toolkits: CRF++2 , NTLK3 , MaxEnt4 respectively for CRF, HMM and Maximum Entropy (MaxEnt) models. A subset of 140 news articles were tagged with the types of events defined in section 2: Main Event (70%), Sub Event (17%), Background (13%). In Tables 1 and 2, we report the results for segmentation into events in terms of recall and precision with a 5-fold cross validation process. More particularly, we report in table 1 our baseline results obtained from a HMM trained using only verb tenses. Table 2 show our results for both Maximum Entropy and CRF, both trained using all the features described in the previous section. Although the HMM model and the two others can’t be compared directly because the first one is based on a unique observation while the others use several features, we note that grammatical tense is a strong enough criterion for identifying sentences that belong to the Main Event but is not sufficient for the other event types. Moreover, taking into account the successive sentence labels as done by the CRF model helps in improving results in comparison with the Maximum Entropy model. Event type Main Event Sub Event Background
HMM Segmentation Recall (%) Precision (%) 82.95 93.56 37.84 9.63 49.15 39.97
Table 1: Results of segmentation into events using HMM Event type Main Event Sub Event Background
MaxEnt Segmentation Recall (%) Precision (%) 94.82 78.69 33.58 54.74 22.02 84.17
CRF Segmentation Recall (%) Precision (%) 98.69 87.39 52.65 95.76 69.31 92.96
Table 2: Results of segmentation into events using MaxEnt and CRF
References [1] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML ’01: Proceedings of the Eighteenth International Conference on Machine Learning, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [2] Eric Laporte, Takuya Nakamura, and Stavroula Voyatzi. A French Corpus Annotated for Multiword Expressions with Adverbial Function. In Proceedings of the Language Resources and Evaluation Conference (LREC), pages 48–51, Marrakech Maroc, 2008. [3] Nadine Lucas. La rhétorique des dépêches de presse à travers les marques énonciatives du temps, du lieu et de la personne. In Actes de la Semaine du Document Numérique (SDN 2004), Journée ATALA, June 2004. 2 http://crfpp.sourceforge.net/ 3 http://www.nltk.org/ 4 http://webdocs.cs.ualberta.ca/~lindek/downloads.htm
2