Extracting temporal references to assign document event-time periods*

11 downloads 0 Views 45KB Size Report
Abstract. This paper presents a new approach for the automatic assignment of document event-time periods. This approach consists of extracting temporal.
Extracting temporal references to assign document event-time periods* D. Llidó 1, R. Berlanga1 and M. J. Aramburu2 1

Departament of Languages and Computer Systems Departament of Engineering and Science of Computers Universitat Jaume I, E-12071, Castellón (Spain) {dllido, berlanga, aramburu}@uji.es

2

Abstract. This paper presents a new approach for the automatic assignment of document event-time periods. This approach consists of extracting temporal information from document texts, and translating it into temporal expressions of a formal time model. From these expressions, we are able to approximately calculate the event-time periods of documents. The obtained event-time periods can be useful for both retrieving documents and finding relationships between them, and their inclusion in Information Retrieval Systems can produce significant improvements in their retrieval effectiveness.

1. Introduction Many documents tell us about events and topics that are associated to well-known time periods. For example, newspaper articles, medical reports and legal texts, are documents that contain many temporal references for both placing the occurrences and relating them with other events. Clearly, using this temporal information can be helpful in retrieving documents as well as in discovering new relationships between document contents (e.g. [1] [2] and [3]). Current Information Retrieval Systems can only deal with the publication date of documents, which can be used in queries as a further search field. As an alternative approach, in a new object-oriented document model, named TOODOR [4], is presented. In this model two time dimensions are considered: the publication date, and the event-time period of documents. Furthermore, by means of its query language, called TDRL [5], it is possible to retrieve documents by specifying conditions on their contents, structure and time attributes. However, TOODOR assumes that the event-time period of a document is manually assigned by specialists, which is an important limitation. By one hand, this task is subjective as it depends on the reader's particular interpretation of the document texts. On the other hand, in applications where the flow of documents is too high, the manual assignment of event-time periods is impracticable. Consequently, it is *

This work has been funded by the Bancaixa project with contract number PI.1B2000-14 and the CICYT project with contract number TIC2000-1568-C03-02.

necessary to define an automatic method for extracting event-time periods from document contents. In this paper we present an approach to extracting temporal information from document contents, and its application to automatically assigning event-time periods to documents. Moreover, with this work we demonstrate the importance of these attributes in the retrieval of documents. The paper is organized as follows. Section 2 describes the semantic models on which the extraction system relies. Section 3 presents our approach to extracting temporal references from texts. Section 4 describes how event-time periods can be calculated with the extracted dates. Finally, Section 5 presents some conclusions.

2. Semantic Models This section describes the semantic models on which the proposed information extraction method relies, these are: a representation model for documents, and a time model for representing the temporal information extracted from texts. 2.1. Documents and their time dimensions This work adopts the document model of TOODOR [4]. Under this model, complex documents are represented by means of object aggregation hierarchies. The main novelty of this model is that document objects have associated two time attributes, namely: the publication time and the event time. The former indicates when the document has been published, whereas the latter expresses the temporal coverage of the topics of the document. The publication time plays an important role in the extraction of temporal expressions, because several temporal sentences, such as "today" and "tomorrow", take it as point of reference. The event-time period of a document must express the temporal coverage of the relevant events and topics reported by its contents. Since the relevance of a topic depends on the interpretation of the document contents, event-time periods are inherently indeterminate. As a general rule, we assume that the location of these periods will coincide approximately with the temporal references appearing in the document, where a temporal reference is either a date or a period mentioned in the document texts. In this way, event-time periods could be either extracted automatically from the texts, or manually assigned by users. 2.2. Time Model Temporal sentences in natural language usually involves the use of the calendar granularities. In concrete, we can express time instants, intervals, and spans at several granularity levels. In this section we provide a time model that takes into consideration the time entities appearing in temporal sentences.

2.2.1. Granularities The proposed time model relies on the granularity system of Figure 1. From now on, we will denote each granularity of this system by a letter: day (d), week (w), month (m), quarter (q), semester (s), year (y), decade (x) and century (c). As shown in Figure 1, these granularities can be arranged according to the finer-than relationship, which is denoted with p [6]. Note that unlike other time models of the literature, in written text it is usual to relate granularities that do not satisfy this relationship (e.g. "the first week of the year"). In Figure 1 they are represented with dashed lines. century

decade year

quarter

semester month week

day

p

Fig. 1. Granularity System

In our model, two types of granularity domains are distinguished, namely: relative and absolute domains. A relative domain for a granularity g is defined in terms of another coarser granularity g' (g p g'), which is denoted with dom(g, g'). For instance, the domain of days relative to weeks is defined as dom(d, w)={1,…,7}. Relative domains are always represented as finite subsets of the natural numbers. Thus, we will denote with first(g, g') and last(g, g'), the first and last elements of the domain dom(g, g') respectively. An absolute domain for a granularity g, denoted with dom(g), is always mapped onto integer numbers (e.g. centuries and years). Time models from the literature associate absolute domains to granularities (called ticks), defining over them the necessary mapping functions to express the finer-than relationship [6]. 2.2.2. Time Entities In this section, we define the time entities of our time model in terms of the granularity system described above. A time point is expressed as the following alternate sequence of granularities and natural numbers T = g1 n1 g2 n2 ... gk nk. In this expression, if gi is a relative granularity then ni must belong to the domain dom(gi, gi−1) with 1 < i ≤ k, otherwise ni∈dom(gi). Consequently, the sequence of granularities must be ordered by the finer-than relationship, i.e. gi+1 p gi with 1 ≤ i ≤ k. From now on, the finest granularity of a time point T is denoted with gran(T).

A time interval is an anchored span of time that can be expressed with two time points having the same sequence of granularities: I = [T1, T2], where T1 = g1 n1 ... gk nk , T2 = g1 n'1... gk n'k y ni ≤ n'i for all 1 ≤ i ≤ k We will use the functions start(I) and end(I) to denote the starting and end points of the interval I respectively. Besides, the finest granularity of I, denoted with gran(I), is defined as the finest granularity of its time points. Finally, a span of time is defined as an unanchored and directed interval of time. This is expressed as S = ± n1 g1... nk gk , where the sign (±) indicates the direction of the span (+ towards the future, - towards the past), ni (1 ≤ i ≤ k) are natural numbers, and the granularities gi with 1 ≤ i < k are ordered (i.e. gi+1 p gi). 2.2.3. Operators This section describes the main operators that are used during the resolution of temporal sentences from the text. Firstly, we define the refinement of the a point T = n1 g1... nk gk to a finer granularity g as follows: refine(T, g) = [T1, T2] where T1 = g1 n1... gk nk g first(g, gk), and T2 = g1 n1... gk nk g last(g, gk) Note that this operation can only be applied to granularities with relative domains. Similarly, we define the refinement of a time interval I to a finer granularity g (g p gran(I)) as follows: refine(I, g) = [start(refine(start(i), g)), end(refine(end(i), g))] The abstraction is the inverse operation to the refinement. Applying it, any time entity can be abstracted to a coarser granularity. We will denote this operation with the function abstract(T, g), where g is a granularity that must be contained in T. This operation is performed by truncating the sequence of granularities up to the granularity g. For example, abstract(y2000m3d1, y) = y2000. Finally, the shift of a time point T = g1 n1 ... gk nk by a time span S = n g is defined as follows: shift(T, S) = g1 n’1. … gi n’i. gi+1 ni+1 … gk nk where gi = g , and n’1… n’i are the new quantities associated to the granularities resulting from n + ni and propagating its overflow to the coarser granularities. These are some examples: shift(y1999m3, +10m) = y2000m1 shift(y2001, -2y) = y1999 shift(y1998m2w2, -3w) = y1998m1w4

3. Temporal Information Extraction To calculate the event-time period of a document we apply a sequence of two modules. The first module, named date extraction module, first searches for temporal expressions in the document text, then extracts dates, and finally inserts XML tags with the extracted dates. Figure 2 shows an example of a tagged document. In our approach we use the tag TIMEX defined in [7], to which we have added the attribute VALUE to store the extracted dates. Figure 3 presents the different stages of the date extraction module. 19990625 El Gobierno británico está decidido a impedir que la marcha de los unionistas de la Orden de Orange prevista para el próximo domingo . Fig. 2. Example of XML tagged document.

Regarding to the second module, named event-time extraction module, it processes all the TIMEX tags of the document to approximately obtain its event-time period. This section is focused on describing how the first module works, whereas Section 4 describes the second module. Patterns for extracting dates

Document Segmentation documents

sentences

Identifying Simple Time Entities dates

Grouping Time Entities

granularities

points, intervals and lists

Temporal reference resolution tagged documents

Fig. 3. Stages of the Date Extraction Process

In the date extraction module, the main problem to solve is similar to that of any natural language processing system, that is the ambiguity. This appears in several contexts: • Syntactic ambiguity. We need to know which words belong to the same temporal expression. After testing several syntactic analysers, we have concluded that they are not able to identify whole phrase like "In May of this year''. • Word sense disambiguation. We need to fix indefinite phrases like "in the last years", vague adverbial words like "now", "recently", and references to events like "since the beginning of these negotiations". • Semantic ambiguity. We need to distinguish between temporal expressions that identify either spans, intervals or dates. The approach we propose in this work consists of applying a shallow semanticsyntactic parser to extract temporal information. Similarly to Information Extraction systems [8], we begin with a lexical analysis that looks-up words related to temporal

expressions into a dictionary (time granularities, day of weeks, months, holidays, etc.), and name recognition of standard date expressions. This is followed by a partial syntactic analysis of the sentences that contain these words, in order to search for more words that probably belong to the same temporal expression. Afterwards, the selected words are coded with their semantic meaning in terms of the formal time model. Finally, these codes are properly combined to obtain dates, intervals and time spans. Next section illustrates the grammatical elements necessary for all this process, and the following stages are described afterwards. 3.1. Grammatical Elements By analysing the range of temporal expressions in natural language, we have classified the words belonging to these expressions in three categories that give us the semantic information necessary to assign the corresponding date. These are: • Granularities, which are words that identify calendar granularities (e.g. "day", "month", "years", "semester", etc.) • Time head nouns, which are words closely related with the calendar granularities. Specifically, these words represent the proper granularities and its synonyms (e.g. "journey"), the granularities values (e.g. "July", "Monday") as well as relevant dates and periods like "Hallowing night", "Christmas", "autumn", etc. • Quantifiers, which are the cardinal, ordinal and indefinite adjectives, as well as the roman numbers (e.g. "first", "second", "two", etc.) • Modifiers, which are words that grammatically can take part in a temporal expression. In this group we can find words for expressing intervals or periods like "during" and "between", words for indicating the temporal direction of spans like "past" and "next", and words for specifying a position within a time interval like "beginning" and "end". All these elements are always translated into codes representing their temporal meaning in the formal time model of Section 2.2. We use the notation e ⇒ c to denote the translation of a temporal expression e into its corresponding representation c in the formal model. This translation is performed as follows: • Time head nouns are always encoded as time entities. For example, since Monday is the first day of the week, we encoded it as "Monday" ⇒ "wd1 ". Other head nouns can be encoded as time intervals, for instance "autumn" ⇒ "[m9d21, m12d26]". • Quantifiers are all encoded as natural numbers. Additionally, ordinal and cardinal numbers must be distinguished in order to identify the time entity they are referring to. For instance, "first day" is encoded as "d1 " (time point), whereas "two days" is encoded as "2d " (span). The order of a quantifier with respect to a granularity changes its meaning. For instance we must distinguish between "day two" ⇒ "d2 " (time point) and "two days" ⇒ "2d " (span). • Modifiers are used to express the direction of time spans, namely: towards the past '−', towards the future '+', and at present time '0'. For instance, "last Monday" is

encoded as "−wd1 ", and "next three days" as "+ 3d ". Besides, modifiers can also refer to both other time entities, denoted with the prefix r, and events, denoted with the prefix R. For example, consider the following translations "that day" ⇒ "rd " and "two days before the agreement" ⇒ "R−2d the agreement". 3.2. Date Extraction Module The basic structural unit in our document model is the paragraph. However in the extraction date module, as in most Information Extraction systems, it is necessary to split them into smaller units to extract complex temporal expressions. For this purpose, we make use of the usual separators of sentences (e.g. '!', '¡',. '?', '-', ':', etc.) Since some of these symbols are also used for other purposes such as numeric expressions, we need to define and apply a set of patterns to correctly split sentences. 3.2.1. Extraction of dates During this stage, regular expressions are applied in order to extract basic temporal expressions for dates. These are common date formats (e.g. \d{2,4}\/\d{1,2}\/\d{1,2}) and relative temporal expressions referred to the publication date (e.g. "today", "this morning", "weekend", etc.). These regular expressions have been obtained by analysing the most frequent temporal sentences. 3.2.2. Identifying simple temporal expressions In this stage all the sentences having temporal head nouns are analysed to extract simple time entities. Sometimes these head nouns appear in usual temporal expressions like "every Monday", "each weekend", "each morning", which do not denote any time entity of our model. To avoid misunderstandings on interpreting such expressions and improve the efficiency of the extraction process, a list of patterns for rejecting them has been defined. Once checked that a temporal expression does not match any of these patterns, the algorithm proceeds to search for modifiers and quantifiers in the head's adjacent words. As a result, the identified head and its modifiers/quantifiers are translated into a single time entity. 3.2.3. Grouping simple time expressions Once the simple time entities from a sentence are extracted, we have to analyse them in order to detect if they are the components of a more complex time entity. Thus, this phase we must determine whether they constitute a single date (e.g. "May last year" ⇒ "y0m5−1y"), a time interval (e.g. "from May to July" ⇒ "from m5 to m7" ⇒ "[m5, m7]"), a list of dates (e.g. "On Wednesday and Friday" ⇒ "on wd4 and wd6 " ⇒ "{wd4, wd6} "), or two different expressions (e.g. "I won yesterday and you today"). Starting from a set of temporal expressions, we have defined a list of regular expressions for grouping simple time entities. For instance, the pattern 'from \entity to \entity' is used to identify a time interval. In this way, when a sentence contains several encoded time entities, the algorithm tries to apply these patterns to identify complex time entities.

3.2.4. Resolution of temporal references Most of the identified time entities can be finally translated into concrete dates, which will be used by the event-time generator. More specifically, only those time entities that contain the granularities either of year or century are translated into dates. In this process we take into account the relationships and operations specified between time entities as well as the time references of the document. To perform these tasks, the system makes use of regular expressions as follows: • If the sentence matches the pattern '\granularity[0-9]+', the date (or interval date) is extracted by applying the refine operation on the temporal expression. Example: "y1999 " ⇒ refine(y1999, d) = [y1999m1d1, y1999m12d31] • If the sentence matches the pattern '(+|-)?\granularity0\D ', the date is extracted by applying the denoted shift operation to the publication date. If the shift sign is omitted, the system tries to determine it by using the tense of the verb within the same sentence. Example: "The meeting will be on Monday" ⇒ "The meeting will be on +w0d1" • If the sentence matches the pattern 'r(+|-)\granularity\d+', the date is extracted by applying the denoted shift operation to the most recent cited date. • If the sentence matches the pattern 'r(+|-)\d+\granularity', we proceed as before. The rest of cases are not currently analysed to extract concrete dates. However, their study can be of interest in order to extract further knowledge about events and their relationships. For instance, temporal expressions containing references to events, for examples "R+2d the agreement", can be very useful to identify named events and their occurrences. However, this analysis will be carried out in future works.

4. Generating event-time periods In this section we describe the module in charge of analysing the extracted dates of each document, and of constructing the event-time period that covers its relevant topics. As in Information Retrieval models, we assume that the relevance of each extracted date is given by its frequency of appearance in the document (i.e. the TF factor). Thus, the most relevant date is considered as the reference time point of the whole document. If all dates have a similar relevance, the publication date is taken as the reference point. This approach differs from others in the literature, where the publication date is always taken as the reference time point. The algorithm for constructing the event-time period of a document groups all the consecutive dates that are located around the reference time point, and whose relevance is greater than a given threshold. Currently, both the date extraction module and the event-time generator have been implemented in the Python language. To perform the dictionary look-ups when solving temporal references, the date extraction module uses the TACAT system [9], which is implemented in Perl.

4.1. Preliminary results To evaluate the performance of the date extraction module we have analysed four newspapers containing 1,634 time expressions. The overall precision (valid extracted dates / total extracted dates) of the evaluated set was 96.2 percent, while the overall recall (valid extracted dates / valid dates in the set) was 95.2 percent. Regarding the execution times, each news is tagged in 0.1 seconds. These results, obtained on a dual Pentium III-600 MHz, are very satisfactory for our applications. To study the properties of the generated event-time periods, we have applied the extraction modules to 4,274 news. Then we have classified them into the following four classes: 1. Class A: news whose event-time periods contain the publication date and are smaller than three days. 2. Class B: news whose event-time periods do not contain the publication date and are smaller than three days. 3. Class C: news whose event-time periods are between four and fourteen days. 4. Class D: news whose event-time periods are greater than fourteen days. Table 1. Classification of documents according to their event-time period. Class A 21%

Class B 53%

Class C 9%

Class D 11%

The obtained results are given in Table 1. It is worth pointing out that near 6% of the articles have no event-time assigned. These cases are due to the lack of dates in the document contents. Moreover, around 42% of the articles contain dates located at least 14 days before or after the publication date. These dates are references to other past or future events, probably described in other newspaper articles. The extraction of these dates can be very useful to automatically link documents through their time references.

5. Related Work The extraction of temporal information from texts is a recent research field within the Information Retrieval area. In [7] it has been shown that near 25% of the tagged tokens in documents are time entities, whereas near 31% of the tags corresponds to person names. The relevance of temporal information is also demonstrated in [2], where the impact of time attributes on Information Retrieval systems is analyzed. Extracting temporal information is also important in the topic detection and tracking tasks. However, the proposed methods in the literature (e.g. [1]) use the publication date as the event time. The work presented in [2] tries to calculate event-time periods by grouping similar news located in consecutive publication dates. This approach can produce errors because an event is published one or more days after its occurrence. There are other works in the literature dedicated to automatically extract dates from dialogues [11] and news [12]. The main limitation of these approaches is that

only absolute temporal expressions [7] are analyzed to extract dates. In [12], some simple relative expressions can also be analyzed by applying the tense of verbs to disambiguate them.

6. Conclusions In this paper a new method for extracting temporal references from texts has been presented. With this method event-time periods can be calculated for documents, which can be used in turn for retrieving documents and discovering temporal relationships. The proposed method is based on the shallow parsing of natural language sentences containing time entities. These are translated into a formal time model where calculations can be performed to obtain concrete dates. Future work is focused on the automatic recognition of events by using the extracted dates and the chunks of texts where they appear. Another interesting task consists of solving the temporal expressions that refers to other events.

References J. Allan, R. Papka and V. Lavrenko. "On-Line New Event Detection and Tracking". 21st ACM SIGIR Conference, pp. 37-45, 1998. 2. R. Swan and J. Allan. "Extracting Significant Time Varying Features from Text," CIKM Conference, pp. 38-45, 1999. 3. R. Berlanga, M. J. Aramburu and F. Barber. "Discovering Temporal Relationships in Database of Newspapers". In Tasks and Methods in Applied Artificial Intelligence, LNAI 1416, Springer Verlag, 1998. 4. M. J.Aramburu and R. Berlanga. "Retrieval of Information from Temporal Document Databases". ECOOP Workshop on Object-Oriented Databases, Lisboa, 1999. 5. M. J. Aramburu and R. Berlanga. "A Retrieval Language for Historical Documents". 9th DEXA Conference, LNCS 1460, pp. 216-225, Springer Verlag, 1998. 6. C. Bettini et al. "A glossary of time granularity concepts". In Temporal Databases: Research and Practice, LNCS 1399, Springer-Verlag, 1998. 7. "The task definitions, Named Entity Recognition Task Definition" Version 1.4, http://www.itl.nist.gov/iad/894.01/tests/ie-er/er_99/doc/ne99_taskdef_v1_4.ps 8. R. Grishman. "Information Extraction: Tehniques and Challenges. International Summer School” SCIE-97. Edited by Maria Teresa Pazienza, Springer-Verlag, pp 10-27, 1997. 9. Castellón, M. Civit and J. Atserias. "Syntactic Parsing of Unrestricted Spanish Text". International Conference on Language Resources and Evaluation, Granada (Spain), 1998. 10. J. Wiebe et al. "An empirical approach to temporal reference resolution.". Second Conference On Empirical Methods in Natural Language Processing, Providence, 1997. 11. M. Stede, S. Haas, U. Küssner. "Understanding and tracking temporal descriptions in dialogue". 4th Conference on Natural Language Processing, Frankfurt, 1998. 12. D.B. Koen and W. Bender, "Time frames: Temporal augmentation of the news," IBM Systems Journal Vol. 39 (3/4), pp. 597-616, 2000. 1.

Suggest Documents