struggles or wars that are reported very frequently (Intifada, Macedonia/Kosovo). ... The meaning of 'Wednesday' in Figure 1 varies depending on when the.
Applying Semantic Classes in Event Detection and Tracking Juha Makkonen and Helena Ahonen-Myka and Marko Salmenkivi Department of Computer Science, Teollisuuskatu 23, P.O. BOX 26, 00014 University of Helsinki, Finland. Email: {jamakkon,hahonen,salmenki}@cs.helsinki.fi Abstract Event detection and tracking is a somewhat recent area of information retrieval research. The detection is about spotting new, previously unreported real-life events from online news-feed, while the tracking assigns documents to previously spotted events. We propose a new vector model consisting of four semantic classes from the documents: locations, proper names, temporal expressions and normal terms that are stored in designated subvectors. We also propose a new similarity measure based on utilizing semantic classes. Moreover, due to the vagueness of the concept of event, we run our experiments with several different definitions. In our experiments on a Finnish online news-stream corpus, we find that the use of semantic classes improves the performance significantly. Furthermore, the granularity by which the events are labeled influences the efficiency of the TDT tasks.
1
Introduction
A fairly novel area of information retrieval called event detection and tracking (also topic detection and tracking, TDT) attempts to design methods to automatically (1) spot new, previously unreported events (first story detection, FSD), and (2) follow the progress of the previously spotted events (tracking) (see e.g. [Allan et al., 1998a; Yang et al., 1999]). For instance, think of an information worker or a specialist who receives several incoming news-streams reporting various things taking place in the world. The worker might want to follow the course of events regarding bush fires in Australia, the development of the presidential elections in France, or just be informed if anything new takes place in Africa or in the metal industry, for example. Much of the current research owes to the initial TDT project started in 1996 [Allan et al., 1998a], which mushroomed projects involving multi-lingual material (see, e.g., [Geutner et al., 1998]), and combining radio and television broadcasts with digitally published newspapers (e.g. [Jin et al., 1999]). Most of the approaches try to establish a statistical model based on a training set, and then
apply that model to a test set (e.g., [Allan et al., 1998a; Yang et al., 1999]). Typically, the news stories are represented by using the vector model. In this paper we introduce a modification of the traditional vector model for TDT tasks. Instead of building a single document vector consisting of terms in general, we split the domain of terms into four semantic classes: temporal expressions, proper names, locations and the ’normal’ words. Accordingly, a document is represented by four vectors, and comparison of two documents is executed class-wise. We also introduce a new similarity measure exploiting the extracted semantic information. In addition, we draw the attention to the definition of event, which is often neglected in TDT research. Empirical results are provided from trials, where several event definitions and similarity measures were used. In Section 2 we outline the previous approaches to TDT. In Section 3 we discuss the definition of an event. Section 4 portrays the semantic class approach in detail. The corpus is described in Section 5. Section 6 deals with the application of the semantic classes. In Section 7 we present the results of our experiments. Section 8 is a brief conclusion.
2
Related Work
The good majority of the approaches in TDT have relied on some sort of clustering: Single-Pass Clustering [Allan et al., 1998a; Yang et al., 1999], hierarchical Group-Average Clustering [Yang et al., 1999]. Also Hidden Markov Models [van Mulbregt et al., 1998], Rocchio, k-Nearest Neighbours [Yang et al., 2000], naive Bayes [Seymore and Rosenfeld, 1997] and Kullback–Leibler divergence [Lavrenko et al., 2002] have been used. An event would hence be some sort of centroid, i.e., a compilation of the vectors assigned to the event. There are numerous ways in which the terms have been weighted: tf-idf variants [Allan et al., 2000; Yang et al., 2000] surprisingness [Allan et al., 1998b], and Time Decay [Allan et al., 1998a]. Allan et al. claim that these systems have portrayed modest performance because “ . . . it [tracking based on current technology] is not sufficient to achieve effective FSD “[Allan et al., 2000]. A highperformance FSD system would require nearly perfect tracking system due to the interrelation of the tasks. The prevailing methods, which are based on full-text similarity comparison, have failed to attain this ideal.
3
Concept of an Event
Initially, TDT was aimed to finding dynamically changing topics from the newsfeed, but the scope was then limited to events. Events take place in the world and news documents reflect them. Obviously, a detection/tracking system does not perceive the events themselves, but rather tries to deduce them by examining their appearance in the news-stream. There are several ways of defining an event: Definition 1 An event is something that happens at some specific time and place [Yang et al., 1999].
This definition was adopted to TDT project and it is intuitively quite sound. Practically all of the events of the TDT test set yield to the temporal proximity (“burstiness”) and the compactness. However, there is also a number of problematic cases which this definition seems to neglect: events which either have a long-lasting nature (Intifada, Kosovo–Macedonia, struggle in Columbia), escalate to several large-scale threads or campaigns (September 11), or are not tightly spatio-temporally constrained (the BSE-epidemics). The events in the world are not as autonomous as this definition assumes. They are often interrelated and do not necessarily decay within weeks or a few months. Some of these problematic events would classify as activities [Papka, 1999], but when encountering a piece of news, we do not know a priori whether it is a short term event or long term activity, a start for a complex chain of events or just a simple incident. Definition 2 An event is a specific thing that happens at a specific time and place along with all necessary preconditions and unavoidable consequences [Cieri, 2000]. This is basically a variant of Definition 1 that in some sense tries to address the autonomy assumption. Yet, it opens a number of questions as to what is the necessary precondition for certain event, an oil crisis, for example. What are the necessary preconditions and unavoidable consequences of Secretary Powell’s visit to Middle East? Drawing the line between an activity and an event is at least equally problematic as above. In the worst case, there would be only few events that would cover everything. Definition 3 An event is a dynamic topic, a subject that is discussed intensely in the news at some time. Here the event is defined in terms of the documents as a characteristic or a quality that a certain group of documents share while an another group does not. An event is not so much a single incident one can point to, but rather a topic that springs up and needs to be detected. This resembles an alternative definition of topic as dynamically changing event [Yang et al., 1999], but Lavrenkot et al., for example, see the topic to be “centered around a specific event” [Lavrenko et al., 2002]. Definition 4 An event is a dynamic topic that evolves and might later fork into several distinct events. This definition enables splitting the event into several sub-streams that then have courses of their own. The terrorist attacks of September 11 have, in addition to the immediate consequences, lead to full-scale war in Afganistan, the overthrow of the Taleban reign, a less intense global war against terrorism, several top-level meetings, a political crises in Germany, just to name a few. The connection between the various outcomes and the initial cause become less and less obvious as the events progress. Nevertheless, this definition is not investigated here. However, we will apply the first three definitions to our corpus. Definition 1 will lead to events that are temporally and spatially quite compact, such as the floods in Siberia, the first space tourists, various elections, accidents, storms, strikes; generally self-contained happenings that are not strongly related or linked to anything else. In our interpretation Definition 2 broadens the first definition slightly. In addition to the previous it will result in events that are longer in duration, are spatially more spread
out or lead to some other events, e.g., conflicts in Columbia, Congo and Kashmir, the problems of Swissair and Summit of the Americas in Quebec. Finally, the events of Definition 3 contain topics that are discussed intensely for some period of time (Kyoto Climate Protocol), themes or changes that occur in multiple locations ( BSE-epidemics, the introduction of Euro currency) and long-lasting struggles or wars that are reported very frequently (Intifada, Macedonia/Kosovo).
4
Semantic Classes
A lot of the previous efforts in this field have relied on statistical learning, and have emphasized different kinds of the words somewhat uniformly. It has been difficult to detect two distinct train accidents or ice hockey games as different events [Allan et al., 1998a]. The terms occurring in the two documents are so similar that the term-space in use fails to represent the required very delicate distinction. Intuitively, when reporting two different train accidents, it would seem that the location and the time, possibly some names of people, are the terms that make up the difference. Papka observes that when increasing the weights of noun phrases and dates the classification accuracy improves and when decreasing them, the accuracy declines [Papka, 1999]. This suggests that it would be beneficial to attribute higher weights on the spatio-temporal terms in order to detect and track events. A news document reporting an event states at the barest what happened, where it happened, when it happened, and who was involved. The automatic extraction of these facts can be quite troublesome and time-consuming, and can still perform poorly. Previous detection and tracking approaches have tried to encapsulate these facts in a single vector. In order to attain the delicate distinctions mentioned above, to avoid the problems with the term-space maintenance and still maintain robustness, we assign each of the questions a semantic class, i.e., the words that have meaning of the same type. The semantic class of locations contains all the places mentioned in the document, and thus gives an idea, where the event took place. Similarly, temporals, i.e., the temporal expressions name an object, that is, a point of time, and bind the document onto the time-axis. names are proper names and tell who was involved. What happened is represented by ’normal’ words, which we call terms. We present each document with four sub-vectors, as represented in Figure 1. We call this an event vector. If two documents coincide as to temporal expressions and locations, for example, it would serve as strong evidence for them relating to the same event. Naturally, a temporal expression needs to be evaluated with respect to the moment of utterance, that is, the time of publication. The meaning of ’Wednesday’ in Figure 1 varies depending on when the news was published. We map each temporal expression to the time-axis as a range with a pair of dates (start, end). Thus ’Wednesday’ would result in pair (20020828, 20020828), whereas ’this week’ would yield (20020826, 20020901).
5
Corpus
Our corpus consists of 3958 Finnish online news documents from a single Internet source dating from April 1 to June 30 2001. We have manually assigned the documents to a total of 168 events of varying size and type. We also have three levels of granularity based on Definitions 1–3 presented in Section 3.
Table 1 portrays the number of events, the number of documents which are assigned to an event and average number of documents in an event for each definition. In addition, the average length indicates the number of days between the first and the last document, and the average gap equals to the average number of days in between two consecutive documents of the same event. The events corresponding to Definition 1 are smallest, temporally shortest and the most compact. Definition 3 introduces several temporally short events that are not accepted by the other definitions, and therefore the gap between two documents of the same event become smaller. The largest event is Intifada comprising 129 documents in granularity 3. There are also several events containing only one document. Also, we have manually classified the documents to 17 categories that form the first level of the International Press and Telecommunications Council (IPTC) taxonomy. The distribution of the classes is illustrated in Figure 2. On the average, a document is assigned to 1.46 categories. U.S. NAVY
TEMPORALS
CALIFORNIA
LOCATIONS
WEDNESDAY
TERMS
SUBMARINE
No. of instances
The distribution of the classes
NAMES
PACIFIC OCEAN
RECORD
FIRE
RESEARCH
Figure 1: An example of event vector.“The U.S. Navy diesel research submarine that holds the world’s deep-diving record caught fire in the Pacific Ocean off California on Wednesday. . . ” (Washington Post, May 22, 2002)
1000 900 800 700 600
Instances
500 400 300 200 100 0
0
2
4
6
8 10 12 The IPTC class
14
16
18
Figure 2: The distribution of IPTC classes in the corpus.
We have employed Connexor Functional Dependency Grammar parser for Finnish (FI-FDG) in conducting a syntactical and morphological analysis, which is required in the mapping of the temporal expressions as well as in selecting the normal terms. The class terms consists of nominal heads and premodifying adjectives and nouns. The recognition of locations and proper names relies on Conexor Term Extractor (FI-BRACKETS). Most of the documents contain about 100 words. From these nearly half are regarded as terms which in our case are nouns and adjectives. There are about 4 references to places and nearly 7 references to people or organizations in an average document. Each document is expected to contain 2 or 3 temporal expressions. Table 2 shows the average and the standard deviation of the number of instances in the semantic classes in the corpus.
6 6.1
Applying Semantic Classes Examining the Intersection
Typically, one semantic class is not enough to determine whether the documents refer to the same event. Two documents reporting ice hockey results have a large number of common names. Weather
defn 1 2 3
events 77 109 168
docs 398 605 1071
size 5.169 5.550 6.375
length 14.5 d 18.2 d 19.0 d
semantic class locations names terms temporals total words
gap 3.47 d 3.98 d 3.53 d
Table 1: Granularity of events.
instances 16703 26431 185613 10107 400999
Avg(X) 4.220 6.736 47.314 2.577 102.578
Table 2: Corpus statistics.
forecasts coincide as to locations. In online news, the temporal expressions are typically references to the current date and are thus similar for news published on the same day. The same terms can describe two different car crashes etc. We investigated the number of common locations, proper names, terms, and temporal expressions in any two documents of the corpus. A summary is presented in Table 3. The ’yes’ row indicates that the two documents refer to the same event, while ’no’ indicates just the opposite. We see that in case of documents refering to the same event the number of common instances of locations is 16-fold (1.332 : 0.083) compared to the opposite case. When it comes to names the ratio is 24:1 (0.846 : 0.035). The ratio between the number of common terms of ’yes’ and ’no’ cases is considerably lower (5:1) than those of locations and names. The temporals appear to have a low ratio (3.5:1) as well. The analysis strongly suggests that the locations and proper names are the most important separators between documents in the event detection and tracking tasks.
6.2
Similarity measures for the Intersection
There are several ways in which one can conduct a similarity test based on the intersection of the two documents. Some of the most typical are Dice, Jaccard, Cosine and Overlap coefficients [van Rijsbergen, 1980]. The coefficients are expressed by formulas D(X, Y ) = 2
|X ∩ Y | |X ∩ Y | |X ∩ Y | |X ∩ Y | p , O(X, Y ) = , J(X, Y ) = , C(X, Y ) = p , |X| + |Y | |X ∪ Y | min(|X|, |Y |) |X| × |Y |
where X and Y are documents. However, because of the structure of the news document, the occurrences are not of equal importance. The core of the content is usually expressed in the headline and in the first few sentences. We also want to take the number of instances of the co-occurring term into account. Thus, we propose a new similarity measure. Instead of introducing several variables, we wrap the ranking of the occurrence, i.e., the ordinal of the sentence in which the occurrence takes place, and the number of occurrences in a single value. To measure the similarity of semantic class α of two documents X and Y , we propose the following formula:
sim(Xα , Yα ) =
mi n X X i=1 k=1
1 2
ln pik
,
where Xα is the vector of features (words) belonging to the semantic class α in the document X. Each of the n ∈ IN terms in the intersection Xα ∩ Yα has mi ∈ IN instances: pik is the ranking of the
kth instance of the ith feature. The weight is ’inverted’ so that the occurrences in the first sentence ( 2ln1 1 = 1) would be of greater value than those in the fifth sentence ( 2ln1 5 ≈ 0.328), for example. Moreover, the natural logarithm soothes the growth of the denominator. For instance, the difference of weights between occurrences in the eighth and ninth sentence is only 0.019 (= 0.237 − 0.218). Table 3 shows the average pairwise similarity of the proposed measure and Dice, Jaccard, Cosine, and Overlap coefficients in each semantic class in the corpus. With the exception of temporals, all the ratios between ’yes’ and ’no’ sections seem to have grown. Hence, the exploitation of the ranking and the number of occurrences has strengthened the similarity of the documents within the same event compared to those of a different event. class locations names terms temporals
same yes no yes no yes no yes no
|X ∩ Y | 1.332 0.083 0.846 0.035 5.801 1.137 0.279 0.083
Dice 0.421 0.022 0.065 0.004 0.009 0.002 0.051 0.021
Jaccard 0.038 0.017 0.046 0.012 0.005 0.011 0.036 0.016
Cosine 0.441 0.024 0.076 0.005 0.028 0.005 0.054 0.022
Overlap 0.547 0.035 0.131 5.545 0.158 0.032 0.079 0.217
proposed 4.919 0.173 2.063 0.051 10.364 1.540 0.315 0.021
Table 3: The average pairwise similarity of documents of the same and different events in each of the semantic classes: average size of the intersection, Dice, Jaccard, Cosine, Overlap, and the proposed measure. We define the total event similarity measure as a sum of all of the distinct semantic class similarities:
ESM (X, Y ) =
X
wα ∗ sim(Xα , Yα ),
(1)
α∈S
where wα is the weight attributed to the semantic class and S is the set of semantic classes. The weight is a parameter to be tuned, and it reflects the relative importance of a semantic class.
6.3
Detection and Tracking Algorithm
Our detection and tracking system uses single-pass clustering [van Rijsbergen, 1980]: as a new document is encountered, it is compared to the existing events. We start from the latest event and proceed towards the earlier. If the similarity exceeds a given threshold, the document is assigned to the corresponding event. If there is not an event with sufficient similarity, the document establishes a new event, i.e., the document is considered as a first story. There could be several events to which the document is similar to but we content ourselves the first hit. Here, an event is represented by the union of the first and last story of the given event. If a new document is regarded to discuss some particular event, it replaces the last document in the event representation.
7
Experiments
In this section we provide results from the experiments conducted to evaluate the system and the different similarity measures. First, we tested the performance of the system using the novel similarity measure against baseline methods such that ignored the separation between semantic classes; i.e., they employed single document vectors as the data representation. Second, we applied Cosine, Dice, Jaccard, and Overlap coefficients with semantic classes. They were not weighted, however, since they provide in a sense normalized output themselves. The similarity measures of semantic classes were simply added together without scaling. The weights for the ESM measure (Eq. 1) were at this point found more or less heuristically: we used 2.0 for locations, 2.0 for names, 0.8 for terms and 1.0 for temporals. In addition, since the brute-force pairwise comparison becomes infeasible with large corpora, we first categorized the news document according to the IPTC taxonomy. The detection and tracking task was then restricted to the resulting categories. Since we are here interested in testing the performance of the detection and tracking system, the categorization was taken as given (see Figure 2). The resulting F1 -measures are presented in Table 4. Each row is produced by one threshold value, i.e., the same threshold is used in both the tracking in all of the event definitions (defn). In order to compare the methods, we combined the F1-measures to indicate overall efficiency of detection and tracking. The average of these two is listed on the right and it was used as a criterion in selecting the optimal threshold. The methods exploiting semantic classes, especially ESM, outperform the other methods in tracking. Also, the precision of the released documents in first story detection is higher than that of the baseline methods. Instead, the baseline methods seem to yield higher recall. This is, however, due to the linkage of the tracking and detection tasks; as the recall of the baseline methods in tracking is poor, a large number of relevant documents is missed, leading to increase in recall of first story detection. Clearly, the event similarity measure outperforms the rest of the methods: it provides consistently the highest F 1-measures. The similarity coefficients seem to behave quite uniformly under each definition. The baseline methods work better with compact events, but as the events become lengthier and more complex, their performance drops. The experimental results indicate, as expected, that the detection and tracking tasks become harder when the simple definition is broadened, and even harder when not including the requirement of a specific location and time. Nevertheless, the differences are not very large. Comparing these results with the previous TDT research is not very reasonable. Our corpus of about 4000 documents is only one quarter of the TDT corpus of University of Massachusetts and Carnegie Mellon University [Allan et al., 1998a]. However, in our corpus we have 3 to 6 times more events depending on the definition and our event size is considerably smaller. These characteristics alter the problem slightly.
defn
1
2
3
method baseline-cosine baseline-dice baseline-jaccard baseline-overlap cosine dice jaccard overlap esm baseline-cosine baseline-dice baseline-jaccard baseline-overlap cosine dice jaccard overlap esm baseline-cosine baseline-dice baseline-jaccard baseline-overlap cosine dice jaccard overlap esm
P 0.493 0.484 0.518 0.477 0.406 0.387 0.368 0.550 0.627 0.420 0.497 0.446 0.412 0.370 0.348 0.325 0.486 0.578 0.358 0.349 0.383 0.366 0.449 0.480 0.500 0.461 0.513
Detection R 0.948 0.974 0.948 0.805 0.635 0.565 0.671 0.647 0.812 0.927 0.691 0.909 0.809 0.627 0.534 0.627 0.602 0.754 0.833 0.881 0.810 0.738 0.428 0.367 0.307 0.530 0.602
F 1D 0.649 0.647 0.670 0.599 0.495 0.459 0.475 0.595 0.708 0.578 0.578 0.599 0.546 0.465 0.421 0.428 0.538 0.654 0.501 0.500 0.520 0.489 0.438 0.416 0.381 0.493 0.554
P 0.365 0.397 0.355 0.285 0.507 0.422 0.500 0.463 0.751 0.373 0.235 0.346 0.285 0.510 0.410 0.477 0.445 0.692 0.304 0.345 0.291 0.227 0.328 0.329 0.271 0.465 0.616
Tracking R 0.340 0.347 0.346 0.394 0.290 0.252 0.194 0.462 0.561 0.244 0.473 0.259 0.247 0.289 0.268 0.201 0.463 0.484 0.163 0.159 0.224 0.168 0.517 0.589 0.693 0.546 0.600
F 1T 0.352 0.370 0.351 0.330 0.369 0.316 0.280 0.462 0.642 0.295 0.314 0.297 0.265 0.369 0.324 0.283 0.454 0.570 0.213 0.217 0.253 0.193 0.402 0.422 0.390 0.502 0.608
F 1D +F 1T 2
0.500 0.508 0.510 0.465 0.432 0.387 0.377 0.528 0.675 0.436 0.446 0.448 0.405 0.417 0.373 0.355 0.496 0.612 0.357 0.359 0.387 0.341 0.420 0.419 0.385 0.498 0.581
Table 4: The detection and tracking results.
8
Conclusions
In this paper, we presented a novel approach to event detection and tracking. While most of the previous approaches in the area were based on a single term vector as the document representation, we extract locations, proper names, and temporal expressions from the data. Every document is represented with four vectors, one for each semantic class, and one for the general terms. We also introduced a new similarity measure that utilizes the new type of vectors and attributes larger weights for words occurring in the beginning of the document. We also put forward several definitions of events to meet the criteria of the various use-cases. The experiments indicate that broadening the criteria decreases slightly the performance.
The new vector type with the proposed similarity measure encouragingly outperforms the baseline methods. In the future, in addition to continuing the use of semantic classes we will delve deeper into what constitutes an event, in what manner are the documents within the event similar, what types of events there are and how the type could be recognized and later exploited automatically.
References [Allan et al., 1998a] James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, and Yiming Yang. Topic detection and tracking pilot study final report. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, February 1998. [Allan et al., 1998b] James Allan, Victor Lavrenko, and Ron Papka. Event tracking. Technical Report IR – 128, Department of Computer Science, University of Massachusetts, 1998. [Allan et al., 2000] James Allan, Victor Lavrenko, and Hubert Jin. First story detection in TDT is hard. In Proc. 9th Conference on Information Knowledge Management CIKM, pages 374–381, McClean, VA USA, 2000. [Cieri, 2000] Christopher Cieri. Multiple annotations of reusable data resources: Corpora for topic detection and tracking. In Proc. Journ`ees Internationales d’Analyse Statistique des Donn´ees Textuelles (JADT), 2000. [Geutner et al., 1998] Petra Geutner, Micheal Finke, Peter Scheytt, Alex Waibel, and Howard Wactlar. Transcribing multilingual broadcast news using hypothesis driven lexical adaptation. In Proc. DARPA Broadcast News Workshop, 1998. [Jin et al., 1999] Hubert Jin, Rich Schwartz, Sreenivasa Sista, and Frederick Walls. Topic tracking for radio, tv broadcast, and newswire. In Proc. DARPA Broadcast News Workshop, 1999. [Lavrenko et al., 2002] Victor Lavrenko, James Allan, Edward DeGuzman, Daniel LaFlamme, Veera Pollard, and Stephen Thomas. Relevance models for topic detection and tracking. In Proc. Human Language Technology Conference (HLT), 2002. [Papka, 1999] Ron Papka. On-line New Event Detection, Clustering and Tracking. PhD thesis, Department of Computer Science, University of Massachusetts, 1999. [Seymore and Rosenfeld, 1997] Kristie Seymore and Ronald Rosenfeld. Large-scale topic detection and language model adaptation. Technical report, School of Computer Science, Carnegie Mellon University, 1997. [van Mulbregt et al., 1998] Paul van Mulbregt, Ira Carp, Lawrence Gillick, Stewe Lowe, and Jon Yamron. Text segmentation and topic tracking on broadcast news via a hidden markov model approach. In Proc. 5th Intl. Conference on Spoken Language Processing (ICSLP’98), 1998. [van Rijsbergen, 1980] C. J. van Rijsbergen. Information Retrieval. Butterworths, 2nd edition, 1980. [Yang et al., 1999] Yiming Yang, Jaime Carbonell, Ralf Brown, Thomas Pierce, Brian T. Archibald, and Xin Liu. Learning approaches for detecting and tracking news events. IEEE Intelligent Systems Special Issue on Applications of Intelligent Information Retrieval, 14(4):32 – 43, 1999. [Yang et al., 2000] Yiming Yang, Thomas Ault, Thomas Pierce, and Charles Lattimer. Improving text categorization methods for event detection. In Proc. ACM SIGIR, pages 65–72, 2000.