Research Track Paper
∗
Time-Dependent Event Hierarchy Construction Gabriel Pui Cheong Fung
Jeffrey Xu Yu
Huan Liu
The Chinese University of Hong Kong, Hong Kong, China
The Chinese University of Hong Kong, Hong Kong, China
Arizona State University, Arizona, USA
[email protected]
[email protected] Philip S. Yu
[email protected]
IBM T. J. Watson Research Center, USA
[email protected]
ABSTRACT
General Terms
In this paper, an algorithm called Time Driven Documents-partition (TDD) is proposed to construct an event hierarchy in a text corpus based on a given query. Specifically, assume that a query contains only one feature – Election. Election is directly related to the events such as 2006 US Midterm Elections Campaign, 2004 US Presidential Election Campaign and 2004 Taiwan Presidential Election Campaign, where these events may further be divided into several smaller events (e.g. the 2006 US Midterm Elections Campaign can be broken down into events such as campaign for vote, election results and the resignation of Donald H. Rumsfeld). As such, an event hierarchy is resulted. Our proposed algorithm, TDD, tackles the problem by three major steps: (1) Identify the features that are related to the query according to both the timestamps and the contents of the documents. The features identified are regarded as bursty features; (2) Extract the documents that are highly related to the bursty features based on time; (3) Partition the extracted documents to form events and organize them in a hierarchical structure. To the best of our knowledge, there is little works targeting for constructing a feature-based event hierarchy for a text corpus. Practically, event hierarchies can assist us to efficiently locate our target information in a text corpus easily. Again, assume that Election is used for a query. Without an event hierarchy, it is very difficult to identify what are the major events related to it, when do these events happened, as well as the features and the news articles that are related to each of these events. We have archived two-year news articles to evaluate the feasibility of TDD. The encouraging results indicated that TDD is practically sound and highly effective.
Algorithms, Design, Management
Keywords Hierarchies, Events, Time, Clustering, Text, Retrieval, Presentation
1. INTRODUCTION In this information overwhelming century, information becomes ever more pervasive and important. While there are some excellent search engines, like Google, to assist us to retrieve our target information by simply providing some keywords, the problem of too much information surrounding us makes it harder and harder to locate the right piece of information efficiently. As an example, consider we want to identify what had happened when the virus SARS (Severe Acute Respiratory Syndrome) broke out in Hong Kong. By specifying the word SARS as the search query, we may obtain a result that is similar to the one as shown in Figure 1 (a) through some existing news search engines such as Factiva1 . In the figure, it shows the headlines, and the first few lines of the news articles if their contents contain the word SARS. Yet, the search result in Figure 1 (a) does not capture any of the “time-dependent information”, meanwhile this information is critical in terms of assisting us to efficiently locate our target. For example, it is very difficult to identify from Figure 1 (a) that what are the major events related to the keyword SARS (e.g. the closure of school, the travel warnings issued by WHO, the resignation of the Hong Kong Secretary of Health, etc), the periods when these events happened, as well as the keywords and the news articles that are related to each of these events. On the other hand, the time related information is often associated with the documents, such as the time when the web page is being updated, the time when a news article is being released or the time when a blog is being written. It is desirable to group the documents in a corpus according to both their contents and their timestamps. This urges us to think critically whether we can extend the capability of the existing search engines to incorporate more information, such as including the time dimension. Specifically, we are curious of whether it is possible to have an algorithm that is able to group the retrieved documents into events according to the similarity of their contents and their timestamps, so as to construct an event hierarchy according to a particular user’s query. Let us continue with our previous example about SARS. The questions that we have raised before can all be answered efficiently by constructing an event hierarchy which is similar to the one shown
Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—clustering; H.5 [Information Interfaces and Presentation]: Miscellaneous ∗The work was supported by a grant of RGC, Hong Kong SAR, China (No. 418206).
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’07, August 12–15, 2007, San Jose, California, USA. Copyright 2007 ACM 978-1-59593-609-7/07/0008 ...$5.00.
1 http://www.factiva.com
300
Research Track Paper
Headlines 1 - 100 of 4671 | Next 100 A
1. Lack of precautions in schools criticised. South China Morning Post, 20 March 2003, 331 words, (English) Educators and parents yesterday criticised the government for not providing schools with concrete precautionary guidelines to avoid a spread of the atypical pneumonia... More Like This
B
C
D
E
2. Outbreak of virus in Hanoi traced to HK. South China Morning Post, 21 March 2003, 687 words, (English) A businessman who spread pneumonia to Vietnam was on the same floor of the Metropole Hotel as the index patient. The 49-year-old American-Chinese ... More Like This
F
G H
I J
Current Event (SARS, School, College, University, Close, ...)
K
3. THE metropole connection. South China Morning Post, 21 March 2003, 259 words, (English) 173 people are in hospital with SARS, 165 with confirmed atypical pneumonia. There have been six deaths, including one reported yesterday... More Like This
Period: 2003/03/22 - 2003/06/02 (Event J) Associated Features: SARS, School, College, University, Close, Teachers, Parents, ... Headlines 1 - 100 of 204 | Next 100
4. Education chief says schools may close. South China Morning Post, 22 March 2003, 496 words, (English) As a fifth student falls ill, Arthur Li considers drastic options if the pneumonia outbreak worsens Schools may be forced to close if the outbreak of atypical pneumonia... More Like This
1. Education chief says schools may close. South China Morning Post, 22 March 2003, 496 words, (English) As a fifth student falls ill, Arthur Li considers drastic options if the pneumonia outbreak worsens Schools may be forced to close if the outbreak of atypical pneumonia... More Like This
5. Prospect of school closures fuels fears. South China Morning Post, 22 March 2003, 420 words, (English) Parents and teachers are worried by the education chief's warning as the virus outbreak shows no signs of abating Parents and teachers have reacted with... More Like This
2. Prospect of school closures fuels fears. South China Morning Post, 22 March 2003, 420 words, (English) Parents and teachers are worried by the education chief's warning as the virus outbreak shows no signs of abating Parents and teachers have reacted with... More Like This
6. Outbreak began in Guangdong, WHO believes. South China Morning Post, 22 March 2003, 601 words, (English) The central government is urged to provide more information about pneumonia infections on the mainland. The central government was yesterday urged to disclose... More Like This
3. Minister rejects blanket school closures. South China Morning Post, 26 March 2003, 758 words, (English) About 50 to 60 institutions suspend classes, but Arthur Li says cool heads are required to tackle the pneumonia outbreak Education Minister Arthur Li Kwok-cheung... More Like This
7. Scientists link `tricky virus' to pneumonia outbreak. South China Morning Post, 23 March 2003, 402 words, (English) Scientists from Hong Kong University said last night they had identified a "tricky new virus" they believe is behind an outbreak of a mystery pneumonia which... More Like This
4. Students use free time to offer help on hotlines. South China Morning Post, 3 April 2003, 400 words, (English) Students on leave from colleges and universities around Hong Kong are providing emotional support for people affected by Sars. About 100 nursing, medical and... More Like This
. . .
. . .
(a)
(b)
Figure 1: The result of searching. In (a), it shows the search results of a traditional search engine against the feature SARS. In (b), it organized the search results in (a) by using an event hierarchy. feld, the former US Secretary of Defense. Furthermore, an event is usually associated with more than one feature. For instance, the 2004 US Presidential Election Campaign is further associated with the features such as President, Campaign, Bush, Senator and Kerry. Eventually, an event hierarchy for the feature Election, such as the one in Figure 2, could be formulated. As a result, we will have a much clearer picture about the relationships among the features. Formally speaking, given a user query, we propose an algorithm to construct an event hierarchy by grouping the retrieved documents into events according to their timestamps and their contents. A search query is defined as a set of keywords, in which we call them features. An event is defined as an object which consists of the following three components: (1) A set of documents with similar contents; (2) A set of representative features extracted from the documents; and (3) Two timestamps to denote for the event begins and ends. Conceptually, given a query, our problem can be readily solved by the following straightforward framework: (1) Identify all of the documents that are related to the query; (2) Group the highly related documents together to form events; (3) Arrange the events in an hierarchical structure; (4) For each event, extract a set of features that can represent it. Intuitively, this framework works well, and can be solved by combining some of the existing techniques [1, 2, 4, 7, 9, 12, 16, 20, 21, 17, 27, 28, 30, 31, 32, 33, 18, 34, 36, 37, 38]. Unfortunately, we claim that this framework is ineffective to solve our problem. The reason is because of the imprecise nature of the search results. The search results returned in Step (1) of the straightforward framework are usually contain a sizable number of unrelated documents. These unrelated documents are hard to clean
in Figure 1 (b). In the figure, the upper block diagram represents when the events happened. The x-axis is the time. The blocks in the diagram represent events. All events in the diagram are related to the keyword SARS. They are arranged in a hierarchical structure, where the links denote their relationships. For instance, Event A and Event B are at the same level, where Event A and Event B respectively contain Event C and Event D. There is no direct relationship between Event C and Event D as they are not connected with each other. Event C is further broken down into Event E and Event F, whereas the later two events are connected. Similar description applies for the other events. The lower half of Figure 1 (b) is similar to Figure 1 (a) except that it only shows the information of the news articles that are related to a particular event in the upper block diagram (Event J for this figure). As such, we can mitigate the problem of information overloading by displaying the information related to a particular event only. It is worth to note that we have preserved all information as we can easily switch from one event to the others. Last but not least, in Figure 1 (b), every event is associated with a set of keywords. For instance, Event J is associated with the keywords such as School, College and University. Practically, it is very useful to associate a set of keywords with each event because we can identify their relationships efficiently. Let us present one more example to account for our motivation. For the keyword, Election, it is related to the events such as the 2006 US Midterm Elections Campaign, the 2004 US Presidential Election Campaign and the 2004 Taiwan Presidential Election Campaign. Usually, an event may be further broken down into several sub-events. In this example, the 2006 US Midterm Elections Campaign may be broken down into events such as campaign for vote, election results and the resignation of Donald H. Rums-
301
Research Track Paper
Features
Election
US Midterm Election Campaign
Events
Vote for Election
...
President
US Presidential Election Camapign
Taiwan Presidential Campaign
...
...
Resignation of Donald H. Rumsfeld
Vote
...
...
... Figure 2: The relationships among the keywords. Sym.
up, making it difficult to group the documents into events so as to construct an appropriate event hierarchy and identify a set of representative features for each event. We will discuss the details in Section 2. In contrast to the above straightforward approach, we proposed an algorithm called Time Driven Documents-partition (TDD) to solve our problem. Given a query, TDD will: (1) Identify the features that are related to the query according to the timestamps and contents of the documents in the corpus. These features are regarded as bursty features; (2) Extract the documents that are highly related to the bursty features based on time; (3) Partition the extracted documents and organize them to construct an event hierarchy. We will show that bursty features are less prune to noise. So our approach starts with discovery the bursty features. We will discuss the details in Section 2. We have conducted extensive experiments to evaluate TDD’s effectiveness by using two-year news articles. We chose news articles because they are indexed by timestamps and their contents have strong temporal structure. It is easy to evaluate whether TDD is feasible. Nevertheless, TDD can be extended to other areas, such as grouping the retrieved web pages according to their latest modification dates. This will be left for our future work. The rest of the paper is organized as follows – Section 2 presents TDD; Section 3 evaluates the effectiveness of TDD; Section 4 briefly discusses some of the major related work; Section 5 summarizes and concludes this paper.
2.
D D di ti W w L F f nfw Nw Q B b Fb Db E e
Description a text corpus a set of documents, D ⊂ D a document, di ∈ D the timestamp of di a set of windows a window, w ∈ W number of windows in the text corpus a set of all different features in the text corpus a feature, f ∈ F number of documents contains f in w the feature count of w a user query, Q ⊂ F a set of bursty periods for Q a bursty period, b ∈ B a set of features associated to b, Fb ⊂ F a set of documents in b that are related to Q, Db ⊂ D a set of events an event, e ∈ E Table 1: Notation used in this paper.
chy, the first step arguably is to identify all of the events that are related to Q. A possible approach is that we first retrieve all of the documents that are related to Q, and then cluster the documents into groups so as to formulate events. We argue that this approach is inappropriate. Let us consider for the following situation. There was a news article released from the South China Morning Post (SCMP)2 with the headline “Two chiefs, two systems of getting the job done” contained the feature SARS. Interestingly, this article has nothing to do with the virus SARS, severe ¯ acute respiratory syndrome. In fact, this news article was release ¯on 2002/01/05 ¯ ¯ (a date well before the virus SARS outbreak) and was related to the appraisals of the two chief executives in the two special administrative regions (Hong Kong and Macau) in China. ¯From our ¯ experiences,¯when ¯we issue the query Q = {SARS}, we are always targeting for retrieving the information that is related to the virus (Severe Acute Respiratory Syndrome) but not the place (Special Administrative Regions). Query the information for the later issue will usually involve some other features, such as Hong, Kong or Macau. Even though a document, d ∈ D , matches the user query, Q, the user may not be interested in it. Accordingly, if we follow the approach of retrieving the related documents first and
PROPOSED WORK
Table 1 summarizes the notations that would be used throughout this paper. Let D = {d1 , d2 , . . . } be a text corpus, where di is a document with the timestamp ti . Following the existing direction [7, 30, 31], D , is partitioned into L consecutive and non-overlapping windows according to the timestamps of the documents. Let W be a set containing all the windows. Let F be a set containing all different features in D . A feature, f ∈ F, is a word in the text corpus. |F| is the number of different features in D . The input of our algorithm is a query, Q, where Q ⊂ F. The output is an event hierarchy (which is similar to Figure 2). We will first describe how we formulate the problem and why it is formulated in such a way by presenting some real life examples in Section 2.1, then we will present the implementation details in the sections thereafter.
2.1 An Overview
2A
Given a user query, Q, in order to construct an event hierar-
302
news agent: http://www.scmp.com
Research Track Paper
3 See http://en.wikipedia.org/wiki/2004 Indian Ocean earthquake.
303
binomial pmf using normal approximation
binomial cdf using normal approximation
0.21
1
0.18
A
B
C
0.9
A
B
C
0.8 Probabiltiy of Bursty
0.16 Probabiltiy of Bursty
then identify the events by clustering, we may need to have some other heuristics for filtering the documents that match Q but do not coincide with the users’ interest. This is difficult. Similarly, even though a document which does not match Q, the user may sometimes be interested in it (we will explain this later). In this paper, we claim that we should first identify the events, E, that satisfy Q, and then map the related documents back to the events. It is a reverse of the just mentioned ineffective approach. Yet, if we follow this new direction, a question immediately ensues: How to identify the events that are related to Q in the text corpus without retrieving any of the documents? From our observations, the emergence of an event is always associated with a burst of features, where some features suddenly appear frequently when the event emerges, and their frequencies drop when the event fades away. For instance, from 2003/01/01 to 2004/12/27 (726 days), the feature Tsunami only appeared 28 times in 23 news articles in SCMP2 . Yet, from 2004/12/27 to 2004/12/31 (5 days), Tsunami appeared 86 times in 41 news articles. Both the number of news articles and the feature frequency increased dramatically in only 5 days. History tells us that there was an undersea earthquake that occurred at 00:58:53 UTC on 2004/12/26 with an epicenter off the west coast of Sumatra, Indonesia. The earthquake triggered a series of devastating tsunami that spread throughout the Indian Ocean killing large numbers of people and inundating coastal communities across South and Southeast Asia.3 This event is usually regarded as the Asian Tsunami. Its emergence can be identified by observing the distribution of Tsunami across the timeline. Hence, by monitoring the feature distributions in D , we can identify whether there is any event occurred. If the feature, f ∈ Q, suddenly appears frequently, we can then conclude that some events related to Q emerge. Here, another question immediately arises: how to define the phrase “suddenly appears frequently”? We will provide the implementation details in Section 2.2. Without loss of generality, assume that we know all of the periods where Q suddenly appears frequently. Let B be a set of such periods and Db be a set of documents which satisfies Q and resides in b ∈ B. Intuitively, the next process is to construct an event hierarchy (e.g. Figure 2) by recursively dividing each Db based on some partitioning algorithms. However, we claim that this process should not be conducted at this moment. Let us consider for the following situation. Assume that Q = {Iraq}. There was an article release from SCMP2 on 2003/03/18 titled “A letter to Saddam: Be more like me”. Obviously, even just judging from the title, this article is related to Iraq. But, in the whole article, the feature Iraq is not found. Some documents may be highly related to Q, but their contents do not match Q. So, before formulating the hierarchical structure, we should extract all of the documents that are highly related to each Db . Our problem now is: given a set of documents, Db , extract all of the documents that have similar contents. A simple but ineffective method is to compare the documents in b with Db one by one: for each document in b, see whether it is similar to any of the document in Db . Unfortunately, since the feature distributions are sparse in the text domain, two documents with similar features may not necessarily belong to the same event [6, 29]. As the result obtained by comparing two sets of documents will usually be more reliable than comparing in a document basis [29], we should select a proper subset of documents in b and compare it with Db . Now, another question arises: how to select a proper subset of documents in b? At the first glimpse, this question can be answered by the tech-
0.14 0.12 0.10 0.08 0.06 0.04
0.7 0.6 0.5 0.4 0.3 0.2
0.02
0.1
0.00
0 0
3
6
9
12
0.499
No. of times the feature appears in a window
3.499
6.499
9.499
12.499
No. of times the feature appears in a window
(a)
(b)
Figure 4: Two binomial distributions. niques used in query refinement [14, 22, 25]: find all of the features that are highly associated with Q, and then for those documents belong to these features, try to align them to Db , ∀ b ∈ B. Unfortunately, the existing techniques merely rely on the co-occurrence of features in the text domain, and cannot include the time dimension. Yet, time is important in solving our problem. For instance, the feature Virus is highly related to the feature Bird (Bird Flu – H5N1 – a kind of virus) for some bursty periods. Not all of the bursty periods of Virus are coincident with Bird. It is unlikely that two features have high association for all of their bursty periods. In this paper, we try to solve the above problem with the help of the bursty patterns of the other features in D . Firstly, we identify the bursty patterns of all features, F, in D . Then, for each f ∈ F, we determine whether any of the bursty periods of f is similar to any of the b ∈ B. Finally, we compare the similarity between Db and the documents that belongs to f by using two novel coefficients: intra-document similarity and inter-document similarity. Their formulation is motivated by a newly defined concept called “map-to”. Details will be discussed in Section 2.3. Thus, we can obtain a set of documents, Db , that are related to Q and are released within the period b ∈ B. As discussed previously, each Db may further be divided into several events. In this paper, we identify these sub-events by implementing the bisecting K-Means clustering algorithm [29]. Details are discussed in Section 2.4.
2.2 Identify the Bursty Periods of the Features Let f ∈ Q. In order to determine which of the windows the feature f “suddenly appears frequently”, we try to compute the “probability of bursty” for each window w ∈ W . Let P(w, f ; pe ) be the probability that the feature f is bursty in window w according to pe , where pe is the probability that f would appear in a time window given that it is not bursty. The intuition is that if the frequency of f appearing in w is significantly higher that the expected probability pe , then w is regarded as a bursty period. We compute P(w, f ) according to the cumulative distribution function of the binomial distribution [19]: nfw
P(w, f ; pe ) =
∑ p(k; Nw , pe ),
(1)
k=1
Nw k pe (1 − pe )Nw −k , k
p(k; Nw , pe ) =
(2)
where Nw is the total count of the features appeared in w and n f w is frequency of f appearing in w. P(w, f ; pe ) is the cumulative distribution function of the binomial distribution and p(k; Nw , pe ) is the corresponding probability mass function. Figure 4 (a) and (b) respectively show the probability mass function (p(k; Nw , pe ) and the cumulative distribution function (P(w, f )) of the binomial distribution with pe = 0.3 and Nw = 1000 using a
Research Track Paper
Bursty State
Bursty State 0 Jan
Mar
May
Jul 2003
(a)
Sep
Nov
1
SARS
1
SARS
0 Jan
Mar
May
Jul 2003
Sep
0 Jan
Nov
SARS
Bursty State
1
SARS
Bursty State
1
Mar
May
(b)
Jul 2003
Sep
0 Jan
Nov
Mar
May
Jul 2003
(c)
Sep
Nov
(d)
Figure 3: The bursty pattern of the feature SARS. In (a), it shows all of the bursty periods related to SARS. Some bursty periods are gradually removed. We only retain those periods with many related documents. Algorithm 1 computeBaselineProbability(N) Input: R = {r1 , r2 , . . . , rL } Output: The expected probability pe 1: repeat 2: pe = mean(R); 3: σ = standardDeviation(R); 4: for each ri ∈ P do 5: if ri > mσ · pe then 6: remove ri from R; 7: end if 8: end for 9: until ri ≤ mσ · pe , ∀ri ∈ R 10: return pe ;
Jan 03
Feature Percentage
Tsunami
Feature Percentage
Iraq
Jul 03
Jan 04
Jul 04
Jan 05
Jan 03
Jul 03
Jan 04
Jul 04
2003 - 2004
2003 - 2004
(a) Iraq
(b) Tsumani
Jan 05
Figure 5: The distributions of Iraq and Tsunami.
Finally, Algorithm 1 may encounter this problem: pe may be equal to 0. Let us consider for the feature Tsunami. Its distribution from 2003/01/01 to 2004/12/31 (730 day) is shown in Figure 5 (b). It only appears in 25 days with a total of 114 times. After completing Algorithm 1, pe would be 0. If pe = 0, then the result of Eq (1) would be undefined. So, we redefine ri by using the Laplacian smoothing [3]. As a result, ri and Eq.(1) will become:
normal approximation. The x-axis represents the number of times f appears in w. Their shape depends on pe only. p(k; Nw , pe ) would be maximum if the probability of f appearing in w equals to pe . Let r = n f w /Nw be the probability of f appearing in w. If r pe , P(w, f ) would approach to 0. In contrast, if r pe , P(w, f ) would approach to 1. In other words, if the probability of f appearing in w is well below the expect probability, pe , we will not consider f is bursty in w. On the other hand, if the probability is much higher than pe , we conclude that f exhibits some abnormal behavior and hence it is bursty in w. In Eq. (1), in order to compute pe , a simple yet ineffective approach is to explicitly define a fixed threshold, x, such that if f appears more than x% in w, f is bursty in w. This approach is impractical as different features have different thresholds. In order to assign different thresholds to different features automatically, we may attempt to rely on the mean probability: if f is distributed evenly over all windows in the text corpus, then the probability of f in any window is: pe = (1/L) ∑w∈W (n f w /Nw ). Unfortunately, it may not be appropriate. If the features with many bursty periods, such as SARS or Iraq, their mean probabilities will be heavily biased by their bursty periods, and these probabilities cannot be used to model the situations when the features occur by chance. In this paper, we ignore the features with a significantly high frequencies in the windows when computing pe . Algorithm 1 shows how we compute pe . Let R = {r1 , r2 , . . . , rL } be a sequence where each element, ri , denote for the probability of f appearing in wi (ri = n f wi /Nwi ), where wi is the ith window in W . In lines 2 – 3, we compute the mean of R (pe ) and its standard deviation. Then, we check whether all ri drops within m standard deviations of the mean of R. We reject those points in R that are out of this range until all ri are less than or equal to this range. In this paper, we set m = 3.
ri =
n f wi + 1 Nw + |F| k , p(k; Nw , pe ) = pe (1 − pe )Nw −k . (3) k+1 Nwi + |F|
In order to decide whether f is bursty in w, we check where n f w drops in Figure 4 (b). If n f w drops in Region D, then f is bursty in w as its occurrence is significantly higher than the expected probability, pe . If n f w drops in Region A, then f is not bursty in w as it appears in w is less than the expected probability, pe . Finally, we claim that user contribution in the bursty periods identification is necessary. Let us take the feature SARS as an example. Its bursty patterns are shown in Figure 3. The x-axes denote the day in 2003 and the y-axes denote the bursty state (0 or 1). When the bursty state is 0, no important event related to SARS happened. The situation reverses if the bursty state is 1, where some important events emerged. Figure 3 (a) shows all of the bursty periods. The number of bursty periods decreased gradually from Figure 3 (b) to Figure 3 (d). We only retain those periods with many related documents. If the number of bursty periods is reduced, the number of events identified must also be reduced. Hence, we may locate our target information more easily if there exists a threshold to control the number of events to be retrieved. Let θ ∈ [0, 1] be such a threshold. We re-scale Region B into this range. For instance, if θ = 0.8 and Region B is from 0.4 to 1.0, then from all w with P(w, f ; pe ) ≥ 0.667, they will be regarded as bursty periods. Our proposed algorithm includes this kind of user contribution, which is not being reported elsewhere.
304
Research Track Paper
Bird
B1
Flu
B2
cantly similar to”. Given two sets of documents, Db and D, we determine whether they are siginficiantly similar to each other by two components: intra-similarity (SI ) and mapping-similarity (SM ). We first present how they are computed, then explain why they are computed in this way. For the intra-similarity (SI ):
B3
F1
F3
SI (D) = Cold
C3
H5N1
B3
B1
SM (D) =
C3 H1 F1
F3
1 ∑ C(d, d ), d ∈ Db and d = d, |D| ∀d∈D
(5)
where d ∈ Db is the nearest document with respect to d ∈ D. SI (·) and SM (·) differs in the contents of their similarity functions, C(·, ·). SI computes the similarity within the same set of documents, whereas SM computes the similarity between two different sets of documents. Specifically, SM (·) finds the set of the most similar documents in D that map-to Db . This is why SM is termed as mapping-similarity. Intuitively, if SM (D) < SI (D), we would be in favor of grouping Db and D together, and rejecting to group them otherwise. Unfortunately, directly comparing SM (·) and SI (·) is inappropriate. Both SM (·) and SI (·) are averaged values. An averaged value may easily be affected by outliers. Thus, standard deviations of both component must be used when we have to conduct the comparison. Furthermore, the overlapping areas between Db and D may also affect the comparison. For instance, if two sets of documents are highly overlapped, they may group together even if SM (·) is a bit smaller than SI (·), i.e. some relaxation should be allowed in this case. On / To the other hand, no relaxation should be given if Db ∩ D = 0. capture these ideas, we use a one-tail t-test with H0 : SM ≥ SI and H1 : SM < SI , with the test statistics:
Figure 6: Four bursty features and their corresponding documents. Feature Bird Flu Cold H5N1
(4)
where d is the nearest neighbor of d ∈ D, and C(d, d ) measures the similarity between d and d (e.g. cosine coefficient [5]). Eq. (4) computes the average similarity between every pair of the nearest neighbor document in D. This is why SI is termed as intrasimilarity. For the mapping-similarity:
H1 B2
1 ∑ C(d, d ), d ∈ D and d = d, |D| ∀d∈D
Relationship B1 ⇒ H1, B1⇒ F1, B3 ⇒ C3 F3 ⇒ C3 – H1 ⇒ B1, H1⇒ F1
Table 2: The map-to relationships in Figure 6
2.3 Identify the Associated Features and the Associated Documents Let us assume that the bursty periods related to Q are identified. Let B be a set of bursty periods and Db be a set of documents that satisfy Q and reside in b ∈ B. In this section, we describe how we identify the features and the documents that are highly related to every b. For the features that are highly related to b, we call them as associated feature, Fb , of period b. Similarly, for the documents that are highly related to b, we call them as associated documents, Db , of period b. As stated in Section 2.1, we identify Fb based on the following steps: (1) For all f ∈ F, identify their bursty periods according to the procedure described in the previous section; (2) Let D be a set of documents that resides in one of the bursty periods of f . We conclude that Db can be enlarged by D if D map-to Db (D ⇒ Db ):
t0 =
SM − SI , σ/ |Db |
(6)
where σ is the standard deviation within SI (D). H0 would be rejected (D should not be mapped to Db ) if t0 > Tα , where 0 < α ≤ 0.5 is the significance level. α cannot be greater than 0.5 since this problem is a one tailed t-test. α must be chosen carefully, as it controls the relaxation of the t-test. The higher the value of α, the tighter is the control. As we discussed above, α should be determined dynamically based on the overlapping area between Db and D. Intuitively, α should behave as the pattern shown in Figure 7, where the y-axis is the degree of confidence (α) and the x-axis is the percentage of overlapping between Db and D. Befor we continue, let us define a variable, δ:
D EFINITION 1. (M AP -T O ⇒ ) Let Db be a set of documents that satisfies Q and resides in b ∈ B. Let D be another set of documents. Db and D may be overlapped. We say that D map-to Db (D ⇒ Db ) if and only if D is also resides in b and is significantly similar to Db , where Db ⊂ Db . Note that D is significantly similar to Db does not imply D is signifantly similar to Db , i.e. D ⇒ Db does not imply Db ⇒ D.
δ=
E XAMPLE 1 (M AP -T O ) Figure 6 shows four features: two bursts for Bird and Flu, and one burst for Cold and H5N1. According to Definition 1, a list of map-to relationships are identified, and are listed in Table 2. For instance, B1 and H1 are highly overlapped, B1 ⇒ H1. Similarly, as H1 and B1 are also highly overlap, H1 ⇒ B1. Since B1 is a subset of F1, B1 ⇒ F1. In contrast, F1 cannot map to B1 or H1, as it is a superset of both of them.
|Db ∩ D| . |Db |
(7)
Figure 7 captures the idea that if the overlapping area is large (δ 1), the required degree of confidence would be small (α 0). This would result in a more relax situation. If the overlapping area is small (δ 0), a much higher degree of confidence (α 0.5) would be resulted. If the overlapping area is halved (δ = 0.5), the required degree of confidence would be ambiguous (0.25). Logically, the relationship between δ and α should not be linear because we should be more certain with the decisions toward both ends. Eventually,
In Defintion 1, we have to carefully define the phrase “signifi-
305
Research Track Paper 0.5 0.45
articles in a day are categorized as noise. All features are weighted using the tf · idf schema [24] whenever necessary. In order to facilitate the computation, a news article with fewer than 35 different features is removed. Since the news articles arrive everyday, a window, w, is naturally meant one day.
cosine function
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05
3.1 An Overview
0 0
0.2
0.4
0.6
0.8
1
Figure 7: Distribution of the confidence interval (α).
Figure 8 and Figure 9 show the event hierarchies for Q1 = {SARS} and Q2 = {Bush} when θ = 0.9. There are totally 32 and 15 events associated with Q1 and Q2 , respectively. Due to the space limit, we only show the events up to the forth level for Q1 and the second level for Q2 . For each event, we select the top 10 features to represent it. These features then propagate from the last level to the first level via their parents. Table 3 and Table 4 list the stemmed features for each event in the first level of the hierarchies. Figure 8 corresponds to the events related to SARS. The first level of the event hierarchy contains four periods. The major event is from 2003/03/28 to 2003/07/02. It further divides into many smaller events. The other three events in the first level all last for only 3 to 4 days. They do not have any further branches. The following is a brief review of these events – Event 1 (2003/03/28 – 2003/07/02): SARS breaks out in Mainland China and spreads out to other countries. On 2003/03/27, CDC official presents the first evidence that a virus associated with the upper respiratory infections and another virus called corona might be likely the cause of SARS. On 2003/07/02, the aforementioned virus is under control, and WHO (World Health Organization) removes all countries, except Taiwan, from its list of areas with recent local SARS transmission after 20 days passed since the last SARS case was reported and isolated. Event 2 (2003/12/26 – 2003/12/28): The first Christmas after SARS was outbreak. Event 3 (2004/01/11 – 2004/01/14): Guangdong has a suspected case of SARS, which is the first suspected case after SARS was broke out in 2003. Event 4 (2004/03/27 – 2004/03/29): One year after SARS was broke out. Figure 9 corresponds to the events related to the US President George Bush. The patterns between Figure 8 and Figure 9 are very different. Figure 8 has one large major period and a “deep” hierarchical structure. Figure 9 is flat and each event in the first level lasts for a few days, where most events are related to Iraq War. Features such as iraq, saddam, hussein and weapon are usually associated to the events. Events in late 2004 are mainly related to the 2004 US Presidential Election, where the features associated to him are mostly ker (John Kerry, a presidential candidate), candid (candidate), sen (senate), voter and whit (white house). The following is a brief review of these events – Event 1 (2003/01/28 – 2003/02/06): President Bush announces that he is ready to attack Iraq even without a UN mandate. Event 2 (2003/02/22): Hans Blix orders Iraq to destroy its Al Samoud 2 missiles by 2003/03/01. Event 3 (2003/03/14 – 2003/03/27): Iraq War begins after President Bush delivers an ultimatum to Saddam Hussein to leave the country within 48 hours, but Saddam refuses. Event 4 (2003/04/07 – 2003/04/10): British forces takes control of Basra and U.S. forces takes control Baghdad. Event 5 (2003/05/29 – 2003/06/01): Weapons of mass destruction (WMD) have not been found, but U.S. secretary of state Colin Powell and British prime minister Tony Blair deny that WMD are distorted or exaggerated to justify an attack on Iraq. Event 6 (2003/07/07): Bush administration concedes that evidence that Iraq was pursuing a nuclear weapons program, cited in January State of the Union address and elsewhere, was unsubstantiated. Event 7 (2003/09/10 – 2003/09/11): Memorial of the September 11 Attack. Event 8 (2003/10/05): White House reorganizes its reconstruction efforts in Iraq, placing National Security Adviser Condoleezza Rice in-charge and diminishing the role of the Pen-
we can model the situation described so far by using a cosine function. Mathematically, 1 1 cos(π · δ) + . (8) 4 4 Since it does not make sense if the degree of confidence is smaller than 0 or larger than 0.5, the cosine function is re-scaled within 0 to 0.5. This is why the two 1/4 are added. α=
2.4 Construct Event Hierarchies In this section, we describe how to construct an event hierarchy for a particular query, Q. Let E be all of the events that appear in the event hierarchy. An event, e ∈ E, is an object which consists of the following three components: (1) A period of two timestamps; (2) A set of representative features; and (3) A set of similar documents. After the previous two steps, we will obtain a set of documents, Db , that is highly related to Q (some documents D ⊂ Db match Q directly, and some of them do not match Q but are very similar to the other documents in that match Q in Db ) and resides in b ∈ B, where B is a set of periods when Q suddenly appears frequently. As discussed previously, for each Db , it may further be broken down into several events. Accordingly, for each Db , we use the bisecting K-Means clustering algorithm [29] to partition it, such that the documents with similar contents in Db would be grouped together to form events. Bisecting K-Means is particularly suitable for partitioning the text corpus and will generate a dendrogram automatically. Details of the algorithm could be referred to [29]. Similar to most of the clustering problems, the only issue remained here is how to specify a stopping criterion for the partitioning process: how many events should be left? Defining a fixed value is unreasonable, as it is impossible to predict how many events exists in Db in advance. In this paper, the stopping criterion of the bisecting K-Means depends on the timestamp of the documents in Db . Let ti be the timestamp of di . We recursively partition Db until max{ti+1 −ti } > m, where m is a predefined threshold. In this paper, we set m = 3. This is based on our observations from the dataset that we used. The details of the dataset are described in Section 3. Eventually, each resulting partition is regard as an event. Since a dendrogram is generated automatically from the bisecting K-Means, a hierarchical tree structure which is similar to Figure 2 is obtained naturally. The representative features of each event are extracted by using the information gain [26]. We extract the top k features for each of the event, e ∈ E. This k is set by the users according to their preferences, and it will not affect the structure of the event hierarchy in any circumstance.
3.
EXPERIMENTAL STUDY
We have archived 78,695 news articles from the South China Morning Post from 2003/01/01 to 2004/12/31. All features are stemmed using the Lovins stemmer [15]. Features that appear in more than 75% of the total news articles in a day are categorized as stopwords. Features that appear in less than 8% of the total news
306
Research Track Paper 1
2
1.1 1.1.1
3
4
SARS 1 - 2003/03/28 - 2003/07/02
1.1.2
1.1 - 2003/03/28 - 2003/04/30
1.1.2.1 1.1.2.2
1.1.1 - 2003/03/28 - 2003/03/30 1.2
1.1.2 - 2003/04/17 - 2003/04/30 1.1.2.1 - 2003/04/17 - 2003/04/23
1.2.1 1.2.1.1
1.1.2.2 - 2003/04/25 - 2003/04/30
1.2.1.2
1.2 - 2003/04/03 - 2003/06/28
.........
1.2.1 - 2003/04/03 - 2003/05/23
1.2.2
1.2.1.1 - 2003/04/03 - 2003/04/08
.........
1.2.2.1
1.2.1.2 - 2003/04/15 - 2003/05/23
.........
.........
1.2.2 - 2003/04/17 - 2003/06/28
1.2.2.2
1.2.2.1 - 2003/04/17 - 2003/05/09
.........
.........
1.2.2.2 - 2003/04/22 - 2003/06/28
.........
2 - 2003/12/26 - 2003/12/28 3 - 2004/01/11 - 2004/01/14 4 - 2004/03/27 - 2003/03/29
Figure 8: The result of the event hierarchy for Q = {SARS}. Event 1
Period 2003/03/28 – 2003/07/02
2
2003/12/26 – 2003/12/28
3 4
2004/01/11 – 2004/01/14 2004/03/27 – 2003/03/29
Feature advisor, airl, airport, amo, anim, apec, athles, atypic, august, budges, cabinet, catha, childr, clean, collect, cris, cultur, deficit, diseas, doct, doctor, easter, elder, epidem, famil, festiv, fever, flight, flu, food, fung, garden, gradu, guangdong, health, hu, hospit, hygi, infect, isol, labour, legc, lung, macau, margares, martin, mask, medicin, nur, outbreak, parent, passenger, pati, patient, pneumon, postpon, princ, quarant, ral, relief, respir, restaur, sack, sar, school, seat, shu, sick, sing, siu, speech, spread, subsid, suspect, symptom, syndrom, taipe, tang, taskforc, tickes, tol, transmis, travel, traveller, treatm, tung, virus, wal, ward, wear, welfar, wen, xinhu, yam, zhang border, christm, diseas, fever, guangdong, guangzhou, health, isol, medicin, pan, pati, producer, sar, suspect, symptom, test, travel,traveller, treatm, ward colleg, guangdong, sar, suspect amo, compens, epidem, garden, hospit, husband, inquir, isol, sar, victim Table 3: The features associated with Q = {SARS}. tures will never be grouped together according to our experimental results. This is because in Section 2.3, we do not simply consider the bursty patterns of two features. We further implement a t-test based heuristic for determining whether two sets of documents are similar with each others. Furthermore, a feature can be assigned to multiple events. This is different from some of the existing works, such as [7], where a feature can only be assigned to one event.
tagon. Event 9 (2003/10/23 – 2003/10/24): The Madrid Conference, an international donors’ conference of 80 nations to raise funds for the reconstruction of Iraq. Event 10 (2003/11/07): Japan may send a total of about 1,200 troops and civilian staff to Iraq to help rebuild the country. Event 11 (2003/12/17): US President Bush against Taiwan separatists to change the status quo. Event 12 (2004/01/23 – 2004/01/28): The United States asks the UN to intercede in the dispute over the elections process in Iraq and report for no WMD has been found in Iraq and that prewar intelligence was ”almost all wrong”. Event 13 (2004/03/07 – 2004/03/08): The Iraqi Governing Council signs interim constitution. Event 14 and Event 15 (2004/10/02 – 2004/10/24 and 2004/10/29 – 2004/11/22): The 2004 United States presidential election.
4. RELATED WORK Although there are advanced systems built for the purpose of browsing information in a text corpus [13, 23, 35], most of them do not include the time dimension. Although some works make use of the time dimension [10, 31], they seldom discuss how their parameters are estimated. Topic detection and tracking (TDT) is the major area that tackles the problem of discovering events from the news articles. Most of them focused on online detection [1, 2, 4, 28]. For the offline ones [32, 38, 37], they did not target organizing the events into hierarchical structures, but aimed at ranking the events in text corpus without human interleave. [9] proposed an algorithm for constructing a hierarchical structure for the features in the text corpus by using an infinite-state automaton. Our problem is different from [9] since we are aiming at constructing event hier-
3.2 Further Discussion In general, the events identified are justifiable. The documents in each group share a high degree of similarity (they are discussing the same event), and the features that are associated with the event can represent it. According to our experiences, as well as the evidences, the bursty patterns of some features are very similar, such as SARS and Iraq. SARS burst out from late March to early July, whereas Iraq burst out from late February to early May. Yet, these two fea-
307
Research Track Paper
1
2
3
4
5
6
7
8
9
10
11
... Bush
... ...
12
13
14
15
... 1 - 2003/01/28 - 2003/02/06 2 - 2003/02/22 - 2003/02/22 3 - 2003/03/14 - 2003/03/27 4 - 2003/04/07 - 2003/04/10 5 - 2003/05/29 - 2003/06/01 6 - 2003/07/07 - 2003/07/07 7 - 2003/09/10- 2003/09/11 8 - 2003/10/05 - 2003/10/05 9 - 2003/10/23 - 2003/10/24 10 - 2003/11/07 - 2003/11/07 11 - 2004/01/23 - 2004/01/28 12 - 2004/03/07 - 2004/03/08 13 - 2004/08/22 - 2004/08/28 14 - 2004/10/02 - 2004/10/24 15 - 2004/10/29 - 2004/11/22
... ...
Figure 9: The result of the event hierarchy for Q = {Bush, George}. Event 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Period 2003/01/28 – 2003/02/06 2003/02/22 – 2003/02/22 2003/03/14 – 2003/03/27 2003/04/07 – 2003/04/10 2003/05/29 – 2003/06/01 2003/07/07 – 2003/07/07 2003/09/10 – 2003/09/11 2003/10/05 – 2003/10/05 2003/10/23 – 2003/10/24 2003/11/07 – 2003/11/07 2003/12/17 – 2003/12/17 2004/01/23 – 2004/01/28 2004/03/07 – 2004/03/08 2004/10/02 – 2004/10/24 2004/10/29 – 2004/11/22
Feature hussein, iraq, kore, north, regim, washingt, weapon destory, peac, war gulf, hussein, iraq, milit, saddam, war, washingt, weapon baghdad, diplom, iraq, saddam, war, weapon hussein, iraq, saddam, war, wmd diplom, hu, hussein, iraq, milit, saddam, unsubstan, weapon attach, su, terror adviser, black, colleg, inquir, learn, philip, rac, scand, washingt Madrid, Iraq japan, iraq, troop taiwan, chen gulf, invas, iraq, nuclear, wmd coun, iraq, hussein, ker, saddam iraq, terror, ker, sen, troop candid, ker, lien, peac, sen, voter, whit Table 4: The features associated with Q = {Bush, George}. [14] are some popular techniques. Similar to these approaches, our proposed algorithm is based on the co-occurrence of features. Unlike these approaches, we take the dimension of time and the period of events into consideration.
archies, but not feature hierarchies. [21, 27, 30, 31] proposed using χ2 -test to construct an overview timeline for the features in the text corpus. Our outcome is different from theirs as we construct hierarchical structures, but their structure is flat. [33] proposes methods for mining knowledge from the query logs of the MSN search engine by building a time-series for each query term based on the similarity of the time-series patterns, but did not pay attention to the contents of the web pages. We find the bursty events based on both the time and the content information. [11] proposed a model for searching the features where the features are used to identify whether a specific message is a response to others. This model is based on analyzing the dynamics of sending-receiving message over some time intervals. Grosz and Sidner [8] proposed a model that organizes documents in a nested structure. From a high-level point of view, all of these works are trying to extract meaningful structures in the data. Yet, their focuses and ours are different. We are neither targeting organizing the documents, nor aiming at detecting the relationships among documents. For the problem of extracting features that are associated to a particular event, it is similar to that of identifying features that are related to a search query. Query expansion [22], query feedback [25] and query refinement
5. SUMMARY AND CONCLUSION We proposed an algorithm called time driven documents-partition (TDD) to construct an event hierarchy in a text corpus based on a user query. An event is an object which consists of: (1) A set of documents with similar contents; (2) A set of representative features extracted from the documents; and (3) The period of the event. Given a query, TDD will: (1) Identify the features, called bursty features, that are related to the query according to the timestamps and the contents of the documents; (2) Extract the documents that are highly related to the bursty features based on time; (3) Partition the extracted documents and construct an event hierarchy. Event hierarchies can assist us to locate our target information efficiently. Without them, it is very difficult to identify what are the major events that are related to the query, the periods when the events happened, as well as the features and the documents that are re-
308
Research Track Paper lated to each of the events. Extensive experiments are conducted to evaluate TDD. The results indicated that it is practically sound and highly effective.
[21]
6.
[22]
REFERENCES
[1] J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), 1998. [2] T. Brants and F. Chen. A system for new event detection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’03), 2003. [3] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL’96), 1996. [4] M. Connell, A. Feng, G. Kumaran, H. Raghavan, C. Shah, and J. Allan. UMass at TDT 2004. In 2004 Topic Detection and Tracking Workshop (TDT’04), Gaithersburg, Maryland, USA, 2004. [5] W. B. Frakes. Stemming algorithm. In W. B. Freakes and R. Baeza-Yates, editors, Information Retrieval Data Structures & Algorithms, pages 131–160. Prentice Hall PTR, 1992. [6] G. P. C. Fung, J. X. Yu, H. Lu, and P. S. Yu. Text classification without negative examples revisit. IEEE Transactions of Knowledge and Data Engineering, 18(1):6–20, 2006. [7] G. P. C. Fung, J. X. Yu, P. S. Yu, and H. Lu. Parameter free bursty events detection in text streams. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05), 2005. [8] B. J. Grosz and C. L. Sidner. Attention, intentions, and the structure of discourse. Computational Linguistics, 12(3):175–204, 1986. [9] J. M. Kleinberg. Bursty and hierarchical structure in streams. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02), 2002. [10] V. Kumar, R. Furuta, and R. B. Allen. Metadata visualization for digital libraries: interactive timeline editing and review. In Proceedings of the 3rd ACM Conference on Digital libraries, 1998. [11] D. D. Lewis and K. A. Knowles. Threading electronic mail: a preliminary study. Information Processing and Management, 33(2):209–217, 1997. [12] Z. Li, B. W. anad Mingjing Li, and W. Y. Ma. A probabilistic model for retrospective news event detection. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05), 2005. [13] X. Lin, D. Soergel, and G. Marchionini. A self-organizing semantic map for information retrieval. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’91), 1991. [14] Y. Liu, H. Chen, J. X. Yu, and N. Ohbo. Using stem rules to refine document retrieval queries. In Proceedings of the 3rd International Conference on Flexible Query Answering Systems (FAQS’98), 1998. [15] J. B. Lovins. Development of a stemming algorithm. Mechanical Traqnslation and Computational Linguistics, 11:22–31, 1968. [16] S. A. Macskassy and F. Provost. Intelligent information triage. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), 2001. [17] Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text – an exploration of temporal text mining. In Proceedings of the 11th International Conference on Knowledge Discovery and Data Mining (KDD’05), 2005. [18] Q. Mei and C. Zhai. Topics over time: A non-markov continuous-time model of topical trends. In Proceedings of the 12th International Conference on Knowledge Discovery and Data Mining (KDD’06), 2006. [19] D. C. Montogomery and G. C. Runger. Applied Statistics and Probability for Engineers. John Wiley & Sons, Inc., second edition, 1999. [20] S. Morinaga and K. Yamanishi. Tracking dynamics of topic trends using a finite mixture model. In Proceedings of the 10th ACM
[23]
[24]
[25]
[26] [27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
309
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02), 2004. R. Papka and J. Allan. On-line new event detection using single pass clustering. Technical Report IR–123, Department of Computer Science, University of Massachusetts, 1998. H. J. Peat and P. Willett. The limitations of term co-occurrence data for query expansion in document retrieval systems. Journal of the American Society for Information Science (JASIS), 41(4):378–383, 1991. E. Rennison. Galaxy of news: An approach to visualizing and understanding expansive news landscapes. In Proceedings of the 7th ACM Symposium on User Interface Software and Technology (UIST’94), 1994. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management (IPM), 24(5):513–523, 1988. G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science (JASIS), 41(4):288–297, 1990. F. Seabastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002. D. A. Smith. Detecting and browsing events in unstructured text. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’02), 2002. M. Spitters and W. Kraaij. TNO at TDT2001: Language model-based topic detection. In 2001 Topic Detection and Tracking Workshop (TDT’01), Gaithersburg, Maryland, USA, 2001. M. Steinbach, G. Karypis, and V. Kumar. A Comparison of Document Clustering Techniques. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00) Workshop on Text Mining, 2000. R. C. Swan and J. Allan. Extracting significant time varying features from text. In Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management (CIKM’99), 1999. R. C. Swan and J. Allan. Automatic generation of overview timelines. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00), 2000. D. Trieschnigg and W. Kraaij. Hierarchical topic detection in large digital news archives. In Proceedings of the 5th Dutch Belgian Information Retrieval workshop, pages 55–62, Utrecht, the Netherlands, 2005. M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos. Identifying similarities, periodicities and bursts for online search queries. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD’04), 2004. E. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95), pages 317–332, Las Vegas, USA, 1994. J. A. Wise, J. J. Thomas, K. Pennock, D. Lantrip, M. Pottieer, A. Schur, and V. Crow. Visualizing the non-visual: Spatial analysis and interaction with information from text documents. In Proceedings of the 1995 IEEE Symposium on Information Visualization (INFOVIS’95), 1995. Y. Yang, T. Ault, T. Pierce, and C. W. Lattimer. Improving text categorization methods for event tracking. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00), 2000. Y. Yang, J. G. Carbonell, R. D. Brown, T. Pierce, B. T. Archibald, and X. Liu. Learning approaches for detecting and tracking news events. IEEE Intelligent Systems, 14(4):32–43, 1999. Y. Yang, T. Pierce, and J. Carbonell. A study on retrospective and on-line event detection. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), 1998.