Life Cycle Modeling of News Events Using Aging Theory1 - CiteSeerX

4 downloads 19 Views 255KB Size Report
We consider a news event as a life form and propose an aging theory to model ... tional single-pass clustering algorithm to model life spans of news events. Ex-.
Life Cycle Modeling of News Events Using Aging Theory1 Chien Chin Chen2, Yao-Tsung Chen2, Yeali Sun3, Meng Chang Chen2, 2

Institute of Information Science, Academia Sinica, Taiwan {paton,ytchen,mcc}@iis.sinica.edu.tw 3 Dept. of Information Management, National Taiwan University, Taiwan [email protected] Abstract. In this paper, an adaptive news event detection method is proposed. We consider a news event as a life form and propose an aging theory to model its life span. A news event becomes popular with a burst of news reports, and it fades away with time. We incorporate the proposed aging theory into the traditional single-pass clustering algorithm to model life spans of news events. Experiment results show that the proposed method has fairly good performance for both long-running and short-term events compared to other approaches.

1 Introduction Nowadays, the Web has become a huge information treasure. Via the simple Hyper Text Markup Language (HTML) [1], people can publish and share valuable knowledge conveniently and easily. However, as the number of Web documents increases, obtaining desired information from the Web becomes time-consuming and sometimes requires specific knowledge to make best use of search engines and returned results. On-line news reflects such an information explosion problem. It is difficult to access and assimilate desired information from the hundreds of news documents from different agencies generated per day. Techniques such as classification [7][9] and personalization [5][6], were invented to facilitate news reading. However, the classification method is not totally effective in that readers generally follow news by interesting threads, not categories. Moreover, unexpected events, such as accidents, awards and sport championships, are out of the learned user profile. Therefore, to reduce search time and search results a precise event detection method, which discovers news events automatically, is necessary. Event detection is part of Topic Detection and Tracking (TDT) [2] in which a news event is defined as incidents that occur at some place and time associated with some specific actions. In contrast with a category in the traditional text classification, events are localized in space and time. The job of event detection is to find out new events in several news streams. Besides discussing the TDT techniques of on-line news, in this paper we also discuss one interesting issue about news events ― the event life cycle. Usually, new news events appear in a news burst and gradually die out as time goes on [8]. Ignoring temporal relations of news events will degrade the performance of a TDT system. Previous works [3][14] were aware of the importance 1

This research was partly supported by NSC under grant NSC 91-2213-E-001-019.

of the temporal information of news events to TDT. Their experimental results showed that modeling temporal information of news events could discriminate between similar but distinct events efficiently. In this paper, we propose the concept of aging theory to model life cycles of news events. Experiments show that our approach can improve the deficiencies of other methods. The rest of the paper is organized as follows. In Section 2, we give a review of related works. In Section 3, we propose the concept of aging theory. Section 4 describes the algorithms that apply the aging theory to a news reading system. We evaluate the system performance in Section 5. Finally, conclusions and future work are given in Section 6.

2 Related Works The project Topic Detection and Tracking (TDT) [2] is a DARPA-sponsored activity to detect and track news events from streams of broadcast news stories. It consists of three major tasks: segmentation, detection and tracking. Our focus, retrospective detection task [3][14], is unsupervised learning oriented [11]. Without giving any labeled training examples, the job of retrospective detection is to identify events from a news corpus. The traditional hierarchical agglomerative clustering (HAC) algorithm [13] is suitable for retrospective detection. However, the computation cost of HAC, which is quadratic to the number of input documents when using group average clustering [14], makes it infeasible when the number of news documents per day is high. Yang, et al. [14] used the technique of bucketing and re-clustering to speed up HAC. However, there is a chance that information from a long running event would be spread over too many buckets and thus divide the event into several events [14]. Another popular approach to retrospective detection is single-pass clustering (or incremental clustering) [4]. The single-pass clustering method processes the input documents iteratively and chronologically. A news document is merged with the most similar detected-cluster if the similarity between them is above a pre-defined threshold; otherwise, the document is treated as the seed of a new cluster. However, by only considering the similarity between clusters and documents will lead context-similar, but event-different, stories to be merged together. In order to obtain better clusters, temporal relations between news documents (or clusters) must be incorporated into the clustering algorithm. Allan, et al. [3] proposed a time-based threshold approach to model the temporal relation. By increasingly raising the detection threshold, distant documents are difficult to align with existing clusters. Therefore, different events could be discerned. Yang, et al. [14] modeled the temporal relation in a time window and a decaying function. The size of a time window specifies the number of prior documents (or events) to be considered when clustering. The decaying function weights the influence of a document in the window based on the gap between it and the examined document. Similar to the time-based threshold approach, distant documents in the time window make less impact on clustering than those nearby. Even though the above methods enhance the result of the single-pass clustering algorithm, they are not adaptable for all types of event detections. The increasing threshold of time-based threshold method keeps distant stories of long-running events from tracking while the large window size of the time window method may mix up

many expired, context-similar, short-term events. In order to balance the tradeoff, and tackle both long-running and short-term events, a self-adaptive event life cycle management mechanism is necessary. We present an aging theory for event cycle in Section 3. For more information about TDT, [4] gives a detailed survey of existing systems and approaches in recent years.

3 Aging Theory A news event is considered a life form with stages of birth, growth, decay and death. To track life cycles of events, we use the concept of energy function. Like the endogenous fitness of an artificial life agent [10], the value of energy function indicates the liveliness of a news event in its life span. The energy of an event increases when the event becomes popular, and it diminishes with time. Therefore, a function of the number of news documents can be used to model the growing stage of events. On the other hand, to model the process of diminishing or aging stages, a decay factor is required. 3.1 Notations and Definitions The news documents to an event is analogous to foods to a life form, As various foods do not contribute the same nutrition to a life form, different news documents make different contributions to an event’s liveliness (i.e. popularity). The degree of the similarity between a news document and an event is used to represent the nutrition contribution. The accumulated similarity between news documents and event V in a time slot t is denoted by xt. The time slot t can be any time interval. In the implementation, we use one day as a time slot. We then define α as the nutrition transferred factor and β as the nutrition decayed factor, 0

Suggest Documents