Trending Words Based Event Detection in Sina Weibo Xinjiang Lu, Zhiwen Yu, Bin Guo, Jiafan Zhang
Alvin Chin, Jilei Tian, Yang Cao
School of Computer Science Northwestern Polytechnical University, Xi’an, China
Microsoft Beijing, China
[email protected], {zhiwenyu, guob}@nwpu.edu.cn,
[email protected]
{alvin.chin, jilei.tian, yang.1.cao}@microsoft.com
ABSTRACT Online social networks provide us an unprecedented volume of available data, which results from the pervasive adoption of online social applications. In particular, for the unique characteristics on promoting content sharing, microblogging social networks offer us a new proxy for detecting and tracking the events being taken place in the real world. In spite of large amount of social babble involved, the microblog data contains fresh news coming from human sensors at a humungous rate. As the online social network is a platform that is able to process fast changing streaming data, however it is hard to discover meaningful events in such noisy circumstances in time. In this paper, we study the keywords determining problem in event detection and propose a novel and much more effective method for discovering bursty words in microblogging social networks by leveraging temporal dynamics information. Based on this, we propose an efficient event detection framework applied in Sina Weibo—a Chinese microblogging site similar to Twitter. With experiments conducted on real data sourced from Sina Weibo, we show the effectiveness and feasibility of our proposed method and framework.
Categories and Subject Descriptors H.2.8 [Database Applications]: Data mining
General Terms Measurement, Experimentation.
Keywords Bursty words detection, Event detection, Sina Weibo
1. INTRODUCTION Sina Weibo (or Weibo for short), similar to Twitter, has become one of the most popular social networking and microblogging services in China. With this microblogging platform, people can write about anything they are interested in as their status messages and share these messages with their followers and friends. Important events like disasters, traffic accidents, concerts and football games can have immediate impact through direct status updates for a given topic. Furthermore, with the unique characteristics of short documents and SMS-alike writing style, and the mechanisms of enabling users to follow, mention, re-publish and comment who/whatever they are interested in, Weibo has been ranked as one of the most Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. BigDataScience ’14, August 04 - 07 2014, Beijing, China. Copyright 2014 ACM 978-1-4503-2891-3/14/08 ...$15.00. http://dx.doi.org/10.1145/2640087.2644156.
visited websites in China1. As a result, Weibo provides us with a much more effective crowd sourced medium for situation awareness by examining the posts written by millions of real world users. In Weibo (similar to Twitter), the message unit is limited to 140 characters, made up with a mix of Chinese and English characters, thus enabling information to be updated at extremely low cost and in real time, making Weibo a timely fresh information resource. Furthermore, the intensive interaction between users in real time enables timely event detection by monitoring status updates where many prominent events are timely spotlighted by Weibo users. Event detection from Weibo would help us gain timely understanding of users’ opinions/sentiments with respect to the detected events, making it possible for companies/organizations to react fast to any emerging crisis. Event detection from Weibo stream would also help us to study mass communication by analyzing the types of events that general users are mostly interested in as well as the reactions by users at different geographical regions. However, due to the noisy content, diverse and fast changing topics, and large data volume, it is a challenging task to detect timely events from Weibo streams. Though the task of event detection has been studied extensively on formal texts, like news articles, blog posts, or academic papers [8][9][10][12], microblogs are significantly different from well written texts because of informal writing style and code mixing (mix of multiple languages and symbols). To address the challenge of detecting timely events from noisy Weibo streams, we present an event detection framework based on the discovery of bursty words. Our contributions are the following. First, we uniquely characterize an event by the combination of core words and subordinate words, extracted from the Weibo posts. Second, we develop a much more effective and practical method for discovering the representative words of an event by determining the basic frequency of words, which is implemented by examining the temporal dynamics information in online social networks. Third, we utilize the LDA framework to examine the lexical similarity of candidate events discovered, taking both the core words and subordinate words into account. Finally, we employ the agglomerative clustering method to obtain a set of hot events in that time interval and each event detected is represented by a ranking list of words for easy interpretation. We evaluated our framework with more than 17.8 million posts published by Sina Weibo users over a 5 month period. Our experimental results demonstrate the effectiveness of our method and the accuracy of the events detected. The rest of the paper is organized as follows. Section 2 presents the related work, while Section 3 describes the proposed event detection framework and its components in detail. Section 4 presents the experimental results, and we conclude our work in Section 5. 1
http://www.alexa.com/siteinfo/weibo.com
2. RELATED WORK Recently, event detection on microblogging social networks (especially Twitter) has become a hot research topic. Mathioudakis and Koudas [11] presented a trend detection system over Twitter stream. They first identified the bursty terms based on queueing theory, then bursty terms are grouped into the events based on their co-occurrences. For a detected trend, PCA, SVD and entity extraction techniques are then applied to derive contextual information for the trend description. Petrović et al. [13] tracked events on the Twitter stream by applying locality sensitive hashing (LSH). LSH is applied to each tweet to measure the similarity to existing tweets. The tweets similar to each other are grouped as events. Phuvipadawat and Murata [14] proposed an approach for breaking news detection and tracking by clustering the similar tweets together. The approach only focuses on the tweets with specific hashtag #breakingnews. The similarity between two tweets of breaking news is measured by using a variant of TFIDF scheme where the named entities detected by a Named Entity Recognizer (NER) are further boosted. Popescu et al. [15] proposed a method for entity-based event detection on Twitter streams. A set of tweets containing the predefined target entity are processed and machine learning techniques are used to predict whether the tweets constitute an event regarding the entity. Li et al. [16] proposed to detect crime and disaster related events (CDE) from tweets. Conventional text mining techniques are applied to extract the meta information (e.g., geo-location names, temporal phrase, and keywords) for event interpretation. To summarize, most existing approaches for detecting events from tweets are applicable to certain types of tweets (e.g., having a specific hashtag, containing a predefined entity, or related to crime and disaster). The other solutions including [11] and [13] involve complicated processing and lead to heavy computational cost. However, there is little work focusing on how to discover the events in time from microblog streams, which significantly affects the availability of event detection from online content generated by users. Besides, there have been some researchers who have studied the event detection with Sina Weibo (e.g., [17] and [18]), but we are the first to systematically study this problem with Sina Weibo.
3. EVENT DETECTION FRAMEWORK Before we give an overview of our event detection framework, let us define a few basic concepts first.
3.1 Terminology Event — What is an event? According to Cambridge Dictionary Online, an event is defined as “anything that happens, especially something important and unusual”2. Here, we define the event e as a realistic event expressed by a set of words chosen from the pre-defined vocabulary V, which is similar to the definition proposed by Gupta et al. [1]. Similar to [1] we use “core words + subordinate words” as the set of words to also { }, represent the event, which can be formulated as where is the set of core words and is the set of subordinate words. Core Words — Core words capture the main body of an event, which have a high probability mass over the words emerged in a certain time interval. These are bursty keywords identified by an approach based on temporal TF-IDF which will be detailed in Section 3.3. 2
http://dictionary.cambridge.org/dictionary/british/event?q=event
Subordinate Words — Subordinate words are words which appear very frequently together with core words in a certain time interval. We aim to capture the descriptive aspects of an event with subordinate words, and since the circumstances or context of an event may change as the time elapses, we extract the subordinate words by examining the probabilities of words conditioned on co-occurring with core words in a certain time interval as illustrated in Section 3.4. Also, a subordinate word for an event e1 might be a core word for some other similar event e2.
3.2 Event Detection Methodology Fig. 1 summarizes our methodology for detecting currently popular events. We apply our methodology to posts from Weibo.
Fig. 1
Event detection methodology
After receiving a post from the Weibo feeds, we first conduct some preprocessing work including word segmentation and stop-words removal. The word segmentation component splits the Weibo post into non-overlapping segments, and each segment may or may not represent a semantic unit. We utilize the NLPIR 3 , a publicly available tool for Chinese words segmentation, to implement the word segmentation. Then the stop-words (we consider most common words, like ‘嗯’, ‘我’, ‘的’, ‘人’, ‘而且’, etc., as stop-words, and if a word lies in the top few frequent words for a large number of time intervals of Weibo stream, we also take it as a stop-word) removal is applied with a general Chinese stop words list 4 (provided by ICT in China) to filter out the meaningless stop-words. After, we utilize the core words detection module to find currently popular keywords in a certain time interval. Next, we find subordinate words which form the context for the core words. Finally, with these candidate events expressed by a set consisting of one core word plus several subordinate words, we use the clustering module to group the set of words with the same topic. We now describe our method for core words detection and subordinate words detection.
3.3 Core Words Detection If one can quickly and accurately capture the sudden popular words, which can be the best representatives of an emerging event, then the identifying and tracking steps for such events should become easier and more timely. We aim to discover the core words by using an approach in a dynamic manner, which is similar to temporal TF-IDF. For the core words—the key words of an emerging event— the term frequencies should be high enough in the current time interval and regular across multiple time intervals. On the other hand, the general words (or the stop-words) should have high frequencies and are also popular across multiple time intervals. Note that we remove stop-words before discovering core words, because there are many frequent colloquial and free style
3
http://www.nlpir.org/
4
http://www.datatang.com/data/43894
Chinese words used in Weibo that do not describe the event itself but perhaps could be opinions or emotions. If we take the bag of words extracted from Weibo feeds in a certain time interval as a single document, then we get various documents for multiple time intervals. Similar to [1], we rank ( ) defined as follows: words by a word score ( )
( )
(1)
( )
where ( ) is the frequency of the word in the current time interval, and ( ) is the basic frequency of the word across multiple time intervals. Then we set a threshold to determine the candidate core words with high WS scores, and choose the top few with highest DF scores (here, we treat each Weibo post as a single document, i. e. the word with high DF score is the word being talked about in many Weibo posts currently) as the core words. Algorithm 1 illustrates the details on choosing the core words. Here, the threshold is determined by maxima, which is calculated by using the definition of IQR (Interquartile range) [5] —one common metric to detect the outliers. The maxima is defined as follows: ( )
( )
( )
(2)
where ( ) is the upper quartile of S (here, S is the set consisting of the term frequencies of words to be examined), and IQR is the distance between upper quartile and lower quartile ( ). Then we rank the candidate core words by using DF score: }| |{ ( ) , here D is the set consisting of N posts published in the current time interval, and m is the post containing the word w. After we get the DF values of the candidate core words, we find that the frequencies (DF) of the words in CS (the set of candidate core words) are highly skewed, so we apply logarithm transformation before calculating maxima (see line 8 in Algorithm 1). So far, we have not illustrated any details on calculating the basic frequency . One simple method is averaging the word frequencies over the entire dataset. However, this is not practical, since that entire dataset is unavailable when we try to discover the currently trending event by monitoring the data streams sourced from Weibo. We propose a much more practical and effective method which works in a recursive manner. With this method, we can estimate the basic frequencies of the words being examined in time, since only the temporal information in last time interval and current time interval is necessary. The details of this method are described in Eqns. (3) and (4) below: ( )
( ) (
| |)
(
)
( )
(3) (4)
( ) is the basic frequency of word w in time slot t. where Often, the term frequencies vary temporally in a rise and fall pattern (rise exponentially but fall in power law) in social media [4], here we use to indicate the decay factor and set at 1.5 as discovered by earlier work on real data using real bloggers [3], and response time to mails by Einstein and Darwin [2]. We also add a noise term as the tune parameter (see Eqn. 4). For capturing the term frequency dynamics immediately, we use the ) ( ) as the incremental addition (Eqn. 3). term (
Algorithm 1: Discovering Core Words input 1
: Bag of words S after preprocessing
output : The set of core words foreach word w in S
2 3 4 5 6 7
Put end
( ̂) { | ( ) foreach word w in CS.
8 9 10
Put
}
( ( )) into ̂
end
11 12 13
( ) into ̂
{
(̂ ) | (
( ))
}
return
3.4 Subordinate Words Detection We aim to discover the subordinate words in terms of high co-occurrence with core words in microblogging social media. We examine the co-occurrence of word w with core word c by leveraging the conditioned probability ( | ) , which is determined by calculating the DF (here, we also treat a post as a single document similar to choosing core words from candidate core words in Section 3.3). Then we select the top few words with highest frequencies )) as the subordinate words. Here, we select subordinate ( ( words from the candidate core words set (i.e. CS in Algorithm 1), such that we can figure out the amount of subordinate words easily without making a tough decision.
3.5 Clustering on Candidate Events After the core words and subordinate words have been determined, we have candidate events which are represented by a set of words consisting of one core word plus several subordinate words. Due to the large amount of the candidate events and the overlapping and incomplete contexts, we utilize clustering to discover more meaningful and compact events which should be represented by core words and subordinate words. First, we use the LDA [6] framework to map each candidate event into a feature vector, which is represented by a set of probabilities over the entire bag of words sourced from candidate events. We weigh the words (core words and subordinate words) forming a candidate event by examining the DF in current time interval (i.e. for a word w, we count the number of Weibo posts including w). As the core words are the main representatives of an event, they should weigh more higher than subordinate words. And with our core words detection and subordinate words detection modules, the core words indeed weigh more than subordinate words in a candidate event by examining the DF scores. Next, with JSD [6] (Jenssen-Shannon Divergence) as the distance measurement, we use the agglomerative clustering to find meaningful clusters. Fig. 2 shows the entire procedure mentioned above.
D1
Core word 1+Subordinate words Probability Distribution
D2
Core word 2+Subordinate words
D3
Core word 3+Subordinate words
LDA
… …
Feature vectors for Dn
… … Dn
Feature vectors for D1 Feature vectors for D2 Feature vectors for D3
Core word n+Subordinate words
Agglomerative clustering (Based on JSD)
Cluster 1
Fig. 2
Cluster 2
……
(a)
Cluster m
Clustering on candidate events
4. EXPERIMENTS AND DISCUSSION In this section, the experiments on evaluating our proposed event detection method are reported. With our experimental results, we show that our event detection method can effectively detect the real world events. Further, we evaluate our proposed method on estimating the basic frequency of the core words.
4.1 Dataset and Experimental Setting Dataset. We use the dataset published by Zhang J. et al [7], which is collected from the microblogging site Weibo.com. Table 1 lists the basic statistics of this dataset. The created time of the posts (including reposts) in the dataset ranges from August 28, 2009 to December 26, 2012, and the distribution is illustrated in Fig.3. Here, we choose the Weibo posts published from July 2012 to November 2012, as this timeframe involves the largest post volume (more than 17.8 million posts). Parameter Setting. There are several parameters involved in our experiments which could affect the performance of our proposed event detection method. The size of time window (termed as ) is a basic parameter for detecting core words and subordinate words. In our evaluation, we fix to be a day. For the agglomerative clustering, we set the threshold for determining the cluster to be ( ( )) 5 .
#users 1,787,443
Table 1 Dataset from Sina Weibo #follow #original #re-posts relationships posts 413,503,687
300,000
37,372,573
(b) Fig.4 RMSE and CV(RMSE) of estimated days we estimated
against the
Recall that, in Algorithm 1, we retain the top bursty candidate core words as the core words by using the common metric (based on IQR) for detecting outliers. However, this also outputted more than 50 core words in a day as we observed. Thus, to make it easier for interpreting the events discovered, we arbitrarily take the top 1% bursty candidate core words as the core words for events interpretation (see Section 4.3.2).
4.2 Evaluation on We first report the evaluation on estimating the basic frequency for detecting the core words. In Eqn. (3), the is defined in a recursive manner, thus we need to assign the initial value to .We use a global-TF as the initial basic frequency for a specific word by examining the posts published in a relatively long time before the day we start to estimate the . Then, we compare the estimated to the dynamic-global-TF (here we take the dynamic-global-TF as the ground truth, and we calculate the dynamic-global-TF in a way similar to global-TF only for 30 days). We show our evaluation results in terms of RMSE and CV(RMSE)6 as shown in Fig.4 (here, we set in Eq. 4). As shown in Fig. 4, both RMSE and CV(RMSE) climb up as the number of days to estimate increases, which indicates that it is much more challenging to estimate the basic frequency for those long-run pop key words. However, due to the fast changing property of Weibo, long-run pop key words are very rare, and for the relative short period (e.g. within 5 days) the performance for estimating the basic frequency is acceptable.
4.3 Event Detection Results The dataset does not come with the ground truth labels on all realistic events within the data collection period. Since it is infeasible to manually label the 300,000 original posts in the dataset, we choose to manually evaluate the events detected by our proposed method.
4.3.1 Event Detection Performance Fig.3
5
Posts volume against month of year
Here indicate the candidate events extracted from the posts published in current time interval.
We arbitrarily select the posts created in 4 days (2012-07-27, 2012-07-29, 2012-09-15 and 2012-11-22), and use the posts 6
Coefficient of variation of the RMSE, which is defined as the RMSE normalized to the mean of the observed value.
Fig.5
Evaluation on illustrated event detection in terms of Precision and Recall
including the core words on that day as the testing set, and manually label each post to one specific event, consequently, we take these labels as the ground truth. Specifically, we first identify the events detected (detailed in Section 4.2.2) on that day. Then we label each original post on that day (or the day before, since some original posts are published in early time, and get many reposts later) according to the identified events. As the reposts that sourced from one single post may talk about the same topic as the original post with high probability, we label the reposts according to the original post. Next, we assign the estimated labels to those posts and/or reposts by determining whether the original text of the specific post/repost contains the core words related to an event detected. We evaluate the performance of our proposed method in terms of precision and recall, which is illustrated in Fig. 5. In Fig. 5, we can see that for some events, both the precision and recall are very high, and for others, the detection performance, especially the recall, is not well as the former ones. After the event interpretation (Section 4.2.2), we will see that the results with poor performance are mainly related to the real life events.
4.3.2 Event Interpretation We argue that the notion of core words plus subordinate words makes the detected events much easier to be interpreted. With taking the words discovered as the key words for searching, we utilize the Baidu7 search engine to find out the stories related to these events. Now, we give some examples by using the events detected by our proposed method (Table 2). These four examples are chosen because these 4 days cover a variety of important events. From Table 2, we make 2 observations. First, many events contain multiple core words and subordinate words. It is easier to interpret the events described by the words in multi-gram manner. For example, is much harder to interpret when compared to the results of and . Second, the detected events cover a wide range of events, from sports to natural disaster. As the topics of microblogs are highly diverse, many of them are not talking about the actual events, which makes the event detection much harder (e.g., from Table 2, we can see that, , , and are not actual events). 7
http://www.baidu.com
5. CONCLUSION AND FUTURE WORK In this work, we proposed a real time event detection framework by discovering the bursty keywords. With taking the social dynamics information into account, we proposed an effective method on discovering the bursty keywords. Then, we also took those words co-occurring frequently with the key words as the contextual words of events. After detecting the representative words of events, we utilized the LDA and JSD to measure the similarity of the candidate events detected, then we employed the agglomerative clustering method to generate the compact form of the events. We evaluated our event detection framework on the data sourced from Sina Weibo, and the experimental results showed that our proposed method can effectively discover the events with high interpretability. In the future, we plan to improve the performance by systematically taking temporal information dynamics into account, and take the posts from which the events are sourced as the social and temporal context of events. With the dimensions mentioned above, we can obtain a more robust (not highly relied on the performance of Chinese phrase identification) and effective event detection method. We also aim to improve the event detection performance on the real life events by taking into account the conversation structure, reposting behavior patterns and texts properties related to a specific event (e.g. the diversity of texts related to one event).
6. ACKNOWLEDGMENTS This paper was partially supported by the National Basic Research Program of China (No. 2012CB316400), the National Natural Science Foundation of China (No. 61103063, 61222209, 61373119), the Specialized Research Fund for the Doctoral Program of Higher Education (No. 20126102110043), and the project funding from Microsoft.
7. REFERENCES [1] M. Gupta, J. Gao, C. Zhou and J. Han. Predicting Future Popularity Trend of Events in Microblogging Platform. In ASIST 2012. [2] A. L. Barabasi. The origin of bursts and heavy tails in human dynamics. Nature, 2005.
Table 2 Date
2012-0727
2012-0729
2012-0915
Illustrated examples on events detected during period of July 2012 to November 2012
Core Words (maybe subset)
Subordinate Words (maybe subset)
Event Description
宜家(A famous furniture brand named as Yi Jia), 托儿所(nursery)
蹂躏(ravage), 折服(be convinced), 涵养(quality), 托儿所 (nursery)
Not a realistic event, people talking about a well known furniture brand (named as Yi Jia), which is famous in China
奥运(Olympic), 开幕式(opening ceremony), 绿洲(Oasis)
伦敦(London), 奥运(Olympic), 开幕式(opening ceremony), 乐 队(band), 手枪(Gun), 阿黛尔(Adale), 绿洲(Oasis)
鲁迅文学奖(Lu Xun Literary Prize)
鲁迅文学奖(Lu Xun Literary Prize)
暴雨(rain storm), 自然灾害(nature disaster), 受灾(hit by disaster)
遇难(died), 暴雨(rain storm), 天津(Tianjin), 自然灾害(Nature disaster), 致哀(lament sb’s death), 受灾(hit by disaster)
应予(should be)
严惩(punish), 扰民(disturbing residents), 应予(should be)
荒山(barren hill), 建厂(constructing plant)
荒山(barren hill), 培育(cultivate), 建厂(constructing plant), 造 纸 papermaking, 地表(the surface of earth), 改种(replant), 砍伐 (fell trees)
奥运(Olympic), 伦敦(London), 金牌 (gold medal)
奥运(Olympic), 伦敦(London), 金牌(Gold Medal), 奥运会 (Olympics), 孙杨(Sun Yang), 朴泰桓(Park Tae-hwan)
冒雨(working in the rain), 乐山市 (Leshan City), 高新(high-tech)
视察(inspect), 冒雨(working in the rain), 高新(high-tech), 乐山 市(Leshan City), 摄像机(camera), 高新技术(high technologies), 水枪(hydraulic giant), 伪造(forge)
同胞(compatriot), 砸(smash), 钓鱼岛 (Diaoyu Island) 掠夺(plunder) 毕节(a city named Bijie), 李元龙 (Yuanlong Li), 毕节市(Bijie City), 失 业者(jobless)
Not a realistic event Torrential rain in Tianjin, China No clear corresponding real-life event People complain about the high polluted paper mill A Chinese swimming athlete (Sun Yang) won the gold medal in the men’s 400m free style final competition. In Leshan, Sichuan, some government officials became “actors”. They pretend to examine the situation of local people in bad weather, by using the fire truck to forge the picture of raining. Anti-Japanese demonstration happened at multiple cities in China.
Related to
儿童(children), 毕节(Bijie), 贵州(Guizhou), 流浪(roam), 机会 (chance), 李元龙(Yuanlong Li), 冻死(freeze to death), 毕节市 (Bijie City) 胺(melamine), 蒋卫锁(a Chinese name), 打假(crack down fake products), 蒋卫(a Chinese name), 彻(not a word), 遇害(being murdered), 掺假(adulterated), 轰动(shocking), 终年(all the year round), 乳品(dairy) 机会(chance), 感恩(thank), 感谢(thanks), 好友(friends), 乡(not a word), 抽取(select), 博(not a word), 礼(gift)
A journalist named Yuanlong Li has been arrested after he reported the death of 5 wandering children in Bijie, Guizhou.
雷政富(a Chinese name), 北碚区 (Beibei distinct), 垫江(Dianjiang City),
雷政富(a Chinese name), 书记(secretary), 北碚区(Beibei Distinct), 举报(inform), 重庆市(Chongqing), 垫江(Dianjiang)
Zhengfu Lei formerly the Chongqing Beibei Distinct Party Secretary, was sacked from his position for his bribe-taking.
翻开(open sth. Like a book), 名著 (famous book), 吴承恩(Cheng’en Wu, a famous ancient novelist), 伏笔(a hint foreshadowing later development in a story)
翻开(open), 闲(leisure), 名著(famous book), 西游记(the Story of a Journey to the West), 吴承恩(Cheng’en Wu), 伏笔(a hint)
Not a realistic event
开刀(perform an operation), 佛罗里达 (Florida), 钢条(iron rod)
郎咸平(Xianping Lang, a famous economist), 切开(incision), 骨 髓(marrow), 开刀(perform an operation), 钛(Titanium), 佛罗里 达(Florida), 钢条(iron rod), 刀刃(blade)
Not a realistic event
胺(melamine), 蒋卫锁(a Chinese name), 蒋卫(a Chinese name), 遇害 (being murdered), 掺假(products with bad quality caused by adulterated) 感恩(be thankful) 2012-1122
日本(Japan), 同胞(compatriot), 砸(smash), 钓鱼岛(Diaoyu Island), 日货(Japanese goods), 财产(property), 抵制(resist), 岛 (island), 国人(compatriot), 流氓(villain), 打砸抢(beating, smashing and looting) 日本(Japan), 同胞(compatriot), 财产(property), 掠夺(plunder), 法治(rule of law)
The 2012 London Olympics Opening Ceremony
[3] J. Leskovec, M. McGlohon, C. Faloutsos, N. S. Glance, and M. Hurst. Patterns of Cascading Behavior in Large Blog Graphs. In SDM, 2007. [4] J. Yang and J. Leskovec. Patterns of temporal variation in online media. In WSDM, 2011. [5] G. Upton, I. Cook. Understanding Statistics. pp.55, Oxford University Press, 1996. [6] G. Heinrich. Parameter estimation for text analysis. Technical report, 2005. [7] J. Zhang, B. Liu, J. Tang, T. Chen, and J. Li. Social influence locality for modeling retweeting behaviors. In IJCAI, 2013. [8] G. P. C. Fung, J. X. Yu, H. Liu, and P. S. Yu. Timedependent event hierarchy construction. In SIGKDD, 2007. [9] G. P. C. Fung, J. X. Yu, P. S. Yu, and H. Lu. Parameter free bursty events detection in text streams. In VLDB, 2005. [10] Q. He, K. Chang, and E.-P. Lim. Analyzing feature trajectories for event detection. In SIGIR, 2007. [11] M. Mathioudakis, and N. Koudas. Twittermonitor: trend detection over the twitter stream. In SIGMOD, 2010.
An entrepreneur (Weisuo Jiang) was murdered and buried in Nov. 21, 2012. Jiang is famous for his exposing the scandal of producing counterfeit dairy in China. Not a realistic event
[12] J. Kleinberg. Bursty and hierarchical structure in streams. In SIGKDD, 2002. [13] S. Petrović, M. Osborne, and V. Lavrenko. Streaming first story detection with application to twitter. In HLT, 2010. [14] S. Phuvipadawat, and T. Murata. Breaking news detection and tracking in twitter. In WI-IAT, 2010. [15] A.-M. Popescu, M. Pennacchiotti, and D. Paranjpe. Extracting events and event descriptions from twitter. In WWW, 2011. [16] R. Li, K. H. Lei, R. Khadiwala, and KC-C. Chang. Tedas: A twitter-based event detection and analysis system. In ICDE, 2012. [17] H. Tu, and J. Ding. An efficient clustering algorithm for microblogging hot topic detection. In CSSS, 2012. [18] D. Shan, W. X. Zhao, R. Chen, B. Shu, Z. Wang, J. Yao, H. Yan, and X. Li. Eventsearch: a system for event discovery and retrieval on multi-type historical data. In SIGKDD, 2012.