Detection, Clustering and Tracking of Life Cycle Events on Twitter ...

2013 IEEE Seventh International Conference on Semantic Computing

Detection, Clustering and Tracking of Life Cycle Events on Twitter Using Electric Fields Analogy Diego Terrana, Giovanni Pilato ICAR - Istituto di Calcolo e Reti ad Alte Prestazioni CNR - Consiglio Nazionale delle Ricerche Viale delle Scienze - Edicio 11 - 90128 Palermo, Italy Email: {terrana, pilato}@pa.icar.cnr.it Given the volume of data and the real-time nature of the flow of Twitter, the extraction of events must take place in on-line mode. Unlike traditional data mining algorithms, our analysis is based on texts (tweets) of limited length (not more than 140 characters); in this case, the events are not known a priori. No training dataset is available. The training set is a potentially infinite stream of data that arrives in an order that cannot be controlled. The requirements that an algorithm must have in these cases are [2]:

Abstract—With the recent explosion of social networks, there is a growing need for systems capable to extract useful information from this amount of data. Social networks generate a large amount of text content over time because of continuous interaction between people. Given the amount and cadence of the data generated by those platforms, classical text mining techniques are not suitable. ”Events” can be deduced from aggregations of tweets in the stream. In this paper, we talk about detection, clustering and tracking of events in tweets stream. We will present an online framework that considers a tweet post as an electric charge and a new event as an electric field. A new event on Twitter is created when several tweets deal with the same topic. This event will disappear over time when there are no more tweet debating it. A corpus of 400 million tweets has been created and analyzed using our algorithm. The results show the effectiveness of the technique, both in terms of time and memory performance. Index Terms—Event Detection, Twitter Stream Analysis, Electric Fields Analogy, Machine Learning

•

•

I. I NTRODUCTION

•

The Web is the largest data repository available today: every day millions of users share their opinions and experience by posting to blogs or microblogs. The users of social network services can connect with people who share interests, activities and experiences with their network of friends. Compared to traditional blogs, microblogging has shown a tremendous growth in popularity in recent years. Microblogging users upload new messages more frequently and instantly. One of the most representative examples is Twitter because of its popularity and data volume. Currently, more than 500 million users around the world are using it to share information, opinions, news, moods, concerns, facts, rumors, general events of public interest as earthquakes, political events, deaths of famous people [1]. Corporations use Twitter to make announcements of products, services, events. Twitter is also a news media: many news outlets in fact have accounts on Twitter to report news. Tweets can be seen as a source of data enabling users and corporations to stay informed of what is happening now or what is being said about them. Twitter employs a social-networking model named ”following”, in which the user is allowed to follow any other user she wants without requiring any authorization. Being a follower on Twitter means receiving all the messages from the followed person. The twitter public stream is an interesting data set for event detection based on text mining techniques. 978-0-7695-5119-7/19 $26.00 © 5119 IEEE DOI 10.1109/ICSC.2013.46

•

Process and monitor a sample and inspect it only once. No random access to the data is allowed. The algorithms must keep the computation requirements as low as possible. Algorithms that require more than one step to operate are tipically not suitable for data streaming mining; Use of a limited amount of memory needed to store the current model; Work in real time: it must process data as fast as a tweet is detected; It should be able to produce the best model after analyzing any number of examples. Final model generation should be direct and it should avoid the re-computation of the model.

In this paper we propose an approach to identify the occurrence of any event of public interest in real time. The huge volume of data (tweets) makes it very difficult to identify and manage all events. Therefore, it is necessary to have an algorithm capable to handle the clustering of a large number of messages. Traditional clustering algorithms group the tweets related to current events in several different groups. Unfortunately, they do not indicate when to delete an event from the system memory, and, this causes high processing time and memory exhaustion in a real system. In our work, we have used a combination of text mining techniques to process a tweet and to work in a limited period of time. Our system is capable to predict an event at anytime without requiring any post processing. In this paper, we illustrate a clustering algorithm by modeling the life cycle of a news-related event as an electric field and a tweet post as an electric charge. Each event is associated to an energy that generates an attractive field for the new tweet. When an event absorbes a tweet its energy is increased. 220

objects electrically charged, and it is defined by Coulomb’s law, which quantitatively expresses the interaction between two punctiform electrical charges. Let q1 and q2 be point charges; the form of the electrostatic force exerted by a charged particle q2 on a charge q1 is: q1 q2 (1) F =k 2 d where k is Coulomb’s constant and d is the distance between the charges. The law says that the force between two charges is proportional to their product, and inversely proportional to the square of their distance. By analogy, in our algorithm we consider the space of events as a set of electric fields, tweets are seen as electrostatic charges. Each field(event) has an energy given by the composition of the charges (tweets) that compose it. The individual tweets are then point particles that are attracted by the various existing events according to a law similar to the Coulomb. In our case, k = 1, q1 represents the energy of the individual fields and q2 is the charge of the last received tweet. A new event on Twitter is created when several tweets deal with the same topic. Our model assigns an initial energy at each new event. Each new incoming tweet which is related to an event increases the energy of the event. The energy of an event decreases over time according to a decay rate. The whole period of time (the period between the first tweet and the current time) of an event is divided in intervals of equal size. In each time slice T , the energy of an event is equal to the energy possessed in the previous period T − 1 to which is added the energy absorbed by all the tweets that have been added in the interval T . The decay phase of an event begins as the number of tweet starts to decrease. If N T represents the number of time intervals elapsed from the first tweet in an event and N is the number of tweets representatives of an event, the decay phase of an event begins when the ratio τ = NNT reaches the first local minimum (Point P in Figure 1). The energy E of an event i in the time slice T is equal to: before first min. of τ Ei (T − 1) + k αk (T )k Ei (T − 1) + ( k αk (T )k ) − βi after first min. of τ (2) where k = energy of k − th tweet. α(T ) = transfered factor of k − th tweet in time slice T . βi = decay rate for the i − th event.

As time goes on the energy of an event decays until it is exhausted. Exhausted the energy of the event, the event disappears and it is deleted from the system memory. We have applied our method on a corpus of 400 million tweets for several months without saturating the available memory and computational resources. Our system has detected events such as sports competitions, the election of the new Pope, the bomb at the Boston Marathon, the earthquake in the province of Fosinone. II. R ELATED W ORK Cordeiro [2] has presented a system of event detection using signal analysis of hashtag occurrences in the twitter public stream. He describes detected events using a Latent Dirichlet Allocation (LDA) topic inference model based on Gibbs Sampling. Peak detection using Continuous Wavelet Transformation achieved good results in the identification of abrupt increases on the mentions of specific hashtags. Aggarwal et al. [3] present an online approach for clustering massive text and categorical data streams with the use of a statistical summarization methodology. They provide a framework that stores statistics data at regular intervals. Petrovic et al. [4] present a first story detection system that works in the streaming model based on locality sensitive hashing. R. Lu et al. [5] consider a news event as a natural life form, and use an energy function to evaluate its activity. A news event on Twitter becomes more active with a burst of tweets discussing it, and it fades away with time. These changes of the activity are well captured by the energy function. They incorporate this energy function into the traditional singlepass clustering algorithm, and propose an on-line news event detection method. Chi Hsieh at al. [6] propose a solution to recognize real-time events from sport games based on analyzing the messages posted on microblogging services. They apply moving threshold burst detection to tweets in order to detect highlights in sport games using a TF-IDF method. Aggarwal et al. [7] present the model for event mining in social streams. Chung-Hong Lee et al. [8] develope several algorithms to detecting and grouping emerging topics by making use of realtime messages and geolocation data provided by social network services. Chung-Hong Lee et al. [9] develope a novel algorithm for ranking topics and messages on microblogs to dentify emerging topics and events which might yield the most recently trending top. Bayar Tsolmon et al. [10] describe a method for extracting social events based on timeline and sentiment analysis from social streams. Ilina et al. [11] focus on detecting social events from Twitter messages. The approach shown in [12] identifies anomalous tweets in twitter streams by determining the divergence of topic of a tweet and the actual topic of the document pointed to by the URLs in the tweets.

Each new tweet is labeled with an energy value . The number of followers has a central role in the calculation of the energy associated to tweets. A tweet will have an higher energy if there are many users who read it. Only a percentage α of this energy actually increases the energy of the event. The energy assigned to each tweet is calculated according to the formula used by Lu, et al. [6]: = λ1 + λ2

III. M ODELING S TREAM AS E LECTRIC F IELD

where 0.0 ≤ λ1 ≤ 1.0,

In physics, the Coulomb force is the force exerted by an electric field whose source is an electric charge. It acts on

221

log(numF oll) log(numF ollM ax)

(3)

0.0 ≤ λ2 ≤ 1.0, λ1 + λ2 = 1.0, numF oll = number of user followers, numF ollM ax = maximum number of follwers. It is a normalization factor. Currently the Twitter user with the greatst number of followers is Lady Gaga with 31,634,549 followers [13]

APIs (1%), in a 24 hour will be around 1.700.000 tweets. The JSON tweet document contains attributes describing the tweet, user information, tweet relations with other tweets, a lists of urls, hashtags and user mentions contained in the tweet. In some cases information related with the location of the user is also provided in the document. 2) Language recognition: For every incoming tweet language is checked. For the correct recognition of the language of a tweet was used the Cybozu library [16] that detects language of a text using naive Bayesian filter with 99% over precision for 53 languages. In our tests, the recognition accuracy is not always proved to be so high because of the limited number of characters in a tweet, not more than 140 characters. In particular, the problem has been reported for the Latin languages such as Spanish, Portuguese, Romanian, Italian. Therefore, we exploit both the recognized language and the declared nationality of the Twitter user in order to be sure that the examined tweet is expressed in language of interest(LAN G) 3) Text pre-processing: The tweet text content is preprocessed. Stop words are filtered out, links and hashtag are removed before processing the text because they often hide off-topic posts or even spam. The tweets containing mainly abnormal sequences of characters were discarded. All the words belonging to one or more of the following categories are identified: abbreviations, adverbs, conjunctions, prepositions, adjectives, articles, pronouns, letters, verbs. 4) Text processing: An energy j is assigned to each incoming tweet j according to the formula (3) and an set χj of the ten words with the highest entropy is extracted by using as a measure of information the following formula taken from information theory:

The percentage α is inversely proportional to both the distance di and the energy Ei : α(T ) =

1 di ∗ Ei (T − 1)

(4)

where di = distance between the incoming tweets and the vector centroid representing the i − th event Ei = energy of event i in the interval (T − 1) An energy decay factor must be subtracted to its energy if an event is already in the decay phase: βi = τ ∗

log(numF ollAverage) log(numF ollM ax)

(5)

where numF ollAverage = average number of followers, At the beginning, after the birth of an event, the energy of a new event grows with the aggregation of new tweets semantically related to it. Generally, the number of tweets related to an event decreases over time and thus also its energy. When the energy of an event falls below a certain threshold θ it will be declared ceased and it will be deleted from the system memory. The system is able to locate and track different events in real time without excessive consumption of resources. A snapshot of the events present in that interval of time is saved in memory at predetermined intervals of time.

Entropyj = −fij log2 fij

(6)

where fij is the frequency of the word i in tweet j All the words belonging to at least one of the categories identified in the previous phase of pre-processing are excluded from the calculation of entropy, in order to identify the most significant words that identify a tweet. 5) Nearest Neighbor Event Search: Only active events (clusters), i.e. is those events having an energy greater than the threshold θ, are stored in the system memory for each time slice of observation. For each incoming tweet the corresponding set χ of the ten words with the highest entropy is compared with all the ones representing the active events to calculate the distance. The distance between the tweet j and the event i is calculated as follows:

A. Algorithm description Figure 2 shows the proposed procedure to perform the detection of events in the twitter stream. The following sections describe each one of the steps in detail. 1) Data acquisition: The dataset object of analysis is retrieved by using the Twitter APIs (statuses/samplemethod) with the default access level. The default access level ( Spritzer ) returns a random sample of all public tweets [14]. This level of access provides a small proportion of all public tweets (1%). The Twitter APIs provide two other levels of access the F irehose (100%) and Gardenhose (10%) using special account. The data returned is a set of documents, one for each tweet, in JavaScript Object Notation(JSON). These documents, in addition to the text of the tweet, contain other data, such as: date, source, type, profile, location, number of favorites, friends, followers, URL, hashtag ,etc.. Given the average number of 170 million tweets sent per day [15] it is expected that the size of the data retrieved by the Streaming

dij = 1 − tanimotoCoef f icient(i, j)

(7)

where tanimotoCoef f icient(i, j) is the Tanimoto coefficient. It measures similarity between sample sets, and it is defined as the size of the intersection divided by the size of the union of the sample sets. If A is the set of words with the highest values of entropy

222

1,2

Time Windows Number / Tweets Number

1

0,8

0,6

0,4

P

0,2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 Time Windows Number

Fig. 1: Example of a trend of the ratio τ = N T /N . From the point P an energy decay factor, function of τ , will be subtracted to the energy of an event

Fig. 2: Event Detection Procedure representing a tweet j and B is the set of words with the highest values entropy representative of the event then the Tanimoto coefficient is defined as: |A B| (8) tanimotoCoef f icient(A, B) = |A B|

empirically or if tweet j is approximately equidistant from all events, otherwise the tweet is assigned to the nearest event, cand(j), and its energy is updated according to the formula (2). At each event it is assigned an identification code. For each event, our system stores: •

The event i with the shortest distance dij from the tweet j is the candidate for the update. 6) Update Event List: At the beginning, we start with an empty set of events (clusters). As new tweets arrive, unit events containing individual tweets are created. Once a maximun number initialN umber of such events has been created, we begin the process of online event maintenance. These events are updated over time with the arrival of new tweets. Let be cand(j) the event candidate for the update and dj the distance between that event and the tweet j. Our system creates a new event if dj is less than a threshold γ obtained

• • • •

The The The The The

creation date; number of aggregated tweets; textual content of tweets; average number of followers; timestamp of the last update.

For each event, the energy level is checked at the end of each time slice of observation (T imeRange) experimentally determined. If an event is already in the decay phase, its energy is decreased by subtracting its energy decay factor (5). If the energy of an event falls below a threshold η, it will be deleted and therefore it will be removed from the list of current active

223

events. 7) Event Summary: At a given time, our algorithm maintains only the current snapshot of events currently present in main memory. This can be done at regular time intervals. Snapshots of events at different times allow us to create and analyze the events along with time. In addition, a report file is daily created together with the trend of the energies of the events. This allows us to plot the trend of events in time and analyze an event in offline mode, if required. Finally, we assign to each event a label obtained by concatenating the words with the highest entropy extracted from the textual content of the event by using the formula (6).

The tweets are accepted as they are received maintaining their arriving order: this ensures a limited use of the resources in both time and memory. Figure 4 shows the time trend of the number of events that resides in main memory in different time intervals. Figure 5 shows the time trend of the memory request from our system. In all test cases, the required resources are bounded within a narrow range of values, e.g. the number of events from 10 December 2012 to 7 January 2013 is between a minimum value of 500 to a maximum value of 3500 events while the memory required is between a minimum value of 0.2 Mbytes and a maximum value of 1.4 MByte. This ensure an approximately constant time processing of 60 minutes for each day of scanning. Figure 6 shows trend of the processing time for each day from 10 December 2012 to 7 January 2013. During testing thousands of events have been identified.

IV. DATA PREPARATION Our system is capable to analyze a stream of tweets without any particular condition or restriction. Our algorithm processes in real time a tweet at a time sequentially. A new processing of data is not required. The approach requires limited resources of both time and memory. In our experiments the twitter spritzer stream[14] was monitored at different periods of time and over several months, in order to verify the quality of the results obtained with our algorithm. For this purpose, we have also created a large dataset of tweets to be used in testing and tuning of the algorithm1 . T able I reports the time windows analyzed. In these time intervals, we collected more than 400 million tweets partitioned into documents of ten thousand lines and 15.5 megabyte within folders, one for each day of scanning. Each file contains one row for each tweet in JSON (JavaScript Object Notation) which stores in addition to text other additional information such as date, type, source, user profile, location, followers number, friends number, favorites number. All the information is stored by using encryption with key / value pairs for each data. F igure3 shows an example of tweet as it is saved in our dataset. Different sections compose this tweet notation: • A general part with text, date, source, id; • Two sections, geolocation and place, with location information; • A final section with all the user information such as id, account creation date, description, followersCount, friendsCount, profileImageURL, etc... Some data such as location and profile information will be used in future work.

Events Number 4000 3500 3000 2500 Events Number 2000 1500 1000 500

Mon Dec 10 21:06:42 CET 2012 Tue Dec 11 12:51:50 CET 2012 Wed Dec 12 05:01:24 CET 2012 Wed Dec 12 17:01:24 CET 2012 Thu Dec 13 08:55:35 CET 2012 Thu Dec 13 20:55:35 CET 2012 Fri Dec 14 13:07:33 CET 2012 Sat Dec 15 05:46:43 CET 2012 Sat Dec 15 17:46:43 CET 2012 Sun Dec 16 10:42:22 CET 2012 Sun Dec 16 22:42:22 CET 2012 Mon Dec 17 15:01:56 CET 2012 Tue Dec 18 06:59:04 CET 2012 Tue Dec 18 18:59:04 CET 2012 Wed Dec 19 11:10:56 CET 2012 Wed Dec 19 23:10:56 CET 2012 Thu Dec 20 14:51:56 CET 2012 Fri Dec 21 06:38:25 CET 2012 Fri Dec 21 18:38:25 CET 2012 Sat Dec 22 11:42:46 CET 2012 Sat Dec 22 23:42:46 CET 2012 Sun Dec 23 16:45:27 CET 2012 Mon Dec 24 09:01:36 CET 2012 Mon Dec 24 21:01:36 CET 2012 Tue Dec 25 13:21:48 CET 2012 Wed Dec 26 05:23:24 CET 2012 Wed Dec 26 17:23:24 CET 2012 Thu Dec 27 08:59:09 CET 2012 Thu Dec 27 20:59:09 CET 2012 Fri Dec 28 12:53:24 CET 2012 Sat Dec 29 05:06:36 CET 2012 Sat Dec 29 17:06:36 CET 2012 Sun Dec 30 09:09:34 CET 2012 Sun Dec 30 21:09:34 CET 2012 Mon Dec 31 12:43:43 CET 2012 Tue Jan 01 05:22:37 CET 2013 Tue Jan 01 17:22:37 CET 2013 Wed Jan 02 09:09:12 CET 2013 Wed Jan 02 21:09:12 CET 2013 Thu Jan 03 12:39:42 CET 2013 Fri Jan 04 04:46:58 CET 2013 Fri Jan 04 16:46:58 CET 2013 Sat Jan 05 09:03:02 CET 2013 Sat Jan 05 21:03:02 CET 2013 Sun Jan 06 13:03:40 CET 2013 Mon Jan 07 04:50:27 CET 2013

0

Time

Fig. 4: Trend of the number of events in main memory from 10 December 2012 to 7 January 2013

Memory 1400000

1200000

1000000

800000 Memory (Byte) 600000

400000

200000

V. E XPERIMENTAL RESULTS

Mon Dec 10 21:06:42 CET 2012 Tue Dec 11 12:51:50 CET 2012 Wed Dec 12 05:01:24 CET 2012 Wed Dec 12 17:01:24 CET 2012 Thu Dec 13 08:55:35 CET 2012 Thu Dec 13 20:55:35 CET 2012 Fri Dec 14 13:07:33 CET 2012 Sat Dec 15 05:46:43 CET 2012 Sat Dec 15 17:46:43 CET 2012 Sun Dec 16 10:42:22 CET 2012 Sun Dec 16 22:42:22 CET 2012 Mon Dec 17 15:01:56 CET 2012 Tue Dec 18 06:59:04 CET 2012 Tue Dec 18 18:59:04 CET 2012 Wed Dec 19 11:10:56 CET 2012 Wed Dec 19 23:10:56 CET 2012 Thu Dec 20 14:51:56 CET 2012 Fri Dec 21 06:38:25 CET 2012 Fri Dec 21 18:38:25 CET 2012 Sat Dec 22 11:42:46 CET 2012 Sat Dec 22 23:42:46 CET 2012 Sun Dec 23 16:45:27 CET 2012 Mon Dec 24 09:01:36 CET 2012 Mon Dec 24 21:01:36 CET 2012 Tue Dec 25 13:21:48 CET 2012 Wed Dec 26 05:23:24 CET 2012 Wed Dec 26 17:23:24 CET 2012 Thu Dec 27 08:59:09 CET 2012 Thu Dec 27 20:59:09 CET 2012 Fri Dec 28 12:53:24 CET 2012 Sat Dec 29 05:06:36 CET 2012 Sat Dec 29 17:06:36 CET 2012 Sun Dec 30 09:09:34 CET 2012 Sun Dec 30 21:09:34 CET 2012 Mon Dec 31 12:43:43 CET 2012 Tue Jan 01 05:22:37 CET 2013 Tue Jan 01 17:22:37 CET 2013 Wed Jan 02 09:09:12 CET 2013 Wed Jan 02 21:09:12 CET 2013 Thu Jan 03 12:39:42 CET 2013 Fri Jan 04 04:46:58 CET 2013 Fri Jan 04 16:46:58 CET 2013 Sat Jan 05 09:03:02 CET 2013 Sat Jan 05 21:03:02 CET 2013 Sun Jan 06 13:03:40 CET 2013 Mon Jan 07 04:50:27 CET 2013

0

About 2250 hours of streaming were monitored during the tests. The tests were carried out by analyzing only the tweets in Italian language. We analyzed time windows of one hour. The parameters λ1 and λ2 , for weighting the energy contribution of a single tweet using (3), have been set respectively to 0.7 and 0.3. The values chosen for the three thresholds θ, η and γ are, instead, respectively 0.9, 0.1 and 0.67. In our experiments, we have analyzed more than 1.7 million tweets. Our algorithm processes a tweet and inspects it only once. 1 The

Time

Fig. 5: Progress in bytes of memory required from 10 December 2012 to 7 January 2013 According to studies conducted by Pear Analytics [17] about 80% of the identified clusters is related to friendly chats while

dataset will be available at http://lithium.pa.icar.cnr.it/twitter dataset/.

224

From 21:06 19:23 20:43 00:00 12:45

on on on on on

the the the the the

10th of December 2012 11th of February 2013 21th of February 2013 27th of February 2013 7th of April 2013

To 07:50 on the 7th of January 2013 23:59 on the 20th of February 2013 19:47 on the 26th of February 2013 23:59 on the 4th of April 2013 23:59 on the 19th of April 2013 Total Number:

Hours 659 220 119 960 300 2258

Italian Tweets 488220 195458 111626 713604 263912 1772820

Total Tweets 118031802 40664066 22912291 164966495 54905490 401480144

TABLE I: time windows analyzed during the tests.

Fig. 3: Diagram of a tweet within the dataset

EART HQU AKE SHOCK ROM E ZON E SHOCK OSCILLAT ION P ARIOLI CIN ECIT T CHAN DELIERS SHOCK N AP LES ROM E EART HQU AKE W ARN ED P ROV IN CEL IGHT W EIGHT F LICKER M AGN IT U DE 2KM SOU RCE DEP T H F ROSIN ON E M AGN IT U DE M AGN IT U DE T IV OLI EP ICEN T ER EART HQU AKE RICHT ER SCHOCK LAZIO DEGREES ZON E ST RON G EART HQU AKE F ROSIN ON E SCHOCK SORA W ARN ED CAST EL EP ICEN T ER LIRI EART HQU AKE RICHT ER 8/5 SCOCK CIOCIARIA LAZIO M AGN IT U DE V ALLEY LIRI EART HQU AKE

TABLE II: Some of the identified events for the earthquake of 16 February.

225

CON CLAV E DAY W AIT IN G SM OKE HABEM U S BELLS AF T ERN OON COU N T RY P AP AM AN N U N T IO V OBIS P AP AM HABEM U S M AGN U M GAU DIU M V OICES 06 CARDIN ALS HABEM U S P AP AM T G N IN ET EEN M ARIO JORGE BERGOGLIO BU EN OS P OP E AIRES F RAN CISCO ARCHBISHOP CARDIN AL CON CLAV E JESU IT BERGOGLIO P OP E F RAN CISCO BIG CHAN GE CARDIN AL JESU IT AN N I M ARIO JORGE BERGOGLIO F RAN CISCO 76 CARDIN ALS P OP E 266 − T H HU M ILIT Y LIV E F RAN CISCOI M U CH SIM P LE COM M U N ICAT ION P OP E P OP E F RAN CISCO HU M BLE IN N OV AT IV E HU M AN

TABLE III: Some of the identified events for the election of new Pope. M ARAT HON BOST ON BOM BS P HOT OF IN ISH LIN E IN JU RED SOM E M ARAT HON BOST ON BOM BS P HOT OF IN ISH ESP LOSION S BOST ON F OX N EW S EXCLU DIN G DEAD 13 OBAM A W EAP ON S BOST ON ESP LOSION T EAM CON T ROLLED U N EXP LODED BOM BSQU AD RECOV ERED OBAM A ESP LOSION BOST ON P RESIDEN T BARACK F IN ISH LIN E M ARAT HON BOST ON BOM BS P HOT OF IN ISH SEQU EN CE BOM BS W ARN ED DEAD BOST ON ST RON G BOM B IN JU RED M ARAT HON BOST ON BOM BS P HOT OF IN ISH ESP LOSION S T EN S DEAD BOST ON GARBAGE W EAP ON S JF K LOCAT ED CON F EREN CE Y EARS F AT HER RICHARD M ARAT HON BOST ON BOM BS P HOT OF IN ISH M ART IN CHILD F IN ISH

TABLE IV: Some of the identified events for Boston Marathon Attack.

the Richter scale was recorded between Rome and Naples, in the province of Frosinone, at 22:16 of 16 February 2013. The earthquake was felt in a large area, from the north to the south of the province and in several municipalities with an epicenter between the towns of Sora and Isola Liri, in the province of Rome. Our system has detected a peak at 22:26 exactly 10 minutes after the first shock. In Table II are reported some of the identified events labeled using the words with the highest entropy value. Figure 8 shows an excerpt of the events identified related to the events on the second day of the Conclave that led to the election of Pope Francis, the black smoke at 12:00, the white smoke of 19:06 and the announcement of the name of the new Pope. In Table III some of the identified events labeled using the words with the highest entropy value are reported. Figure 9 and IV shows the trend of the main events related to the news of the two bombs exploded around at 20.50 Italian time ahead of the arrival of the Boston Marathon.

Time Processing Cost 80

70

60

50

Time Processing (minutes) 40

30

20

10

0

Days

Fig. 6: Progress cost time in minutes

VI. C ONCLUSION In this paper, we discussed a method for detection and management of events in a streaming of tweets. We have shown that, by using our system, it is possible to maintain

the remaining 20% relates to facts of public interest. The charts of some of the events identified using our algorithm are shown in Figures 7, 8 and 9 . Figure 7 shows an excerpt of the events identified related to the earthquake of 4.8 degrees on

226

constant both the required memory and the execution time. We used our system for detecting new events from over 500 million tweets with reasonable results. Our method can also be generalized to deal with other data streams, in the future. Although the proposed method is good enough for a real system, there are still some aspects to improve as the nearest neighbor search, filtering of identified events and usability.

120

100

80

Event Energy 60

40

20

VII. ACKNOWLEDGEMENTS Sun Feb 17 23:06:25 CET 2013

Sun Feb 17 22:06:25 CET 2013

Sun Feb 17 21:06:25 CET 2013

Sun Feb 17 20:06:25 CET 2013

Sun Feb 17 19:06:25 CET 2013

Sun Feb 17 18:06:25 CET 2013

Sun Feb 17 17:06:25 CET 2013

Sun Feb 17 16:06:25 CET 2013

Sun Feb 17 15:06:25 CET 2013

Sun Feb 17 14:06:25 CET 2013

Sun Feb 17 13:06:25 CET 2013

Sun Feb 17 12:06:25 CET 2013

Sun Feb 17 11:06:25 CET 2013

Sun Feb 17 10:06:25 CET 2013

Sun Feb 17 09:06:25 CET 2013

Sun Feb 17 08:06:25 CET 2013

Sun Feb 17 07:06:25 CET 2013

Sat Feb 16 23:26:32 CET 2013

Sun Feb 17 06:06:25 CET 2013

Sat Feb 16 22:26:32 CET 2013

Sun Feb 17 05:06:25 CET 2013

Sat Feb 16 21:26:32 CET 2013

Sun Feb 17 00:29:03 CET 2013

Sat Feb 16 20:26:32 CET 2013

Sat Feb 16 19:26:32 CET 2013

Sat Feb 16 18:26:32 CET 2013

Sat Feb 16 17:26:32 CET 2013

Sat Feb 16 16:26:32 CET 2013

Sat Feb 16 15:26:32 CET 2013

Sat Feb 16 14:26:32 CET 2013

Sat Feb 16 13:26:32 CET 2013

Sat Feb 16 12:26:32 CET 2013

Sat Feb 16 11:26:32 CET 2013

Sat Feb 16 10:26:32 CET 2013

Sat Feb 16 09:26:32 CET 2013

Sat Feb 16 08:26:32 CET 2013

Sat Feb 16 07:26:32 CET 2013

Sat Feb 16 06:26:32 CET 2013

Sat Feb 16 05:26:32 CET 2013

Sat Feb 16 00:28:13 CET 2013

0

This work has been partially supported by the PON01 01687 - SINTESYS (Security and INTElligence SYSstem) Research Project.

Time

R EFERENCES [1] “Twitter for business resources available at:https://business.twitter.com/.” [2] M. Cordeiro, “Twitter event detection: combining wavelet analysis and topic inference summarization,” 2012. [3] C. C. Aggarwal and P. S. Yu, “A framework for clustering massive text and categorical data streams,” in Proc. SIAM conference on Data Mining, 2006, pp. 477–481. [4] S. Petrovic, M. Osborne, and V. Lavrenko, “Streaming first story detection with application to twitter,” in Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 16, 2010, Los Angeles, 2010, pp. 181–189. [5] R. Lu, Z. Xu, Y. Zhang, and Q. Yang, “Life activity modeling of news event on twitter using energy function,” in Advances in Knowledge Discovery and Data Mining - 16th Pacific-Asia Conference, PAKDD 2012, Kuala Lumpur, Malaysia, May 29 - June 1, 2012, Proceedings, Part II. Lecture Notes in Computer Science 7302 Springer 2012, ISBN 978-3-642-30219-0, 2012, pp. 73–84. [6] L.-C. Hsieh, C.-W. Lee, T.-H. Chiu, and W. H. Hsu, “Live semantic sport highlight detection based on analyzing tweets of twitter,” in IEEE International Conference on Multimedia Expo (ICME) 9th-13th July 2012, Melbourne, Australia, 2012, pp. 949–954. [7] C. C. Aggarwal and K. Subbian, “Event detection in social streams,” in SIAM 2012 Int Conf on Data Mining, April 27-28, 2012, Anaheim, California, USA, 2012, pp. 624–635. [8] C.-H. Lee, H.-C. Yang, T.-F. Chien, and W.-S. Wen, “A novel approach for event detection by mining spatio-temporal information on microblogs,” in International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2011, Kaohsiung, Taiwan, 25-27 July 2011. IEEE Computer Society, 2011, pp. 254–259. [9] C.-H. Lee, T.-F. Chien, and H.-C. Yang, “An automatic topic ranking approach for event detection on microblogging messages,” in IEEE Int Conf on Systems, Man, and Cybernetics, Oct 9-12, 2011, Anchorage, Alaska, 2011, pp. 1358–1363. [10] B. Tsolmon, A.-R. Kwon, and K.-S. Lee, “Extracting social events based on timeline and sentiment analysis in twitter corpus,” in 18th International Conference on Application of Natural Language to Information Systems (NLDB2013) 19-21 June 2013, University of Salford, MediaCity, UK, 2012, pp. 265–270. [11] E. Ilina, C. Hauff, I. Celik, F. Abel, and G.-J. Houben, “Social event detection on twitter,” in 12th International Conference on Web Engineering ICWE 2012, July 23-27, Berlin, Germany, 2012, pp. 169–176. [12] P. Anantharam, K. Thirunarayan, and A. P. Sheth, “Topical anomaly detection from twitter stream,” in ACM Web Science 2012, June 22-24, Evanston, IL, USA, 2012, pp. 11–14. [13] “Twitaholic,” Resources available at:http://twitaholic.com/top100/followers/. [14] “Twitter developers: Streaming api methods,” Resources available at:https://dev.twitter.com/docs/streaming-api/methods. [15] “Internet 2012 numbers,” Resources available at:http://royal.pingdom.com/2013/01/16/internet-2012-in-numbers/. [16] “Language detection library,” Resources available at:http://developer.cybozu.co.jp/oss/. [17] “Ryanakelly: Pearanalytics - twitter study - august 2009 (2009),” Resources available at:http://www.pearanalytics.com/wpcontent/uploads/2012/12/Twitter-Study-August-2009.pdf.

Fig. 7: The earthquake of 16 February at 22:16 in the province of Frosinone

250

200

150 Event Energy 100

50

0

Time

Fig. 8: Election of new Pope Francis of 16 March.

100 90 80 70 60 Event Energy 50 40 30 20 10

Tue Apr 16 23:29:23 CEST 2013

Tue Apr 16 22:29:23 CEST 2013

Tue Apr 16 21:29:23 CEST 2013

Tue Apr 16 20:29:23 CEST 2013

Tue Apr 16 19:29:23 CEST 2013

Tue Apr 16 18:29:23 CEST 2013

Tue Apr 16 17:29:23 CEST 2013

Tue Apr 16 16:29:23 CEST 2013

Tue Apr 16 15:29:23 CEST 2013

Tue Apr 16 14:29:23 CEST 2013

Tue Apr 16 13:29:23 CEST 2013

Tue Apr 16 12:29:23 CEST 2013

Tue Apr 16 11:29:23 CEST 2013

Tue Apr 16 10:29:23 CEST 2013

Tue Apr 16 09:29:23 CEST 2013

Tue Apr 16 08:29:23 CEST 2013

Tue Apr 16 07:29:23 CEST 2013

Tue Apr 16 06:29:23 CEST 2013

Tue Apr 16 05:29:23 CEST 2013

Tue Apr 16 04:29:23 CEST 2013

Mon Apr 15 23:37:03 CEST 2013

Mon Apr 15 22:37:03 CEST 2013

Mon Apr 15 21:37:03 CEST 2013

Mon Apr 15 20:37:03 CEST 2013

Mon Apr 15 19:37:03 CEST 2013

Mon Apr 15 18:37:03 CEST 2013

Mon Apr 15 17:37:03 CEST 2013

Mon Apr 15 16:37:03 CEST 2013

Mon Apr 15 15:37:03 CEST 2013

Mon Apr 15 14:37:03 CEST 2013

Mon Apr 15 13:37:03 CEST 2013

Mon Apr 15 12:37:03 CEST 2013

Mon Apr 15 11:37:03 CEST 2013

Mon Apr 15 10:37:03 CEST 2013

Mon Apr 15 09:37:03 CEST 2013

Mon Apr 15 08:37:03 CEST 2013

Mon Apr 15 07:37:03 CEST 2013

Mon Apr 15 06:37:03 CEST 2013

Mon Apr 15 05:37:03 CEST 2013

Mon Apr 15 04:37:03 CEST 2013

Mon Apr 15 00:26:06 CEST 2013

0

Time

Fig. 9: Boston Marathon Attack of 15 April.

227

Detection, Clustering and Tracking of Life Cycle Events on Twitter ...

Detection, Clustering and Tracking of Life Cycle Events on Twitter ...

Suggest Documents

Clustering on Twitter: case study Twitter account of higher education ...

Dynamic Detection and Tracking of Composite Events in Wireless

On Systems That Migrate Across Otherwise Terminal Life Cycle Events

Rumor detection on twitter

Rumor detection on twitter

BotRevealer: Behavioral Detection of Botnets based on Botnet Life-cycle

Mining of health and disease events on Twitter: validating ... - arXiv

tracking the 'life cycle trajectory': metrics and measures for ... - SSRN

Applications of Clustering to Early Software Life Cycle Phases

Life Cycle Modeling of News Events Using Aging Theory1 - CiteSeerX

Life Cycle Modeling of News Events Using Aging ... - Semantic Scholar

Twitter hashtags: Joint Translation and Clustering - CiteSeerX

guidance on organizational life cycle assessment - Life Cycle Initiative

guidance on organizational life cycle assessment - Life Cycle Initiative

A review on Life Cycle Assessment, Life Cycle Energy ...

guidance on organizational life cycle assessment - Life Cycle Initiative

guidance on organizational life cycle assessment - Life Cycle Initiative

Detecting events and sentiment on Twitter for improving Urban Mobility

tracking the life cycle of construction steel - KU ScholarWorks

Life Cycle Inventories and Life Cycle Assessments of ... - treeze Ltd.

Life Cycle Inventories and Life Cycle Assessments of Photovoltaic ...

Life Cycle Inventories and Life Cycle Assessments of ... - IEA-PVPS

Life Cycle Inventories and Life Cycle Assessments of ... - IEA PVPS

Life Cycle Inventories and Life Cycle Assessments of ... - treeze Ltd.