Detecting Location-Based Enumerating Bursts in ...

3 downloads 2486 Views 611KB Size Report
Page 1 ... set of georeferenced micro-posts, which are crawling tweets posted on the Twitter site. ... site and articles on the Facebook site, are called micro- posts [1]. A micro-post ..... using its API. To evaluate the .... of the graph of burstiness.
2013 Second IIAI International Conference on Advanced Applied Informatics

Detecting Location-based Enumerating Bursts in Georeferenced Micro-Posts Hajime Kitakami Graduate School of Information Sciences, Hiroshima City University, Hiroshima, Japan [email protected]

Keiichi Tamura Graduate School of Information Sciences, Hiroshima City University, Hiroshima, Japan [email protected]

us with new research challenges in different application domains, because georeferenced micro-posts are becoming the medium that has the most influence on people. The georeferenced micro-posts on social media sites are usually arranged in temporal order, and hence are represented as a document stream. Burstiness is one of the simplest and most effective criteria for extracting hot topics and events in a document stream. The number of microposts related to a topic or an event increases gradually as the topic or the event attracts the interest of more people, and conversely, gradually decreases when interest in the topic and the event diminishes. The most well known burst detection algorithm is Kleinberg’s burst detection algorithm [5]. In [5], enumerating bursts, which are the bursts of discrete series in a batch document stream, are defined. The batch document stream is represented as a discrete series of document sets. When a keyword related to an attention-attracting topic or event becomes highly bursty in the i-th element of a discrete series, the number of documents including the keyword in the ith document set becomes large during a particular discrete period. Kleinberg’s burst detection algorithm is designed to find certain discrete periods in which the number of documents including a keyword is larger than usual in the time period. Kleinberg’s burst detection algorithm is the simplest but most useful algorithm for detecting enumerating bursts in batch document streams. However, it does not consider users’ locations. Suppose that a keyword “snow” is a bursty keyword in a particular area, “A.” Then, the keyword “snow” is a hot topic near area “A.” However, the keyword “snow” is not a useful topic for users located far away from “A.” In this case, we need to detect burstiness considering the users’ locations. Whereas the keyword “snow” should be presented as a highly bursty topic for users close to area “A,” it should be presented as not highly bursty for users far away from that region. In this paper, a novel algorithm for detecting locationbased enumerating bursts in georeferenced micro-posts is proposed. The key contributions of this study are as follows: • To allow easy handling of a batch document stream that consists of georeferenced micro-posts, a new document

Abstract—Nowadays, a large number of georeferenced micro-posts, i.e., short messages including location information, are posted on social media sites. People transmit and collect information over the Internet through these georeferenced micro-posts, which are usually related to not only personal topics but also local topics and events. Detecting local topics and events in georeferenced micro-posts is beneficial for many different geo-mobile application domains. Burstiness is one of the simplest and most effective criteria for extracting hot topics and events in micro-posts. In this paper, we propose a novel burst detection algorithm for detecting location-based enumerating bursts in georeferenced micro-posts. To evaluate the proposed burst detection algorithm, we used an actual set of georeferenced micro-posts, which are crawling tweets posted on the Twitter site. The experimental results show that our new burst detection algorithm can detect location-based enumerating bursts. Keywords-georeferenced micro-posts; burst detection; enumerating bursts; spatiotemporal data; topic detection and tracking.

I. I NTRODUCTION With the increasing attention being paid to social media, as well as widespread use of smart phones, a huge number of short messages are posted on social media sites (e.g., Twitter and Facebook) through the Internet. The short messages posted on social media sites, such as tweets on the Twitter site and articles on the Facebook site, are called microposts [1]. A micro-post usually consists of multiple items such as posted time, geo-tags, text data, and links to images on image-sharing Web services. People transmit and collect information through these micro-posts, the contents of which are usually related to not only personal topics, but also social topics and events [2]. In other words, micro-posts posted by people through social media sites are becoming a new type of social sensor, with a collective intelligence. In recent years, we have witnessed the emergence of a new type of micro-post, in the form of georeferenced micro-posts, which include location information (e.g., address, place name, geo-tags, and GPS data). Because of the widespread use of smart phones equipped with a GPS, as well as the increasing interest in consumer-generated media, a large number of georeferenced micro-posts are posted by people on social media sites [3], [4]. This development presents 978-0-7695-5071-8/13 $26.00 © 2013 IEEE DOI 10.1109/IIAI-AAI.2013.36

389





stream model, which consists of georeferenced microposts, is defined as a georeferenced batch document stream (GBDS). In a GBDS, georeferenced micro-posts are referred to as not only a batch document stream but also geographical document sets. To detect location-based enumerating bursts in a GBDS, we extend Kleinberg’s model. In our extension, the influence rate of a georeferenced document, defined as the distance between a user and a georeferenced document, is integrated into Kleinberg’s model. The number of georeferenced documents and relevant georeferenced documents are adjusted to reflect their influence rates in order to vary burstiness according to users’ locations. To evaluate the new burst detection algorithm, an actual GBDS composed of crawling tweets posted on the Twitter site was used. The experimental results show that the algorithm can detect location-based enumerating bursts by considering users’ locations. Moreover, the proposed algorithm can recognize bursts with consideration for the regional dispersion of the number of posted georeferenced micro-posts.

 

 

  













Figure 1.







Example of batch document stream.

earthquakes to estimate a typhoon’s trajectory and an earthquake’s epicenter. Yin et al. [12] proposed a method to discover different topics in geographical regions. Furthermore, Yang et al. [13] developed a method to reveal the appearance and disappearance of topics in different regions. There are numerous studies on burst detection and geographical topic detection and tracking. However, to the best of our knowledge, until now, there is no study that attempts to detect location-based enumerating bursts. In our previous work [14], we proposed location-based burst detection algorithm for continuous time space. The algorithm can extract bursts related to location; however it does not take into account the regional difference of the number of the posted micro-posts and the algorithm can not detect location-based enumerating bursts.

The rest of the paper is organized as follows: In Section 2, related work is reviewed. In Section 3, a brief explanation of batch document streams, burst, and Kleinberg’s burst detection algorithm are presented. In Section 4, we explain the problem definition of location-based burst detection and the proposed burst detection algorithm. In Section 5, the experimental results are discussed. In Section 6, we present our conclusions.

III. E NUMERATING B URSTS D ETECTION This section presents the definition of a batch document stream and enumerating bursts , and briefly explains Kleinberg’s burst detection algorithm.

II. R ELATED W ORK Burst detection in document streams has attracted many researchers, because burstiness is the simplest but most effective criterion for topic detection and tracking. The algorithm that has had the most significant impact on many studies is Kleinberg’s burst detection algorithm [5], which is based on a queuing theory for detecting bursty network traffic. Kleinberg’s burst detection algorithm is used for analyzing document streams from various sources, such as e-mail [5], blogs [6], online publications [7], bulletin boards, and social tags [8]. Moreover, in some studies Kleinberg’s burst detection algorithm was extended. In particular, Qi He et al. [9] proposed a clustering algorithm for documents in a document stream that uses bursty feature representation as a feature vector for clustering. Leskovec et al. [10] formulated memes as patterns of words using a scalable clustering approach. These algorithm is beneficial for extracting events and topics; however these can not recognize bursts related to location. Recently, geographical topic detection and tracking [11], [12], [13] have been attracting increasing research attention. Sakaki et al. [11] focused on tweets about typhoons and

A. Batch Document Stream A batch document stream is similar to a data stream. It is defined as a sequence of sets of documents arranged in order. Figure. 1 shows an example of a batch document stream, where the sets of documents are posted in order. Suppose that there are n sets of documents BDS = {D1 , D2 , · · · , Dn }. Each set of documents is posted discretely; however, the time interval Dt between Dt+1 is not continuous. Examples of a batch document stream include, but are not limited to, conference papers on publication sites. Moreover, tweets posted on the Twitter site at every fixed time interval (e.g., per hour, day, or month) are referred to as a batch document stream. Figure 1 shows an example of a batch document stream. In this example, BDS = {D1 , D2 , · · · , D5 } and each batch arrives per day. B. Enumerating Burst Enumerating bursts can identify bursts for a periodical time interval, such as per hour, day or month. The number of documents that include some particular keywords related to a topic or an event increases gradually when the topic or

390

The Viterbi algorithm for hidden Markov models, which is a dynamic programming approach, is the most effective solution for determining an optimal state-transition sequence s = (s1 , s2 , · · · , sn ) to minimize Equation 1. First, we calculate the cost Cj (i):

event attracts the interest of many people. As the number of documents that include a keyword related to a topic or an event increases in a batch document stream, the number of documents including the keyword becomes large. The keyword is considered highly bursty during a period in which the number of documents including the keyword is larger than usual.

Cj (i) = − ln σk (ndi , nri ) + minl (Cl (i − 1) + τ (l, j)), (5) where Cj (i) is the minimum cost of a state-transition sequence that ends with state j at the i-th time-interval in the document stream. Equation 5 can be calculated using the previous (i−1)-th Cl (i−1) (0 ≤ l ≤ m−1). Second, we find the minimum cost in Cj (n)(0 ≤ j ≤ m − 1). Suppose that the minimum cost in Cj (n)(0 ≤ j ≤ m − 1) is Cmin (n). Finally, we trace back with Cmin (n) as the starting point.

C. Kleinberg’s Burst Detection Algorithm Let Ri be a set of relevant documents in Di . The number of Di and Ri is denoted by ndi and nri , respectively. The sequence of the number of documents is nds = (nd1 , nd2 , · · · , ndn ) and the sequence of the number of relevant documents is nrs = (nr1 , nr2 , · · · , nrn ). Let the total number of documents nand that of relevant documents n be denoted by N D = t=1 ndt and N R = t=1 nrt , respectively. For example, in Figure 1, nds = (2, 2, 5, 4, 3) and nrs = (0, 0, 1, 3, 2). Kleinberg defined a model with an infinite-state automaton in which bursts are represented as state transitions. Assuming that there are m states in the automaton, the number of documents and the number of relevant documents are probabilistic outputs that depend on the internal states of the infinite-state automaton. The problem is defined as finding an optimal statetransition sequence s = (s1 , s2 , · · · , sn ) to minimize the cost function. n−1   C(s|nds, nrs) = τ (si , si+1 ) +

 i=1 n 

IV. L OCATION - BASED E NUMERATING B URST D ETECTION This section presents the problem definition and a novel location-based burst detection method. A. Data Model A georeferenced document j in the i-th document set includes location information as well as text data. In this study, a georeferenced document gdi,j consists of three items: batch number, text data, and location information: gdi,j =, where texti,j is the text data (e.g., title, posted short message, and tags) and li,j is the location where gdi,j was created or is located (i.e., the latitude and longitude). A georeferenced micro-post is referred to as a georeferenced document. A georeferenced batch document stream is a batch document stream in which each document set consists of georeferenced documents. Suppose that there are n georeferenced document sets in a GBDS. Let GDi be the i-th georeferenced document set in GBDS: GDi = {gdi,1 , gdi,2 , · · · , gdi,numd(i) }, where numd(i) is the number of documents in it. Figure. 2 shows an example of a GBDS comprising five georeferenced document sets. The georeferenced document gd2,3 is in the second document set, which is represented as May 3rd’s posts and has a location on the geographical coordinate space.

 (− ln σsi (ndi , nri )) . (1)

i=1

The function τ (i, j) returns a state-transition cost from state i to state j. It is defined as  (j − i)γ, if j > i, τ (i, j) = (2) 0, otherwise, where γ(> 0) is a user-given parameter. Equation 2 indicates that moving to a higher state incurs a cost and moving to a lower state incurs no cost. The function σk (ndi , nri ) is the exponential density function for the probability of outputting ndi and nri in state k, and is defined as   ndi σk (ndi , nri ) = pk (1 − pk )(ndi −nri ) , (3) nri

B. Problem Definition Let GRi be the i-th georeferenced relevant document set in a GBDS in which each georeferenced document’s text data include the keyword: GRi = {gri,1 , gri,2 , · · · , gri,numr(i) }, where numr(i) is the number of documents in the i-th georeferenced relevant document set. A set of relevant georeferenced documents GRi is a subset of GDi .

where pk is the arrival rate of documents associated with state k and is defined as NR k (4) pk = β , ND where β(> 1.0) is a user-given parameter. Equation 4 indicates that a higher state has a higher rate of relevant documents.

φi : GRi → GDi + , gri,j → gdi,φi (j) .

(6)

In Figure. 2, there are four georeferenced documents in GD4 and there are three georeferenced relevant documents

391



 

   

 

 



 

 

 

 

 





 

 

 

 

 

 

 

 

 

 

















  



 

  

%  #



Map of Japan.



gradually lose their weight (or memory) according to the increase in distance from the user. The definition of the influence rate of a georeferenced document gdi,j is

 





   

   

Figure 2.

" !#

$! !

Figure 3.





 !



inf ri,j = αcalc

Example of a georeferenced document stream.

distance(li,j ,ul)

,

(9)

where α(< 1.0) is a user-given parameter. in GR4 . In this example, gr4,1 = gd4,1 , gr4,2 = gd4,2 , and gr4,3 = gd4,4 . Therefore, φ4 (1) = 1, φ4 (2) = 3, and φ4 (3) = 4. The sequence of the number of georeferenced documents is ngds = (ngd1 , ngd2 , · · · , ngdn ) and the sequence of the number of relevant documents is ngrs = (ngr1 , ngr2 , · · · , ngrn ). Let the total number of documents and be denoted by N GD = n n that of relevant documents ngd and N GR = ngr t t , respectively. t=1 t=1 The goal of this study is to detect the location-based burst by considering the user’s direction, which varies according to the user’s location ul. The problem is to find an optimal state-transition sequence s = (s1 , s2 , · · · , sn ) to minimize the C(s|ngds, ngrs) associated with ul. n−1   C(s|ngds, ngrs) = τ (si , si+1 ) +

 i=1 n 

D. Algorithm The sequence of the influence rates for GDi is IN F Ri = {inf ri,1 , inf ri,2 , · · · , inf ri,|GDi | }.

(10)

In the proposed algorithm, we replace ngdi and ngri by the total amount of their influence rates. |GDi |

ngdi =

 j=1

|GRi |

inf ri,j , ngri =



inf ri,φi (j) .

(11)

j=1

The steps of the proposed burst detection algorithm are (1) For each GDi in a GBDS, the following sub-steps are executed. (a) The sequences of the influence rates IN F Ri for GDi are created by calculating the influence rates of the documents in GDi . (b) Then, ngdi and ngri are calculated using IN F Ri . (2) The sequences ngds and ngrs are created using all the elements ngdi and ngri , respectively. (4) We find an optimal state-transition sequence s = (s1 , s2 , · · · , sn ) to minimize C(s|ngds, ngrs) using the Viterbi algorithm for hidden Markov models.

 (− ln σsi (ngdi , ngri )) ,(7)

i=1

where,

N GR k (8) β . N GD For instance, in Figure. 2, U ser1 is located close to the location of georeferenced relevant documents that are in bursty discrete periods. For U ser1, we need to show that keyword is highly bursty. Conversely, U ser2 is far away from georeferenced relevant documents that are in bursty discrete periods. For U ser2, we need to show that keyword is not highly bursty. pk =

V. E XPERIMENTS We collected geo-tagged tweets from the Twitter site using its API. To evaluate the location-based burst detection algorithm that considers the user’s location, we used an actual GBDS that was composed of crawling geo-tagged tweets posted on the Twitter site. The number of collected tweets was 480,000. The time period of a set of tweets was from December 2011 to February 2012 (UTC). In the experiments, we used the keyword “snow,” which achieved the first score in the tf*idf results. The five major cities of

C. Influence Rate To consider users’ locations, we integrate the influence rate of a georeferenced document, which is determined by its distance from the user, into Kleinberg’s burst detection algorithm. In this study, the forgetting theory [15] is used to calculate the influence rate. Georeferenced documents

392

'(

)'(

'(

'(

)'(

'(

&  &  &

'(

)'(

'(

 &  &

'(

)'(

'(

(m) Sapporo (from 1-Jan to 15-Jan)

&  &  & '(

&'(

'(

&  &  & '(

&'(

'(

 &'(

'(

&'(

(n) Sapporo (from 16-Jan to 31-Jan) Figure 4.

)'$*

'$*

'$*

)'$*

'$*

(f) Osaka (from 1-Feb to 15-Feb) &  &  &

'$*

)'$*

'$*

(i) Nagoya (from 1-Feb to 15-Feb) &  &  &

&'(

&

& '(

&  &  &

&'(



'$*

(c) Fukuoka (from 1-Feb to 15-Feb)

&'(

(k) Tokyo (from 16-Jan to 31-Jan)

 

 

(j) Tokyo (from 1-Jan to 15-Jan)

'(

(h) Nagoya (from 16-Jan to 31-Jan)

 

 

(g) Nagoya (from 1-Jan to 15-Jan)

&'(

(e) Osaka (from 16-Jan to 31-Jan)

 

 

&  &  &

&  &  & '(

&  &  &

&'(

 

'(

(d) Osaka (from 1-Jan to 15-Jan)

'(

 

 

 

&  &  &

&'(

(b) Fukuoka (from 16-Jan to 31-Jan)

 

)'(

'$*

)'$*

'$*

(l) Tokyo (from 1-Feb to 15-Feb)

 

'(

(a) Fukuoka (from 1-Jan to 15-Jan)

&  &  & '(

 

 

 

&  &  &

&  &  &

'$*

)'$*

'$*

(o) Sapporo (from 1-Feb to 15-Feb)

Results of keyword “snow.”

many regions, except for the Tokyo metropolitan region and Hokkaido region, had snow. Therefore, the burstiness levels of Fukuoka, Osaka, and Nagoya are high. Figure 5 shows the data plots on the map. From January 19 to January 20, there are many georeferenced relevant micro-posts that include “snow” in their text data in the Tokyo metropolitan region. This result agrees with the retsults of the graph of burstiness. From January 23 to 24, there are many georeferenced relevant micro-posts in Tokyo metropolitan region and Osaka. This result also agrees with the result of the graph of burstiness. From January 25 to 26, there are many georeferenced relevant micro-posts in the Kinki and Tubu regions, in which Osaka and Nagoya are located, respectively. This result also agrees with the result of the graph of burstiness. From February 1 to 2, there are many georeferenced relevent micro-posts in many regions except for the Tokyo metropolitan region and Hokkaido region. Therefore, the burstiness levels for Fukuoka, Osaka, and Nagoya are high. In Sapporo, from February 2 to 12, the results show that the level of extracted burstiness is high. This is because,

Japan, Fukuoka, Osaka, Nagoya, Tokyo, and Sapporo were set as the users’ positions (Figure. 3). Figure 4(a), Figure 4(b) and Figure 4(c) are the results at Fukuoka. Figure 4(d), Figure 4(e) and Figure 4(f) are the results at Osaka. Figure 4(g), Figure 4(h) and Figure 4(i) are the results at Nagoya. Figure 4(j), Figure 4(k) and Figure 4(l) are the results at Tokyo. Figure 4(m), Figure 4(n) and 4(o) are the results at Sapporo. These graph show the burstiness from 1-Jan to 15-Feb (UTC). On January 19 and 20, in the Tokyo metropolitan region, we had heavy snow. Tokyo was highly bursty, as shown in Figure 4(k). In Fukuoka, Osaka, and Nagoya, we did not have snow, and therefore, bursts did not appear. Sapporo had snow, and therefore, Sapporo was a little bursty. On January 23 and 24, in the Tokyo metropolitan region, we had heavy snow and Osaka had its first snow. Tokyo was highly bursty, as shown in Figure 4(k) and Osaka was highly bursty, as shown in Figure 4(h). On January 26, the Kinki and Tubu region, in which Osaka and Nagoya are located, had heavy snow. In Figure 4(e) and Figure 4(h) show a high level of burstiness on January 26. On February 1 and 2,

393

(a) From 19-Jan to 20-Jan

(b) From 23-Jan to 24-Jan Figure 5.

(c) From 25-Jan to 26-Jan

(d) From 1-Feb to 2-Feb

Data plots of georeferenced relevant micro-posts

in Sapporo, a snow festival was being held. Some georeferenced relevant micro-posts were posted. The number of georeferenced micro-posts that were posted in Sapporo is smaller than that of other regions. However, our proposed burst detection algorithm integrates the influence rate of document, so, the algorithm can extract this event with highly burstiness.

[6] R. Kumar, J. Novak, P. Raghavan, and A. Tomkins, “On the bursty evolution of blogspace,” in Proceedings of WWW’03, 2003, pp. 568–576.

VI. C ONCLUSION

[8] J. Yao, B. Cui, Y. Huang, and X. Jin, “Temporal and social context based burst detection from folksonomies,” in Proceedings of AAAI 2010, 2010.

[7] K. K. Mane and K. B¨orner, “Mapping topics and topic bursts in pnas,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101 Suppl 1, pp. 5287– 5290, 2004.

This study focused on a batch document stream that consists of georeferenced micro-posts. We call this type of document stream a georeferenced batch document stream (GBDS). We proposed a novel algorithm for detecting location-based enumerating bursts in a GBDS that considers the users’ location. To detect location-based enumerating bursts in a GBDS, we extend Kleinberg’s model. In our extension, the influence rate of a georeferenced document, defined as the distance between a user and a georeferenced document, is integrated in Kleinberg’s model. To evaluate the new location-based burst detection algorithm, we used an actual GBDS composed of crawling tweets posted on the Twitter site. The experimental results showed that the proposed algorithm can detect location-based enumerating bursts that vary with the users’ location. In future work, we intend to conduct more performance evaluations and comparisons between our results and those of other studies.

[9] Q. He, K. Chang, E.-P. Lim, and J. Zhang, “Bursty feature representation for clustering text streams,” in Proceedings of the Seventh SIAM International Conference on Data Mining, 2007. [10] J. Leskovec, L. Backstrom, and J. Kleinberg, “Meme-tracking and the dynamics of the news cycle,” in Proceedings of SIGKDD’09, 2009, pp. 497–506. [11] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes twitter users: real-time event detection by social sensors,” in Proceedings of WWW ’10, 2010, pp. 851–860. [12] Z. Yin, L. Cao, J. Han, C. Zhai, and T. Huang, “Geographical topic discovery and comparison,” in Proceedings of WWW’11, 2011, pp. 247–256. [13] H. Yang, S. Chen, M. R. Lyu, and I. King, “Location-based topic evolution,” in Proceedings of MLBS ’11, 2011, pp. 89– 98.

R EFERENCES [1] D. Reis, F. Goldstein, and F. Quintao, “Extracting unambiguous keywords from microposts using web and query logs data,” in Making sense of Microposts (at WWW 2012), 2012.

[14] K. Tamura and H. Kitakami, “Location-based burst detection algorithm in spatiotemporal document stream,” in Proceedings of DMIN’12, 2012, pp. 195–201.

[2] A. Java, X. Song, T. Finin, and B. Tseng, “Why we twitter: understanding microblogging usage and communities,” in Proceedings of WebKDD/SNA-KDD ’07, 2007, pp. 56–65.

[15] Y. Ishikawa, Y. Chen, and H. Kitagawa, “An on-line document clustering method based on forgetting factors,” in Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries, 2001, pp. 325–339.

[3] J. Chon and H. Cha, “Lifemap: A smartphone-based context provider for location-based services,” IEEE Pervasive Computing, vol. 10, no. 2, pp. 58–67, Apr. 2011. [4] M. Naaman, “Geographic information from georeferenced social media data,” SIGSPATIAL Special, vol. 3, no. 2, pp. 54–61, Jul. 2011. [5] J. M. Kleinberg, “Bursty and hierarchical structure in streams,” in Proceedings of SIGKDD’00, 2002, pp. 91–101.

394