An Efficient Online Event Detection Method for Microblogs via User Modeling Motivation I Detecting events in microblogs is important but still challenging.
• •
Online Learning
Tweet stream is a mixture of user interests and external events, it’s difficult to distinguish them. Existing methods are ineffective since they ignore user interests or only model interests and events on a fixed dataset without scalability.
I We introduce an online learning model User Modeling Based Interest and Event Topic Model (UMIETM).
•
I Gibbs Sampling • 1st phase: sample user profile’s hidden topic sun. • 2nd phase: joint sample tweet’s hidden topic zud and yud . I Batch Learning is Expensive: O(I1K|P| + I2(K + E)|W |) • I1: iteration number of first phase sampling. • I2: iteration number of second phase sampling. • K , E: number of user interest related topics and number events in each time window. • |P|: number of total user profile tokens; |W |: number of total tweet tokens. ³Yá I Online Learning Vçf • Reuse the word-topic distribution in last time window • Smoothing for new words ėkĀÓ^Ð
exploiting user profile to discover events by filtering out ³Yá user interest-related tweets; treating the arriving data as stream and Vçf run the detection in online learning style. ėkĀÓ^Ð
Ā}}QÕ$ ĤöcñÏ
Ú&=e)Āü\ęĒy
VW.VUXVVWW.VWUXV LsĔ¯7d= uÅ=) Ā
E@A)\?B:8\)* E@A)\?B:8\)/
E@A)\?B:8\),
ĤöcñÏ
ĀüÓĀ
Ú&=e)Āü\ęĒy
Relationship between Profile and User Interests Āü Āü Ā
•
LsĔ¯7d= uÅ= Ā ĀüÓĀ
V
Āü Āü Ā
V
Ā}}QÕ$
I User profile is highly correlated to user’s interests.
(c) online processing
VW.VUXV
>8Y?Z), >8Y?Z)[
>8Y?Z), >8Y?Z)[
T
T
>8Y?Z)"
>8Y?Z)"
(d) smoothing for new words
Complexity: O(I K|P| + I (K + E)|W |) , where |W | |W | 1 2 t t
Experiments
I Weibo Dataset
(a) Andrew Ng (Computer Scientist)
(b) Van Persie (Soccer Player)
I Users who have similar user profiles also have similar interests.
•
46.2% #MachineLearning users also choose #DataMining tag in their profiles.
I Users’ interests can be enriched by followee’s profiles. • e.g. Biz Stone (Co-founder of Twitter, Medium, and now
Jan 2012 to Dec 2012 I split dataset by week I segment Chinese words I remove stop words, low freqency words I remove tweets whose token number is less than 3
Table: statistics of processed dataset
I from
•
User and their followees’ profiles are more stable to reflect user’s interests than tweets; external events draw global attention in short time. (An example of generated user profile and tweets)
θu
α
pun
sun
k K
Pu
k
wudn
zud
t
(User profile tokens) Biz Stone @biz Co-founder of Twitter, Medium, and now Co-founder and CEO of Askjelly.com. Following 696 accounts.
User Modeling Information: Social Media, Business, Technology
K
(User interest related microblog) Biz Stone @biz - Feb 24 Today’s Suggestion: Let’s not call people who use social media “users,” let’s call them “participants.” It’s nicer and more accurate.
T
0
γ
πu
yud
tud
xudn Nud
Du
!te U
E
T
(Event related microblog)
Biz Stone @biz - Mar 14 Remember @SXSW, “Ballroom D is the place to be!” at 12:30 pm. It’s a big room so bring some friends too. #NextSearch
I generative process • user generates hidden profile topic sun from user-interest distribution θu, then • • •
generates profile token pun from multinomial distribution ψsun . user decides to post interest-related tweet or event-related tweet according to switcher yud . if yud = 0, generates tweet’s hidden topic zud from user’s profile hidden topic {su1, · · · , sun}uniformly, then generates tweet tokens from word-interest distribution φzud . if yud = 1, generates tweet’s hidden topic zud from event-time window distribution ηt , then generates tweet tokens from word-event distribution ϕt,zud .
···
···
···
···
···
I Effectiveness • Example events detected by UMIETM Time window
Top words of example events
Japan, earthquake, occur, the first day, The first January, 7.0, 2012 week of New, year, happy, 2012, New Year’s 2012 Day, healthy, blessing, happiness
Co-founder and CEO of Askjelly.com) follows 60 founders, 27 CEOs, 21 Google related, and 9 medium related accounts, etc.
User Modeling Based Interest and Event Topic Model I Main idea
#user #profile token #tweet #tweet token whole year 252,369 1,470,080 16,421,167 251,686,571 week1 9,785 73,307 31,503 440,217 week2 29,721 222,280 242,554 3,679,979 week3 30,891 231,042 254,698 3,881,633
The second reed, steel, appearance, engineering, week of shoddy construction, criminal 2012
···
···
Example events In January 1 of 2012, a magnitude-7 earthquake occurred in Japan. Everyone bless happy new year in the first day of 2012 In an accident, a car crashed through the guardrail into the river. People found that, the guardrail was built with reed which should be built with steel bar.
···
and EDCoW[3] failed to discover the shoddy construction event in the second week.
I TimeUserLDA[1]
•
Comparisons precision UMIETM 0.894 UMIETM(-) 0.847 IETM 0.824 LSH[2] 0.394 EDCoW[3] 0.731
I UMIETM(-):
UMIETM’s degration, doesn’t use user profile sufficiently I IETM, LSH, EDCoW: doesn’t use user profile I LSH: perfers recall than precision I EDCoW: prefers precision than recall $"$! "!%
! !
! ! ! !
! ! !
I Efficiency
#&$""!"#
(e) Convergence of complete log likelihood of UMIETM and LDA. x: round of iteration, y: complete log likelihood.
recall 0.913 0.697 0.536 0.913 0.435
%'%$& # !
)
(
'
&
%
% '!"'
$ # " !
!
"
#
$
! % $# ! +.*(,)
!
% & ' ( ) ! !!
(f) Efficiency of UMIETM. x: time window, y: duration
References [1] Qiming Diao, Jing Jiang, Feida Zhu, and Ee-Peng Lim. Finding bursty topics from microblogs. In: ACL 2012.
[2] Streaming first story detection with application to twitter. In: HLT-NAACL 2010. [3] Event Detection in Twitter. In: ICWSM 2011.
Weijing Huang
Wei Chen
[email protected] EECS, Peking University, Beijing
Lamei Zhang
Tengjiao Wang
{pekingchenwei,tjwang}@pku.edu.cn
[email protected]