DECK: Detecting Events from Web Click-through Data

3 downloads 0 Views 201KB Size Report
Washington D.C. Cherry Blossom Festival. March 25, 2006. Buck Owens, a singer and guitarist, died. March 25, 2006. Rocio Durcal, a spanish actress, died.
DECK: Detecting Events from Web Click-through Data Ling Chen L3S Research Center University of Hannover 30167 Hannover, Germany [email protected]

Yiqun Hu School of Computer Engg Nanyang Technological University Singapore 639798 [email protected]

Abstract In the past few years there has been increased interest in detecting previously unidentified events from Web resources. Although most of existing research usually detects events by analyzing the content or structural information of Web documents, a recent direction is to study the usage data, e.g., the click-through data generated by Web search engines. However, detecting events from Web click-through data is a challenging problem considering that Web clickthrough data may not be as informative as Web contents or structures. Furthermore, a large amount of queries issued to a Web search engine do not correspond to any real event. In this paper, we propose a novel approach which effectively detects events from click-through data based on robust subspace analysis. We first transform click-through data to the 2D polar space. Then, a new algorithm, based on Generalized Principal Component Analysis (GPCA), is proposed to estimate subspaces of transformed data such that each subspace contains query sessions of similar topics. Then, we prune uninteresting subspaces which do not contain query sessions corresponding to real events by considering both the semantic certainty and the temporal certainty of query sessions in each subspace. Finally, various events are detected from interesting subspaces by utilizing the nonparametric clustering technique Mean Shift. Compared with the existing approach, our experimental results based on reallife click-through data have shown that the proposed approach is more accurate in detecting real events and more effective in determining the number of events.

1 Introduction The problem of event detection is part of a broader initiative called Topic Detection and Tracking (TDT) [2]. The objective of event detection is to discover new or previously unidentified events, where each event refers to a specific thing that happens at a specific time and place [1]. Particularly, event detection can be divided into two categories:

Wolfgang Nejdl L3S Research Center University of Hannover 30167 Hannover, Germany [email protected]

retrospective detection and on-line detection [17]. The former refers to the detection of previously unidentified events from accumulated historical collection, while the latter entails the discovery of the onset of new events from live feeds in real-time. Our focus in this paper is retrospective event detection. With the prevailing publishing activities over the Internet, the Web of nowadays covers almost every object and event in the real world. For example, regarding the Asian tsunami disaster happened in the end of 2004, a query to Google News returned more then 80, 000 online news articles about this event within one month (Jan. 17 through Feb. 17, 2005) [12]. This phenomenon has recently motivated the event detection community to discover knowledge such as topics, events and stories from large volumes of Web data [11][3][10][15]. Most of existing work can be divided into two categories: content-based event detection and structure-based event detection. Basically, the former analyzes the textual information of Web documents using natural language processing techniques to extract meaningful knowledge [11][3], while the latter utilizes the structural information, such as the website structures and hyperlink structures, to discover groups of Web documents corresponding to events [10][15]. Events detected from Web resources are useful in applications such as Web search results organizing and Web site restructuring etc. [18].

1.1

Motivation

A recent research direction in detecting events from Web resources is to study the Web usage data. For example, Zhao et al. [18] proposed to detect events from Web click-through data, which are the log data generated by Web search engines. We believe that click-through data serve as a good alternative data source for event detection because of the following three reasons. Firstly, similar to the content and structural information of Web documents, the Web clickthrough data can be regarded as a sensor of the real world as well. Click-through data containing query keywords and clicked pages of users often reflect users’ response to recent real world events. Secondly, the sharply accentuated role of

Query jojo jojo poker poker poker

Time 3/19/2006 15:02 3/19/2006 15:12 3/14/2006 21:44 3/14/2006 21:58 3/14/2006 21:58

Page http://www.jojoonline.com http://www.azlyrics.com http://www.pokerroom.com http://www.pokerstars.com http://www.partypoker.com

0.05

Normalized Daily Frequency

User 337 337 178 178 178

Table 1. Example of click-through data. ID 1 2 3 4 5 6 7 8 9 10

Query google mapquest yahoo ebay google.com yahoo.com myspace.com myspace www.google.com bank of america

Page http://www.google.com http://www.mapquest.com http://www.yahoo.com http://www.ebay.com http://www.google.com http://www.yahoo.com http://www.myspace.com http://www.myspace.com http://www.google.com http://www.bankofamerica.com

Freq. 38608 16518 13860 11669 9284 8361 7169 6678 4739 3957

Table 2. Frequent query-page pairs. Web search engine as an entry point to the Web has given rise to a huge volume of click-through data, which enables effective knowledge discovery. Thirdly, using click-through data, complicated data analysis techniques, such as natural language process on detailed Web content information, can be avoided. Each entry of Web click-through data basically records the following four types of information: an anonymous user identity, the query issued by the user, the time at which the query was submitted for search, and the URL of clicked search result [13]. For example, Table 1 shows an example excerpt of click-through data. It can be observed that Web click-through data provide two aspects of useful knowledge for event detection: semantics of events (e.g., the knowledge indicated by the queries and the corresponding clicked pages) and time of events (e.g., the knowledge indicated by the timestamps at which the queries are issued). To the best of our knowledge, the work done by Zhao et al. [18] is the first and the only one which detects events from Web click-through data. They performed two phases of clustering to detect events. Firstly, query-page pairs are clustered based on their semantics. Secondly, in each semantically similar cluster, query-page pairs are clustered again based on their frequency evolution over some predefined time intervals. The clustering results will be returned as detected events. However, after inspecting real life clickthrough data collected by AOL [13], we observed two limitations of this approach as follows: • The number of events cannot be determined. The two-phase-clustering approach borrowed the normalized graph cut algorithm [14] to perform clustering in each phase. However, they did not discuss how many clusters should be generated in each phase. This is actually an important issue which should not be ignored since the number of clusters determines the number of detected events. An inappropriately selected cluster number will damage the quality of detected events. • Discovered clusters may not correspond to real events.

0.045 0.04 0.035 0.03 0.025 0.02

go o gle

0.015

go o gle.co m

0.01

www.go o gle.co m

0.005 0 1

6

11

16

21

2

31

Day

Figure 1. Frequency evolution of queries. The two-phase-clustering approach studied Web clickthrough data in unit of query-page pair. Table 2 shows the top 10 most frequent query-page pairs in the clickthrough data logged by AOL in March, 2006. None of them correspond to any real events. However, these query-page pairs will probably be returned erroneously as events. For example, the first, fifth and ninth querypage pairs in Table 2 are similar in semantics (all of them are about Google). Their daily frequencies in March, 2006 are shown in Figure 1 (normalized by their overall frequency in March). It can be observed that their frequencies evolve similarly as well. Nevertheless, such query-page pairs do not represent any meaningful events. In this paper, we aim to address the two problems by proposing a novel event detection algorithm DECK (Detecting Event from ClicK-through data). In order to automatically decide the accurate number of events, we developed a new subspace analysis method based on Generalized Principal Component Analysis (GPCA) [16], which is capable of segmenting click-through data into distinguished topics without any priori knowledge. We handled the second problem based on the “burst of activity” premise proposed in [8]. That is, the appearance of an event is signaled by a burst of activity, with certain features rising sharply in frequency as the event emerges. Particularly, in DECK, we simultaneously consider the burst of semantics as well as the burst of time of queries to filter click-through data which do not correspond to real events.

1.2

Overview and Contributions

Given a collection of Web click-through data, the overview of DECK is presented in Figure 2. Basically, there are four steps involved in DECK: polar transformation, subspace estimation, subspace pruning, and cluster generation. Firstly, we transform click-through data to 2D polar space. Each query session of Web click-through data (A query session, which will be formally defined in Section 2, contains a query and a set of corresponding pages clicked by a user.), is mapped to a point in polar space such that the angle θ and radius r of the point respectively reflect the semantics and the occurring time of the query ses-

90o

90o

90o

90o

e im r: t

:semantics Polar Transformation

0o

Subspace Estimation

0o

Subspace Pruning

0o

Cluster Generation

Figure 2. Overview of DECK. sion. Then, we propose a new method KNN-GPCA, which improves the robustness of GPCA by embedding the distribution of K nearest neighbors constraint within the GPCA framework, to estimate the subspaces of transformed clickthrough data. Each estimated subspace contains query sessions of some broad topic1 . Next, by considering both the semantic certainty and the temporal certainty of query sessions in each subspace, we prune uninteresting subspaces which do not contain query sessions corresponding to real events. Finally, we detect various events from interesting subspaces by utilizing a non-parametric clustering method called Mean Shift [4]. The main contributions of this paper are summarized as follows. • We propose a novel algorithm to segment Web clickthrough data into broad topics without prior knowledge of the number of topics. The proposed algorithm, KNN-GPCA, takes into account the distribution of K nearest neighbors of data points and estimates subspaces with weighted least square approximation. • We propose an interestingness measure, which simultaneously considers semantic certainty and temporal certainty of click-through data, to justify the likelihood that the subspace contains query sessions corresponding to events. • We evaluate the performance of DECK with extensive experiments on real life click-through data collected by AOL. We compare the performance of DECK with the existing two-phase-clustering approach [18]. The rest of the paper is organized as follows. In Section 2, we illustrate the polar transformation of Web clickthrough data. We describe our linear subspace estimation algorithm, KNN-GPCA, in Section 3. In Section 4, we first define the interestingness measure of subspaces and then discuss the detection of events from interesting subspaces. Experimental results based on real-life data sets are presented in Section 5. We review related work in Section 6 and conclude this paper in Section 7. 1 TDT has its own topic definition. However, in this paper, we refer to a topic as some general theme, which may or may not be related with an event.

2 Polar Space Representation Given a collection of Web click-through data, Zhao et al. [18] considered the data in unit of query-page pair which consists of a user query and a page clicked by the user. For example, the click-through data in Table 1 will be considered as five distinct query-page pairs. However, in DECK, we study Web click-through data in unit of query session, which refers to an episode of interaction between a Web user and a Web search engine and consists of a query issued by the user and a set of pages clicked by the user on the search result. Definition 1 (Query Session) A query session Q = (q, P ), where q is a bag of keywords representing a user query issued to a search engine and P = {p1 , p2 , · · · , pn } refers to the set of corresponding pages clicked by the user on the search result. In addition, a query session Q is associated with a time point, T (Q), when the query is issued. For example, the first two entries in the Table 1 indicate that after issuing the query “jojo”, the user 337 clicked two pages “http://www.jojoonline.com” and “http://www.azlyrics.com”. Thus, the two entries will be considered as one query session. For simplicity, we take the timestamp of the first entry as the occurring time of a query session in this paper. The reason we adopt the unit of query session is based on the following intuitive: the multiple query-page pairs corresponding to a query session usually represent the same event about which a user inquired. Given a query session Q, we aim to map it to a point (θ, r) in polar space such that the angle θ and the radius r reflect the semantics and the occurring time of Q respectively. Particularly, given two query sessions Q1 and Q2 , we map them to two points (θ1 , r1 ) and (θ2 , r2 ) respectively such that the more similar the two query sessions are in semantics, the smaller the angle |θ1 − θ2 | is. Furthermore, if the two query sessions are exactly same in semantics, the closer their occurring time, the smaller the distance |r1 − r2 |. We first consider mapping the semantics of a query session Q to an angle θ. Since the angle between two query sessions reflects how similar they are in semantics, we need to define the semantic similarity between two query sessions first. Recall that a query session contains a query and

0o

a set of clicked pages. In order to avoid further process of query keywords, we determine the similarity between two query sessions based on the two sets of clicked pages (which suffices according to our experimental results). We use adapted Jaccard coefficient to measure the similarity between two query sessions. Definition 2 (Semantic Similarity) Given two query sessions Q1 = (q1 , P1 ) and Q2 = (q2 , P2 ), the semantic similarity between Q1 and Q2 , denoted as Sim(Q1 , Q2 ), is |P1 ∩ P2 | Sim(Q1 , Q2 ) = max{|P1 |, |P2 |} For example, consider the example query sessions in Figure 3 (a), where each entry of the table represents a query session. Sim(Q1 , Q2 ) = 2/3, while Sim(Q1 , Q3 ) = 1/2. Thus, given a set of n query sessions {Q1 , Q2 , · · · , Qn }, a n×n semantic similarity matrix M can be computed such that each element mij = Sim(Qi , Qj ). In other words, the relative semantics of a query session Qi is represented by the n-dimension row vector Ri = hmi1 , mi2 , · · · , min i of M . In order to map the semantics of Qi to an angle θi in polar space, we need to reduce the dimension of Ri to 1. For dimension reduction, we perform Principle Component Analysis (PCA) on the semantic similarity matrix M . Then, the first principal component is used to preserve the dominant variance in semantic similarities. Let {f1 , f2 , · · · , fn } be the first principal component which corresponds to the set of query sessions {Q1 , Q2 , · · · , Qn }. A query session Qi can be mapped to a point (θi , ri ) where θi is computed as fi − minj (fj ) π θi = × maxj (fj ) − minj (fj ) 2 minj (fj ) and maxj (fj ) are the minimum and maximum values respectively in the first principal component. Obviously, θi is restricted to [0, π/2]. The mapping from the occurring time of a query session to the radius of a point can be handled directly. Given a set of query sessions {Q1 , Q2 , · · · , Qn }, the radius of the point (θi , ri ) corresponding to the query session Qi is given by T (Qi ) − minj (T (Qj )) ri = maxj (T (Qj )) − minj (T (Qj )) where minj (T (Qj )) and maxj (T (Qj )) are respectively the earliest and latest occurring time of all query sessions. ri takes value in the range of [0, 1]. For example, consider the set of query sessions in Figure 3 (a) again. Query sessions Q1 to Q3 are similar in semantics as well as in occurring time. Query sessions Q4 through Q6 are similar in semantics only. Query session Q7 is dissimilar to any other query sessions in semantics. Figure 3 (b) shows the polar transformation of the set of query sessions2 . It can be observed from the figure that our polar transformation has the following two features: 2 In order to show the polar transformation clearly, we constraint the angles of query sessions to [π/9, 4π/9] in this example.

ID

Query

Pages

2006-03-15 14:08:16

Time

Q1

q1

p1p2

Q2

q2

p1p2p3

2006-03-15 21:20:09

Q3

q2

p2p3

2006-03-16 10:42:25

Q4

q3

p4p5

2006-03-09 19:22:18

Q5

q4

p4p6p7

2006-03-18 11:35:52

Q6

q3

p4p8

2006-03-29 12:05:31

q5

p9p10

2006-03-22 20:41:11

Q7

(a) Example query sessions 90o

Q6 Q7

Q5

Q4

Q1 Q3 Q2 0o

(b) Example polar transformation

Figure 3. Example polar transformation. • Subspace consistency. The mapping from the semantics of a query session to the angle of a point in polar space causes query sessions of similar semantics lie on one and only one 1D subspace. For example, points of Q1 to Q3 and points of Q4 to Q6 locate around the two dotted lines in Figure 3 respectively. However, the point of Q7 appears as an outlier in the figure. • Cluster consistency. The mapping from the occurring time of a query session to the radius of a point in polar space forces query sessions of similar semantics and similar occurring time appear as clusters in subspaces. For example, points of Q1 to Q3 form a cluster in the lower dotted line while points of Q4 to Q6 distribute along the upper dotted line.

3 Subspace Estimation According to the subspace consistency of polar transformation, our objective now is to estimate subspaces from the transformed data such that each subspace contains query sessions of similar semantics. For this purpose, we propose a new subspace estimation algorithm, called KNNGPCA, based on Generalized Principal Component Analysis (GPCA) [16]. GPCA is an algebro-geometric approach which simultaneously estimates subspace bases and assigns data points to subspaces. The reasons why we develop our algorithm based on GPCA are as follows. Unlike prior work on subspace analysis [5, 6], GPCA does not require initialization, which usually results in local optimum, or restrict the subspaces to be either orthogonal or trivially intersecting. Furthermore, GPCA is capable of estimating subspaces without prior knowledge on the number of subspaces.

However, as analyzed in the following subsection, the performance of GPCA degrades in the presence of outliers. In order to improve the robustness of GPCA, KNN-GPCA takes into account the distribution of K nearest neighbors of data points, which yields a weighted least square estimation of subspaces. In this section, we first review the algorithm of GPCA. Then, the embedding of the distribution of K nearest neighbors are described. Finally, we estimate subspaces using weighted least square technique.

3.1

Review of GPCA

Given a set of sample data points, GPCA estimates a mixture of linear subspaces by the following three steps: fitting polynomials to data, estimating the number of subspaces and estimating the normal vectors of subspaces. Based on the fact that each data point x ∈ RD satisfies T bi x = 0 where bi is the normal vector of the subspace to which x belongs, a data point lying on one of the n subspaces satisfies the following equation: n Y pn (x) = (bTi x) = 0 (1) i=1

where {bi }ni=1 are normal vectors of the subspaces. It can be converted to a linear expression by expanding the product of all bTi x and viewing all monomials xn = xn1 1 xn2 2 · · · xnDD (0 ≤ nj ≤ n, j = 1 · · · D, n1 + n2 + · · · + nD = n) as system unknowns. By introducing the Veronese map vn : [x1 · · · xD ]T 7→ [· · · , xn , · · ·] where xn is a monomial of the form xn1 1 xn2 2 · · · xnDD , equation (1) becomes the following linear expression: X pn (x) = vn (x)T c = cn1 ···nD xn1 1 · · · xnDD = 0 (2) where c ∈ R represents the coefficient of the monomial xn . Since each polynomial pn (x) = vn (x)T c must be satisfied by every data point, given a collection of N sample data points {xj }N j=1 , a linear system can be generated as   vn (x1 )T  vn (x2 )T      ·  c = 0 Ln c =  (3)  ·     · vn (xN )T After fitting polynomials to data, the number of subspaces is estimated based on the condition that there is one unique solution for c. Thus, when there exists no noisy data, the number of subspaces can be computed as the minimum value i such that the rank of the polynomial maµ embedding ¶ i+D−1 trix Li equals to Mi − 1, where Mi = is D−1 the number of different monomials in Li . In the presence of noise, GPCA relies on some pre-defined threshold ² to combine subspaces that are close to each other. For example, the σr+1 < ², where σj is rank of Li is determined as r if σ1 +···+σ r the j-th singular value of Li .

Once the number of subspaces is determined, the vector of coefficient c can be computed accordingly. Then, the normal vectors {bi }ni=1 can be solved by polynomial differentiation in the absence of noise. Taking noise into account, the estimation of normal vectors is cast as a constrained nonlinear optimization problem which is initialized using the normal vectors obtained by polynomial differentiation. Further details of the GPCA algorithm are available in [16]. While GPCA provides an elegant solution to the problem of linear subspace estimation, there are some inherent limitations of the algorithm. Particularly, we observed that when outliers are present, accuracy of the following steps of GPCA will be impaired: • Estimation of the number of subspaces. In the presence of noise, GPCA determines the rank r of Li based on some threshold ². When noise and outliers are moderate, the difference between the r-th singular value and the (r + 1)-th singular value of Li might be slight. Thus, the appropriate threshold is hard to specify and the number of subspaces can be determined erroneously. • Estimation of the normal vectors {bi }ni=1 . GPCA estimates coefficients c before estimating the normal vectors of subspaces. In the presence of noise, c is computed as the singular vectors of Li associated with its smallest singular values. However, when outliers are included in the calculation, the computed singular values and vectors are prone to error. Consequently, the coefficients and the normal vectors may not be determined appropriately. Therefore, in order to improve the robustness of GPCA, the impact of noise and outliers in the two estimation steps should be demoted. We achieve this purpose by assigning weight coefficients to data points to distinguish true data from noise and outliers. Particularly, we weigh data points based on the distribution of its K nearest neighbors.

3.2

Weight Coefficient Assignment

The K th Nearest Neighborhood Distance (kN N D) metric proposed in [7] detects and removes outliers based on the following fact: in a cluster containing more than K points, the kN N D for a data point is small; otherwise, it is large. Recall that our polar transformation has the cluster consistency property, which indicates true data points (e.g., query sessions corresponding to real events) lie in clusters inside subspaces. The kN N D of true data points should be small while the kN N D of noise and outliers should be large. However, instead of simply using the kN N D to differentiate outliers and inliers, we consider the distribution of the K nearest neighbors of data points. Given a data point xi , let its K nearest neighbors be N Nk (xi ). Both the variance of the K nearest neighbors along the direction of the

and the selection of K can be fixed for the computation of weights for all data points. In addition, due to the high complexity of directly discovering K nearest neighbors, in our implementation, we optimize to perform a bisecting kmeans clustering first and discover K nearest neighbors in each cluster only.

90o x3 x2

3.3

x1 svar

nvar

W Ln c =

0o

Figure 4. Distribution of 3 nearest neighbors. subspace of xi (i.e., the direction from the origin to the point xi ), denoted as svar(N Nk (xi )), and the variance of the K nearest neighbors along the direction which is orthogonal to the subspace direction, denoted as nvar(N Nk (xi )), can be computed by PCA. For example, Figure 4 shows the distribution of 3 nearest neighbors of three data points x1 , x2 and x3 . The two directions along which svar(N N3 (x1 )) and nvar(N N3 (x1 )) are computed are indicated by svar and nvar respectively in the figure. If xi is a true data point, it forms some cluster together with its K nearest neighbors. Thus, both svar(N Nk (xi )) and nvar(N Nk (xi )) should be small. Similarly, the sum S(N Nk (xi )) =svar(N Nk (xi ))+ nvar(N Nk (xi )) will be small as well. On the contrary, if xi is an outlier (e.g., x2 in Figure 4), S(N Nk (xi )) will be large. However, even if a data point forms a cluster together with its K nearest neighbors, it may not be a true data point if its neighbors spread along the orthogonal direction of the subspace (e.g., x3 in Figure 4). That is, the cluster does not exist inside some subspace (Note that, directly using kN N D, this type of noisy data points cannot be identified). In this case, svar(N Nk (xi )) is small but nvar(N Nk (xi )) is large. Then, the ratio R(N Nk (xi )) =nvar(N Nk (xi ))/ svar(N Nk (xi )) is large. Hence, we assign a weight, W (xi ), to a data point xi based on S(N Nk (xi )) and R(N Nk (xi )) as follows, 1 W (xi ) = (4) 1 + S(N Nk (xi )) × R(N Nk (xi )) The value of W (xi ) ranges from 0 to 1. When the data point xi lies in a cluster where data points spread along the direction of the subspace, the value of R(N Nk (xi )) is 0 in the absence of noise and is very small in the presence of noise. The value of S(N Nk (xi )) is small as well. Hence, the weight W (xi ) is close to 1. In other words, we assign a large weight to a true data point. Otherwise, R(N Nk (xi )) and/or S(N Nk (xi )) are large, which results in a small W (xi ). Thus, the impact of noise and outliers will be reduced by small weight values. Note that, since noise and outliers are usually far away from true data points, any small value of K can be used

KNN-GPCA

Taking into account the weight of each data point, the linear system of equation (3) is modified as      



W (x1 )

    

W (x2 ) · · W (xN )

vn (x1 )T vn (x2 )T · · vn (xN )T

   c = 0  

(5) where W (xi ) is the weight of data point xi . We then perform a Singular Value Decomposition (SVD) on W Ln to estimate the number of subspaces. Note that, if xi is an outlier, its small weight will pull it back to the origin. Since origin exists in all null space, xi will not increase singular values. Consequently, the selection of the threshold ² here will not be as crucial as before. After estimating the number of subspaces, the coefficient vector c can be computed using the weighted least square technique. Particularly, we express the left side of equation (5) as      vn (x1 )(2..M )T c2 W (x1 )vn (x1 )(1)c1  vn (x2 )(2..M )T   c3   W (x2 )vn (x2 )(1)c1         · +  · · W            · · · T cM W (xN )vn (xN )(1)c1 vn (xN )(2..M ) (6) where vn (xi )(2..M ) is a vector containing all elements of µ vn (xi ) except ¶ for the first element vn (xi )(1). M = r+D−1 where r is the estimated subspace numD−1 ber. In order to calculate a basis of coefficient vector, let c1 = 1. Then, the equation (5) can be expressed as      vn (x1 )(2..M )T c2 −W (x1 )vn (x1 )(1)  vn (x2 )(2..M )T   c3   −W (x2 )vn (x2 )(1)        ·  =   · · W            · · · T cM −W (xN )vn (xN )(1) vn (xN )(2..M ) (7) The above equation can be succinctly written as W Ac = d, where A is the matrix whose rows are vn (xi )(2..M )T , i = 1, 2..N and d is the right side of the above equation. By minimizing the objective function kd − AckW , we can obtain the weighted least square approximation of ci , i = 1, 2..N as

c1 = 1 and [c2 , · · · , cM ]T = (AT W T W A)−1 (AT W T W d) (8) Note that, since we use the diagonal matrix of weight coefficient W to demote the impact of noise and outliers, the estimation error of coefficient vector c is reduced. To estimate the normal vectors {bi }ni=1 , we calculate them in the absence of noise using polynomial differentiation as in original GPCA. The computed vectors serve to initialize the following constraint nonlinear optimization which differs from GPCA in the introduction of the weight coefficients: N X min W (xj )k˜ xj − xj k2 subject to

j=1 n Y

(bTi x ˜j ) i=1

=0

j = 1, · · · , N

(W (xj )k˜ xj − xj k2 + λj

j=1

n Y

(bTi ))

vn

s1

hm

v1

s2

h1

0o

(9)

where x ˜j is the projection of xj onto its closet subspace. By using Lagrange multipliers λj for each constraint, the above optimization problem is equivalent to minimizing the following function N X

90o

(10)

i=1

Taking partial derivatives with respect to x ˜j and equating it to 0, we can solve for λj /2 and W (xj )k˜ xj − xj k2 . By replacing them into the objective function (9), the simplified objective function on the normal vectors can be derived as Qn N X W (xj )(n i=1 bTi xj )2 Pn Q En (b1 , · · · , bn ) = (11) k i=1 bi l6=i (bTl xj )k2 j=1 We observed that the convergence of equation (11) is slow. Hence, in our implementation, we employed a weighted k-means iteration method to estimate the optimal vectors. We assign weighted data points to their nearest subspaces and update the normal vectors of subspaces. This process is performed iteratively till there is no change to the subspaces. According to our experimental results, this method achieves the same performance of equation (11) but with a faster convergence rate.

4 Event Detection After estimating subspaces from the transformed polar space, each subspace contains query sessions of similar topic. However, not every subspace is interesting such that it contains clusters corresponding to real events. For example, consider the query sessions with queries on popular public portals (e.g., google). Due to their intense frequency during the whole observed time period, such query sessions will form a subspace although they do not represent any real event. For example, the subspace s1 in Figure 5 represents such an uninteresting subspace. In this section, we first discuss pruning uninteresting subspaces. Then, various events are detected from interesting subspaces.

Figure 5. Interestingness of subspaces.

4.1

Subspace pruning

We distinguish interesting and uninteresting subspaces based on the intuitive premise used by [8]: the appearance of an event is signaled by a “burst of activity”, with certain features rising sharply in frequency when the event emerges. In our work, we simultaneously consider two features: the occurring time of query sessions as well as the semantics of query sessions. Particularly, if a subspace is interesting such that it contains query sessions corresponding to events, both the occurring time and the semantics of query sessions in the subspace should exhibit certain “bursts”. Recall that our polar transformation respectively maps the occurring time and semantics of a query session to the radius and angle of a point in polar space. The temporal “burst” and semantical “burst” should be reflected by the certainly distribution of data points along the subspace direction and the orthogonal direction of the subspace respectively. For example, consider the subspace s2 in Figure 5. s2 might be an interesting subspace since its data points distribute certainly along the subspace direction and the orthogonal direction of the subspace. In order to measure the certainty of the distribution of data points along the two directions, we project data points to the two directions respectively and calculate the respective histograms of the distributions. Let hh1 , h2 , · · · , hm i and hv1 , v2 , · · · , vn i, where hi and vi are individual bins, be the two corresponding histograms. We employ the entropy measure to define the interestingness of a subspace as follows. m n X X I(si ) = 1 − [−p hi log hi − (1 − p) vi log vi ] (12) i=1

i=1

where p ∈ [0, 1] is a weight which can be determined in experiments to assign different importance to the entropy values in the two directions. For example, if p = 1, then only the temporal “burst” is considered. The interestingness measure takes values from 0 to 1. The more certain the distributions in two directions, the smaller the entropies

in the brackets of equation (12), the greater the value of interestingness. Given some threshold ζ, subspace si will be pruned as an uninteresting subspace if I(si ) < ζ. We observed from our experimental results that I(si ) of an interesting subspace si is usually much greater than I(sj ) of an uninteresting subspace sj . Hence, it is not difficult to select an appropriate threshold ζ. Note that, the calculation of the histograms may be biased by noisy data points. In our implementation, we remove noise by an inlier growing method before projecting data points. Recall that the weight coefficients associated with data points (as computed in the previous section) indicate the possibility of a point belonging to true data. Hence, we sort the points according to their weights and pick the top m points to form an initial set of inliers. Then, we measure the Euclidean distance between each inlier and its nearest neighbor in the set of inliers. A normalized average minimum distance d¯ is defined as d¯ = Ds /Ws , where Ds is the sum of the minimum distances obtained for all inliers and Ws is the sum of weights of all inliers. Next, we grow the set of inliers by examining the rest of the data points. For each point xi , we calculate its distance d(xi ) to the nearest inlier. If d(xi ) ≤ z × d¯ (13) W (xi ) xi will be included in the set of inliers. Here, z is a parameter that is used to extend the threshold beyond the average minimum distance. If a point does not satisfy the condition in equation (13), it is discarded as a noisy data. Note that, the normalization by weight of each data point in equation (13) increases the discriminatory power since a small (large) weight associated with noise (inlier) causes the left hand side of equation (13) to be even larger (smaller).

4.2

Query session clustering

After pruning uninteresting subspaces, events can be detected from the remaining subspaces by clustering data points based on the cluster consistency property of polar transformation. Although there are a plethora of published clustering techniques, methods relying upon a priori knowledge of the number of clusters are not applicable here because the number of events in each subspace cannot be decided easily. Hence, in our approach, we employed a nonparametric clustering technique called Mean Shift Clustering [4]. Mean shift clustering is an application of the mean shift procedure, which successively computes the mean shift vector which always points toward the direction of the maximum increase in the density and converges to a point where the gradient of density function is zero. Based on the mean shift procedure, we perform mean shift clustering on data points in each subspace as follows. • Firstly, mean shift procedure is run with all the data points to find the stationary points of the density estimate.

Algorithm 1 DECK. Input: A set of query sessions Output: A set of query session clusters corresponding to real events Description: 1: Transform query sessions to polar space 2: Estimate subspaces of query sessions using KNN-GPCA 3: for Each estimated subspace do 4: Project inliers to the subspace direction and orthogonal direction respectively 5: Compute the entropy of the distribution histogram in the two directions 6: if Interestingness of the subspace is less than some threshold then 7: Prune the subspace 8: end if 9: end for 10: for Each interesting subspace do 11: Perform mean shift clustering 12: end for 13: Return the clustering results

• Secondly, discovered stationary points are pruned such that only local maxima are retained. The set of all points that converge to the same mode defined the same cluster. The returned clusters are expected to represent real events. The complete algorithm of DECK is shown in Algorithm 1.

5 Performance Study In this section, we study the performance of DECK. We first describe the data set used in our experiments. Then, we present and analyze the results of conducted experiments.

5.1

Data Set

The real-life Web click-through data collected by AOL [13] from March 2006 through May 2006 are used. According to [13], if a user clicked on more than one page in the list returned from a single query, these pages appear as successive entries in the data. Hence, in our experiments, we simply extract successive pages corresponding to the same query and the same user as a query session. The timestamp of the first clicked page of a query session is taken as the occurring time of the session. We manually identified a set of events from the data set, including both predictable events (e.g., the Memorial Day on May 29, 2006) and unpredictable events (e.g., the death of Dana Reeve, an American actress, on Mar 6, 2006). For each event, we identified a set of query keywords and selected all query sessions which contained the query keywords and happened close to the date of the event. After filtering events which are represented by less than 50 query sessions, a total of 35 events are used in our experiments. The complete list of events are shown in Figure 6. We then randomly select query sessions which do not represent any real events, together with the query sessions corresponding to real events, to generate five data sets, which respectively contain 5K, 10K, 20K, 50K and 100K query sessions.

5.2

Event

Timestamp

Ash Wednesday

March 01, 2006

Jack Wild, an english actor, died

March 02, 2006

Dubai Tennis Open end

March 04, 2006

2006 World Baseball Classic

Mar 03 - Mar 20, 2006

Philadelphia flower show

Mar 05 - Mar 12, 2006

Jame blunt performance @ Oprah Winfrey Show

March 08, 2006

Ides of March

March 15, 2006

Saint Patrick Day

March 17, 2006

V for Vendetta, a film released in USA

March 17, 2006



She the man, film opening weekend

March 19, 2006

Los Angeles Marathon

March 19, 2006

Debra Jean Beasley Case. Plea deal rejected.

March 21, 2006

Washington D.C. Cherry Blossom Festival

March 25, 2006

Buck Owens, a singer and guitarist, died

March 25, 2006

Rocio Durcal, a spanish actress, died

March 25, 2006

2006 Indy racing league season began

March 26, 2006

Crash kills Indy driver Paul Dana

March 27, 2006

Basic Instinct 2, released

March 31, 2006

April Fools Day

April 01, 2006

Good Friday

April 14, 2006

Scary Movie 4, released

April 14, 2006

Easter

April 16, 2006

Boston Marathon

April 17, 2006

The 58th Annual Primetime Emmy Awards

April 27, 2006

Steve Howe, a baseball player, died

April 28, 2006

David Blaine performance @ Lincoln Center

May 01, 2006

The 132

nd

Kentucky Derby

May 06, 2006

Chris Daughtry, eliminated from american idol

May 10, 2006

Soraya, a colombian-american singer, died

May 10, 2006

Mother Day

May 14, 2006

The 41st Academy of Country Music Awards Fleet week

May 23, 2006 May 24 – May 30, 2006

UFC 60: Hughes vs. Gracie

May 27, 2006

The 90th Indianapolis 500

May 28, 2006

Memorial Day

May 29, 2006

Figure 6. Event List.

Result Analysis

Performance of DECK. We first compare the performance of DECK with the existing two-phase-clustering algorithm [18]. Given the set of clusters returned as detected events, the existing algorithm [18] finds a best match between discovered clusters and true events. A best match of a true event is defined as a cluster that has the maximum overlap with the true event in terms of the number of common query-page pairs (query sessions). We further constrain that the number of common query-page pairs (query sessions) should be no less than some specified threshold (in our experiments, we set the threshold as 50% of the query-page pairs or query sessions representing the true event). Then, the evaluation metrics, precision and recall, can be computed as follows. Precision is the ratio of the number of correctly detected events to the overall discovered clusters. Recall is the ratio of the number of correctly detected events to the total number of events. The experimental results are shown in Figures 7 (a) and (b) respectively. The existing algorithm is referred to as 2PClustering in the figures (please ignore the other two algorithms temporarily). We observe that DECK outperforms 2PClustering in both precision and recall. Since the algorithm 2PClustering did not discuss how to decide the number of clusters, we use the number of events generated by DECK. Then, since more clusters generated by 2PClustering do not represent any events, both the precision and recall values are low. We further evaluate the performance using an entropy measure. For each generated cluster i, we compute pij as the fraction of query-page pairs (query sessions) representing the true P event j. Then, the entropy the of cluster i is Ei = − j pij log pij . The total entropy can be calculated as the sum of the entropiesPof each cluster weighted by the m i size of each cluster: E = i ni ×E n , where m is the number of clusters, n is total number of query-page pairs (query sessions) and ni is the size of cluster i. The experimental results are shown in Figure 7 (c). Again, DECK works better than 2PClustering. The reason is similar as before. Given the number of clusters generated by DECK, 2PClustering generates clusters with larger size. Since it does not prune any data, the entropy of 2PClustering is higher than DECK. Performance of Subspace Estimation. We also study the performance of the subspace estimation step of DECK. Particularly, we examine the the effectiveness of KNNGPCA. An alternative version of DECK, referred to as DECK-GPCA, is implemented to employ the GPCA algorithms to estimate subspaces. As shown in Figure 7, DECKGPCA performs worse than DECK, which indicates that KNN-GPCA effectively demotes the impact of outliers and improves the robustness of subspace estimation. However, DECK-GPCA is also better than 2PClustering. The reason is that DECK-GPCA also prunes data which do not repre-

0.9

DECK

2P Clustering

0.8

DECK-GP CA

DECK-NP

1 0.9

2P Clustering

DECK-GP CA

DECK-NP

0.3 DECK

0.25

0.8

0.7

2PClust ering DECK-GPCA

0.7

0.5 0.4 0.3

0.2

0.6

Entropy

0.6

Recall

Precision

DECK

0.5 0.4

0.1

0.3

0.2

0.2

0.1

DECK-NP

0.15

0.05

0.1

0

0

0 5K

10K

20K

50K

100K

No. of Query Sessions

5K

10K

20K

50K

5K

100K

10K

(a)

20K

50K

100K

No. of Query Sessions

No. of Query Sessions (b)

(c)

Figure 7. Precision, recall and entropy of DECK, 2PClustering, DECK-GPCA and DECK-NP.

n

error =

1X cos−1 (bTi ˜bi ) n i=1

(14)

Figure 8 (b) plots the mean error calculated over all the 500 trials as a function of the percentage of outliers. We observed that when there are no outliers, KNN-GPCA and GPCA have the same estimation errors. However, as the fraction of outliers increases, the estimation error of KNN-

GPCA is much smaller than that of GPCA. 3

Singular Value Gap Ratio

10event

20event

35event

10K

20K

2.5 2 1.5 1 0.5 0 5K

50K

100K

No. of Query Sessions (a)

KNN-GP CA

0.16

GP CA

0.14 0.12 Error

sent events. As analyzed in Section 3, after assigning lower weights to outliers, the selection of the parameter ², which determines the number of subspaces, can be decided more easily. Hence, we further investigate how KNN-GPCA enlarges the gap between the rth and (r + 1)th singular values of the polynomial embedding matrix. We conduct the experiments on the five datasets by varying the number of true events. Note that, in the presence of noise, both GPCA and KNN-GPCA decide the number of subspaces by assuming there are two subspaces first. Then, they justify whether σr+1 σ1 +···+σr < ². If the condition is not satisfied, they recursively increase the number of subspaces and check the condition till it is satisfied. Hence, in our experiments, the value r is automatically decided by fixing the value of ² as 1.0e − 3 for both GPCA and KNN-GPCA. Then, the ratios of the gap computed by KNN-GPCA to the gap computed by GPCA are shown in Figure 8 (a). We notice that KNNGPCA does enlarge the gap and improves the possibility of selecting appropriate ². Besides improving the robustness in estimating the number of subspaces, KNN-GPCA is also expected to improve the accuracy in estimating the normal vectors of subspaces. Hence, we randomly generate n = 2 subspaces where each subspace contains N = 400 points on 1−dimensional subspaces of R2 . Zero-mean Gaussian noise with standard deviation of 20 is added to the sample points. We consider six cases of outliers that are randomly added to the data such that the fraction of outliers are 0%, 2.5%, 5.0%, 7.5%, 10% and 12.5% respectively. For each of the six cases of outliers, we run KNN-GPCA and GPCA for 500 times. The error between the true subspace normals {bi }ni=1 and their estimates {˜b}ni is computed for each run as [16]

0.1 0.08 0.06 0.04 0.02 0 0

0.025

0.050

0.075

0.100

0.125

Percentage of Outliers (b)

Figure 8. Performance of subspace estimation. Performance of Subspace Pruning. In order to examine the subspace pruning step of DECK, we implement another alternative version of DECK, referred to as DECKNP (stands for DECK with No Pruning), which skips the subspace pruning step. The performance of DECK-NP is shown in Figure 7 too. Since DECK-NP does not prune any uninteresting subspaces, more clusters will be generated by this algorithm. Hence, although it achieves similar recall as DECK, its precision is even lower than 2PClustering. However, since it employs KNN-GPCA to estimate subspaces, the correctly discovered clusters are of high quality. Hence, its entropy is better than DECK-GPCA. We further conduct experiments to examine whether the

Interestingness Ratio

6

10event

20event

20K

50K

35event

5 4 3 2 1 0 5K

10K

100K

No. of Query Sessions (a)

Interestingness Ratio

6

10event

20event

35event

5 4 3 2 1 0 0

0.25

0.5

0.75

1

Parameter p (b)

Figure 9. Performance of subspace pruning. threshold ζ, which is used to prune uninteresting subspaces, can be selected easily. The experiments are conducted on the five datasets by varying the number of events. For each dataset, we order the estimated subspaces according to their interestingness values. We then compute the interestingness ratio between a pair of successive subspaces with the most difference in their interestingness values. The experimental results are shown in Figure 9 (a), where the high ratios indicate that appropriate ζ values can be decided with ease. We further examine variations of the interestingness ratio with respect to the parameter p, which decides how much the entropies from the temporal dimension and the semantic dimension will respectively contribute to the interestingness value of a subspace. The results in Figure 9 (b) show that considering both temporal and semantic dimension is better than considering only one of them.

6 Related Work The general task of event detection is divided into two categories by Yang et al. [17] as retrospective detection and on-line detection, which are also known as Retrospective Event Detection (RED) and New Event Detection (NED). Most of the existing research focuses on NED [17][2]. There are only a few approaches proposed for RED. For example, recently, Li et al. [11] proposed a multi-model RED algorithm which models both contents and time information of documents explicitly. The particular feature of their algorithm is that the model of timestamps can work like auto-

adaptive sliding windows on time line, which overcomes the inflexible usages of timestamps in traditional RED algorithms. The most significant difference between our work and the existing work on NED and RED is that we use different data source for event detection. Existing work on NED and RED usually analyze contents of documents by using some natural language processing techniques, which, however, can be avoided by analyzing Web click-through data. With the growing richness of the contents over the Web, the community of Topic Detecting and Tracking was attracted to study Web data as well [3][15][9]. For example, Kumar et al. [9] viewed the pages returned by search engines as good reflections of the quality of the “Web presence” of topics. They thus proposed to extract storylines from Web search results by a graph-theoretic based approach. Although they also utilized search results as data source, their algorithm still needs to analyze the content information of returned pages. Furthermore, their objective is to find a storyline as considered in news media. While, we aim to discover specific events which happen at specific time and places. The work which is most similar to ours is the one done by Zhao et al. [18]. They performed two phases of clustering such that the semantic and temporal features of queries are considered separately. On the contrary, we simultaneously consider the information from the two dimensions in subspace estimation and pruning. Furthermore, our approach automatically decides the number of events and filters queries which do not correspond to events.

7 Conclusions Web click-through data are recently identified as a potential source for event detection. In this paper, we propose a novel and effective approach for detecting events from Web click-through data. We study Web click-through data in unit of query sessions and transform them into 2D polar space. A novel subspace analysis method, KNN-GPCA, is proposed to estimate subspaces of query sessions of similar topics. We prune uninteresting subspaces by defining an interestingness measure which considers the entropies in the semantic dimension and the temporal dimension. Finally, events are detected by clustering query sessions in interesting subspaces. Our experimental results on real-life Web click-through data show that our approach based on robust subspace analysis is effective in detecting events from Web click-through data.

References [1] J. Allan, J. G. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In DARPA Broadcast News Transcription and Understanding Workshop, 1998. [2] J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In SIGIR, 1998.

[3] J. Allan, C. Wade, and A. Bolivar. Retrieval and novelty detection at the sentence level. In SIGIR, 2003. [4] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. In IEEE Transaction On Pattern Analysis And Machine Intelligence, volume 24, 2002. [5] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. J. Kriegman. Clustering appearances of objects under varying illumination conditions. In CVPR (1), 2003. [6] K. ichi Kanatani. Motion segmentation by subspace separation and model selection. In ICCV, 2001. [7] Q. Ke and T. Kanade. Robust subspace clustering by combined use of knnd metric and svd algorithm. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. [8] J. M. Kleinberg. Bursty and hierarchical structure in streams. In KDD, 2002. [9] R. Kumar, U. Mahadevan, and D. Sivakumar. A graphtheoretic approach to extract storylines from search results. In KDD, 2004. [10] W.-S. Li, K. S. Candan, Q. Vu, and D. Agrawal. Retrieving and organizing web pages by “information unit”. In WWW, 2001. [11] Z. Li, B. Wang, M. Li, and W.-Y. Ma. A probabilistic model for retrospective news event detection. In SIGIR, 2005. [12] Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In KDD, 2005. [13] G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In The First International Conference on Scalable Information Systems, 2006. [14] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8), 2000. [15] A. Sun and E.-P. Lim. Web unit mining: finding and classifying subgraphs of web pages. In CIKM, 2003. [16] R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. [17] Y. Yang, T. Pierce, and J. G. Carbonell. A study of retrospective and on-line event detection. In SIGIR, 1998. [18] Q. Zhao, T.-Y. Liu, S. S. Bhowmick, and W.-Y. Ma. Event detection from evolution of click-through data. In KDD, 2006.

Suggest Documents