Context-Sensitive Modeling of Web-Surfing Behaviour using Concept Trees Sreangsu Acharyya Dept. of Electrical and Computer Engineering, University of Texas at Austin. Austin, Texas 78712 email:
[email protected]
Abstract Early approaches to mathematically abstracting websurfing behavior were largely based on first-order Markov models. Most humans however do not surf in a “memoryless” fashion, rather they are guided by their timedependent situational context and associated information needs. This belief is corroborated by the non-exponential revisit times observed in many site-centric weblogs. In this paper, we propose a general framework for modeling users whose surfing behavior is dynamically governed by their current topic of interest. This allows a modeled surfer to behave differently on the same page, depending on his situational context. The proposed methodology involves mapping each visited page to a topic or concept, (conceptually) imposing a tree hierarchy on these topics, and then estimating the parameters of a semi-Markov process defined on this tree based on the observed transitions among the underlying visited pages. The semi-Markovian assumption imparts additional flexibility by allowing for non-exponential state re-visit times, and the concept hierarchy provides a nice way of capturing context and user intent. Our approach is computationally much less demanding as compared to the alternative approach of using higher order Markov models for capturing history-sensitive surfing behavior. Several practical applications are described. The application of better predicting which outlink a surfer may take, is illustrated using web-log data from a rich community portal, www.sulekha.com as an example, though the focus of the paper is on forming a plausible generative model rather than solving any specific task.
1. Introduction Modeling internet surfing behaviour as a uniform random walk on the web graph has met with considerable success, leading to link analysis based search result ranking schemes such as Pagerank [16] and HITS [10]. Pagerank, for instance, assumes that an user surfs the web randomly by following the outlinks with equal probability, superposed over a weak tendency to visit an arbitrary web-page, uniformly at random. Given a query, the Pagerank algorithm returns a list of pages containing the query term and
Joydeep Ghosh Dept. of Electrical and Computer Engineering, University of Texas at Austin. Austin, Texas 78712 email:
[email protected] orders them according to the number of transitions a random surfer would make to them, if allowed to surf indefinitely. Models have been proposed to augment such “uniformly at random” behavior with intentional or purposeful ones. We continue to use the term “random” as an indicator of its stochastic behaviour and not to imply any lack of intent. These purely link analysis based models have been extended to incorporate non-uniform sampling of outlinks based on the content of the pages to which the outlinks point [5, 18]. However such models were not meant to address the effect of the past or that of the dynamically changing interest of a surfer over the course of a surfing session. For example, the concept drift phenomena, where an user switches to, say, non-sport pages after having spent enough time on sports pages proves difficult to capture using memoryless Markovian assumptions. The above-mentioned techniques are also closely related to Markov chain [20] and probabilistic grammar [4] based link prediction approaches. The general consensus in these works has been that the next page cannot be predicted accurately from the current URL alone. Higher order models [8] can alleviate this problem somewhat, but are difficult to estimate because of the exponential growth of the state space, thereby requiring even larger amounts of data to find good estimates. Even for models of moderate complexity, the training data requirement can far exceed what can be supported in practice. Researchers have therefore tried various smoothing schemes, for example combining the error prone higher order Markovian estimates with predictions by simpler models [20, 8], with some success. Many non-generative models have been used effectively in specific tasks related to web usage mining like personalization, link prediction and page pre-fetching. These are primarily based on similarities between click stream sequences, using for instance, longest repeating subsequences [17, 3], identifying various sequential as well as non-sequential patterns, [13] using techniques such as frequent itemset graphs [12], Association Rule Mining [1] and Sequential Pattern Mining [2] etc. Our goal in this paper, is to develop a simple generative model for surfing patterns of a user who is guided dynamically by his current topic of interest, thereby allow-
ing the modeled surfer to behave differently on the same page, depending on his situational context. These different concepts/topics 1 , drawn from the set of all possible concepts, called concept space, constitute the states of the model. Estimating the contextual state occupied by the surfer should then allow better prediction and understanding of the surfer’s behaviour. The observation that topic ontologies improve identification of web usage patterns [6], motivates the rooted structure given to the state space. This approach is explored in detail in the following sections.
2. Overview Instead of developing complex higher-order Markov models to capture the history, an additional state variable called the concept is used to decouple the surfer’s present behaviour from the past and also describe it at any chosen or appropriate level of detail. Assuming that each page can be labeled with a concept, one can introduce two views of a surfing session: 1. the (observable) random walk on the URLs and 2. a (possibly hidden) corresponding random walk on the concept space.
transition behaviour defined on them by the probabilities on the edges. In the following sections we describe the probabilistic model used to define the two sets of random transitions mentioned above. The first occurs in the concept or topic space, reflecting the immediate interest of the surfer and the second on the actual web-graph. We explain both the generative model and the method of estimating the parameters of the model. This is followed by examples of the use of this model to predict the surfing behaviour of some web users.
3. Model The first subsection below describes the generative model for the state transitions in the concept space, that is, it explains how the modeled surfer selects a topic to transit to. The two important aspects: (i) evaluating the probability of an event given a model and (ii) estimating the parameters of the model from the training data, are stressed upon. In the subsequent section we describe how, depending on the surfer’s topic selection, probabilities are apportioned to the outlinks, proportional to the relevance of the link to the concept state to be occupied next.
3.1 Random Walk on the Concept Space We restrict the concept space to have a tree structure which simplifies the model and also has some basis in popular ontological divisions of topics. For certain web-sites where the content is arranged as an ontology, e.g., dmoz.org, yahoo directory, etc these two views may converge. In this paper we examine the case where the concept tree is provided apriori and need not be induced from data. However cases where the states are not identified can also be dealt with and would be similar to extending Markov models to hidden Markov ones using the EM(Baum-Welch) algorithm. The web documents may be hand labelled, as followed in our experiments or, identified using a classifier using the content of the documents. We briefly describe such an automated scheme in section 3.2.1. The simplest model for state transition dynamics is the Markovian model. However as models of web-surfing, they are restrictive for a couple of reasons. Markovian models limit us to exponential state re-visit times that may not be flexible enough to model surfing behaviour, so we relax this condition slightly and model the behaviour as a coupled semi-Markov process on trees. Non-Markovian models may be motivated further by noting that, to utilize the structural information provided by complex trees we need models that are non-Markovian, since for purely Markovian state transitions over leaves of a tree, a flat and a multi-level tree are equivalent [11]. That is, any Markovian state transition distribution on the leaves of a multilevel tree can be represented equally well on a flat tree, as shown in Fig.1, where the two trees have the same state 1 the
terms “concept” and “topic” are used synonymously
In the random walk model on the concept space the surfer occupies some leaf node of the concept tree and can transit to another leaf node in the next time step. The leaves of this tree correspond to different concepts or topics, whereas their parents denote supersets of their children and therefore represent higher or abstract level concepts owning these leaf concepts. Typically several web pages will map to each leaf. However, unlike traditional Markov models, here a surfer may choose to remain on a concept sub-tree over several time steps independent of the transition probabilities. That is, once the surfer is in a state, the probability of revisiting the same state in the very next step (revisit event) is no longer governed by the transition matrix but by a different probability distribution which may depend on the state selected for the next transition. This semi-Markovian behaviour is governed by an occupation time probability distribution, that we explain later in this section. In the hierarchical description, we view the surfer to be logically resident on all the nodes of the tree leading from the root to the leaf. To specify the state completely we will need the n-tuple = fvroot ; ; vleaf g consisting of all the vertices from the root to the leaf node vleaf , where n is the length of this path. However since the leaf specifies this entire path on a tree, we will often specify just the leaf for the sake of brevity. We define the random walk on the concept space in terms of the following quantities:
States .
A (Concept) tree: G(V; E ) with a set of vertices (concepts) V = fvi g and edges E = fei g on which the surfer makes a state transition at each time step.
Time T Occupation time at the vertex vi : t(vi ; T ), abbreviated to tvi when T is clear from the context. This is calculated as the time steps accumulated by the surfer visiting vi or any of its children. It is reset to zero once the surfer pops to the parent level of vi . This quantity is also called “holding” or “sojourn” time in literature, but we will primarily use the term occupation time for the purpose.
Events f o, pg: The event o signifies that the surfer continues to occupy some subtree of the current vertex, whereas p signifies it pops to the parent level. A conditional occupation time probability density function Qvi (ojtvi ) (abbreviated to Qvi (tvi )) for each vertex vi , defining the probability of drawing an “occupy” event at the node vi of the tree, given that the surfer’s occupation time at the node vi is tvi and that the surfer has decided to continue to occupy the subtree of the parent of vi . This distribution governs how long the surfer is going to stay on the leaves of the corresponding subtree. The conditional probability based distribution is not standard, but is equivalent to a semi-Markov model and allows us to maintain consistency of the probabilistic model. Instead of directly choosing a parametric form for the conditional distribution, we assume a form for the unconditioned distribution qvi (tvi ), from which Qvi (ojtvi ) may be derived. Transition probabilities P r at each non-leaf node governing transitions to their children.
3.1.1 Generation In this model, at each time step, the surfer, who has occupied the state = fvroot ; ; vleaf g with the corresponding occupation times ftroot ; ; tvleaf g, chooses the next event by sampling sequentially from the distributions Qvi (tvi ) for vi 2 , to decide whether or not to remain on the corresponding subtree. This distribution is assumed to be independent of the occupation time accumulated on any of the child vertices and is applied top down. That is, first we sample from the distribution corresponding to the root, with the index i going sequentially from the root to the leaf . If the event chosen is “pop to parent”, a node belonging to the parent vertex vi 1 is chosen according to the corresponding transition probabilities P r. The chosen node may be a leaf or a sub-tree, in which case the time is not incremented until the surfer finally transits to a leaf. The transitions being governed by the transition probabilities
P r at each of the subsequent nodes. On the other hand if the event chosen is “occupy”, we move on to the next vertex vi+1 in and repeat the sampling procedure. If the current vertex vi under consideration is a leaf, the sample drawn from Qvi (tvi ) decides whether the surfer stays on the same leaf vertex or not. 3.1.2 Example 1 1/5
3/5
1/5
A 1.1
1/5 A
B 1.2
1/5 B
1/5 1/5 1/5 C
D
E
1.3 2/3
1/3
C 1.3.1
1.3.2 1/2
D 1.3.2.1
1/2
E 1.3.2.2
Figure 1. Transitions on a Concept Tree Let us consider a toy example. Consider the tree whose vertices are numbered as shown in Fig.1. Following the top down approach explained above we define the transition matrix for the top most level of the tree by collapsing(marginalizing) subtrees into single states as follows.
1 1:1 1:2 1:3: 1 0 P r(1:1) P r(1:2) P r(1:3) 0 0 1:1 1 Q1:1 (t1:1 ) Q1:1 (t1:1 ) 1:2 1 Q1:2 (t1:2 ) 0 Q1:2 (t1:2 ) 0 1:3: 1 Q1:3 (t1:3 ) 0 0 Q1:3 (t1:3 ) Because of the hierarchical structure we can further expand Q1:3 (t) to include the next level of the tree. Using the independence assumption we can write it as follows (we have omitted the subscripts on the times as they are obvious from the context). # "
Q1:3(t):
0
P r (1:3:1)
P r (1:3:2)
(1
Q1:3:1 (t))
Q1:3:1 (t)
0
(1
Q1:3:2 (t))
0
Q1:3:2 (t)
So far we have explained the sample generation process of this model. Now we want to evaluate the probability of transiting from the current vertex v1 to v2 in the next step. Consider the unique top down path from the root to the destination vertex v2 , given by the destination state d = fvroot; vi v2 g and let the corresponding source state be s = fvroot ; vj v1 g. At each vertex on d , we evaluate the probability of drawing a resample event according to Qvi (t(vi )), after which the surfer transits to the next vertex in the path according to the transition probability specified, until it reaches the destination. Let i0 be the smallest index such that the i0 th vertex in d (i0 ) 6= s (i0 ), we know that a pop event must have been drawn at vi0 ,
the probability of which is given by 1 Qvi0 (t(vi0 )) . The probability of the entire transit is thus the product of these values. For instance the probability of transiting from 1.3.1 to 1.3.2.2 would be
Q1:3 (t1:3 ):(1 Q(t1:3:1 )):P r(1:3:2):P r(1:3:2:2): 3.1.3 Estimation of Parameters Given the structure of the tree and the family of the distribution to which q belongs, the parameters can be estimated from the frequency counts using maximum-likelihood. However for small training sets, only a handful of the events with positive probabilities will be sampled. The maximum-likelihood estimation procedure especially suffers from the fact that it assigns a probability of zero to all unobserved events, leading to biased estimates. This prompts us to use Bayesian estimation, which performs regularization or smoothing to compensate for these effects. The estimation procedure consists of:
Estimating the the state transition probabilities P r at each node. Estimating the parameters for the occupation time density function q .
The state transition probabilities P r associated with a vertex are estimated from relative counts, where these counts are recorded when the surfer chose to “pop to parent” on a child vertex. Note that each row of the transition matrix constitutes a multinomial distribution which we estimate using the Laplace smoothing with Dirichlet priors. The occupation time distributions q () are estimated from the occupation events occurring at the nodes and the parameter values are set to the expected value of the parameter, where the expectation is taken over the posterior distribution. The frequencies of the events are used to calculate the posteriors for the corresponding vertices. We now come to the decision of choosing the parametric form for the distributions q (tvi ). This choice needs to be as unbiased as possible. That is, given the training data set we choose the family of parametric distributions that makes minimum extra assumptions. The most noncommittal distribution on R+ under the constraints on the first moment is the exponential distribution, which is memoryless and therefore too restrictive for our purpose. Additional constraints on higher moments can be incorporated into our model; note, however that the confidence in the values of higher order moments will be lower. Putting a restriction on the first logarithmic moment, i.e. E [ln(x)℄, one can solve for
Z
Z
max q(x)lnq(x)dx 0 q(x)d q(x) 1
Z
xq(x)dx 2
Z
ln(x)q(x)dx
which yields the two parameter gamma distribution q (t) = t . If we have enough information to have confidence in the higher moments, we can incorporate them as constraints. For example, incorporating the second moment yields the Rayleigh distribution. The parameter is known as the shape parameter, whereas is known as the scale parameter. The mean and 2nd moment are given as E [t℄ = and E [t2 ℄ = ( +1) 2 . Note that the exponential distribution is a special case where the shape parameter equals 1. Because of the sample sparsity problem explained before, we use Bayesian estimation to learn the parameter values. Let ft1 ; ti tN g be the training samples (subscripts vk have been suppressed to simplify the notation). The gamma family of distributions is self conjugate for the scale parameter , so we use a gamma prior on the scale parameter to obtain
1 G( ; ) = t ( e)
0 0 1 0 t 1 e t1 i P (jti ) / P ():q(ti j) = 0 ( 0e) ( ) = G(N + 0 ; + ti ):
The estimated parameter is set to the mean of the posN terior probability which is E [[℄℄ = 00 + + ti . Note also that this imposes the prior estimates of = 00 . These may be estimated from the global behaviour of the entire training sequence.
3.2
Random Walk on the url Space
In this section we examine how the random walk on the concept space influences the links that are followed by the surfer. We give two schemes for apportioning probabilities to the outlinks based on the state occupied in the concept space. The first method is a direct application of the Richardson and Domingos model of an Intelligent Surfer [18] and makes use of a relevance function, Whereas the second scheme uses the oracle that is used to map the content of a page to its concept label.
3.2.1 Relevance Function-based Browsing We use the distribution on the concept state vt+1 to be occupied by the surfer at time step t + 1 to model the surfing behaviour on the web graph. Let the current page be p and the set of pages pointed by outlinks be Lp = fl1 ; l2 ; lk g: We are given a relevance function R(v; l) that measures the relevance of the link to l to the concept state v , which may be based either on the anchor text or the content of the pointed page. The relevance function can be calculated suitably, for example using tfidf [19] or LSI score [7] P hildren of v R(v ; l): etc. We only require that R(v; l) v The probability of following the outlink conditioned on the concept state to be occupied by the surfer in the next step P r(li jvt ) can now be evaluated as
R(v ; l ) P r(li jvt ) = PL t+1 i + (1 )Prnd () j R(vt+1 ; lj )
The link following probabilities calculated by this model have obvious applications in ranking strategies similar to pagerank. In the pagerank algorithm the transition matrices are not customized. Researchers have looked at ways of achieving some level of customization. Customization on the random jump component of the pagerank algorithm has a closed form solution in terms of some precomputed computed pageranks [9]. Customization of the full transition matrix does not yield easily to analytically closed form solutions but can be implemented nonetheless by linearly combining the precomputed pageranks computed from some basis transition matrices [18]. However, with this model we can in principle carry this customization one step further where we incorporate the temporal dependencies, i.e. the time-varying nature of these transition matrices. Given the parameters of the model, these time dependent transition matrices are fairly easy to characterize. The principal issue here would be to come up with algorithms where the stationary distributions of these can be computed efficiently.
where Prnd () is the distribution over the pages if the surfer chooses to select the next page randomly, which it does with probability (1 ). The link following probabilities are obtained by the expectation over the distribution of the vertices in the concept space. The advantage of having a tree structure with a top down inference scheme is that we can not only curtail this computation at some depth of the tree according to the computational resources available [14], but also give bounds on the approximation involved.
3.2.2 Oracle based Browsing In this model we use the oracle to label the pages fli g forward linked to the current page p with a topic in the concept space L(li ). Once provided with the mapping, the random walk model on the concept space allows us to assign probabilities to these events P (p ! li ) / P (L(p) ! L(li )), re-normalizing which we obtain the probabilities of following the links. A fact implicit in this formulation is that the probabilities P (p ! li ) are now dependent on the occupation times on the vertices in the path from the root to the current topic L(p).
P (L(p) ! L(li )) + (1 )Prnd () P (p ! li ) = P j P (L(p) ! L(lj )
Customized web page rankings
Clustering users: Using EM on the mixture of semiMarkov websurfing processes to cluster users. Hidden Markov models have been used as an underlying model for clustering users [21, 23] and is subsumed by the corresponding mixture of semi-Markov models. It would be interesting to see if better results are obtained by using a more flexible technique. In this paper we explore only the link prediction task.
4. Applications
5. Experimental Results
Once the parameters are estimated the model can be used for tasks such as:
To evaluate our model’s predictive accuracy, ideally one should have client-side logs of web pages visited on a large variety of topics (and hence reflecting visits to many sites), as well as the location of these pages on the concept tree. Despite an extensive search we could not find rich enough client-side log data in the public domain with the desired characteristics. Available log datasets are typically from a specific site. This has several drawbacks for client-side modeling since one does not get a 360 degrees view of the user, including what he does at other sites [15]. More specifically, in our framework, the occupation times the surfer has accumulated on each vertex of the concept tree (because of visits to other sites) prior to entering the current site, are unknown for site-centric data. We simply set the initial occupation times to zero in such cases. In the absence of client-side data, we used web-log data from a web-community portal www.sulekha.com for our purpose [3]. This data-set consists of webserver logs collected over February 2001 during the initial phases of the formation of the portal and consists of 423000 hits. This site hosts multiple, diverse topics and has an implicit hierarchical organization which is fairly easy to extract. Due to
Predicting the next concept state of the surfer: Depending on whether we have a probabilistic or deterministic label of the content of the current page, we can estimate the next concept state from the model either by taking the expectation over the probabilistic distribution mentioned above or directly from the model.
Estimating the link following probabilities / Recommend urls in advance The link following probabilities are given conditioned on the next concept state vt+1 and can be evaluated by taking the expectation over the distribution over the next concept state. In a general graph, this would mean taking the expectation over the huge set of possible values that the next state can take. However, the tree model allows us to calculate it at different resolutions of the hierarchy.
5.1 Estimation and Validation As a measure of validation we report leave-one-out crossvalidation error in predicting the next link visited by an user. That is, from the collection of web sessions of n user we remove one test session to create the training data to train our model, which is then used to predict the links followed in the removed session. At each step of the test session the trained model is used to estimate the probability
Occupation Times for Coffeehouse
Occupation times for Coffeehouse/movies
4000
5000 4500
3500
4000 3000
Frequency
Frequency
3500 2500
2000
1500
3000 2500 2000 1500
1000 1000 500
500
0 0
5
10
15
20
25
30
Sequence Length
35
40
45
50
0
A
0
Occupation Times for Coffeehouse
5
10
15
20
25
30
Sequence Length
35
40
45
50
B
Occupation Time for Coffeehouse/movies
12
9
8 10 7
Frequency
8
Frequency
the presence of several dedicated visitors to the site, some fairly long sessions sessions can be found. We used the well separated hierarchical classifications of content provided by Sulekha as our concept tree. However some manual intervention was required to label the web documents into their hierarchical locations. We ran the algorithm on a three level hierarchy whose structure is as follows: -album -articles -articles-authors -coffeehouse -coffeehouse-biztech -coffeehouse-biztech-messages -coffeehouse-books -coffeehouse-books-messages -coffeehouse-contests -coffeehouse-contests-messages -coffeehouse-creative -coffeehouse-creative-messages -coffeehouse-fun -coffeehouse-fun-messages -coffeehouse-messages -coffeehouse-movies -coffeehouse-movies-messages -coffeehouse-personal -coffeehouse-personal-messages -coffeehouse-philosophy -coffeehouse-philosophy-messages -coffeehouse-wo-men -coffeehouse-wo-men-messages -favorite -favorite-favorites -news The revisit time histograms for some of the different concepts are as shown for aggregated users in Fig.2:A and Fig.2:B followed by histograms of an individual user in Fig.2:C and Fig.2:D. The X axis represents the number of times the state was revisited consecutively (occupation time) whereas the Y axis denotes how frequently was it observed.For instance if in the entire history of a surfer there are 5 occasions such that it stayed on a particular state for 10 consecutive time steps the point plotted would be (X=10,Y=5). Note that the revisit times for the individual users are not monotonically decreasing, showing that pure Markov-chain models, with exponentially decaying re-visit times, are inappropriate. It is on these kinds of user behaviours that the current model hopes to capitalize.
6
6
5
4
4 3 2 2
0 0
10
20
30
40
Sequence Length
50
60
70
C
1 0
1
2
3
Sequence Length
4
5
Figure 2. Occupation Probabilities at Different Leaves. X axis denotes occupation time (length of a repeated sequence) and Y axis denotes frequency of observation of such an event. The occupation time probabilities for aggregate behaviour at the leaves “coffehouse” and “coffeehouse/movies”(A,B) are seen to be close to exponential, whereas for individual users (C,D) they are not.
distribution over the leaves of the tree. The value of the error reported is the total variational distance between the estimated distribution and the actual, averaged over the length of the session. The priors distributions used for smoothening the parameters were estimated from the data averaged over all users. The entries that could be identified as being from proxies and dial-up users were removed as they may not represent single users whom we are trying to model. Standard sessionization procedures [22, 3] were followed. Users that had very short sessions were ignored since they do not generate adequate data for parameter estimation. The threshold for sequence length was set to 75. Table 1 lists the 10 users for which the (absolute) decrease in the total variational error for link prediction is the highest. Over the entire population of users considered the average value of the improvement was 3.57% with a standard deviation of 3.24%. To give some anecdotal feel for the kind of users the semi-Markovian model performs better on, the sequences of a specific user with high value of lift are given below. The number in parentheses represents the number of times the corresponding concept was revisited contiguously. 01 Feb 08:00:29 /coffeehouse/fun/(3) 01 Feb 08:01:40 /coffeehouse/contests/(2) 01 Feb 08:10:53 /coffeehouse/biztech/(1) 01 Feb 08:11:48 /coffeehouse/contests/(4) 01 Feb 08:49:04 /articles/articlesl/(2) 01 Feb 09:15:13 /coffeehouse/fun/(1) 01 Feb 09:15:28 /coffeehouse/contests/(40) 01 Feb 10:21:50 /coffeehouse/fun/(1) 01 Feb 10:22:07 /coffeehouse/messages/(5) 01 Feb 10:23:07 /coffeehouse/wo-men/(9)
D
Gamma
User Distribution
1 2 3 4 5 6 7 8 9 10
Memoryless
Improvement
0.758 0.683 0.667 0.671 0.688 0.810 0.542 0.681 0.714 0.754
12.15% 6.74% 6.43% 5.66% 3.37% 3.23% 3.17% 2.9% 1.71% 1.45%
0.676 0.640 0.626 0.635 0.666 0.785 0.525 0.662 0.702 0.743
6. Conclusion First order Markov models are inadequate for accurately describing the surfing behavior of many users. The use of higher order Markov models or their equivalent probabilistic grammars, on the other hand is computationally very demanding as the size of the state space grows exponentially with the order of the model. In this paper, we utilize an additional “concept” variable and an imposed hierarchy to build a model which is computationally simpler but yet effective. The only extra computation, as compared to a Markov model with the same number of states, involves estimating the occupation time probability distributions corresponding to each state and therefore is linear in the number of states. This extra computation can be justified in terms of its ability to model the changing interest of an user over a surfing session, particularly observed in long-term or loyal users. Support of this observation is provided based on empirical studies on link prediction capability within a rich web-community portal. Lack of time prevented investigating the more interesting comparison of this model with Kth order Markov models. Experiments were also affected by the lack of publicly available data-sets. If rich clientside user logs were available, more elaborate studies could have been conducted. An immediate extension of the proposed model would be to use it for clustering users based on a mixture model. Since this model subsumes memoryless behaviour, we expect that, given adequate data, it can identify user groups for whom complex models are necessary as well as those for whom simpler Markov models are adequate, given a mixture of both types of users.
Table 1. Total Variational Error of Prediction
01 Feb 10:29:37 /coffeehouse/creative/(4) 01 Feb 10:30:21 /coffeehouse/personal/(6) 01 Feb 10:35:10 /coffeehouse/contests/(1) 01 Feb 10:36:08 /coffeehouse/biztech/(13) 01 Feb 10:41:57 /coffeehouse/books/(10) 01 Feb 10:46:38 /coffeehouse/philosophy/(39) 01 Feb 11:35:38 /coffeehouse/contests/(11) The important feature to note in the list above is the tendency to remain on the same topic over several pages, as well as the eventual concept drift. This is clearly a non-memoryless behaviour for which the semi-Markovian model performs better. We expect loyal users of a web portal to have browsing characteristics similar to this and therefore more suited to being modeled by a semi-Markov process, whereas simpler Markovian models maybe adequate for the casual surfer. The revisit time histograms for a loyal user on some of the topics is shown in Fig.3.
7. Acknowledgement This research was supported in part under grant #8032 from Intel Corporation.
Occupation Times for /coffeehouse/ 16
Occupation Times for /articles/aticle 16
14
12
10
FREQUENCY
FREQUENCY
References
14
12
8
6
4
10
8
6
4
2
2
0 0
20
40
60
80
100
120
Sequence Length
140
160
0
A
0
Occupation Times for /coffeehouse/philosophy
1
2
3
4
5
Sequence Length
6
7
B
Occupation Times for Coffeehouse/Women
7
4
3.5
6
FREQUENCY
FREQUENCY
3 5
4
3
2.5
2
1.5
2 1 1
0.5
0 0
5
10
15
20
25
Sequence Length
30
35
40
C
0 0
5
10
15
20
25
Sequence Length
30
35
Figure 3. Occupation Probabilities at Different Leaves for an individual user
40
45
D
[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 12–15 1994. [2] R. Agrawal and R. Srikant. Mining sequential patterns. In P. S. Yu and A. S. P. Chen, editors, Eleventh International Conference on Data Engineering, pages 3–14, Taipei, Taiwan, 1995. IEEE Computer Society Press. [3] A. Banerjee and J. Ghosh. Clickstream clustering using weighted longest common subsequences. In Workshop on Web Mining : 1st SIAM Conference on Data Mining, pages 33–40, April 2001. [4] J. Borges and M. Levene. Mining navigation patterns with hypertext probabilistic grammars. Technical Report Research Note RN/99/08, University College London, Dept of Computer Science,Gower Street, London, UK, February, 1999.
[5] D. Cohn and T. Hoffman. The missing link - a probabilistic model of document content and hypertext connectivity. In Neural Information Processing Systems 13, 2001. [6] H. K. Dai and B. Mobasher. Using ontologies to discover domain-level web usage profiles. In Second Workshop on Semantic Web Mining, at the 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’02), pages 61–82, 2002. [7] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. [8] M. Deshpande and G. Karypis. Selective markov models for predicting web-page accesses. In Proceedings SIAM Int. Conference on Data Mining (SDM’2001), Apr, 2001. [9] T. Haveliwala. Topic-sensitive pagerank. In Eleventh International World Wide Web Conference, Honolulu, Hawaii, May 2002., 2002. [10] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. [11] T. Mitchell. Conditions for the equivalence of hierarchical and flat bayesian classifiers. Technical report, Center for Automated Learning and Discovery, Carnegie-Mellon University, 1998. [12] B. Mobasher, H. Dai, T. Luo, and M. Nakagawa. Effective personalization based on association rule discovery from web usage data. In Web Information and Data Management, pages 9–15, 2001. [13] B. Mobasher, H. Dai, T. Luo, and M. Nakagawa. Using sequential and non-sequential patterns in predictive web usage mining tasks. In Proceedings of the IEEE International Conference on Data Mining (ICDM’02), Maebashi City, Japan, 2002. [14] A. W. Moore and M. S. Lee. Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research, 8:67–91, 1998. [15] B. Padmanabhan, Z. Zheng, and S. Kimbrough. Personalization from incomplete data: What you don’t know can hurt. In KDD, pages 154–163, 2001. [16] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, Stanford, CA, 1998, 1998. [17] J. E. Pitkow and P. Pirolli. Mining longest repeating subsequences to predict world wide web surfing. In USENIX Symposium on Internet Technologies and Systems, 1999. [18] M. Richardson and P. Domingos. The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [19] G. Salton. Automatic information organization and retrieval. McGraw Hill, New York, 1968. [20] R. R. Sarukkai. Link prediction and path analysis using markov chains. Computer Networks, 33(1-6):377–386, 2000. [21] P. Smyth. Clustering sequences with hidden markov models. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9, page 648. The MIT Press, 1997. [22] J. Srivastava, R. Cooley, M. Deshpande, and P. Tan. Web usage mining: Discover and applications of usage patterns from web data. SIGKDD Explorations, 1(2):12–23, 2000.
[23] A. Ypma and T. Heskes. Categorization of web pages and user clustering with mixtures of hidden markov models. In Proceedings of the International Workshop on Web Knowledge Discovery and Data Mining, WEBKDD’02, July 23 2002, Edmonton, Canada, 2002.