Efficient Hybrid Web Recommendations Based on ...

1 downloads 0 Views 309KB Size Report
In this paper, we present novel methods that combine (1). Markov Models and (2) web page content search techniques to generate web navigation ...
Efficient Hybrid Web Recommendations Based on Markov Clickstream Models and Implicit Search ∗ Zhiyong Zhang

Olfa Nasraoui

Abstract In this paper, we present novel methods that combine (1) Markov Models and (2) web page content search techniques to generate web navigation recommendations. For clickstream modeling, both first-order and second-order Markov Models were studied and a compact storage format for Markov transition matrices was used. For content-based search, a search engine was used to obtain similar-content pages for recommendation to compensate for the sparsity of the Markov model and thus improve coverage. Experiments were conducted on real web clickstream logs, and confirmed the efficiency of the proposed methods.

system. In this paper, we use a mixed order Markov model which can adaptively select the right order model for specific user sessions. And in order to address the memory requirement for higher order Markov model, we use a compact storage format which takes advantage of the sparsity of the session data to save space and improve efficiency. More importantly, we also exploit content similarity through an implicit query to a fast search-engine as a complementary mechanism for solving the low coverage problem that is tied to higher order Markov models.

2. Related Work and Contributions 1. Introduction Markov models have been successfully used for modeling and predicting users’ web browsing behavior. One of the assumptions of using a simple Markov Model is its “memoryless” property, that is, users’ future visit depends only on their current visit and not on their past visits. This corresponds to the first-order Markov model. But for predicting browsing behavior, these models do not look far enough into the past to correctly discriminate between different access patterns. Hence, higher-order models are often used. But higher order models require higher computational complexity and more memory. In addition, some studies, [5], have shown that higher order models may also lead to lower coverage, although they result in improved accuracy. On the other side, using a search engine for recommendations has not been studied much. Yet using a search engine to retrieve similar-content pages efficiently, can improve coverage. Markov model based methods tend to group URL links with similar usage pattern (collaborative filtering), while content-based filtering tend to group links with similar content information. Hence, the two methods can complement each other to form a hybrid recommender ∗ The authors are with the Knowledge Discovery & Web Mining Lab, Dept. of Computer Science and Engineering, University of Louisville, Louisville, KY 40292,USA

ACM/IEEE Web Intelligence 2007

One family of research in link prediction or recommendation techniques was based on data mining, where, in a training phase, classification or clustering methods are used to obtain user profiles. Later, in the recommendation phase, once a new user session is available, some techniques like nearest neighbors or decision trees are used to select the relevant profiles to give recommendations. In [11], Nasraoui et al presented several single-step and contextsensitive two-step recommender systems by training neural networks through back propagation. Association rule mining and Collaborative Filtering techniques have also been used for web page recommendation [10]. These data mining techniques for web recommendation generally do not put too much emphasis on the order of page visits and, thus are suitable for e-commerce recommendations and for “Content” or “e-media” websites. But for applications where the sequential order is important, these methods may not be sufficient. Another family of recommendation techniques used Markov Models. In [15], an early work was presented using Markov Chains for link prediction and path analysis. This work and another work [6] use first-order Markov models. To improve prediction accuracy, higher-order Markov models were used. In [13] [7], hybrid-order Markov models were used to overcome the low overall accuracy of the firstorder Markov models. In [5], Deshpande, et al used selective Markov Models to reduce the state space complexity of

1. Using a hybrid-order (first-order and second-order) Markov model in combination with web page content to provide web recommendations. 2. Using a compact structure for storing the Markov state space, which reduces the computational complexity of giving recommendations. 3. Using a search engine to index the content and retrieve similar web pages for web recommendation. We perform experiments on a real web-site with one month worth of web logs for training and one day’s logs for testing. We have also implemented a working demo system using the squid web proxy. The rest of the paper is organized as follows. Section 3 discusses the data preprocessing; section 4 introduces hybrid Markov models for web access prediction; section 5 presents the content-based recommendation using a search engine; section 6 presents the combination of the two methods; section 7 presents our experimental results. Finally section 8 concludes the paper.

3. Data Preprocessing and Session Extraction A Web log file contains requesting IP-address, date, time, requested URL, etc. These records may contain requests to some embedded objects such as graphical, cgi, or Javascript files, etc, as well as Crawler or Robot visits and bad requests (with http error code 4xx). These records are cleaned from the data. After cleaning the weblog records, we start our session extraction process. Let V = (i, t, u, r), where i, t, u, r are the IP address, visit time, URL, and referrer, represent one visit entry from the log file. Suppose that we have two visits V1 = (i1 , t1 , u1 , r1 ) and V2 = (i2 , t2 , u2 , r2 ) and t1 < t2 , we consider them as sequential visits in the same session, only if they satisfy the following conditions: (1) i1 = i2 ; (2) t2 − t1 < τ ; (3) r2 = u1 , where τ is the timeout threshold for identifying one session, set to 10 minutes in our experiment. We also have two additional rules for identifying a session: (1) As soon as a repeated URL is detected in one session, we start a new session from the revisited URL. This is similar to Maximal Forward References [4]; (2) Longer sessions are divided into shorter sessions with a maximum length of 10 (in our experiments). The last two rules will make sure that there will be no loops in one session. Session Length Distribution 0.8 0.7 0.6 Percentage

All − K t h − Order Markov models while striving to keep a high prediction accuracy. In [16], transition matrix compression techniques were used to reduce the computational complexity incurred by high-order Markov models. Meanwhile, other efforts began to combine Markov models with other data mining techniques in order to improve prediction accuracy and system efficiency. In [2], K-means clustering was used to clone new states when the original states were not accurate enough. In [9], the EM clustering method was used to group request sequences into K clusters and learn the parameters of a mixture Markov model from the training data. In [17], the CitationCluster algorithm was used to build a conceptual hierarchy for link prediction. Later work tried to combine Markov models with content-based recommendation. In [12], an N-order Markov model was combined with the site content for prediction, and in [1], relational Markov Models, which are very similar to site content, were used for web recommendation. Combining content information with Markov modelbased usage patterns can provide better recommendation since Markov models capture users’ access behavior while content-based methods takes into account what users have accessed. However, the above attempts to integrate content used the site map or site hierarchy, instead of page content. They require dividing the web-site links into several thematic or semantic categories in advance and then use this constructed site map to help Markov models make better predictions. However, no work has been done to combine the page content with sequential access patterns on the fly for web recommendation. In our work we make the following contributions:

0.5 0.4 0.3 0.2 0.1 0

2

3

4

5 6 7 8 Session Length

9

10

Figure 1. Session Length Distribution for one week’s logs (220,218 sessions in total) Figure 1, shows the session length distribution that results from our session extraction method. We find that short sessions take a large proportion of the click sessions. For example, two-click session and three-click session take 91.1% of the total number of sessions. This is our motivation for using first-order and second-order Markov models for recommendations.

4. Hybrid Markov Models for Web Session Analysis We say that Xn is a Markov Chain [8] [14] with transition matrix p(i, j) if for any j, i, in−1 , ..., i0 P (Xn+1 = j|Xn = i, ..., X0 = i0 ) = p(i, j)

which states that the future state Xn+1 depends only on the current state Xn , and any other information about the past is irrelevant for predicting Xn+1 . In cases where the firstorder Markov model is not sufficient, higher order Markov models can be used. For a second-order Markov model, the future state will not only depend on the current state, but also on one previous state. For the second-order Markov Model we have:

Figure 3. Compact Form Transition Matrix

P (Xt = i0 |X0 = it , ..., Xt−1 = i1 ) = P (Xt = i0 |Xt−2 = i2 , Xt−1 = i1 ) = qi2 i1 i0 As can be seen from Figure 1, short-length sessions dominate typical user sessions. We cab use first-order and second-order Markov models for length-2 and length-3 sessions respectively. Thus we only need first-order and second-order Markov models to cover more than 90% of the sessions. Therefore we will not increase the state space complexity beyond the second-order Markov model. From first-order to second order, the state space size for our transition matrix would increase by square. Fortunately for sparse web sessions, there are many zeros in the state transition matrix and there are many states that never occur in any session. We can save tremendous space by not storing these zeros and the non-occurring states. Figure 2 gives some sample web sessions extracted from log files and their corresponding first-order and second-order Markov transition probability matrices. Here, we omit the non-occurring sequences in the matrices. The state-transition frequen-

transition probability, we have: wi,j Pi,j = P k wi,k

(1)

where wi,j is the number of transitions from i to j and the denominator is the number of transitions from i to any other state. For second-order ones, we have: wi,j,k Pi,j,k = P l wi,j,l

(2)

where wi,j,k is the number of transitions from i to j to k, and the denominator is the number of transitions from i (followed by j) to all other states.

Figure 4. Hybrid Markov Models

Figure 2. Sample Web Sessions with the corresponding 1st & 2nd order Transition Probability Matrices

cies in the web sessions are used as a voting mechanism. By omitting the zeros in the transition probability matrices of Figure 2, we get the compact storage format in Figure 3. Finally, we use the following normalization to obtain transition probabilities. For the first-order Markov model

We use a hybrid Markov recommendation process, as shown in Figure 4. When the session length is one, we use the first-order Markov model for recommendation. As the session length increases to two or more, we start using a hybrid model combining first-order and second-order models. Links that occur in both the first-order recommendation set and the second-order recommendation set will accumulate weights from both models. The links in the recommendation set are sorted in decreasing order of their cumulative recommendation weights.

5. Content-Based Recommendation via Devoted Search Engine The Markov model captures users’ sequential browsing behavior. But it does not address users’ content need. In order to improve the coverage and to be able to give recommendations even when there are no previous similar visiting sequences as the current user, we decided to resort to the web page content information for help. The idea behind content-based filtering is that given a few pages that a user has viewed, the system recommends other pages with content that is similar to the content of the viewed pages. Our recommendation process essentially consists of transforming a new user session into an implicit query that can be understood by a devoted search engine. The search engine is devoted to retrieving relevant documents from an inverted index of the website built using Lucene 1 in a preliminary phase. First, each URL in the user session is mapped to a set of content terms that are most characteristic of this URL. Then these terms are combined with their frequencies to form a query vector.

Figure 6. Content-based Filtering implemented via a Search Engine

in our experiment) F in (0, 1) as follows: w1 ← F.w1 . The cumulative session to content query transformation is illustrated in Fig. 5, and the content-based search recommendation is shown in Figure 6. The recommendation process can be summarized as: session –> URLS –> terms –> fielded query vector –> results (ranked according to cosine similarity (result vector, query vector))

6. Combination of Markov Models and Nutch Search engine For Navigation Recommendation

Figure 5. Cumulative session to content query transformation in the Content-based Filtering approach

Once the query has been formed, it is delivered to the search engine for further processing, and the search engine returns a ranked set of items/web pages that are sorted by the cosine similarity between the vector of TF-IDFweighted terms of the query (which in turn represents a user’s session) and the TF-IDF weighted terms of the indexed pages. Hence the results accomplish the goal of content-based filtering. The user session is transformed into a query that accumulates all the term weights of the pages that are visited. However, the weight w1 of an older page P1 gets weakened with the arrival of each new page in the session by multiplying it by a forgetting factor (set to 0.95 1 http://lucene.apache.org/java/docs/

In our online recommendation system, we use the information acquired on-the-fly from our real-time session tracking module and then process these sessions through both Markov based prediction and search-engine based Contentbased Filtering. The result is a hybrid recommendation system. Here we suggest two hybrid schemes: weighted combination and cascaded combination.

6.1. Weighted Combination In the first hybrid recommendation scheme, the Markovmodel based methods and the content-based methods are connected in parallel with the same input user session. When the session length is smaller than 2, the first-order Markov model was used. Simultaneously, pages that are similar in content to the visited page(s) are retrieved for recommendation. As the session length grows, the recommendation system starts exploiting hybrid order Markov models

and merging the visited page term vectors in order to produce both Markov model based recommendations and content based recommendations. The final recommendation set is the combination of the previous two sets.

Figure 8. Cascaded Combination of Markov-model and Content Search for Recommendation

Content Search for Recommendation

6.2. Cascaded Combination In this scheme, we first use Markov models to predict which page the user would likely click after certain visited pages and then search for pages that are similar in content with the Markov predicted page. Figure 8 gives this alternative combination scheme, also known as Cascaded combination [3]. In this hybrid recommendation scheme, the output of the Markov-model based method was used as input to the content-based method. As shown in Figure 8, the “content of” recommended URLs that result from the Markov based methods, are used to search for the final similar-incontent pages to be recommended. Basically this scheme uses the output of the Markov model methods to improve the accuracy of the content-based similarity search.

Accuracy for Different Markov Model Policies 0.9

0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4

7. Experimental Results We conducted several experiments evaluating our Markov model based methods and our content-based methods respectively.

7.1. Hybrid Markov Model Evaluation In our experiments, we use 30 day (December 01 - December 30 2005) worth of logs for training the Markov models and used another day’s (January 01 2006) logs for testing. For that day’s log, our session extraction process

First−order Second−order Hybrid−order

0.85

Accuracy

Figure 7. Weighted Combination of Markov-model and

resulted in 10,901 sessions with length distribution similar to that of the one-week’s session length distribution shown in Figure 1. For each session, we use the previous URL (URLs) as input to see how accurate our system can predict the following URL. If our recommended URL is the same as the real URL, we count this prediction as accurate. Otherwise, we count it as a failure. If we find a match between the real URL and a recommended page, we also record the ranking of the matched recommendation page in our overall recommendation set. Another way to capture the ranking accurately is to record if the real URL falls into the top-N recommendation set with varying N.

1

2

3

4

5

6

7

8

9

10

top−N recommendation set

Figure 9. Recommendation Accuracy for Different Markov Model Policies (5,044 sessions with length ≥ 3 were used)

In our first experiment, we compare different order Markov models. These experiments were conducted using session lengths ≥ 3, and in case of no recommendations, the accuracy was counted as zero. Altogether we have 5,044 such sessions from one day’s logs. From Figure 9, we can see that second-order gives the lowest accuracy in gen-

Table 1. Comparison of Different Markov-model Policies model name first-order second-order hybrid-order

num of no match among top-20 749 (14.85%) 1279 (25.36%) 744 (14.75%)

num of no recommendation 66 (1.31%) 363 (7.2%) 66 (1.31%)

Table 1 compares the number of no-matches among the top-20 recommendations and the number of null recommendation sets of different Markov model policies. As expected, higher-order Markov models have a higher potential to result in null recommendation sets, since there is a higher possibility of no history visits for longer sequences. Notice there is slight increase in accuracy from first-order model to the hybrid model, this slight increase in accuracy can be important in some cases. In our second experiment, we observe the accuracy for different session lengths. To do that, we used a session length ranging from 2 to 6 while using the hybrid Markov model. Accuracy for Different Session Length With Hybrid Markov Models 1 2−length sessions 3−length sessions 4−length sessions 5−length sessions 6−length sessions

0.9

0.8

7.2. Content-based Recommendations Because the content-based search recommendations do not depend on user access behavior, but rather on content on-the-fly, it would be unfair to use the above session prediction accuracy for evaluation. Since content-based recommendations rely on the assumption that a user’s visited pages share similar content information, we will validate this assumption. The recommended pages result from retrieving all the pages with similar content to the previous pages. If the recommended pages are also similar in content to the actual (to-be-visited) pages, then our assumption is correct and will support our recommendations. In our experiments, we used nutch to crawl our school’s web site (with a crawling depth of 6) resulting 30,405 pages. The crawling process took 8 hours. We use a cosine similarity, as illustrated in section 5, for ranking similar content pages. In our experiment, 450 sessions of different lengths, from one day, were used. Content−based Recommendation Similarity for Different Session Length 0.3 2−length sessions 3−length sessions 4−length sessions 5−length sessions 6−length sessions

0.28 Cosine Similarity with to−be−visited page

eral, while our hybrid-order Markov model policy gives a slightly better performance than first-order Markov models, especially when N is low (< 3). The reason for the low overall accuracy for the second-order Markov model is the presence of more null recommendation sets.

0.26

0.24

0.22

0.2

0.18

0.16

1

2

3

4

5

6

7

8

9

10

Accuracy

Nth recommendation page

0.7

Figure 11. Content Similarity Comparison For Different Session Lengths

0.6

0.5

0.4

1

2

3

4

5

6

7

8

9

10

top−N recommendation set

Figure 10. Accuracy Comparison for Different Session Lengths In One Day With Hybrid Markov Models

From Figure 10, we can see that session length 2 has the lowest average accuracy. This is because for this session length, we can only use the first-order Markov model for recommending the second given the first URL. While for longer sessions, we can use hybrid-order Markov models, which have higher accuracy. Notice that length 3 and length 6 have higher accuracies than other session lengths. One possible explanation for this result is the use of the hybrid scheme which combines second-order (that corresponds to session length 3) and first-order Markov models.

Figure 11 shows that length-2 sessions and length-6 sessions get a higher overall similarity value. There may be some navigation links that can lead to destination page that are not necessarily very similar in content to the navigation pages. From our data set, we noticed length-3 and length-5 sessions contain more navigation links. This may explain there low general accuracy. But further study is needed to address whether such pattern can be generalized to other data set. This result also showed that content-based methods promise to complement Markov models which have lowest accuracy at length 2.

7.3. Evaluation of the combined methods In this section, we present experimental results that show how content-based methods are improved by using Markov models. The experimental setup is similar to the above.

We use 500 sessions, randomly chosen from one day’s web log, and which have not been used in training the Markov models. We compare the results of using the input sessions directly in content-based search against using the Markovbased recommendation as input to content based search. For Markov recommendation, we use only the first two recommended URLs as the input to content-based search. Parallel And Sequential Combination of MC and Content Methods 0.65 Parallel Combination Sequential Combination

Cosine Similarity with to−be−visited page

0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2

1

2

3

4

5

6

7

8

9

10

Nth recommendation page

Figure 12. Page Cosine Similarity for Weighted And Cascaded Combination of Markov-based Navigation methods and Content-based Methods

Figure 12 shows that the overall content-similarity improved noticeably by using Markov recommendations instead of the original input session as input to the contentbased search.

8. Conclusions and Future Work In this paper, we have presented a hybrid architecture that include both Markov models and content-based search. While Markov models can capture users’ navigation patterns, content-based search can improve the coverage. We have showed the effectiveness of combining implicit search with Markov models for a recommender system through our experiments. In the future, page viewing time can be taken into account to represent user sessions more accurately. We will also investigate incorporating additional recommendations sub-models such as cluster-based Collaborative Filtering.

9. Acknowledgments This work is partially supported by National Science Foundation CAREER Award IIS-0133948.

References [1] C. Anderson, P. Domingos, and D. Weld. Relational markov models and their application to adaptive web navigation. 2002. [2] J. Borges and M. Levene. A dynamic clustering-based markov model for web usage mining. 2004. [3] Burke and Robin. Hybrid recommender systems: Survey and experiments. User Modeling and User-Adapted Interaction, 12(4):331–370, November 2002. [4] M.-S. Chen, J. S. Park, and P. S. Yu. Efficient data mining for path traversal patterns. Knowledge and Data Engineering, 10(2):209–221, 1998. [5] M. Deshpande and G. Karypis. Selective markov models for predicting web-page accesses. In Proceedings SIAM Int. Conference on Data Mining, 2001. [6] D. Dhyani, S. S. Bhowmick, and W. K. Ng. Modelling and predicting web page accesses using markov processes. In 14th International Workshop on Database and Expert Systems Applications (DEXA’03), 2003. [7] X. Dongshan and S. Junyi. A new markov model for web access prediction. Computing in Science and Engineering, 4(6), 2002. [8] R. Durrett. Essentials of Stochastic Processes. Springer; 1 edition, 2001. [9] Y. Liu, A. An, and X. Huang. Web surfing recommendations in a real application. In Proceedings of the ECML/PKDD’04 Workshop on Statistical Approaches for Web Mining (SAWM), Pisa, Italy, 2-13, 2004. [10] B. Mobasher, R. Cooley, and J. Srivastava. Creating adaptive web sites through usage-based clustering of urls. In KDEX ’99: Proceedings of the 1999 Workshop on Knowledge and Data Engineering Exchange, page 19, Washington, DC, USA, 1999. [11] O. Nasraoui and M. Pavuluri. Complete this puzzle: A connectionist approach to accurate web recommendations based on a committee of predictors. In In Proceedings of WebKDD- 2004 workshop on Web Mining and Web Usage Analysis, Seattle, WA,., 2004. [12] D. Oikonomopoulou, M. Rigou, and S. Sirmakessis. Realtime navigation recommendations: Integrating n-gram based pattern extraction and site content. In 11th International Conference on Human-Computer Interaction, July, 2005. [13] J. E. Pitkow and P. Pirolli. Mining longest repeating subsequences to predict world wide web surfing. In USENIX Symposium on Internet Technologies and Systems, 1999. [14] S. Ross. Introduction to probability models. 7th ed. Harcourt Academic Press, 2000. [15] R. R. Sarukkai. Link prediction and path analysis using markov chains. Computer Networks, 33, 2000. [16] J. Zhu, J. Hong, and J. G. Hughes. Using markov chains for link prediction in adaptive web sites. In Soft-Ware, pages 60–73, 2002. [17] J. Zhu, J. Hong, and J. G. Hughes. Using markov models for web site link prediction. In In Proc. of the Thirteenth ACM conference on Hypertext and Hypermedia (Hypertext 02), pp. 169-170, College Park, MD, 2002.

Suggest Documents