one of the main approaches used to build Web recommendation systems. ... relationships between Web pages and these tasks, and discover task-level usage ...
A Web Recommendation System Based on Maximum Entropy Xin Jin, Bamshad Mobasher, Yanzan Zhou Center for Web Intelligence School of Computer Science, Telecommunication, and Information Systems DePaul University, Chicago, Illinois, USA Abstract We propose a Web recommendation system based on a maximum entropy model. Under the maximum entropy principle, we can combine multiple levels of knowledge about users’ navigational behavior in order to automatically generate the most effective recommendations for new users with similar profiles. The knowledge include the page-level statistics about users’ historically visited pages, and the aggregate usage patterns discovered through Web usage mining. In particular, we use a Web mining framework based on Probabilistic Latent Semantic Analysis to discover the underlying interests of Web users as well as temporal changes in these interests. Our experiments show that our recommendation system can achieve better accuracy when compared to standard approaches, while providing a better interpretation of Web users’ diverse navigational behavior. 1
Introduction
Web recommendation systems have been shown to greatly help Web users to navigate the Web, locate relevant and useful information, and receive dynamic recommendations from Web sites on possible products or services that match their interests. Web usage mining is one of the main approaches used to build Web recommendation systems. Web usage mining involves the applications of data mining and machine learning techniques to discover usage patterns (or build user models) through the analysis of Web users’ historical navigational activities. Given a user’s profile or activity within a site, the user’s profile can be matched with the discovered usage patterns to find “like-minded” users (sometimes referred to as nearest neighbors). The profiles of these nearest neighbors can then be used to recommend items of interest to the current user. The discovery of usage patterns is the central problem in Web usage mining. By applying data mining or machine learning techniques on the Web users’ historical navigational data, we are able to discover various usage patterns. Generally, most usage patterns capture shallow associations among pages, or among users based on the pages they visit. We call such patterns page-level patterns. For example, we may find a pattern like “Web users who have visited page A will visit page B next with a probability of 80%”, or “pages A, C and D are most likely to be visited together in a user’s single visit (user session)”. These patterns can capture common characteristics in users’ navigational behavior. However, they do not provide explicit insight into the users’ underlying interests and preferences. All we know is the existence of such patterns, but we cannot tell why, nor can we know what the users’ interests really are. This shortcoming limits the ability of recommender systems for assisting users in their decision making. To address these problems, some researchers have proposed approaches to incorporate
other knowledge sources from a Web site into the Web usage mining process, including domain knowledge, content information, and linkage structure [1, 2, 3]. Another approach is to directly model Web users’ underlying interests and preferences. In [4], we proposed a Web usage mining framework based on Probabilistic Latent Semantic Analysis (PLSA). We used a set of hidden variables to represent users’ common tasks when navigating the Web site, and applied the PLSA model to quantitatively characterize users’ tasks, as well as the relationships between Web pages and these tasks, and discover task-level usage patterns. The derived task-level patterns not only enable us to better understand users’ interest and preferences, but also make it possible to model the temporal changes of users’ interests. These may, in turn, help boost the accuracy of recommendations. Both the page-level usage patterns and the task-level patterns provide different aspects of knowledge in terms of modeling Web users’ behavior. Page-level patterns capture Web users’ common navigational paths, while task-level patterns can capture the users’ underlying interests, preferences, or tasks, thus providing better understanding of users behavior. In this paper, we provide a recommendation system which combines page-level and task-level patterns using the maximum entropy principle. Our experiments show that our recommendation system can achieve better accuracy and interpretability when compared to recommendations based on individual usage patterns. Maximum Entropy model is a powerful statistical technique with solid theoretical foundations. It has been widely used in Natural Language Processing, Information Retrieval, Text Mining and other areas [5, 6, 7, 8, 9, 10, 11]. Under the maximum entropy (ME) principle, different sources of knowledge can be combined in a natural way. The basic principle of ME is very simple. Let us say that we have information (overlapping or non-overlapping, related or unrelated) from different sources. To create a unified model based on these information sources, normally one builds a model to represent each source, and then combines these models together (e.g., using Linear combination or other methods), as a post-processing step. Under the ME principle, on the other hand, we build a single, combined model to represent all the information sources we have. The combined model satisfies all the constraints imposed by each source, and meanwhile is as uniform as possible (thus has the maximum entropy). In other words, maximum entropy model prefers the most uniform model, which, at the same time, satisfies all the given constraints. This paper is organized as follows. In Section 2, we present our recommendation method based on the maximum entropy model. In Section 3, we present our method of discovering page-level and task-level usage patterns and how to generate features from these patterns. We report our experiment results on two real Web site data in Section 4. Finally, before concluding the paper, in Section 5, we discuss some possible improvements of our model and some related work. 2
A Recommendation System Based on a Maximum Entropy Model
In this section, we present our recommendation system based on a maximum entropy model. It consists of two components. The offline component accepts constraints from the training data and estimates the model parameters. The online part reads an active user session and runs the recommendation algorithm to generate recommendations (a set of pages) for the user.
Generally, we begin Web usage analysis with the data preparation process. Raw Web log data is cleaned and sessionized to generate user sessions. Each user session is a logic representation of a user’s single visit to the Web site (within certain amount of time). Let’s say, after data preparation, we have a set of user sessions U = {u1 , u2 , · · · , um }, a set of pages P = {p1 , p2 , · · · , pn }. The Web session data can be conceptually viewed as a m × n session-page matrix UP = [w(ui, pj )]m×n , where w(ui, pj ) represents the weight of page pj in user session ui . The weights can be binary, indicating the existence or non-existence of the page in the session, or they may be a function of the occurences or duration of the page in that session. Each user session can also be represented as a sequence of pages, as ui = p1i , p2i , · · · , pti , where ptj ∈ P . To make predictions or generate recommendations for active users, we aim to compute the conditional probability P r(pd|H(ui)) of a page pd being visited next given a user’s (ui ) recent navigational history H(ui), here pd ∈ P , and H(ui) represents a set of recently visited pages by ui . Since the most recent activities have greater ability to reflect a user’s present interests, we only use the last several pages to represent a user’s history. 2.1
The Maximum Entropy Model
In the Web recommendation scenario, the goal is to build a model which can measure the probability of a page being visited next given a user’s recent navigational history. In our maximum entropy model, the training data is used to set constraints and train the model. A constraint represents certain characteristics of the training data which should be satisfied by the model. To implement constraints, we define one or more feature functions (“features” for short). For example, we define a feature as fpa ,pb (H(ui), pd ) =
1 if user ui visited page pa and page pb consecutively 0 otherwise
Such a feature captures certain characteristics (visiting pages pa and pb consecutively) of user sessions in the training data. For example, from a user session ui = p2 , p3 , p5 , we can derive a pair H(ui), pd , where H(ui) = {p2 , p3 } and pd = p5 . Let’s say we have defined a feature fp3 ,p5 as above. This feature will be evaluated to 1 for session ui, while another feature fp5 ,p3 will be evaluated to 0 for this session. We will discuss our feature selection methods in the next section. Based on each feature fs , we represent a constraint as: ui pd ∈P
P r(pd |H(ui)) • fs (H(ui), pd ) =
fs (H(ui), D(H(ui)))
(1)
ui
where D(H(ui)) denotes the page following ui’s history (H(ui)) in the training data. Every constraint restricts that the expected value of each feature w.r.t. the model distribution should always equal its observation value in the training data. In Equation 1, the left hand side represents the expectation of the feature fs (pd , H(ui)) with respect to the distribution P r(pd|H(ui)), and the right hand side stands for the expected value of this feature in the training data, both up to the normalization factor.
After we have defined a set of features, F = {f1 , f2 , · · · , ft }, and accordingly, generate constraints for each feature. Under all the constraints, it’s guaranteed that a unique distribution exists with maximum entropy [12]. This distribution has the form:
P r(pd|H(ui)) =
exp(
λs fs (H(ui ), pd )) Z(H(ui))
s
(2)
where Z(H(ui)) = pd ∈P P r(pd |H(ui )) is the normalization constant ensuring that the distributions sum to 1. λs are the parameters we need to estimate. 2.2
Parameter Estimation
Maximum entropy principle requires that the expectation of each feature with respect to the model distribution be consistent with the observation of the feature in the training data. Under these constraints, a unique distribution as given in Equation 2 is guaranteed to exist. Our goal here is to find parameters (λ) associated with each feature. There have been several algorithms which can be applied to estimate the λs. These include Generalized Iterative Scaling (GIS) [13], Improved Iterative Scaling (IIS) [14], and Sequential Conditional Generalized Iterative Scaling (SCGIS) [15]. Of these algorithms, SCGIS has been shown to be more efficient especially when the number of outcomes in not very large. In our experiments, the outcomes will be the items (pages or products) to predict, with the total number in the order of several hundred. So we adopt this algorithm in our experiments. We also apply the smoothing technique suggested by Chan and Rosenfeld [16], which can improve the model by using some prior estimates for model parameters. 2.3
Recommendations
After we have estimated the parameters associated with each feature, we can use Equation 2 to compute the probability of any page pi being visited next given certain user’s history, and then pick the pages with highest probabilities as recommendations. Given an active user ua , we compute the conditional probability of each Web page P r(pi|ua ). Then we sort all pages based on the probabilities and pick the top N pages to get a recommendation set. The algorithm is as follows: Input: an active user session ua , parameters λ estimated from the training data. Output: a list of N pages sorted by probability of being visited given ua . 1. Identify the last page of the active user session as plast . Consider the last L pages of the active user session as a sliding window (these pages are considered as this user’s history), and identify the dominant task using Equation 6. (L is the size of the sliding window, also the length of the user history.) 2. For each page pi not appearing in the active session, assume it as the next page to be visited, and evaluate all the page-level features and task-level features based on their definitions (in Section 3). 3. Using Equation 2 to compute P r(pi|ua ).
4. Sort all the pages in descending order of P r(pi|ua ) and pick the top N pages to get a recommendation set. 3
Feature Selection
Feature selection is a central problem in the maximum entropy model. Here, we take into account two kinds of usage patterns, the page-level and the task-level patterns. We define features based on these patterns and incorporate them into our recommendation system. 3.1
Features Based on Page-level Usage Patterns
Page-level patterns capture Web users’ common navigation paths with statistical significance. As we stated before, these have been many different data mining and machine learning techniques which can be applied to the users’ navigational data to generate page-level usage patterns. Here we adopt the first-order Markov model to generate page transitions. For each page transition pa → pb where P r(pb|pa ) = 0, we define a feature function as: 1 if page pa is the last page in H(ui), and page pb = pd fpa ,pb (H(ui ), pd ) = 0 otherwise 3.2
Features Based on Task-level Patterns
3.2.1 Discovery of Task-level Usage Patterns Page-level usage patterns can be directly used for generating recommendations for new users. However, they provide no insight into users’ underlying interests, thus limiting the accuracy of recommendations. To address this problem, some have proposed directly modeling Web users’ hidden interests and preferences. In [4], we used the PLSA model to directly model Web users’ common navigational tasks performed in a Web site. We employ this approach in the current paper to identify task-level usage patterns. Let us assume there exist a set of hidden (unobserved) variables Z = {z1 , z2 , ..., zl } corresponding to users’ common interests (tasks). Each usage observation, which corresponds to an access by a user to a Web resource in a particular session which is represented as an entry of the m × n co-occurrence matrix UP , can be modeled as P r(ui, pj ) = P r(ui) • P r(pj |ui),
(3)
where P r(pj |ui) =
l
P r(zk ) • P r(ui|zk ) • P r(pj |zk ),
(4)
k=1
summing over all possible choices of zk from which the observation could have been generated. Now, in order to explain a set of observations (U, P ), we need to estimate the parameters P r(zk ), P r(ui|zk ), P r(pj |zk ), while maximizing the following likelihood L(U, P ) of the observations, L(U, P ) =
m n i=1 j=1
w(ui, pj ) log P r(ui, pj ).
(5)
We use the Expectation-Maximization (EM) algorithm [17] to perform maximum likelihood parameter estimation of P r(zk ), P r(ui|zk ), P r(pj |zk ). These probabilities quantitatively measure the relationships among users, Web pages, and common interests (tasks). Here, P r(pj |zk ) represents the probability of page pj being visited given a certain task zk is pursued. Recall that each user session can also be represented as a page sequence, ui = p1i , p2i , · · · , pti , pji ∈ P . Here we can use the following algorithm to identify each user session’s underlying tasks and the temporal changes in these interests. Input: a user session, P r(pj |zk ), a sliding window with size of L (equal the length of a user’s history H(ui )). Output: a user session represented as a task sequence. 1. Get the first L pages from ui and put them into a sliding window W . Compute the probability of each task given all the pages in the window W as P r(z|W ) =
P r(W |z)P r(z) P r(p|z) ∝ P r(z) P r(W ) p∈u
(6)
i
2. Pick the task with the highest probability as the current task and output it. 3. Move the sliding window to the right (remove the first page from the window, and add the next page in ui into the window), recompute the probability of each task given the current sliding window, and pick up the top task. Repeat this process until the sliding window reaches the end of this session. After running this algorithm, each task can be represented as a sequence of tasks, ui = zx , · · · , zy , where z ∈ Z. This representation give us a direct understanding of users’ interests. Now we can apply some data mining techniques to generate task-level patterns from these task sequences. For example, we can run simple first-order Markov model to compute probabilities of task transitions, P r(zb |za ). 3.2.2 Using the Discovered Task-Level Patterns as Features Compared to page-level patterns, task-level patterns can capture Web users’ underlying interests and their temporal changes, providing better understanding of users’ diverse behaviors. Given a task transition za → zb , we define a feature as fza ,zb (H(ui), pd ): 1 if dominant task of H(ui ) is za , and after moving sliding window right to include pd , zb will be the dominant task, fza ,zb (H(ui), pd ) = 0 otherwise 4
Experimental Results
We use two real Web site data sets to empirically measure the accuracy of our recommendation system. Our accuracy metric is called Hit Ratio and is used in the context of top-N recommendation framework: For each user session in the test set, we take the first
K pages as a representation of an active session to generate a top-N recommendations. We then compare the recommendations with page K + 1 in the test session, with a match being considered a hit. We define the Hit Ratio as the total number of hits divided by the total number of user sessions in the test set. Note that the Hit Ratio increases as the value of N (number of recommendations) increases. Thus, in our experiments, we pay special attention to smaller number recommendations (between 1 and 20) that result in good hit ratios. In our experiments, we use Web server log data from two Web sites. The first data set is based on the server log data from the host Computer Science department. This Web site provide various functionalities to different types of Web users. For example, prospective students can obtain program and admissions information or submit online applications. Current students can browse course information, register for courses, make appointments with faculty advisors, and log into the Intranet to do degree audits. Faculty can perform student advising functions online or interact with the faculty Intranet. After data preprocessing, we identified 21,299 user sessions (U) and 692 pageviews (P ), with each user session consisting of at least 6 pageviews. We choose 30 as the task number and set the length of user history to 4. This data set is referred to as the “CTI data.” The second data set is from the server logs of a local affiliate (Next Generation Realty) of a national real estate company. The primary function of the Web site is to allow prospective buyers to visit various pages and information related to some 300 residential properties. During data preprocessing, the data was filtered to limit the final data set to those users that had visited at least 3 properties. In our final data matrix, each row represented a user vector with properties as dimensions and visit frequencies as the corresponding dimension values. In the experiment, task number is set to 10 and the length of user history is set to 2. We refer to this data set as the “NG Realty data.” Each data set was randomly divided into multiple training and test sets to use with 10-fold cross-validation. Training data is used to build the model and we evaluate our model on the test data. We performed feature selection as discussed in Section 3 and built our recommendation system according to the algorithm of Section 2. For comparison, we built another recommendation system based on the standard first-order Markov transition model to predict and recommend which page to visit next. The Markov-based system models each page as a state in a state diagram with each state transition labeled by the conditional probabilities estimated from the actual navigational data from the server log data. It generates recommendations based on the state transition probability of the last page in the active user session. Figure 1 depicts the comparison of the Hit Ratio measures for the two recommender systems in each of the two data sets. The experiments shows that the maximum entropy recommendation system has an clear overall advantage in terms of accuracy over the firstorder Markov recommendation system on the CTI data. As for the Realty data, our model also performs better when the recommendation set size is greater than 5, but the overall advantage is not as apparent as in the CTI data. One explanation of the difference in the results is that the CTI Web site provides many distinct functionalities for users. Users can collect admission information, submit online applications, perform degree audits, etc. The maximum entropy model, in this case, captures, not only users’ immediate navigational interest reflected in the next visited page, but also the functional tasks in which the users
Figure 1: Comparison of Recommendation Accuracy: Maximum Entropy Model v. Markov Model.) are engaged at the given point in time. Thus, when a user just begins a certain navigational task, our system tends to recommend more pages highly related to this task to the user. While the user has visited adequate number of pages associated with one task, the system is likely to predict the next possible task interesting to the user and recommend pages related to the next task. 5
Discussion and Related Work
In this paper, our intention has been to show the benefits of combining different levels of knowledge about Web users’ navigational activity using the maximum entropy principle. Therefore, here we only use a simple first order Markov model to generate page-level usage patterns, while the PLSA model has been used to identify and characterize task-level patterns. As we mentioned before, many data mining techniques can be applied to discover more complex page-level usage patterns potentially improving recommendation or prediction accuracy. As for task-level patterns, models such as Latent Dirichlet Allocation [18] or mixtures of Hidden Markov Models can also be adopted to model users’ underlying interests and tasks. In our framework we only use a basic form of the maximum entropy model. Much work has been done to enhance this basic model, including effective methods for feature selection [7, 12, 19, 20] and parameter estimation [13, 14, 15, 21, 22], as well as approaches to create mixtures of maximum entropy models [23, 10]. We believe applying these techniques can further boost recommendation accuracy. Our work is mainly inspired by the work in [10]. In that paper, a mixture of maximum entropy models is used where each component is a user cluster, and each cluster is modeled using the maximum entropy principle. As for features of each maximum entropy model, first-order Markov page transitions (called “bigrams”) and other page-level statistics are
used. For model fitting, EM algorithm is used to estimate the associations between each user and each user cluster (a component of the mixture model), as well as the parameters of each maximum entropy components. Our model is different in that we first use a mixture model (PLSA) to identify users’ common tasks and generate task-level usage patterns. Then we use a single maximum entropy model to combine these task-level patterns with the basic page-level features. Therefore, the only parameters we use are those associated with each feature in the single maximum entropy model. As a result, our parameter estimation is computationally much more efficient. However, more systematic comparison is needed to draw further conclusions. 6
Conclusion
In this paper, we propose a Web recommendation system based on a maximum entropy model. We show that two levels of knowledge about historical users’ navigational behavior can be seamlessly combined together to generate recommendations for new users. The knowledge include the page-level statistics about users’ historical navigation, and the usage patterns about Web users’ underlying interests and their temporal changes. Our experiment shows that our recommendation system can achieve better recommendation accuracy, while providing a better interpretation of Web users’ diverse navigational behavior, when compared to the standard Markov model for page predictions. References [1] C. Anderson, P. Domingos, D. Weld, Relational markov models and their application to adaptive web navigation, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, Canada, 2002. [2] H. Dai, B. Mobasher, Using ontologies to discover domain-level web usage profiles, in: Proceedings of the 2nd Semantic Web Mining Workshop at ECML/PKDD 2002, Helsinki, Finland, 2002. [3] R. Ghani, A. Fano, Building recommender systems using a knowledge base of product semantics, in: Proceedings of the Workshop on Recommendation and Personalization in E-Commerce, at the 2nd Int’l Conf. on Adaptive Hypermedia and Adaptive Web Based Systems, Malaga, Spain, 2002. [4] X.Jin, Y.Zhou, B.Mobasher, Web usage mining based on probabilistic latent semantic analysis, in: Proceedings of the Tenth ACM SIGKDD Conference(2004), 2004. [5] R. Rosenfeld, Adaptive statistical language modeling: A maximum entropy approach, Phd dissertation, CMU (1994). [6] K.Nigram, J.Lafferty, A.McCallum, Using maximum entropy for text classification, in: Proceedings of IJCAI-1999, 1999. [7] A.Berger, S. Pietra, V. Pietra, A maximum entropy approach to natural language processing, Computational Linguistics 22 (1).
[8] D.Pavlov, E. Manavoglu, C.L.Giles, Collaborative filtering with maximum entropy, Technical report, NEC Labs (2003). [9] D.Pavlov, D.Pennock, A maximum entropy approach to collaborative filtering in dynamic, sparse, high-dimensional domains, in: Proceedings of Neural Information Processing Systems(2002), 2002. [10] E.Manavoglu, D.Pavlov, C.L.Giles, Probabilistic user behavior models, in: Proceedings of the Third IEEE Conference on Data Mining (ICDM-2003), 2003. [11] F. A.McCallum, D.Freitag, Maximum entropy markov models for information extraction and segmentation, in: Proceedings of ICML-2000, 2000. [12] F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, MA, 1998. [13] J.N.Darroch, D.Ratcliff, Generalized iterative scaling for log-linear models, Annals of Mathematical Statistics 43 (1972) 1470–1480. [14] S. Pietra, V. Pietra, J.Lafferty, Inducing features of random fields, Technical report, CMU (1995). [15] J.Goodman, Sequential conditional generalized iterative scaling, in: Proceedings of ACL-2002, 2002. [16] S.Chen, R.Rosenfeld, A guassian prior for smoothing maximum entropy models, Technical report, CMU (1999). [17] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the em algorithm, Journal of Royal Statistical Society B (39) (1977) 1–38. [18] D.Blei, A.Ng, M.Jordan, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003) 993–1022. [19] S.Chen, R.Rosenfeld, Efficient sampling and feature selection in whole sentence maximum entropy language models, in: Proceedings of ICASSP-1999, 1999. [20] A.McCallum, Efficiently inducing features of conditional maximum entropy distributions, in: Proceedings of UAI-2003, 2003. [21] J.Goodman, Exponential priors for maximum entropy models, in: Proceedings of NAACL-2004, 2004. [22] R.Malouf, A comparison of algorithms for maximum entropy parameter estimation, in: Proceedings of the Sixth Conference on Natural Language Learning(2002), 2002. [23] D. Pavlov, A. Popescul, D. Pennock, L. Ungar, Mixtures of conditional maximum entropy models, Nec technical report, NEC Research Institute (2002).