Towards User Profiling for Web Recommendation - CiteSeerX

3 downloads 24605 Views 241KB Size Report
based on the analysis of usage data, has been used in building Web recommendation system recently. With the usage pattern knowledge discovered in Web ...
Towards User Profiling for Web Recommendation Guandong Xu1, Yanchun Zhang1, and Xiaofang Zhou2 1

School of Computer Science and Mathematics, Victoria University, PO Box 14428, VIC 8001, Australia {xu,yzhang}@csm.vu.edu.au 2 School of Information Technology & Electrical Engineering, University of Queensland, Brisbane QLD 4072, Australia [email protected]

Abstract. Collaborative recommendation is one of widely used recommendation systems, which recommend items to visitor on a basis of referring other’s preference that is similar to current user. User profiling technique upon Web transaction data is able to capture such informative knowledge of user task or interest. With the discovered usage pattern information, it is likely to recommend Web users more preferred content or customize the Web presentation to visitors via collaborative recommendation. In addition, it is helpful to identify the underlying relationships among Web users, items as well as latent tasks during Web mining period. In this paper, we propose a Web recommendation framework based on user profiling technique. In this approach, we employ Probabilistic Latent Semantic Analysis (PLSA) to model the co-occurrence activities and develop a modified k-means clustering algorithm to build user profiles as the representatives of usage patterns. Moreover, the hidden task model is derived by characterizing the meaningful latent factor space. With the discovered user profiles, we then choose the most matched profile, which possesses the closely similar preference to current user and make collaborative recommendation based on the corresponding page weights appeared in the selected user profile. The preliminary experimental results performed on real world data sets show that the proposed approach is capable of making recommendation accurately and efficiently.

1 Introduction In recent years, the massive influx of information onto World Wide Web has facilitated user, not only retrieving information, but also discovering knowledge. However, Web users usually suffer from the information overload problem due to the fact of significantly increasing and rapidly expanding growth in amount of information on the Web. One approach addressed to the information overload is the recommendation system, which aims to help users locate more needed or preferred information. Typically, Web recommendation system focuses on the processes of identifying Web users or objects, collecting information with respect to users’ preference or interests as well as adapting its service to satisfy the users’ needs. In short, Web recommendation can be used to provide better quality service and application of Web to users during their browsing period. S. Zhang and R. Jarvis (Eds.): AI 2005, LNAI 3809, pp. 415 – 424, 2005. © Springer-Verlag Berlin Heidelberg 2005

416

G. Xu, Y. Zhang, and X. Zhou

To-date, the problem of recommending appropriate items from data repository to users has been extensively studied and two paradigms named content-based filtering and collaborative filtering systems have emerged. Content-based filtering systems such as WebWatcher [8], try to recommend items that are similar to those visited by a given user in the past, whereas collaborative filtering systems intend to identify user category whose taste or preference is close enough to the given user and recommend items that are historically rated by them [6]. The former often utilizes traditional information filtering and information retrieval methods, while the latter employs user correlation or nearest-neighbor algorithm. Especially, the collaborative filtering technique has been gradually adopted in the context of Web recommendation applications and has achieved great success as well [5, 9] in recent years. Web usage mining technique, which exploits data mining methods, such as kNearest Neighbor algorithm (kNN) [5], Web user or page clustering [4, 11, 12], association rule mining [1] and sequential pattern mining technique [2], to create model based on the analysis of usage data, has been used in building Web recommendation system recently. With the usage pattern knowledge discovered in Web usage mining process, Web recommendation system can generate usage-based user profiles as the representatives of the aggregate user behaviors for collaborative recommendation. As a result, a variety of research communities have addressed this topic and Web usage mining is becoming a potential approach for Web recommendation. To reveal the underlying relationships among Web objects, Latent Semantic Analysis (LSA) technique has been incorporated into Web usage mining process. Some LSA-based algorithms are developed for Web recommendation [13, 14]. In this paper, we propose a Web recommendation framework based on user profiling technique. The usage pattern knowledge, in the form of user profile derived from Web usage mining, is combined into Web recommendation system to improve the efficiency of recommendation by predicting user-preferred content and customizing the presentation. During pattern discovery stage, probabilistic inference method based on Probabilistic Latent Semantic Analysis (PLSA) model, a variant of LSA, is exploited to model the underlying relationships among the co-occurrence activities and identify the latent task model in terms of latent semantic factor. Through Web user session clustering, we create user profiles as the representatives of usage patterns. To make Web recommendation, we match the current active user activity against such discovered patterns to find the most like-minded user category, in turn, determine the potentially interested pages as recommendation set based on the visited probabilities exhibited by such type of users. We demonstrate the effectiveness of the proposed technique through experiments performed on real world data sets. The evaluation results show that the usage-based approach is more applicable in comparison with some traditional techniques. The rest of the paper is organized as follows. In section 2, we introduce the Web usage mining process, especially we focus on how to model Web co-occurrence activities based on PLSA. We present the algorithms for discovering usage-based user profiles and latent factors in section 3. In section 4, we propose the Web recommendation framework upon user profiling approach. We conduct preliminary experiments on two real world datasets, implement some comparisons against the traditional work in section 5, conclude and outline future work in section 6.

Towards User Profiling for Web Recommendation

417

2 Usage-Based User Profiling with PLSA As discussed above, Web recommendation is the ultimate goal of Web usage mining conducted on the data collected at the Web log servers of a specific Web site. This whole procedure usually consists of three steps, i.e. data collection and preprocessing, pattern mining as well as knowledge application. Figure 1 depicts the whole process.

Fig. 1. The process of Web Mining and Web Recommendation

2.1 Usage Data Representation Prior to introducing user profiling technique, we briefly discuss the issue with respect to construction of usage data. In general, the exhibited user access interests may be reflected by the varying degrees of visits on different Web pages during one session. Thus, we can represent a user session as a weighted page vector visited by the user during a period. In this paper, we use the following notations to model the cooccurrence activities of Web users and pages: • S = { s1 , s2 ,

• P = { p1 , p2 ,

sm } : a set of m user sessions. pn } : a set of n Web pages.

• For each user, the navigational session is represented as a sequence of visited pages with corresponding weights: si = ai ,1 , ai ,2 , ai ,n , where ai , j denotes

{

the weight for page

}

p j visited in si user session. The corresponding weight is

usually determined by the number of hit or the amount time spent on the specific page. Here, we use both of them to construct usage data from two real world data sets. • SPm×n = ai , j : the ultimate usage data in the form of weight matrix with di-

{ }

mensionality of m × n .

418

G. Xu, Y. Zhang, and X. Zhou

2.2 PLSA Model The PLSA model is based on a statistic model called aspect model, which can be utilized to identify the hidden semantic relationships among general co-occurrence activities. Similarly, we can conceptually view the user sessions over Web pages space as co-occurrence activities in the context of Web usage mining to discover the latent usage pattern. For the given aspect model, suppose that there is a latent factor space Z = { z1 , z2 , zk } and each co-occurrence observation data < si , p j > is associated with the factor zk ∈ Z by varying degree to zk . Based on these assumptions and Bayesian rule, we calculate the probability of an observed pair < si , p j > by adopting the latent factor variable zk as: P ( si , p j ) =

∑ P( z ) • P(s | z ) •P( P | z ) k

zk ∈Z

i

k

j

k

(1)

Following the likelihood principle, the total likelihood is determined as Li =

∑ m( s , p

s i ∈S , p j ∈ P

i

j

) • log P ( s i , p j )

(2)

where m(si , p j ) is the element of the session-page matrix corresponding to session

si and page p j . In order to maximize the total likelihood, we make use of Expectation Maximization (EM) algorithm to perform maximum likelihood estimation of P( zk ) , P ( si | zk ) , P ( p j | z k ) in latent variable model [3]. The executing of E-step and M-step is repeating until Li is converging to a local optimal limit, which means the estimated results can represent the final probabilities of observation data. It is easily found that the computational complexity of this algorithm is O ( mnk ) , where m is the number of user session, n is the number of page, and k is the number of factors.

3 Discovery of Latent Factors and Usage-Based User Profiles As we discussed in section 2, the estimated probabilities quantitatively measure the underlying relationships among Web users, pages as well as latent factors (i.e. tasks). Therefore, it is reasonable to identify the latent factors and discover the related usagebased access patterns upon probability inference process. In this section, we propose how to derive the aforementioned usage information. 3.1 Characterizing Latent Factor First, we discuss how to capture the latent factor associated with user navigational behavior. This aim is to be achieved by characterizing the “dominant” pages that contribute significantly to the factor. Note that p ( p j | zk ) represents the conditional occurrence probability over the page space corresponding to a specific factor, whereas p ( zk | p j ) reflects the conditional probability distribution over the factor space corre-

Towards User Profiling for Web Recommendation

419

sponding to a specific page. Thus, we may choose the pages whose conditional probabilities p ( zk | p j ) and p ( p j | zk ) are both greater than a predefined threshold to form “dominant” page set. Exploring the contents of these pages would result in characterizing the semantic meaning of each factor. In section 4, we will present various examples of latent factors as well as those “dominant” pages derived from two real data sets. 3.2 Building Usage-Based User Profiles Note that the set of P ( zk | si ) is conceptually representing the probability distribution over the latent factor space for a specific user session

si , we, thus, construct the ses-

sion-factor matrix based on the calculated probability estimates, to reflect the relationship between Web users and latent factors, which is expressed as follows:

si' = (bi ,1 , bi ,2 ,..., bi ,k )

(3)

where bi , s is the occurrence probability of session

si on factor zs . In this way, the

distance between two session vectors may reflect the exhibited navigational behavior similarity. We, therefore, define their similarity by applying well-known cosine similarity as:

(

sim( s i' , s 'j ) = s i' , s 'j

(

) ∑b

where s ' , s ' = i j

k

m =1

i ,m

b j ,m , s' = 2 i

)

( s i'

k

∑b m=1

2 i ,m

2

• s 'j )

, s 'j

(4)

2

2

=

k

∑b m =1

2 j ,m

With the page similarity measurement (4), we propose a modified k-means clustering algorithm [13] to partition user sessions into corresponding clusters. As each user session is represented as a weighted page vector, it is reasonable to derive the centroid of cluster obtained as the usage pattern in the form of user profile. In this work, we compute the mean vector to represent the centroid. The algorithm for clustering user sessions and constructing user profiles is as follows: Algorithm 1. Building User Profiles Input: the set of conditional probabilities P ( zk | si ) Output: A set of user session clusters SCL = {SCL1 , SCL2 , of user profiles PF = {PF1 , PF2 ,

, SCLP } and a set

, PFp }

1. For all user sessions, employ the modified k-means clustering algorithm and output a set of usage-based session clusters SCL = {SCLt } . 2. for each user session cluster, calculate the centroid of cluster as

Cidt = 1/ SCLt •



si ∈SCLt

si'

where SCLt is the number of sessions in the cluster.

(5)

420

G. Xu, Y. Zhang, and X. Zhou

3. Treat the centroid of generated cluster as the aggregate user profile, and sort the normalized weights in a descending order to reflect the relative “significance” contributed by the corresponding pages within the selected user profile, i.e.

{

}

PFt = < p1t , w1t >, < p t2 , w2t >, where wtj = 1/ SCLt •



si ∈SCLt

, < pnt , wnt >

t t ai , j , w1 > w2 >

(6)

> wnt , and ptj ∈ P

4. Output PF = {PFt } .

4 Using PLSA for Web Personalization Generally, we recommend Web items to users in customized or preferred style based on analysis of their interests exhibited by individual or groups of users. In this work, we adopt the model-based technique in our Web recommendation framework. We consider the usage-based user profiles generated in section 3.2 as the aggregated representatives of common navigational behaviors exhibited by all individuals in same particular user category. For a newly coming active user session, we utilize cosine function to measure the similarity between it and discovered user profile. We, then, choose the closest profile, which shares the highest similarity with the current user session, as the matched pattern to current user. Finally, we generate the top-N recommendation pages based on the historically visited probabilities of pages by other users in the selected profile. The detailed procedure is as follows: Algorithm 2. Web Recommendation Based on user profiling Input: An active user session and a set of user profiles Output: The top-N recommendation pages 1. The active session and the profiles are to be simplified as n-dimensional weight vectors sa , s p instead of page-weight pair vector over the page space that is generated from algorithm 3 within a site, i.e. s p = [ w1p , w2p ,

p , wnp ] , where wi

is the significance weight contributed by page pi in this profile, similarly

sa = [ w1a , w2a ,

wna ] , where wia = 1 , if page pi is already accessed, and other-

wise wia = 0 . 2. Measure the similarities between the active session and all derived usage profiles, and choose the maximum one out of the calculated similarities as the most matched pattern: j sim( sa , s mat sa p ) = max( sim ( sa , s p )) = max(( sa i s p ) j

j

3. Incorporate the selected profile the recommendation score

2

sp )

(7)

2

s mat with the active session sa , then calculate p

rs ( pi ) for each page pi :

Towards User Profiling for Web Recommendation

rs ( pi ) = wimat , wimat ∈ s mat p

421

(8)

Thus, each page in the profile will be assigned a recommendation score between 0 and 1. Note that the recommendation score will be 0 if the page is already visited in the current session. 4. Sort the calculated recommendation scores in step 3 in a descending order, i.e.

rs = ( w1mat , w2mat ,

, wnmat ) , and select the N pages with the highest recom-

mendation scores to construct the top-N recommendation set: mat REC ( N ) = { p mat | rs ( p mat j j ) > rs ( p j +1 ), j = 1, 2,

N , p mat ∈ P} j

(9)

5 Experiments and Evaluations In order to evaluate the effectiveness of the proposed method based on PLSA model and explore the discovered latent semantic factor, we have conducted preliminary experiments on two real world data sets. 5.1 Data Sets The first data set we used is downloaded from KDDCUP Web site (www.ecn.purdue.edu/KDDCUP/). After data preparation, we have setup an evaluation data set including 9308 user sessions and 69 pages, where every session consists of 11.88 pages in average. We refer this data set to “KDDCUP data”. In this data set, the number of Web page hits by the given user determines the element in sessionpage matrix associated with the specific page in the given session. The second data set is from a academic Website log files[10]. The data is based on a 2-week Web log file during April of 2002. After data preprocessing stage, the filtered data contains 13745 sessions and 683 pages. The entries in the usage data correspond to the amount of time (in seconds) spent on pages during a given session. For convenience, we refer this data as “CTI data”. 5.2 Latent Factors Based on PLSA Model We conduct experiments on the two data sets to extract the latent factors via identifying “dominant” page set. Here, we present the experimental results of the derived latent factors from two real data sets based on PLSA model respectively. Table 1 illustrates one example out of the derived factors extracted from the KDDCUP data set as well as the “dominant” page set, whose probabilities are over the predefined threshold, whereas Table 2 presents the example out of those from CTI data set. From these tables, it is easily concluded that the factor #6 in KDDCUP data set reflects the scenario involving in online shopping process, whereas the factor #13 stands for activity of searching postgraduate program information.

422

G. Xu, Y. Zhang, and X. Zhou Table 1. Example of laten factor and its associated pages from KDDCUP Factor

#6 online shopping process

Page # 27

Content main/login2

Pgae # 50

Content account/past_orders

32

main/registration

52

account/credit_info

42

account/your_account

60

checkout/thankyou

44

checkout/expresCheckout

64

account/create_credit

45

checout/confirm_order

65

main/welcome

47

account/address

66

account/edit_credit

Table 2. Example of laten factor and its associated pages from CTI Factor # 13 Postgradprogram

Page # 386

Content /News

Pgae # 588

Content /Prog/2002/Gradect2002

575

/Programs

590

/Prog/2002/Gradis2002

586

/Prog/2002/Gradcs2002

591

/Prog/2002/Gradmis2002

587

/Prog/2002/Gradds2002

592

/Prog/2002/Gradse2002

5.3 Evaluation Metric of User Session Clusters and Web Recommendation In order to evaluate the quality of clusters derived from PLSA-based approach, we adopt one specific metric, named the Weighted Average Visit Percentage (WAVP) [8]. This evaluation method is based on assessing each user profile individually according to the likelihood that a user session, which contains any pages in the session cluster, will include the rest pages in the cluster during the same session. Suppose T is one of session set within the evaluation set, and for s specific cluster C, let Tc denote a subset of T whose elements contain at least one page from C, the WAVP is computed as: ⎛ t •C ⎞ WAVP = ⎜ ∑ ⎟ ⎝ t∈Tc Tc ⎠

⎛ ⎞ ⎜ ∑ wt ( p, pf ) ⎟ ⎝ p∈PF ⎠

On the other hand, we exploit a metric called hit precision [7] to measure the precision in the context of top-N recommendation. Given a user session in the test set, we extract the first j pages as an active user session to generate a top-N recommendation set via the procedure described in section 4. Since the recommendation set is in descending order, we then obtain the rank of j + 1 page in the sorted recommendation list. Furthermore, for each rank r > 0 , we sum the number of test data that exactly rank the rth as Nb(r ) . Let S ( r ) = ∑ Nb(i ) , and hitp = S ( N ) / T , where i =1 r

T repre-

sents the number of testing data in the whole test set. Thus, hitp stands for the hit precision of Web recommendation. In order to compare our approach with other existing methods, we implement a baseline method that is based on the clustering technique [11]. This method is to

Towards User Profiling for Web Recommendation

Fig. 2. WAVP comparison for CTI

423

Fig. 3. Hitp comparison for CTI

generate usage-based session clusters by performing k-means clustering process on usage data explicitly. Then, the cluster centroids are treated as the aggregated access patterns. Figures 2 and 3 depict the comparison results of WAVP and hitp coefficient performed on CTI dataset using the two methods discussed above respectively. The results demonstrate that the proposed PLSA-based technique consistently overweighs standard clustering-based algorithm in terms of WAVP and hit precision parameter. In this scenario, it can be concluded that our approach is capable of making Web recommendation more accurately and effectively against the conventional method. In addition to recommendation, this approach is able to identify the hidden factors why such user sessions or Web pages are grouped together in same category.

6 Conclusion and Future Work In this paper, we have developed a Web recommendation framework incorporating user profiling technique based on PLSA model. With the proposed probabilistic method, we can measure the co-occurrence activities (i.e. user sessions) in terms of probability estimations to capture the underlying relationships among Web users, pages as well as latent tasks. Analysis of the estimated probabilities leads to build up usage-based user profiles and identify the hidden factors associated with the corresponding interests or patterns as well. The discovered usage patterns in the forms of user profiles is used to make collaborative recommendation, in turn, lead to improve the precision and effectiveness of Web recommendation. We have demonstrated the efficiency of our technique through preliminary experiments performed on the real world datasets and comparisons with other existing work. Our future work will focus on the following issues: we intend to identify the primitive task of active user and incorporate Web page categories to predict user potentially visited pages, and implement more experiments to validate the scalability of our approach.

424

G. Xu, Y. Zhang, and X. Zhou

References 1 2

3 4

5

6

7

8

9

10 11

12

13

14

R. Agarwal, C. Aggarwal and V. Prasad, A Tree Projection Algorithm for Generation of Frequent Itemsets, Journal of Parallel and Distributed Computing, 61 (1999), pp. 350-371. R. Agrawal and R. Srikant, Mining Sequential Patterns, in P. S. Y. a. A. S. P. Chen, ed., Proceedings of the International Conference on Data Engineering (ICDE), IEEE Computer Society Press, Taipei, Taiwan, 1995, pp. 3-14. A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal Royal Statist. Soc. B, 39 (1977), pp. 1-38. E. Han, G. Karypis, V. Kumar and B. Mobasher, Hypergraph Based Clustering in HighDimensional Data Sets: A Summary of Results, IEEE Data Engineering Bulletin, 21 (1998), pp. 15-22. J. Herlocker, J. KONSTAN, A. BORCHERS and J. RIEDL, An Algorithmic Framework for Performing Collaborative Filtering, Proceedings of the 22nd ACM Conference on Researchand Development in Information Retrieval (SIGIR'99), Berkeley, CA., 1999. J. L. Herlocker, J. A. Konstan, L. G. Terveen and J. T. Riedl, Evaluating collaborative filtering recommender systems, ACM Transactions on Information Systems (TOIS), 22 (2004), pp. 5 - 53. X. Jin, Y. Zhou and B. Mobasher, A Unified Approach to Personalization Based on Probabilistic Latent Semantic Models of Web Usage and Content, Proceedings of the AAAI 2004 Workshop on Semantic Web Personalization (SWP'04), San Jose, 2004. T. Joachims, D. Freitag and T. Mitchell, Webwatcher: A tour guide for the world wide web, The 15th International Joint Conference on Artificial Intelligence (ICJAI'97), Nagoya, Japan, 1997, pp. 770-777. J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon and J. Riedl, Grouplens: Applying Collaborative Filtering to Usenet News, Communications of the ACM, 40 (1997), pp. 77-87. B. Mobasher, Web Usage Mining and Personalization, in M. P. Singh, ed., Practical Handbook of Internet Computing, CRC Press, 2004. B. Mobasher, H. Dai, M. Nakagawa and T. Luo, Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization, Data Mining and Knowledge Discovery, 6 (2002), pp. 61-82. M. Perkowitz and O. Etzioni, Adaptive Web Sites: Automatically Synthesizing Web Pages., Proceedings of the 15th National Conference on Artificial Intelligence, AAAI, Madison, WI, 1998, pp. 727-732. G. Xu, Y. Zhang and X. Zhou, A Latent Usage Approach for Clustering Web Transaction and Building User Profile, The First International Conference on Advanced Data Mining and Applications (ADMA 2005), Springer, Wuhan, china, 2005, pp. 31-42. G. Xu, Y. Zhang and X. Zhou, Using Probabilistic Semantic Latent Analysis for Web Page Grouping, 15th International Workshop on Research Issues on Data Engineering: Stream Data Mining and Applications (RIDE-SDMA'2005), Tokyo, Japan, 2005.