Design of Framework for Recommender System by Incorporating Sequential Information Pradeep Kumar
Bharat Bhasker
Indian Institute of Management, Lucknow Prabandh Nagar, IIM, Lucknow
Indian Institute of Management, Lucknow Prabandh Nagar, IIM, Lucknow
Indian Institute of Management, Lucknow Prabandh Nagar, IIM, Lucknow
Uttar Pradesh (INDIA) Pin -226013
Uttar Pradesh (INDIA) Pin -226013
Uttar Pradesh (INDIA) Pin -226013
[email protected]
[email protected]
Rajhans Mishra
[email protected]
ABSTRACT Recommender Systems are used for generation of recommendations for users with respect to various products and applications. Currently, recommender systems are widely used in e- commerce applications to suggest the appropriate products and services to the users. Sequential information plays an important role for deciding the interests of the user. The proposed system happens to be a collaborative-model based recommendation system and considers the sequential information present in web logs for generation of the recommendations. The model is a combination of clustering, classification and recommendation engine. Clustering has been performed to group users on the basis of sequential and content similarity present in their web page visit sequences. Each cluster represents an interest area or category. Singular value decomposition (SVD) has been used for classification and generating the recommendations for new users.
Keywords Recommender systems, Sequential information, SVD.
1. INTRODUCTION In the last decade with the increase in web users for ecommerce, design and development of recommender systems have attracted researchers both from academia and industry. The purpose of recommender system is to generate recommendations for users for various items, products and services. Few recommender systems are quite popular like Amazon.com for books, CDs and various other educational related products [1], MovieLens [2] for movies, VERSIFI [3] for news, PHOAKS system for relevant information to users on web [4].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. iiWAS2011, 5-7 December, 2011, Ho Chi Minh City, Vietnam. Copyright 2011 ACM 978-1-4503-0784-0/11/12...$10.00
Recommender systems have been developed using data mining techniques, heuristics and association patterns among the items [5]. Web usage data exhibit sequentiality property i.e. provides information about the order of page visits thus providing the information regarding the importance of different web page visits for each user. Web navigation pattern, genomic data, GIS data and buying patterns of customers are few examples of sequential data. To the best of our knowledge most of the developed systems have not explored the sequential information present in the data. In the current work we have designed a recommender system which has mainly three steps viz. clustering, classification and rule generation. In order to handle the sequential nature of data we adopt 1).A sequence based similarity/distance measure during the clustering of users’ web page visits. 2). Designing the weight matrix which considers the sequential information of user’s visits during classification phase. Based on the generation of recommendations, recommender systems can be broadly classified into two categories, namely, content and collaboration based recommender systems [5, 6]. Combination of content based and collaboration based system results into hybrid systems, which also forms a major area of research. Content based system generates recommendations on the basis of past preferences of the same user while collaboration based system generates prediction using preferences of similar users. Combination of the two approaches generates hybrid recommender systems. Content and Collaboration based systems can be further classified into memory and model based systems [5]. Memory based systems use ad hoc heuristic rules developed on the basis of user–preference data to generate recommendations. Model based systems build the model on the basis of user preferences using data mining, machine learning and statistical techniques. The developed model is used to generate recommendations for new users SVD has been widely used technique for reduction of dimensionality and noise in bioinformatics. Gene expressions, protein sequences which are highly sequential in nature have been analyzed using SVD [7, 8].Vozalis & Margaritis [9] have used SVD with collaborative filtering (CF) techniques to enhance its performance and efficiency of recommender system. We are working on web log data which is also sequential in nature thus justifies use of SVD in our case.
Figure 1.Proposed 1 Mod del Rest of the papper has been organized o as folllows. Section 2 discusses the proposed p methoodology. Concluusion and futurre work has been presented p in secttion 3.
2. METHO ODOLOGY Generation of recommendation r ns from a set of users’ web pagge visits forms a challenging isssue. It has atttracted a lot oof researchers bothh from field of academia a and ind dustry. Capturing the sequential innformation embeedded in web paage visits, make it i more interestinng from the user, u as well as a the designerr, perspective. As the web users have h different innterests thus theyy generate a sparse set of user web w page visits. With increasing web users as well as populaarity of internet among peoplle scalability forms a major challen nge. In this sectionn, we present our methodology to design a recommender system. s Our prooposed methodoology consists oof three phases namely clusteering, classificaation and rulle o our design wee group the userrs generation. As a starting point of based on their web w page visits. The T set of web page p visits can bbe collected from server s and these data are pre-proocessed to obtain the set of webb pages for eacch user in each h session. Whille clustering we utilize u a sequencce based similarity measure likee, Levensthein, LL LCS or S3M measure [30]. The resulting clusterrs can be either hard h or soft bassed on the natu ure of clustering algorithm. The output of thhe clustering moodule has been used u to construcct the response matrix. m We proppose to adopt a sequence based weight matrix where w weights are a assigned to the t visited pagees based on the order o of occurrrences as well as frequency oof occurrences. Th he response maatrix will be deecomposed using singular value decomposition (SVD) .SVD will w be used foor r ns for new userr. We outline thhe generating the recommendation steps in of propposed model in algorithm 1.Figuure 1 depicts thhe block diagram of o proposed mod del.
Algorith hm 1 Input: D Dataset containinng sequence of web w log data = |U| S Sequential visits of new user = n N Number of Top clusters = M Output:: Recommendation R ns for new user = R Begin Step1: S Identify ddifferent users byy the IP addressees. Step2: S Compile cclick stream dataa of users in to a single s sequencee. (Each sequennce will repressent a user) u = |U| Step3: S Apply cluustering algorithm m to generate clusters c considerring sequential similarity. Step4: S Find the top-M t similar cllusters for new user u n. n Step5: S Constructt the Response Matrix M for new usser n. n Step6: S Constructt the weight vecttor for new user considering c the location l of vario ous pages. Step7: S Apply Sinngular Value Deccomposition (SV VD) on o Response Matrix to break thee matrix Step8: S Apply preediction function n of SVD to geneerate ratings r of web paages for new useer. Step S 9: Return thhe set of recomm mendations R End
2.1 Grou uping of useers The users are a grouped on the basis of sim milarity measuress using clustering algorithm. Varrious clustering g algorithms can be utilized forr grouping the uusers. Clusteringg has been perfformed on the baasis of similaritty present amonng users. Sim milarity measures are a used to estim mate the similaritty between the obbjects. Various siimilarity measurres exist in literrature. Content based similarity measures m estimaate the content similarity s amongg users while sequence similaritty measures esstimate the seqquence similarity among users. Jaccard and Dice similarity meeasures ples of contennt based similaarity measures while are examp Levenstheiin Distance, Lonngest Common Sub sequence (LCS) and Hamm ming Distance are examples of sequence based similarity measures. m Combinatio on of content annd sequence baseed similarity meeasures results in hybrid h similarityy measures. S3M [10] happenss to be such a sim milarity measure.. In our work we w recommend thhe use
of hybrid similarity measure during the clustering so that the content similarity and sequence similarity both are considered during clustering.
Table 2. Frequency of pages present in Nth State
Table 1: Number of clusters & total intra-cluster distance using rough set based clustering considering hard clusters
For Mins =2 and δ=0.1 Jaccard Dice Levenshtein
Number of Clusters 290 290
Intracluster distance 53.413174 49.786194
Number of Clusters 713 713
Frequency
1
5
5
2
8
6
Intracluster distance
9
3
10
7
94.73115
15
1
89.03695
16
4
249
70.79684
416
116.15682
S M(p=0.5)*
228
62.632885
411
100.08952
S3M(p=0.8)*
80
36.116745
266
89.950554
S3M(p=0.9)*
82
30.130714
367
59.985477
3
Page
*“p” represents the value of the parameter used for S3M measure [10] We have implemented the clustering part of our proposed model through rough sets based clustering algorithm using similarity upper approximation [11]. Experiments have been performed on a PC having an Intel Core 2 Duo Processor (1.83 GHz) with 2 GB RAM using JAVA as programming language on the Windows XP platform. We have used “msnbc” web dataset [12] and used sequences of length six for the purpose of the experiment (as approximate average length of sequences in “msnbc” web dataset happens to be ‘6’) . The data set has seventeen categories that are "frontpage", "news", "tech", "local", "opinion", "on-air", "misc", "weather", "health", "living", "business", "sports", "summary", "bbs" (bulletin board service), "travel", "msn-news", and "msnsports". These categories have been converted in to numbers starting from 1 to 17 respectively for the purpose of experiment. “Mins” represents the minimum number of data points required in a cluster. “δ” represents the minimum similarity threshold considered for similarity upper approximation. The clusters can be validated using total intra-cluster distance. Table 1: represents number of clusters generated and the total intra cluster distance for data size 1000 and 2000 considering hard clustering. Clusters with least total intra cluster distances represent the most intact clusters, hence may be considered as best case clustering. It has been clear from the experimental results that much better clusters have been generated (as it has least total intra cluster distance) with S3M similarity measure which considers both content and sequence similarity present among the uses. It validates our assumption that sequential information do play important role in designing the recommender system. The generated clusters will be utilized to build the response matrix for new users which has been illustrated in section (2.2).
2.2 Construction of response matrix using top M similar clusters After the generation of the clusters it is utilized to generate the response matrix for new user. This step has been illustrated in the current section considering “msnbc” web dataset. Assuming that the clustering has been performed on navigation sequences of length “6” of “msnbc” web dataset. Clusters will contain the users that are similar to each other on the basis of their navigation steps of length “6”.
17 5 Top “M” similar clusters will be selected for the formation of response matrix “A”. In this case each row of matrix “A” will contain seventeen columns (as there are 17 categories in “msnbc” web dataset). If we assume M=20 (arbitrary chosen for the purpose of illustration) then matrix “A” will be of size 340 (=20×17). The row vector of response matrix “A” corresponding to the first cluster is a1. Mth cluster will be represented by vector am. Pages those are not present as the 6th state of any member in the cluster will be represented with “0”. Table 2: shows the pages/categories which are present in the 6th state of the members of first cluster, along with their frequency. a1 will be represented as a1={5,0,0,0,2,0,0,6,3,7,0,0,0,0,1,4,5} Matrices U, S, VT will be generated using SVD decomposition. The size of U will be 20×17, size of S will be 17×17 and size of VT will be 17×17.Diagonal elements of matrix S will be non zero other elements will be zero.
2.3 Constructing the weight vector of the new user This step has been illustrated in this section .Let the sequential pattern of new user for first five visits is {3, 8, 7, 5, 1}. The weight of any page/category i visited by the user in jth position has been termed as Wij and can be calculated as per equation [1]
Wij =
| Vi j | | Vi |
[1]
where |Vij|=Number of times page i has been present in jth position. |Vi|= Number of times page i has been present in the dataset. Wij = Weight of ith page in jth position. Calculation of Wij has been explained as follows: Let the T1, T2, T3, T4 are four sequences of length “6”. T1= 3, 7, 8, 3, 1, 9 T2= 5, 8, 3, 5, 4, 8 T3= 7, 1, 8, 13, 2, 6 T4= 5, 15, 7, 13, 2, 6 The weights of the pages appearing at different positions can be calculated as follows:
Table 3. Weights of different pages for next step Page
P1
Weight
0.50
Ranking
9
P2
P3
P4
-0.88 0.33 0.80 17
11
7
W31=
|1| |1| =0.33, W82= | 3| |4|
=0.25, W73=
W54=
|1| |1| =0.33, W15= | 3| |2|
=0.50.
P5
P6
P7
0.33
0.92
0.33
12
6
13
|1| =0.33, | 3|
For next page visit the weight Wk6 has to be calculated where k= {1, 2, 3 …17}.A weight vector Y1 of length “17” will be formed for the new user. The entries of this vector will be the weights of the corresponding page/category. Initially Y1= {0.5,X,0.33,X,0.33,X,0.33,0.25,X,X,X,X,X,X,X,X,X}.The unknown entries( represented by ‘X’) of Y1 will be calculated as per the calculation shown in section (2.4)
This section deals with calculation of unknown weights of pages/categories (entries represented as ‘X’). Let the weight of the dth page be Rd and U, S, V represents the decomposed matrices of response matrix “A”, “i” represents the ith user and “k” represents the kth feature.
∑k UikS Vjk kk
[2]
In this case R1=0.50, R3 =0.33, R5 =0.33, R7 = 0.33 and R8 =0.25 are known for the new user .The mathematical formulation has been shown by equations (3-7).
R3=U3S33V33+U8S88V38+U7S77V37+U5S55V35+U1S11V31 [3] R8=U3S33V83+U8S88V88+U7S77V87+U5S55V85+U1S11V81 [4] R7=U3S33V73+U8S88V78+U7S77V77+U5S55V75+U1S11V71 [5] R5=U3S33V53+U8S88V58+U7S77V57+U5S55V55+U1S11V51 [6] R1=U3S33V13+U8S88V18+U7S77V17+U5S55V15+U1S11V11 [7] Values of U3, U8, U7, U5, U1 will be calculated by solving equations (3-7) simultaneously. The weight of any category “d” can be calculated using equation (8). Rd=Ui3S33Vd3+Ui8S88Vd8+Ui7S77Vd7+Ui5S55Vd5+Ui1S11Vd1
P14
P15
P16
P17
2.53
0.69
2.20
0.49
P8
P9
P10
P11
P12
14 15 3 4 5 16 8 2 10 1 viability of the model. In this paper, we have focused on the illustration of the different components of the model. Clustering module utilizes clustering algorithms to generate clusters on the basis of similarity measures capable of capturing sequence and content similarity. Weights for visited pages of new users have been calculated by utilization of their occurrences on specific location and total occurrences in the dataset. SVD has been used for generation of recommendation for new users. In future the full fledged integrated recommender system can be developed using the proposed methodology and can be validated with the existing systems.
REFERENCES [1.] Linden, G., Smith, B., York, J 2003. Amazon.com Recommendations: Item-to-Item Collaborative Filtering, IEEE Internet Computing, 7(1): 76-80.
2.4 Generation of recommendations
Rd=
P13
0.25 0.06 1.90 1.46 1.11 -0.6
[8]
The page with highest weight among the new ranked pages will be the next suggested page. In Table 3: we have presented the weight of all “17” pages to be visited by new user. In third row based on the weight we have also listed the ranking of all the “17” pages. Weight of page “14” happens to be “2.53” which maximum among all the pages hence page “14” has ranking “1” and most likely to be visited by the new user as next visit.
3. CONCLUSION AND FUTURE WORK The current paper presents a model for recommender system which incorporates the sequential information present in the web logs for recommendations. The paper has demonstrated the recommendation model with the example and shown the
[2.] Miller, B.N., Albert, I., Lam, S.K., Konstan, J.A., Riedl, J.: MovieLens Unplugged 2003. Experiences with an Occasionally Connected Recommender System, Proc. Int’l Conf. Intelligent User Interfaces. [3.] Billsus,D., Brunk, C.A., Evans, C., Gladish,B. and Pazzani,M. 2002. Adaptive Interfaces for Ubiquitous Web Access, Comm. ACM, 45(5): 34-38 [4.] Terveen, L., Hill, W., Amento, B., McDonald, D., and Creter, J.: PHOAKS 1997. A System for Sharing Recommendations, Comm. ACM, 40(3): 59-62. [5.] Adomavicius, G., Tuzhilin, A. 2005.Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Transactions on Knowledge and Data Engineering , 17 (6): 734-749. [6.] Su, X., and Khoshgoftaar, T.M. 2009. A survey of collaborative filtering techniques, Advances in Artificial Intelligence archive, vol. 2009, Article ID 421425, 19 pages, 2009. doi:10.1155/2009/421425 [7.] Holter, N. S., Mitra, M., Maritan, A., Cieplak, M., Banavar, J. R. and Fedoroff, N. V. 2000. Fundamental patterns underlying gene expression profiles: Simplicity from complexity. PNAS, 97, pp. 8409–8414. [8.] Wall, M. E., Dyck, P. A. and Brettin, T. S. 2001. SVDMAN singular value decomposition analysis of microarray data. BIOINFORMATICS , 17(6) : 566– 568 [9.] Vozalis, M. and Margaritis, K. 2007. Using SVD and demographic data for the enhancement of generalized Collaborative Filtering. Information Sciences , 177(15) : 3017–3037. [10.] Kumar, P., Radha Krishna, P. and Raju, B. S. 2007. SeqPam: A clustering algorithm for sequential data In International Journal of Data Warehousing and Mining 3(1):29-53. [11.] Kumar, P., Krishna, P. R., Bapi, R. S. and De, S. K. 2006. Clustering using Similarity Upper Approximation. IEEE International Conference on Fuzzy Systems. [12.] http://archive.ics.uci.edu/ml/datasets/MSNBC.com+A nonymous+Web+Data