Kernel-based Approaches for Collaborative Filtering - IEEE Xplore

0 downloads 0 Views 179KB Size Report
Department of Computer Science, The University of Texas at Dallas, Richardson, TX, 75080 email:[email protected]. Abstract—In a large-scale collaborative ...
2010 Ninth International Conference on Machine Learning and Applications

Kernel-based Approaches For Collaborative Filtering Zhonghang Xia∗ , Wenke Zhang† , Manghui Tu ‡ , and I-Ling Yen§ of Math & Computer Science, Western Kentucky University, Bowling Green, KY 42101 email:[email protected] † Department of Computer Science, The University of Texas at Dallas, Richardson, TX, 75080 email:[email protected] ‡ School of Business & Info Systems, Dakota State University, Madison, SD, 57042 email:[email protected] § Department of Computer Science, The University of Texas at Dallas, Richardson, TX, 75080 email:[email protected]

∗ Department

The kernel matrix can be calculated by inner product of training data over the whole set of attributes [3]. However, the inner product may not be properly defined for a sparse dataset as some elements of the vectors are missing. Furthermore, it is not realistic that two users have similar opinions on the whole itemset. Recently, the concept of bicluster has been introduced into the CF area. Since only subsets of users and items are needed to define similarity between users or items, the sparsity problem is alleviated. Existing bicluster methods are mainly memory-based.

Abstract—In a large-scale collaborative filtering system, pairwise similarity between users is usually measured by users’ ratings on the whole set of items. However, this measurement may not be well defined due to the sparsity problem, i.e., the lack of adequate ratings on items for calculating accurate predictions. In fact, most correlated users have similar ratings only on a subset of items. In this paper, we consider a kernelbased classification approach for collaborative filtering and propose several kernel matrix construction methods by using biclusters to capture pairwise similarity between users. In order to characterize accurate correlation among users, we embed both local information and global information into the similarity matrix. However, this similarity matrix may not be a kernel matrix. Our solution is to approximate it with the matrix close to it and use low rank constraints to control the complexity of the matrix.

In this paper, we tackle the sparsity problem with the kernel-based classification method, in which a classifier is first learned from users/items with given ratings and then those users/items with unknown ratings are mapped into corresponding classes. Several kernel matrix construction methods are proposed based on biclusters. Different with other bicluster methods [18] using only local information (users similar to the active user) for prediction, we combine local information with global information (all users). The matrix defined in this way, however, may not be a positive semi-definite matrix, and, thus, not a kernel matrix either. Our solution is to approximate the kernel matrix with a positive semi-definite matrix close to it. Note that the approximation matrices having very complex structure may cause the overfitting problem. It is well known that the matrix structure can be controlled by matrix rank. We formulate the matrix approximation problem by minimizing the matrix rank. The rank minimization problem is NP hard [10], and hence, we provide heuristic solutions for the problem.

Keywords-collaborative filtering; kernel method; bicluster

I. I NTRODUCTION Collaborative filtering (CF), as a major technique for recommender systems, has been extensively studied [2], [9], [16] since mid-1990s. CF algorithms predict users’ preferences by exploiting similarities among users based on their opinions on items. There are two categories of CF methods: memory-based and model-based [5]. Memory-based [16] CF methods find neighbors for a user (active user) who needs recommendations and use neighbors’ preferences to estimate the active user’s preference. In contrast, model-based CF methods first develop data models based on available data and then predict users’ preferences according to the models. Existing CF algorithms still suffer from the sparsity problem. In reality, most users actually rate very few items while there may be a large number of available items in the system. Model-based methods have many advantages for sparse dataset if data models can precisely capture features of the whole data space. The kernel-based algorithm has gained much attention because of its adaptedness to a variety of data types. The algorithm builds a model based on pairwise similarities among users, specified by a positive semi-definite matrix, called kernel matrix. 978-0-7695-4300-0/10 $26.00 © 2010 IEEE DOI 10.1109/ICMLA.2010.41

The rest of the paper is organized as follows. Some related work is reviewed in Section II. Section III introduces the background knowledge of memory-based and modelbased methods for CF and biclusters. In Section IV, we consider four different kernel matrix construction methods. Two heuristic methods are also studied in this section. Experimental studies are presented in Section V. Section VI states the conclusion of the paper. 229

A. Memory-based Methods

II. R ELATED W ORK

Training is not required in memory-based methods. The prediction function is determined by aggregation of weighted ratings given by users in 𝑢𝑎 ’s neighborhood, consisting of users similar to 𝑢𝑎 . Usually a more similar user should be assigned a larger weight in the prediction. The weight that user 𝑢𝑛 contributes to 𝑢𝑎 ’s prediction indicates how similar 𝑢𝑎 and 𝑢𝑛 ’s opinions are on an item. Generally, the weights can be represented by a kernel function. Then, 𝑢𝑎 ’s prediction on 𝑥𝑎 can be calculated by ∑𝑁 𝑘(𝑢𝑎 , 𝑢𝑛 )(𝑟(𝑢𝑛 , 𝑥𝑎 ) − 𝑟¯(𝑢𝑛 )) 𝑓 (𝑢𝑎 ; 𝑥𝑎 ) = 𝑟¯(𝑢𝑎 ) + 𝑛=1 . ∑𝑁 𝑛=1 𝑘(𝑢𝑎 , 𝑢𝑛 )

Model-based methods include item-item, clustering, classification, etc. Item-item approaches [17] build models based on the correlations between items. The classification approaches [4] recast CF as classification problems and assign unknown users (items) into proper classes. In [3], a kernelbased learning model was employed to build the classifier, in which pairwise similarity was characterized by the kernel matrix and different attributes were integrated by multiple kernel matrices. Clustering approaches [19] simultaneously group similar users and items into clusters and derive useritem association based on those clusters. More detailed introduction and performance measurement about these approaches have been reported in survey papers [2], [8]. Many researches have been launched to solve the sparsity problem. A simple solution to the sparsity problem is default voting [5] which inserts default rating values for unrated items to increase the density of the user-item matrix. Normally, default voting uses neutral or negative preference value for those unrated items. This method, however, can mislead the classifiers in most cases. Some approaches have extended traditional CF approaches to overcome the sparsity problem. Demographic filtering [14] uses personal attributes, such as gender, age ,area code, education, etc, and make recommendations based on demographic profiles. In [18], a memory-based bicluster method was proposed: form a neighborhood for the active user and construct a prediction function by aggregating users’ opinions in the neighborhood. A recent progress on the theory of CF was reported in [1]. A framework was proposed to predict a set of users’ possible preferences by using linear operators based on spectral regularization. Some low-rank type matrix-completion methods for CF have been shown as special cases of the framework.

A widespread measure is the Pearson correlation coefficient which was first introduced in [16]. In this case, 𝑘(𝑢𝑎 , 𝑢𝑛 ) is ∑

√∑

𝑥∈𝑋𝑎 ∩𝑋𝑛 (𝑟(𝑢𝑎 , 𝑥)

− 𝑟¯(𝑢𝑎 ))(𝑟(𝑢𝑛 , 𝑥) − 𝑟¯(𝑢𝑛 )) ∑ ¯(𝑢𝑎 ))2 𝑥∈𝑋𝑎 ∩𝑋𝑛 (𝑟(𝑢𝑛 , 𝑥) − 𝑟¯(𝑢𝑛 ))2 𝑥∈𝑋𝑎 ∩𝑋𝑛 (𝑟(𝑢𝑎 , 𝑥) − 𝑟

B. The Model-based Method - Classification In model-based methods, the prediction function is usually learned by training users who have evaluated active items. In the classification method, each active user (item) is formed as a separate classification problem and a classifier is built for the active user (item) by using other users (items) as training instances. As an example of building the classifier for users, the training instance corresponding to user 𝑢 is represented by a vector of 𝑢’s ratings on items other than the active item 𝑥. For simplicity, we consider a binary classification problem in which users’ preferences are categorized into two classes: “like” or “dislike”. The method, however, can be extended to the multi-class problem easily. In a binary classification problem setting, a linear prediction function can be written as 𝑓 (𝑢) = 𝑠𝑔𝑛(𝑤𝑇 𝜙(𝑢) + 𝑏) in the feature space 𝑅𝑝 , where 𝑤 ∈ 𝑅𝑝 is the vector of feature weights, 𝜙(𝑢) : 𝑈 −→ 𝑅𝑝 is a mapping from the user dataset to the feature space, 𝑏 ∈ 𝑅 is the intercept, and 𝑠𝑔𝑛(𝑢) = +1, if 𝑢 > 0, and −1 otherwise. When a kernel function is confined on a set of training data, their images form a kernel matrix. For the user dataset 𝑈 , we write its kernel matrix as 𝐾 = (𝑘(𝑢𝑖 , 𝑢𝑗 )), 𝑖, 𝑗 = 1, . . . , 𝑁 . Interestingly, without knowing the exact expression of a kernel function, the knowledge of a kernel matrix on training data points is enough to build a learning model. In a kernel-based support vector machine (SVM) model, the prediction function for 𝑥𝑎 can be written as

III. P RELIMINARY S TUDIES A typical CF problem is to estimate ratings for those items that have not been evaluated by users. Consider a collaborative filtering system consisting of a set of users 𝑈 = {𝑢1 , . . . , 𝑢𝑁 } and a set of items 𝑋 = {𝑥1 , . . . , 𝑥𝑀 }. Denote by 𝑟(𝑢, 𝑥) the numeric rating given by user 𝑢 ∈ 𝑈 on item 𝑥 ∈ 𝑋. Let 𝑋𝑛 be the set of items that have been of 𝑢𝑛 ’s ratings on 𝑥 ∈ 𝑋𝑛 , rated by user 𝑢𝑛 . The mean ∑ 𝑟¯(𝑢𝑛 ), is calculated by ∣𝑋1𝑛 ∣ 𝑥∈𝑋𝑛 𝑟(𝑢𝑛 , 𝑥), where ∣𝑋𝑛 ∣ is the number of items in 𝑋𝑛 . The z-score of 𝑟(𝑢, 𝑥) is 𝑟 (𝑢) 𝑧(𝑢, 𝑥) = 𝑟(𝑢,𝑥)−¯ , where 𝜎(𝑢) is the variance of 𝑢’s 𝜎(𝑢) ratings. For active user 𝑢𝑎 , we aim at determining prediction function 𝑓 (𝑢; 𝑥) to calculate 𝑢𝑎 ’s rating on item (active item) 𝑥 𝑎 ∈ 𝑋 ∖ 𝑋𝑛 . The prediction function in a CF algorithm depends on similarities among users. The pairwise similarity between users 𝑢 and 𝑢′ can be characterized by a positive definite kernel function [12], denoted by 𝑘(𝑢, 𝑢′ ).

𝑓 (𝑢; 𝑥𝑎 ) = 𝑠𝑔𝑛(

𝑁 ∑

𝛼𝑛 𝑘(𝑢𝑛 , 𝑢) + 𝑏),

(1)

𝑛=1

where 𝛼𝑖 , 𝑖 = 1, . . . , 𝑁 , is the corresponding Lagrangian multipliers.

230

algorithm for the user as follows.

Kernel construction is the key in the kernel based methods. In [3], the kernel matrix is calculated by 1 ∑ 𝑧(𝑢, 𝑥)𝑧(𝑢′ , 𝑥). (2) 𝐾= ∣𝑋∣ 𝑥

Algorithm 1: Main algorithm Input: User-item matrix, 𝑢𝑎 Output: Labels of items unrated by 𝑢𝑎 Initialization: Set SVM model parameters; Normalize ratings 𝑟(𝑢, 𝑥); for all 𝑥′𝑎 𝑠 do Calculate biclusters 𝐵𝑖 for 𝑢𝑎 and 𝑥𝑎 ; Call a Sub-algorithm to compute kernel 𝐾; Train SVM classifier and determine parameters 𝛼; and 𝑏; Compute 𝑓 (𝑢; 𝑥) and output the label of 𝑥𝑎 ; end

C. Bicluster Methods The pairwise similarity defined in aforementioned CF methods is measured over the entire set of items. In fact, finding a rating pattern over the entire itemset is neither necessary nor realistic. In most cases, users have correlations only over a subset of items. The bicluster method is to find a set of users who partially share some interests. Different with other CF methods, bicluster methods measure pairwise similarity based on those users and items in a bicluster. For example, given a user-item matrix as follows: ⎞ ⎛ 𝑖𝑡𝑒𝑚1 𝑖𝑡𝑒𝑚2 𝑖𝑡𝑒𝑚3 𝑖𝑡𝑒𝑚4 𝑖𝑡𝑒𝑚5 ⎟ ⎜ 𝑢𝑠𝑒𝑟1 ? 2 5 1 2 ⎟ ⎜ ⎟ ⎜ 𝑢𝑠𝑒𝑟2 1 5 4 2 5 ⎟ ⎜ ⎟ ⎜ 𝑢𝑠𝑒𝑟3 4 ? 2 3 3 ⎟ ⎜ ⎠ ⎝ 𝑢𝑠𝑒𝑟4 3 3 1 4 ? 𝑢𝑠𝑒𝑟5 1 3 1 4 5

A. Construct Kernel with Local Information in which only those users in a bicluster are deemed to be correlated. Given a bicluster 𝐵𝑡 (𝑈𝑡 , 𝑋𝑡 ), the similarity between two users 𝑢, 𝑢′ ∈ 𝑈𝑡 except 𝑢𝑎 can be measured by the inner product of normalized ratings over those items in 𝑋𝑡 . For the similarity between 𝑢𝑎 and other users, we cross out item 𝑥𝑎 in the item list because 𝑢𝑎 ’s rating on 𝑥𝑎 is unknown. Hence, the corresponding kernel is defined as

As we can see, user 2 and 5 tend to have similar ratings on items 1 and 5, hence row 2, 5 and column 1,5 form a bicluster. Likewise, row 1,3 and column 1,2, 5 form another bicluster. Let 𝐵𝑡 (𝑈𝑡 , 𝑋𝑡 ) denote the bicluster defined by a set of users 𝑈𝑡 and a set of items 𝑋𝑡 . Algorithms for computing 𝐵𝑡 have been studied in many papers [20]. In [20], a move-based algorithm (FLOC) was proposed to captured the coherence of a submatrix existing in a matrix. The algorithm searches the bicluster by iteratively changing the membership of a row (or column) to reduce the residue of the submatrix until a threshold is reached.

𝐾𝑡 (𝑢, 𝑢′∑ )= ⎧ 1 ′ ⎨ ∣𝑋𝑡 ∣ ∑ 𝑥∈𝑋𝑡 𝑧(𝑢, 𝑥)𝑧(𝑢 , 𝑥), 1 𝑧(𝑢, 𝑥)𝑧(𝑢𝑎 , 𝑥), ⎩ ∣𝑋𝑡 ∣−1 𝑥𝑎 ∕=𝑥∈𝑋𝑡 0,

if 𝑢𝑎 ∕= 𝑢, 𝑢′ ∈ 𝑈𝑡 if 𝑢, 𝑢′ ∈ 𝑈𝑡 , 𝑢′ = 𝑢𝑎 otherwise. (3) There may be multiple biclusters calculated for 𝑢𝑎 and 𝑥𝑎 . Suppose that 𝐵𝑡 (𝑈𝑡 , 𝑋𝑡 ), 𝑡 = 1, . . . , 𝑇 , are biclusters output by the modified FLOC algorithm. We first compute each base kernel 𝐾𝑡 by (3), and then construct kernel 𝐾 by linear combination of all these base kernels. That is,

IV. B ICLUSTER K ERNEL M ETHODS FOR CF

𝐾=

In this section, we study a kernel-based learning model based on biclusters to overcome the sparsity problem. As our goal is to recommend the active item to the active user, we are only interested in those biclusters including users and items similar to 𝑢𝑎 and 𝑥𝑎 . To the end, we modify FLOC algorithm by adding constraints of 𝑢𝑎 and 𝑥𝑎 in bicluster calculation. In addition, users and items in a bicluster will be replaced with random values in the next round of bicluster search so that there is no overlap between these biclusters. Since the kernel matrix defines pairwise similarity of users, the performance of a kernelbased classifier closely depends on the design of a kernel matrix. We propose several different kernel construction methods based on biclusters. Note that the main algorithm of the classification approach is not changed for these kernel construction methods and the procedures of building classifiers for an active user and item are similar, we only give the main

𝑇 ∑

𝜆 𝑡 𝐾𝑡 ,

(4)

𝑡=1

where 𝜆𝑡 are pre-defined weights. Sub-algorithm 1 is given as follows: Algorithm 2: Sub-algorithm 1 Input: Normalized user-item matrix, 𝑢𝑎 ,𝑥𝑎 ,𝐵𝑡 (𝑈𝑡 , 𝑋𝑡 ) 𝜆𝑡 , 𝑡 = 1, . . . , 𝑇 Output: Kernel 𝐾 Compute 𝐾𝑡 by(3); Compute 𝐾 by (4); B. Integrate Global and Local Information The kernel defined in section IV-A reflects only local information in the neighborhood of 𝑢𝑎 and 𝑥𝑎 . However, a model-based method aims to define a function over the whole data space. Actually, it is reasonable to assume that there exists some correlation between 𝑢𝑎 (or 𝑥𝑎 ) and users(or

231

items) outside the neighborhood. Thus, for two users not in a bicluster, we measure the similarity by ratings on items they both have evaluated. In summary, given bicluster 𝐵𝑡 (𝑈𝑡 , 𝑋𝑡 ), the similarity between 𝑢 and 𝑢′ is defined as follow:

problem:

1 ∣𝑋𝑡 ∣

, 𝑥), if 𝑢𝑎 ∕= 𝑢, 𝑢′ ∈ 𝑈𝑡 , 𝑧(𝑢, 𝑥)𝑧(𝑢′ , 𝑥), otherwise.

𝑧(𝑢, 𝑥)𝑧(𝑢 𝑥∈𝑋 ∑𝑡

1 ∣𝑋𝑢 ∩𝑋𝑢′ ∣

𝑥∈𝑋𝑢 ∩𝑋𝑢′



min s.t.

(5) However, 𝑆𝑡 obtained by solving (5) may not be used as a kernel matrix because the positive semidefinite condition is not guaranteed. A deviation of the kernel matrix may cause conflict similarity measures among the data points, and thus, result in poor classification accuracy. Furthermore, we have more confidence on similarity defined by biclusters than those defined by users not in any bicluster. Our solution is to find a positive semi-definite matrix nearest to 𝑆𝑡 . Meanwhile, the similarity defined by the biclusters should be kept in the solution. Given bicluster 𝐵𝑡 (𝑈𝑡 , 𝑋𝑡 ), let 𝑆𝐵𝑡 = (𝑆𝑡 (𝑢, 𝑢′ ))𝑢𝑎 ∕=𝑢,𝑢′ ∈𝑈𝑡 , then the problem can be formed as min 12 ∥𝐾𝑡 − 𝑆𝑡 ∥22 s.t. 𝐾𝐵𝑡 = 𝑆𝐵𝑡 , (6) 𝐾𝑡 ર 0.

1 2 2 ∥𝐾𝑡 − 𝑆𝑡 ∥2 𝐾𝑡 (𝑖, 𝑖) = 1, 𝑖

𝐾𝑡 ર 0.

= 1, . . . , 𝑁,

(8)

Trace(𝐾𝑡 ) 𝐾𝐵 𝑡 = 𝑆 𝐵 𝑡 , 𝐾𝑡 ર 0.

(9)

Although some convex programming software packages [7] can be used to solve problem (9), they are not efficient for this particular problem. Motivated by the approximation method [13] for re-constructing a kernel matrix based on the Gram-Schmidt decomposition, we propose a heuristic method for problem (9). A symmetric matrix can be represented by linear combination of rank one matrices. Let 𝜈𝑛 , 𝑛 = 1, . . . , 𝑁 , be the 𝑛th eigenvector ∑𝑁 of 𝑆𝑡 and 𝛽𝑛 the corresponding eigenvalue. Then 𝑆𝑡 = 𝑛=1 𝛽𝑛 𝜈𝑛 𝜈𝑛𝑇 . Also, let 𝑉𝑡𝑛 = 𝑣𝑛 𝑣𝑛𝑇 and 𝑉𝑡𝑛 (𝑖, 𝑗) the (𝑖, 𝑗)th entry of the matrix 𝑉𝑡𝑛 . Then the heuristic method for problem (9) is to estimate weights 𝛽𝑛 by solving the following optimization problem: min s.t.

𝑟𝑇 𝜇 ∑𝑁

𝑛 𝑛=1 𝜇𝑛 𝑉𝑡 (𝑖, 𝑗) = 𝑆𝑡 (𝑖, 𝑗), 𝑖 ∈ 𝐼𝑡 , 𝑗 ∈ 𝐽𝑡 𝜇𝑛 ≥ 0.

(10) where 𝑟 = (𝑟1 , 𝑟2 , . . . , 𝑟𝑛 , . . . , 𝑟𝑁 )𝑇 , 𝑟𝑛 = 𝑡𝑟𝑎𝑐𝑒(𝜈𝑛 𝜈𝑛𝑇 ) and 𝐼𝑡 , 𝐽𝑡 are indexes of 𝑈𝑡 , and 𝑋𝑡 , respectively. Subalgorithm 3 is given as follows:

The above problem can be efficiently solved by the Newton method studied in [15], which consider a similar problem min s.t.

Rank(𝐾𝑡 ) 𝐾𝐵 𝑡 = 𝑆 𝐵 𝑡 , 𝐾𝑡 ર 0.

A well known heuristic method for (8) is to solve the following problem [10]:



𝑆𝑡 (𝑢, 𝑢 ) = { ∑

min s.t.

(7) Algorithm 4: Sub-algorithm 3 Input: Normalized user-item matrix, 𝑢𝑎 , 𝑥𝑎 ,𝐵𝑡 (𝑈𝑡 , 𝑋𝑡 ), 𝜆𝑡 , 𝑡 = 1, . . . , 𝑇 Output: Kernel 𝐾 Compute 𝑆𝑡 by (5) Calculate eigenvalues 𝛽𝑛 and eigenvectors 𝑣𝑛 of 𝑆𝑡 Compute 𝜇𝑛 by solving (10) ∑𝑁 Compute 𝐾𝑡 by 𝑛=1 𝜇𝑛 𝜈𝑛 𝜈𝑛𝑇 Compute 𝐾 by (4)

Note that the constraint 𝐾𝐵𝑡 = 𝑆𝐵𝑡 in (6) is more general than the constraint 𝐾𝑡 (𝑖, 𝑖) = 1 in (7), but the original algorithm can be extended to solve (6). Sub-algorithm 2 is given as follows: Algorithm 3: Sub-algorithm 2 Input: Normalized user-item matrix, 𝑢𝑎 , 𝑥𝑎 ,𝐵𝑡 (𝑈𝑡 , 𝑋𝑡 ), 𝜆𝑡 , 𝑡 = 1, . . . , 𝑇 Output: Kernel 𝐾 Compute 𝑆𝑡 by (5); Compute 𝐾𝑡 by solving (6); Compute 𝐾 by (4);

D. Kernel with More Constraints Biclusters, nearest positive semidefinite matrix, and low rank for kernel construction have been shown effective. In addition, when the kernel method is used to measure similarity, it is considered effective to normalize the kernel matrix so that all the diagonal elements are 1. Now, we summarize all these constraints in the following optimization problem: min rank(𝐾𝑡 ) s.t. 𝐾𝐵𝑡 = 𝑆𝐵𝑡 , diag(𝐾𝑡 ) = 𝑒, (11) ∥𝐾𝑡 − 𝑆𝑡 ∥22 ≤ 𝑐 𝐾𝑡 ર 0,

C. Low Rank Kernels In a large scale CF system, the solution to (6) usually has a complex structure, which may cause a overfitting problem. According to Occam’s razor, the simplest model is the best. Rank minimization [10] has shown effectiveness to regularize the complexity of the model. Hence, the kernel matrix can be obtained by solving the following optimization

232

where 𝑒 is a vector with all 1’s, 𝑐 is a predefined small constant. The heuristic method for problem (10) can be extended to solve problem (11). Instead, we solve the following problem: min s.t.

matrix might not be a kernel matrix. Hence, the nearest positive semi-definite matrix was used to approximate it. Among those approximation matrices, we chose the one with low rank as the kernel matrix to avoid overfitting problem. Experimental studies have shown bicluster methods can significantly improve prediction recall and accuracy.

𝑟𝑇 𝜇 ∑𝑁 𝜇𝑛 𝑉𝑡𝑛 (𝑖, 𝑗) = 𝑆𝑡 (𝑖, 𝑗), 𝑖 ∈ 𝐼𝑡 , 𝑗 ∈ 𝐽𝑡 ∑𝑛=1 𝑁 𝜇𝑛 𝑉𝑡𝑛 (𝑖, 𝑖) = 1 ∑𝑛=1 𝑁 2 𝑛=1 (𝛽𝑛 − 𝜇𝑛 ) ≤ 𝑐 𝜇𝑛 ≥ 0.

R EFERENCES [1] Abernethy, J., Bach, F., Evgeniou, T. and Vert, J., A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization, Journal of Machine Learning Research 10, 803-826, 2009.

(12) Sub-algorithm 4 is similar to Sub-algorithm 3 except that 𝜇’s are solved by (12).

[2] G. Adomavicius and A. Tuzhilin., Towards the Next Generation of Recommender Systems: A Survey of the Stateof-the-Art and Possible Extensions, IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, June 2005.

V. N UMERICAL R ESULTS We examine the efficiency of kernel construction methods by performing the main algorithm on a CF system. Although the main algorithm is presented only for the active user in Section IV, we have experimented the algorithm for the active item as well. All experiments are conducted on a PC with Intel Core2 Duo CPU 2.67 GHz and 6 GB RAM. The training and test datasets in the experiments are collected from MovieLens [11] which consists of 100,000 ratings from 943 users on 1682 movies. The comparison of proposed methods with benchmark methods are based on training and test datasets: ua.base, ua.test, ub.base, and ub.test. The ratings range from 1 to 5. Since there are only two types of preference considered, we re-scaled label +1 to the data points with rating 3, 4 or 5, and label -1 to the data points with rating 1 or 2. The results are compared with a standard SVM (S-SVM) [6] classifier, in which pairwise similarity is defined over all items, on measures of precision, recall, and accuracy. In 4, we solve QCQP problem in [12] to obtain 𝜆𝑡 , 𝑡 = 1, . . . , 𝑇 , given 𝑇 biclusters output by FLOC algorithm. The experimental results on ”ua.test” with user-based and item-based classification methods are reported in Table 1 , and results on ”ub.test” in Table 2. As we can see in the two tables, bicluster methods are comparable to S-SVM on metric precision and significantly outperform S-SVM on recall and accuracy. We have also tried to solve (8) and (11) directly, however, existing softwares [7], can only handle problems with the size of 100×100 and the accuracy is about 77% due to lack of training instances. With our heuristic methods, however, the two problems can be easily solved. Furthermore, the performance of Sub-algorithm 3 & 4 are overall better than that of Sub-algorithm 1 & 2.

[3] J. Basilico, Thomas Hofmann, Unifying collaborative and content-based filtering, ICML 2004. [4] Billsus, D. and Pazzani, M. J., Learning collaborative information filters, In Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, 46-54, 1998. [5] Breese, J. S., Heckerman, D., and Kadie, C., Analysis of Predictive Algorithms for Collaborative Filtering, In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pp. 43-52,1998. [6] Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [7] http://cvxr.com/cvx. [8] Gunawardana, A. and Shani, G., A Survey of Accuracy Evaluation Metrics of Recommendation Tasks, Journal of Machine Learning Research, 10, 2935-2962, 2009. [9] R. Burke, Hybrid Recommender Systems: Survey and Experiments, User Modeling and User-Adapted Interaction, volume 12, pages 331-370, 2002. [10] Fazel, M., Hindi, H., and Boyd, S. P., A rank minimization heuristic with application to minimum order system approximation, In Proceedings of the American Control Conference, vol.6, 4734-4739, 2001. [11] MovieLens, http://www.cs.umn.edu/Research/GroupLens/. [12] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan, Learning the Kernel Matrix with Semidefinite Programming, Journal of Machine Learning Research, 5, 2772, 2004.

VI. C ONCLUSION

[13] N. Cristianini, J. Kandola, L., A. Elisseeff, and J. ShaweTaylor, On Kernel Target Alignment, Technical Report NeuroColt 2001-099, Royal Holloway University London, 2001.

We have presented four different methods to construct kernel matrices used in the kernel-based learning model for a CF problem. Similarities among users outside the active user’s neighborhood were combined with similarities within biclusters to form a similarity matrix. Such a similarity

[14] M. Pazzani, A framework for collaborative, content-based and demographic filtering, Artificial Intelligence Review , 1999, 13, 393-408.

233

S-SVM Sub-Alg1 Sub-Alg2 Sub-Alg3 Sub-Alg4

precision (%) 86.64 84.91 85.04 84.81 85.00

User recall (%) 64.68 97.93 97.83 98.75 98.43

Accuracy (%) 62.09 83.70 83.78 84.15 84.15

precision (%) 85.75 84.96 85.04 84.90 84.91

Item recall (%) 66.62 97.22 97.21 97.77 97.74

Accuracy (%) 62.79 83.26 83.35 83.58 83.57

Table I P ERFORMANCE COMPARISON OF KERNEL CONSTRUCTION METHODS WITH S-SVM ON ” UA . TEST ”

S-SVM Sub-Alg1 Sub-Alg2 Sub-Alg3 Sub-Alg4

precision (%) 85.85 84.60 84.77 84.38 84.39

User recall (%) 64.97 97.92 97.87 98.72 98.68

Accuracy (%) 61.80 83.37 83.53 83.67 83.66

precision (%) 85.86 84.62 84.74 84.73 84.78

Item recall (%) 64.46 97.15 96.80 97.26 97.23

Accuracy (%) 61.45 82.87 82.72 83.07 83.11

Table II P ERFORMANCE COMPARISON OF KERNEL CONSTRUCTION METHODS WITH S-SVM ON ” UB . TEST ”

[15] H.-D. Qi and D. Sun, A quadratically convergent Newton method for computing the nearest correlation matrix, SIAM J. Matrix Anal. Appl., vol 28 (2), pp. 360–385, 2006. [16] Resnick, P., N. Iacovou, M. Sushak, P. Bergstrom, and J. Ried, GroupLens: An open architecture for collaborative filtering of netnews, In Proceedings of the ACM 1994 Conference on Computer Supported Collaborative Work (CSCW ’94), Chapel Hill, NC, 1994, pp. 175-186. [17] Sarwar, B. M., Karypis, G., Konstan, J. A., and Riedl,J., Itembased collaborative filtering recommendation algorithms, In Proceedings of the 10th International World Wide Web Conference (WWW10), Hong Kong, 2001. [18] Symeonidis, P., Nanopoulos, A., Papadopoulos, A., and Manolopoulos, Y., NearestBiclusters Collaborative Filtering, WEBKDD’06, Philadelphia, Pennsylvania, USA, 2006. [19] Ungar, L. and Foster, D., Clustering methods for collaborative filtering, In Proceedings of the Workshop on Recommendation Systems,AAAI Press, Menlo Park California. [20] Yang, J., Wang, W., Wang, H., and Yu, P., 𝛿-clusters: Captuering subspace correlation in a large data set, ICDE, 517-528, 2005.

234