An Experimental Investigation of Graph Kernels on a ... - ISYS

13 downloads 0 Views 328KB Size Report
seven kernels (or similarity matrices) on a graph, namely the exponential ..... graph, while the kNN method can only be used in the case of bipartite graphs. Thus ...
An Experimental Investigation of Graph Kernels on a Collaborative Recommendation Task Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information Systems Research Unit (ISYS) Universit´e catholique de Louvain {francois.fouss,luh.yen,alain.pirotte,marco.saerens}@ucLouvain.be

Abstract

or similarities between pairs of nodes allows to determine the item that is most relevant (that is, similar) to a given item and allows, for instance, to cluster or to label them. It is worth mentioning that these distance (similarity) measures take the indirect paths between the nodes into consideration, and have the nice property of decreasing (increasing) when the number of paths connecting two nodes increases and when the “length” of any path decreases (the communication is facilitated). In short, two nodes are considered as similar if there are many short paths connecting them. On the contrary, the “shortest path” (also called “geodesic” or “Dijkstra”) distance between nodes of a graph does not necesarily decrease when connections between nodes are added, and thus does not capture the fact that strongly connected nodes are closer than weakly connected nodes. Moreover, one can easily show that these defined distances are Euclidean [7], [8]; that is, the nodes of the graph can be embedded in an Euclidean space where the distances between nodes are exactly preserved. This property can be used to visualize the graph in a lower-dimensional space or, more generally, to define a subspace projection that has some optimality property, such as a principal components analysis [7], [8] or a discriminant analysis of the graph. In [14], we introduced such a kernel, the “commute time kernel”, and derived some of its properties. It appeared that this graph kernel performs well in collaborative recommendation, as shown in [7]. Following this previous work, the objective of the present paper is twofold. First, we provide a comprehensive comparison between five kernels that have been proposed in the literature, on a collaborative recommendation task. Second, we propose two new graph similarity matrices (the first one being a kernel matrix), the Markov diffusion kernel and the cross-entropy diffusion matrix and compare them with the other kernels. More precisely, we first review the few basic concepts behind kernel matrices on a graph. Seven recently defined kernels on a graph are then briefly reviewed, namely the exponential diffusion kernel [10], the Laplacian exponen-

This work presents a systematic comparison between seven kernels (or similarity matrices) on a graph, namely the exponential diffusion kernel, the Laplacian diffusion kernel, the von Neumann kernel, the regularized Laplacian kernel, the commute time kernel, and finally the Markov diffusion kernel and the cross-entropy diffusion matrix – both introduced in this paper – on a collaborative recommendation task involving a database. The database is viewed as a graph where elements are represented as nodes and relations as links between nodes. From this graph, seven kernels are computed, leading to a set of meaningful proximity measures between nodes, allowing to answer questions about the structure of the graph under investigation; in particular, recommend items to users. Crossvalidation results indicate that a simple nearest-neighbours rule based on the similarity measure provided by the regularized Laplacian, the Markov diffusion and the commute time kernels performs best. We therefore recommend the use of the commute time kernel for computing similarities between elements of a database, for two reasons: (1) it has a nice appealing interpretation in terms of random walks and (2) no parameter needs to be adjusted.

1

Introduction

There is a recent grow of interest in mining structured data, such as graphs or relational databases (see for instance [6], [20], and the other papers in these special issues). In this framework, some new types of kernels were introduced, allowing, to a certain extend, to answer questions about the structure of the graph under investigation. Indeed, once we have defined a meaningful kernel on a graph, a number of interesting measures are provided almost for free. For instance, a distance measure between the nodes of the graph can easily be deduced from the kernel. Computing distances 1 Proceedings of the Sixth International Conference on Data Mining (ICDM'06) 0-7695-2701-9/06 $20.00 © 2006

tial diffusion kernel [10], [19], the von Neumann diffusion kernel [17], the regularized Laplacian kernel [9], [19], the commute time kernel [7], [14], and finally the Markov diffusion kernel and the cross-entropy diffusion matrix, both introduced in this paper and inspired by [11] and [13]. Finally, the power of this approach is illustrated by applying (and comparing) the seven kernel-based methods to a collaborative recommendation task. Comparisons with standard collaborative recommendation methods are also reported. Section 2 reviews the basic relevant theory. Section 3 briefly introduces the seven investigated kernels. Section 4 details the experiments – the comparison between the seven above mentioned kernels on a collaborative recommendation task. Section 5 concludes the work.

2

Basic relevant theory and notations

In this section, we briefly review the basic theory behind kernels on graphs. In a very general setting, once we have proved that some meaningful proximity measure between the nodes of a graph is a kernel matrix, a number of derived quantities and interesting results automatically follow, almost for free (see [8]). Let us consider that we are given a weighted, undirected, graph, G, with symmetric weights wij > 0 between all couple of nodes, i and j, which are linked by an edge (say G has n nodes in total). The weight wij of the edge connecting node i and node j should be set to some meaningful value, with the following convention: the more important the relation between node i and node j, the larger the value of wij , and consequently the easier the communication through the edge. The elements aij of the adjacency matrix A of the graph are defined in a standard way as aij = wij if node i is connected to node j and 0 otherwise. We also introduce the Laplacian matrix L of the graph, defined in the usual manner: L = D − A, where D = Diag(ai. ) is the diagonal degree !n matrix, with diagonal entries dii = [D]ii = ai. = j=1 aij . We suppose that the graph has a single connected component; that is, any node can be reached from any other node of the graph. Moreover, one can easily show that L is symmetric and positive semidefinite (see for instance [4]). Any “meaningful” similarity coefficient defined on every couple of nodes, K(i, j), leading to a positive semidefinite matrix K with entries [K]ij = kij = K(i, j), could be used as a kernel on a graph. Let us assume that we defined such a similarity matrix from the adjacency matrix A (seven such similarity matrices are investigated in the experimental section). In general, such a similarity matrix will take both direct links (aij > 0) and indirect links (that is, paths involving more than one link), connecting the nodes, into

Proceedings of the Sixth International Conference on Data Mining (ICDM'06) 0-7695-2701-9/06 $20.00 © 2006

account. We review the main concepts behind this idea in [8]. These concepts basically come from the multidimensional scaling field [1], [5], and from the study of kernel methods [16], [17]. The kernel matrix K automatically induces a distance measure between nodes. Indeed, d2ij = ||xi − xj ||2 = (xi − xj )T (xi − xj ) = xTi xi + xTj xj − 2xTi xj = (ei − ej )T K(ei −ej ). This expression defines a distance between any couple of nodes of the graph. Since it corresponds to an Euclidean distance in "n , it verifies all the properties of a distance metric (positiveness, triangular inequality, etc.).

3 Seven kernels on a graph In this section, we review five recently proposed kernels on a graph and introduce two new ones. These kernels will be compared on a collaborative recommendation task in the experimental section. Notice that, because of a lack of space, we only introduce the formulas for each kernel; the details are provided in [8]. The exponential diffusion kernel. The exponential diffusion kernel, introduced by Kondor & Lafferty [10], [19], is defined as ∞ " α k Ak KED = = exp(αA) (1) k! k=0

where A is the adjacency matrix. The parameter α has to be tuned, usually by a trial-and-error procedure. The Laplacian exponential diffusion kernel. The Laplacian exponential diffusion kernel, also introduced in [10], [19], is defined as KLED =

∞ k " αk (−L)

k=0

k!

= exp(−αL)

(2)

which is similar to Equation (1), except that it involves the Laplacian matrix as basis matrix instead of the adjacency matrix. Once more, the parameter α has to be tuned by a trial-and-error procedure. The von Neumann diffusion kernel. We now introduce the von Neumann diffusion kernel [17], which differs from the exponential diffusion kernel by the discounting scheme: the von Neumann diffusion kernel has an exponential discounting rate KVND =

∞ "

k=0

αk Ak = (I − αA)−1

(3)

where the discounting factor is now αk . The regularized Laplacian kernel. The regularized Laplacian kernel KRL [9], [3] is defined as KRL =

∞ "

k=0

k

αk (−Lγ ) = (I + αLγ )−1

(4)

where Lγ is the modified Laplacian matrix: Lγ = γD − A. It has a nice interpretation in terms of matrix-forest theorem [3]. Two parameters, α, γ, need to be tuned. The commute time kernel. The commute time kernel introduced in [7], [14] is defined as KCT = L+

(5)

where L+ is the Moore-Penrose pseudoinverse of the Laplacian matrix of the graph. The commute time kernel is closely related to the average first-passage time between nodes in a Markov model based on the initial graph [7]. This time, no parameter tuning is necessary. The Markov diffusion kernel. In this section, we introduce a new kernel on a graph, based on [13] and [11]. In their work, the authors propose a new distance measure between nodes of a graph based on a continuous-time diffusion model. This work is adapted here to discrete-time processes, in order to define a valid kernel on a graph. Let us consider a discrete Markov model and let s(t) be a random variable representing the state of the Markov model at time step t. Thus, if the process is in state i ∈ {1, . . . , n} at time t, then s(t) = i. Let us denote the probability of being in state i at time t by xi (t) = P(s(t) = i) and define P as the probability transition matrix with entries pij =[P]ij =P(s(t+1) = j|s(t) = i), where [P]ij is element i, j of matrix P. We consider a Markov model for which the transition probabilities are provided by P(s(t + 1) = !n j|s(t) = i) = aij /ai. with ai. = j=1 aij . In other words, to any state or node s(t) = i, we associate a probability of jumping to an adjacent node s(t + 1) = j, which is proportional to the weight aij ! 0 of the edge connecting i and j (this corresponds to a standard random walk model). Since we assume that the graph is totally connected, the Markov chain is irreducible, that is, every state can be reached from any other state. In matrix form, the evolution of the Markov chain is characterized by x(t + 1) = PT x(t), which provides the state probability distribution x(t) = [x1 (t), x2 (t), ..., xn (t)]T at time t once the initial probability density x(0) = x0 at t = 0 is known. It is clear that after t steps, the probability distribution will be given by x(t) = (PT )t x0 . Now, from this model, we compute a quantity, xik (t), which, roughly speaking, corresponds to the average presence rate in state k after t steps, given ! that the process t started in state i at time t = 0, xik (t) = 1t τ =1 P(s(τ ) = k|s(0) = i). From this quantity, we define the diffusion distance at time t as dij (t) =

n "

k=1

(xik (t) − xjk (t))2

(6)

which corresponds to the sum of the squared differences between the presence rate at node k after t transitions, when

Proceedings of the Sixth International Conference on Data Mining (ICDM'06) 0-7695-2701-9/06 $20.00 © 2006

starting from two different nodes, node i and node j. This is a natural definition which quantifies the similarity between two nodes based on the evolution of the probability distribution. Of course, when i = j, dij (t) = 0. By defining !t Z(t) = 1t τ =1 Pτ , developing Equation (6) leads to (see [8] for details) dij (t) = (ei − ej )T Z(t)ZT (t)(ei − ej )

(7)

We immediately deduce the form of the Markov diffusion kernel KMD (t) = Z(t)ZT (t) with Z(t) =

1 !t Pτ t τ =1

(8)

The parameter t has to be tuned. Notice that in order to evaluate Z(t), we use a trick similar to the one used in PageRank [12]: a dummy absorbing state linked to all the states of the Markov chain is created with a very small probability of jumping to this state. This aims to substract some very small quantity from every element of P, with the result that the matrix P is now substochastic and that Z(t) admits the following analytical form, Z(t) = 1 −1 (I − Pt )P. t (I − P) The cross-entropy diffusion matrix. Now, if, instead of using (6), the symmetric cross-entropy divergence is computed, $ n # " xik (t) xjk (t) dij (t) = xik (t) log + xjk (t) log (9) xjk (t) xik (t) k=1

one obtains the corresponding cross-entropy diffusion matrix (since the distance cancels out the asymmetric part of the matrix, the kernel matrix can be symmetrized without change), KCED (t) = Z(t) log(ZT (t)) + log(Z(t))ZT (t)

(10)

where the logarithm is taken on each element of the matrix and Z(t) is defined as in Equation 8. Notice, however, that we did not prove that the positive definitiveness of this matrix; it therefore cannot be considered as a kernel at this time.

4 Experiments and results 4.1 Experimental methodology Our experiments were performed on a real database, the MovieLens data set (see [8] for additional results obtained on the Book-Crossing data set). The MovieLens data set is a real movie database from the web-based recommender system MovieLens (www.movielens.umn.edu). People are invited to visit MovieLens in order to rate and ask for recommendations for movies. We used a sample of this database

proposed in [15]. Enough users (i.e., 943 users) were randomly selected to obtain 100, 000 ratings (considering only users that had rated 20 or more movies on a total of 1, 682 movies). Basically, the database contains two tables, one grouping the users and one grouping the items. A bipartite graph is defined from this database. Each element of the people (i.e., users) and the item tables corresponds to a node of the graph. Each node of the people table is connected by an edge to each item purchased by the corresponding user (items are watched movies in our case). The results shown here do not take into account the numerical value of the ratings provided by the users but only the fact that a user has or has not purchased the item (i.e., entries in the user-item matrix are 0’s and 1’s). The experimental methodology is similar to the one used in [7]; please refer to this work for the experimental setting. Suffice to say that the performances are assessed by a 10fold cross-validation. For each run of the cross-validation and each user, some watched movies are removed from the training set and added to the test set. Then, a ranking of the movies from the test set is provided by each scoring algorithm. The ranking provided by the scoring algorithms is then compared to the “ideal ranking” where watched movies present in the test set are ranked first (we indeed ideally expect watched movies to be on top of the ranked list). To compare the ranked list provided by the scoring algorithms described in Section 4.2 with the ideal ranking, three different measures are used: (1) the degree of agreement (which is a variant of Somers’D [18]), (2) a percentile score, and (3) a recall score. These measures are detailed in [7]. Notice also that a preliminary experiment was performed to tune the parameters of each scoring algorithm (see [7]). Three different methods were used to determine which items to suggest to a particular user based on the similarities between pairs of nodes (details are provided in [8]): the direct method (similarities between a user and an item are computed directly), the user-based indirect method (similarities between a user and an item are computed indirectly by using user-user similarities and defining a neighbourhood of the active user), and the item-based indirect method (similarities between a user and an item are computed indirectly by using item-item similarities and defining a neighbourhood of the items chosen by the active user).

4.2

Scoring algorithms

The seven scoring algorithms are based on the seven graph kernels defined in Section 3 and are compared to more standard techniques: the k-nearest neighbours (kNN) and the maximum frequency (MaxF) algorithms. The knearest neighbours method has been chosen for comparison since it provided the best results among all the other stan-

Proceedings of the Sixth International Conference on Data Mining (ICDM'06) 0-7695-2701-9/06 $20.00 © 2006

dard methods (see [7]): cosine coefficient method, Katz’ method, shortest path algorithm. The maximum-frequency algorithm has been chosen because of its simplicity and its intuitive basis (it simply ranks the items by the number of users who purchased them). The details about these two standard methods as well as the results obtained on the MovieLens data set are provided in [7]. Notice that, for each of the seven investigated kernels, we additionally computed four measures, derived from the kernel matrix (see [8] for more details): (1) the centered kernel matrix, (2) the cosine kernel matrix, (3) the centered cosine kernel matrix, and (4) the distance measure derived from the kernel matrix. Each of these four measures has been systematically investigated exactly in the same way as the original kernel. However, we observed that the results obtained by using the original kernels were always at least as good as those obtained by the derived measures (as already observed for the commute time kernel in [7]). In other words, the derived measures (1)-(4) do not improve the results obtained by simply using the original kernel. We therefore do not report these results. Exponential diffusion kernel (ED). We compute the ED kernel by Equation (1). We varied systematically the parameter α (=10−6 , 10−5 , ..., 102 ) in the preliminary experiment, and show in the sequel only the best results (obtained with α = 10−6 ). Laplacian exponential diffusion kernel (LED). We compute the LED kernel by Equation (2). We varied systematically the parameter α (= 10−6 , 10−5 , ..., 102 ) in the preliminary experiment, and show in the sequel only the best results (obtained with α = 10−2 ). Von Neumann diffusion kernel (VND). We compute the VND kernel by Equation (3). We varied systematically the parameter α (= (10−6 , 10−5 ,..., 10−1 , 0.99) ×||A||−1 2 ) in the preliminary experiment, and show in the sequel only the best results (obtained with α = 10−6 × || A||−1 2 ). Regularized Laplacian kernels (RL). We compute the RL kernel by Equation (4). We first tuned parameter γ (= 10−6 , 10−5 , ..., 1) in the preliminary experiment (fixing the parameter α = 1); the best results where obtained with γ = 1. We then varied systematically, still in the preliminary experiment, the parameter α (= 10−6 , 10−5 , ..., 105 ), and show in the sequel only the best results (obtained with α = 102 and γ = 1). Commute time kernel (CT). We compute the CT kernel by taking the pseudoinverse of the Laplacian matrix; see Equation (5). No parameter needs to be tuned. Markov diffusion kernel (MD). We compute the MD kernel by Equation (8). We varied systematically the parameter t (= 1, 2, ..., 10, 50, 100) in the preliminary experiment, and show in the sequel only the best results (obtained with t = 2). Cross-entropy diffusion matrix (CED). We compute

the CED matrix by Equation (10). We varied systematically the parameter t (= 1, 2, ..., 10, 50, 100) in the preliminary experiment, and show in the sequel only the best results (obtained with t = 4, except for the recall 10 for the ML data set where t = 5).

4.3

Cross-validation results

All the results are summarized in Table 1, which shows the three performance measures: the degree of agreement (Agreement), the percentile score (Percentile), and the recall, considering either the top 10 of the ranked list (Recall 10) or the top 20 of the ranked list (Recall 20). The standard deviation of the results (STD) across the 10 crossvalidation runs is also reported, as well as the optimal number of neighbours (Neighbours), when applicable (indirect methods). Notice that we use a paired t-test to determine if there is a significant difference (with a p-value smaller than 10−2 ) between the results of the various scoring algorithms (across the 10 runs). The best results, for each measure of performance and for each method (i.e., direct, user-based indirect, or item-based indirect), are displayed in bold in each row of the table, based on the t-test. Table 1 shows that, when using the direct method to rank the movies for each user, the best results are obtained by MD, whatever performance measure used (degree of agreement, percentile, or recall). For the user-based indirect method, the bests results are obtained by CT and RL. In particular, the best degree of agreement is provided by both scoring algorithms with no significant difference, the best percentile score by RL, whereas the best recall scores (either the recall 10, or the recall 20) are obtained by CT. For the item-based indirect method, kNN (for the degree of agreement, the recall 10, and the recall 20) and RL (only for the percentile score) provide the best results. When looking at the global performance (regardless of the direct or indirect way the similarities are computed) of the various scoring algorithms, we observe that the best degree of agreement is provided by kNN (93.27 in the itembased indirect method), the best percentile is provided by RL (4.53 in the user-based indirect method), and the best recalls are provided by CT (21.43 for recall 10, 32.20 for recall 20, both in the user-based indirect method). We also observe that the user-based indirect method provides better recommendations than the item-based indirect method for all the scoring algorithms and all the performance measures except for the degree of agreement and the percentile for kNN. The results indicate that the best kernel-based methods (CT, MD, and RL) provide results comparable to the best standard method (kNN). However, the kernel-based methods are more general since they provide a generic way for computing similarities between nodes of an undirected

Proceedings of the Sixth International Conference on Data Mining (ICDM'06) 0-7695-2701-9/06 $20.00 © 2006

graph, while the kNN method can only be used in the case of bipartite graphs. Thus, the kernel-based methods could also take the characteristics of the users and the items into account, in addition to the links (purchases) between users and items. This will be investigated in a following paper. Moreover, the results obtained by using the original kernels were always at least as good as those obtained by the derived measures (see [8]). In other words, the derived measures provided by centering the kernel matrix, by using the cosine kernel matrix, or by using the derived distance (see [8]) do not improve the results obtained by simply using the original kernel. This indicates that inner-product similarities seem to perform well for computing node similarities. Another interesting observation is the fact that Laplacian matrix-based kernels (LED and RL) perform systematically better than their corresponding adjacency matrix-based kernels (ED and VND). This point is discussed in [9], and is linked to the observation that loops are penalized in Laplacian matrix-based kernels, in comparison with adjacency matrix based kernels, for which no penalty is introduced. Yet another conclusion is that the indirect procedure performs systematically better than the direct one. This is probably related to the fact that the indirect method takes the a priori frequency of the items into account with high correlation, while the direct one is only slightly correlated with item frequencies (MaxF) (see [2], [7]). Finally, among the kernel-based methods, CT, RL, and MD provide the best results. We, however, have a preference for CT which has a nice appealing interpretation in terms of random walks (and electrical networks) and for which no parameter needs to be tuned.

5 Conclusion and further work In this work, we investigated various kernel-based procedures allowing to compute similarities between elements of a database or, more generally, nodes of an undirected graph. In particular, we propose two new graph similarity matrices (the first one being a kernel matrix): the Markov diffusion kernel and the cross-entropy diffusion matrix. All these similarity measures can be used in order to compare items belonging to database tables that are not necessarily directly connected. They rely on the degree and the importance of the connectivity between these items. We showed through experiments performed on a collaborative recommendation task that some of these quantities are competitive with respect to standard methods.

References [1] I. Borg and P. Groenen. Modern multidimensional scaling: Theory and applications. Springer, 1997.

Agreement STD

Percentile STD

Recall 10 STD

Recall 20 STD

Agreement STD Neighbours

Percentile STD Neighbours

Recall 10 STD Neighbours

Recall 20 STD Neighbours

Agreement STD Neighbours

Percentile STD Neighbours

Recall 10 STD Neighbours

Recall 20 STD Neighbours

MovieLens data set Direct method (in %) RL CT

ED

LED

VND

MD

CED

kNN

MaxF

88.38 0 .30 8.53 0 .31 14.98 0 .29 23.12 0 .41

89.67 0 .28 7.17 0 .30 18.06 0 .26 26.95 0 .39

88.38 0 .30 8.53 0 .31 14.98 0 .29 23.12 0 .41

91.17 0 .19 5.78 0 .16 18.85 0 .17 28.43 0 .45

91.11 0 .17 5.87 0 .15 16.31 0 .33 26.39 0 .48

92.74 0 .17 4.65 0 .18 20.66 0 .24 31.07 0 .46

91.86 0 .17 5.19 0 .15 20.13 0 .23 30.54 0 .48

/ / / / / / / /

85.98 0 .32 10.97 0 .36 11.02 0 .23 17.43 0 .43

90.07 0 .30 30 6.74 0 .26 30 17.07 0 .48 50 25.41 0 .32 30

92.28 0 .25 20 4.70 0 .16 30 20.58 0 .30 50 31.27 0 .43 40

90.07 0 .30 30 6.74 0 .26 30 17.07 0 .47 50 25.94 0 .46 40

92.93 0 .20 100 4.53 0 .15 70 21.00 0 .32 30 31.80 0 .50 40

92.90 0 .22 100 4.63 0 .17 60 21.43 0 .41 40 32.20 0 .40 50

92.58 0 .17 90 4.81 0 .16 40 20.66 0 .25 50 31.27 0 .59 30

92.25 0 .19 40 5.19 0 .14 40 19.44 0 .29 70 29.09 0 .41 30

92.64 0 .16 100 5.53 0 .48 50 20.76 0 .33 50 31.13 0 .39 50

/ / / / / / / / / / / /

86.80 0 .26 100 10.58 0 .26 100 5.83 0 .32 100 11.11 0 .46 100

90.82 0 .18 100 6.73 0 .18 100 13.60 0 .29 30 22.63 0 .31 30

86.80 0 .26 100 10.58 0 .26 100 5.83 0 .32 100 11.11 0 .46 100

92.29 0 .20 100 4.91 0 .17 50 16.93 0 .29 30 27.11 0 .24 30

92.13 0 .22 70 5.20 0 .16 40 16.41 0 .65 40 27.98 0 .50 40

92.43 0 .22 70 5.09 0 .21 30 16.57 0 .24 30 26.67 0 .41 20

89.60 0 .23 30 9.25 0 .31 30 8.92 0 .34 30 15.20 0 .27 20

93.27 0 .22 100 4.93 0 .32 40 18.92 0 .27 20 30.81 0 .38 20

/ / / / / / / / / / / /

User-based indirect method (in %)

Item-based indirect method (in %)

Table 1. Average results obtained on the MovieLens data set, by performing a 10-fold cross-validation, for the various scoring algorithms and the three methods defined to use them (direct, user-based indirect, and item-based indirect).

[2] M. Brand. A random walks perspective on maximizing satisfaction and profit. Proceedings of the 2005 SIAM International Conference on Data Mining, 2005. [3] P. Chebotarev and E. Shamis. The matrix-forest theorem and measuring relations in small social groups. Automation and Remote Control, 58(9):1505–1514, 1997. [4] F. R. Chung. Spectral graph theory. American Mathematical Society, 1997.

data set parameterization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9):1393–1403, 2006. [12] A. Langville and C. D. Meyer. A survey of eigenvector methods for web information retrieval. SIAM Review, 47:135–161, 2005. [13] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis. Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. Advances in Neural Information Processing Systems 18, pages 955– 962, 2006.

[6] S. Dzeroski. Multi-relational data mining: an introduction. ACM SIGKDD Explorations Newsletter, 5(1):1–16, 2003.

[14] M. Saerens, F. Fouss, L. Yen, and P. Dupont. The principal components analysis of a graph, and its relationships to spectral clustering. Proceedings of the 15th European Conference on Machine Learning (ECML 2004). Lecture Notes in Artificial Intelligence, Vol. 3201, Springer-Verlag, Berlin, pages 371–383, 2004.

[7] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. Random-walk computation of similarities between nodes of a graph, with application to collaborative recommendation. To appear in the IEEE Transactions on Knowledge and Data Engineering, 2006.

[15] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Recommender systems for large-scale e-commerce: Scalable neighborhood formation using clustering. Proceedings of the Fifth International Conference on Computer and Information Technology, 2002.

[8] F. Fouss, L. Yen, A. Pirotte, and M. Saerens. A study of graph kernels applied to collaborative recommendation. Technical Report, IAG, Universite catholique de Louvain, 2006.

[16] B. Scholkopf and A. Smola. Learning with kernels. The MIT Press, 2002.

[5] T. Cox and M. Cox. Multidimensional scaling, 2nd ed. Chapman and Hall, 2001.

[9] T. Ito, M. Shimbo, T. Kudo, and Y. Matsumoto. Application of kernels to link analysis: First results. In Proceedings of the second workshop on mining graphs, trees and sequences, ECML/PKDD, Pisa, pages 13–24, 2004. [10] R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete structures. Proceedings of the 19th International Conference on Machine Learning, pages 315–322, 2002. [11] S. Lafon and A. B. Lee. Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and

Proceedings of the Sixth International Conference on Data Mining (ICDM'06) 0-7695-2701-9/06 $20.00 © 2006

[17] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [18] S. Siegel and J. Castellan. Nonparametric Statistics for the Behavioral Sciences, 2nd ed. McGraw-Hill, 1988. [19] A. J. Smola and R. Kondor. Kernels and regularization on graphs. In M. Warmuth and B. Sch¨olkopf, editors, Proceedings of the Conference on Learning Theory (COLT) and Kernels Workshop, 2003. [20] R. C. Wilson, E. R. Hancock, and B. Luo. Pattern vectors from algebraic graph theory. IEEE Transactions on Pattern Analysis and Machine Intelligence,, 27:1112–1124, 2005.

Suggest Documents