2012 26th International Conference on Advanced Information Networking and Applications Workshops
Link Prediction in Social Network Using Co-clustering based Approach Elham Hoseini, Sattar Hashemi, Ali Hamzeh Department of Comuter Sciences and Engineering Shiraz, Iran
[email protected],
[email protected],
[email protected]
Abstract-This paper introduces an approach to derive whether an individual is related to an item or not. In our approach, the well-known DBLP dataset is used and we try to find some skills that are related to an author that we were not aware of before. To realize our objective, we cluster authors and skills using Spectral Graph Clustering algorithm, then simultaneously obtain author and skill clusters via Bipartite Graph (Bigraph) Spectral Co-clustering approach, and then generate predictions based on the outputs of clustering and co-clustering steps. Accordingly, we utilize clustering and coclustering advantages to predict the probability of link existing between an author and a skill. Experimental results on DBLP dataset show that our approach works well in the specified task.
a1
a2
a3
s3
a4
s4
a5
s5
a6
s6
a7
a8
a9
As experimental results on a subset of DBLP dataset show, this approach outperforms some other approaches applying RMSE metric. The overall design of proposed algorithm is shown in Fig. 2. The remainder of the paper proceeds as follows. Section2 describes Spectral graph clustering and Bigraph Spectral Co-clustering algorithms. Section 3 introduces a model to calculate link existence probability. Section 4 presents the experimental study. Section5 concludes the paper and finally section 6 points future directions.
INTRODUCTION
The main objective of this paper is to pinpoint whether a given author has a predefined skill or not in social network field. This realm deserve conducting considerable research due to its wide varieties of applications in collaborative filtering, information retrieval and other approaches as link prediction problem. Be noted that, the relationship between authors and skills can be represented as a bigraph, which means a graph who have two disjoint part with some links between their nodes as shown in Fig. 1. Whereas it is possible to represent authors (skills) as nodes of a graph, we used Spectral Graph Clustering algorithm [11] to perform clustering. This method is a powerful technique that has many advantages such as capability of implementing efficiently for large data sets even with a sparse graph. Consequently, the author-skill bigraph is reduced to containing author and skill clusters as its nodes. After clustering step, we perform co-clustering on author cluster-skill cluster bigraph which is faster than authorskill bigraph co-clustering and impart more precision since the partitions are made using pre-formed clusters . 978-0-7695-4652-0/12 $26.00 © 2012 IEEE DOI 10.1109/WAINA.2012.189
s2
Figure 1. The square and circular vertices denote the two kinds of vertices in the bipartite graph. Co-clusters are achieved by partitioning this bipartite graph.
Keywords- Spectral Graph clustering, Bigraph Spectral Co-clustering, Link prediction.
I.
s1
II.
RELATED WORKS
Link prediction has recently been studied in a wide range of problems such as collaborative filtering recommendation [1] (predicting user-item links based on a user-item interaction matrix), information retrieval [2] (predicting query-document links based on a documentword network), record linkage problem [3] (predicting links among records with same identity), social network [4] (predicting author-author links). To solve this problem, various methods were introduced including Probabilistic Relational Models (PRMs) [5], Relational Markov Networks (RMNs) [6], logistic regression model [7] or other supervised learning algorithms [8]. Approximately, all link prediction methods are defined on unipartite graphs, and applying these approaches on bipartite graphs will not work well. Although, In [9] an 795
Author-Author similarity
Skill-Skill similarity
Authors clustering
Skills clustering
Author clustering
Co-clustering
Co-clustering Skill clustering
TEST PAIR
Building the link prediction model Figure 3. The overall design of clustering and co-clustering.
The top layer is author space, bottom is skill space and the middle is their cluster space. Probability value
published by a, and is the number of papers published by a containing s. Please note that skill s can be simply extracted from paper title. As experimental results show, because of sparsity and potentially huge size of graph, performing co-clustering algorithm on this bipartite graph without pre-clustering is not recommended. So, we formed two symmetric similarity matrix A and S between pair-wise authors and pair-wise skills using Jaccard measure as follows:
Figure 2. The overall design of our approach. The lines with arrows represent the work flow as well as the flow of data.
algebraic function is performed on a bipartite graph to identify link existence probability. III.
ENRICHED CO-CLUSTERING
In this section we present main part of our approach. The objective is simultaneously obtain reviewer and skill neighborhoods so that predictions can be applied based on the accessed results. At first, clustering algorithm is performed on authors and skills separately to reduce data dimensions in the co-clustering process. To do this, Spectral Graph Clustering algorithm is applied. Then Bigraph Spectral Co-clustering algorithm is used to cocluster formed author and skills clusters. The overall design of these section is shown in Fig 3. A. Spectral Graph Clustering Spectral Graph Clustering Algorithm is used to put similar authors or items together and so to increase our information approximately. In practice, after preprocessing DBLP dataset, we formed author-skill bigraph Gp×q -with p (q) showing the number of authors (skills)where edge weight between author a and skill s is calculated as , =
│ │
, =
⋂ ⋃
| ⋂ |
, = |
⋃ |
(1)
where Pi (Pj) is the set of papers are published by i (j) and Rm (Rn) is the set of authors having skill m (n). It is obvious that A and S indicate undirected graphs with weighted edges between nodes; So using the known Spectral Graph Clustering algorithm, we can find similar authors (or skills) and form some clusters. The spectral graph clustering algorithm, partitions a graph, keeping maximum within-cluster similarity meanwhile the between-cluster similarity is minimized. As a result, the following equation should be maximized: ∑ ∑ , є ,
(2)
Where, is the number of clusters, is !’th cluster and , is edge weight between nodes i and j. Equation (2) is maximized if "#$(, ̅ ) is small and &'!() is large. Where ̅ is all clusters except and:
, where is the set of papers
796
"#$(, ̅ ) = ∑ є,*є+ ,
,
&'!() = ∑ , є ,
(3)
In Fig. 3(a), a graph is clustered using the mentioned optimization formula. To get optimized clusters, this algorithm is based on eigenvectors of Laplacians, which are a combination of the weight and the degree matrix. For broader discussion of mathematical properties of Laplacians refer to [12] and [13]. The normalized symmetric Laplacian matrix is defined as 0 1
/
/
(a)
s1
s2
a2
a3
s3
s4
s5
0 1
- = . -. , where L=D-W is the unnormalized Laplacian, W is the adjacency matrix representing the graph and D is diagonal degree matrix for W is defined as follows: deg(4) 45 4 = 6 2 , = 3 0 '$ℎ9:49
a1
a5
a6
(b)
The corresponding pseudocode of this algorithm is shown in Algorithm 1.
Figure 3. Here, the optimized cuts to partitioning these graphs is drawn using dashed lines. (a) Assuming the left formed cluster as C1 and the right one as C2, we have: "#$( ,K ) = 2 and &'!( ) = 8 and &'!(K ) = 7. (b) Assuming the left formed cluster as C1 and the right one as C2, we have: "#$( ,K ) = 1 and &'!( ) = 7 and &'!(K ) = 6.
Algorithm1: Spectral Graph Clustering Input: The adjacency matrix W є Rn×n , Number of desired clusters k. 1: Compute Lsim and its k first eigenvectors. 2: Let V є Rn×k contains v1, . . . , vk as columns and yi є R, with i = 1, . . . , n, correspond to the i-th row of V. Cluster the points yi with the k-means algorithm into clusters C1, . . . ,Ck. Output: Clusters C1, . . . , Ck, with Ci = {j│yj ε Ci }
where A and S are sets of authors and skills respectively. Also, -(, ) is defined as: -(, ) = Q E, 0
Performing this algorithm on A and S separately, causes (