Fast Single-Pair SimRank Computation Pei Li1
Hongyan Liu2
Jeffrey Xu Yu3
Abstract SimRank is an intuitive and effective measure for link-based similarity that scores similarity between two nodes as the first-meeting probability of two random surfers, based on the random surfer model. However, when a user queries the similarity of a given node-pair based on SimRank, the existing approaches need to compute the similarities of other node-pairs beforehand, which we call an all-pair style. In this paper, we propose a Single-Pair SimRank approach. Without accuracy loss, this approach performs an iterative computation to obtain the similarity of a single node-pair. The time cost of our Single-Pair SimRank is always less than All-Pair SimRank and obviously efficient when we only need to assess similarity of one or a few node-pairs. We confirm the accuracy and efficiency of our approach in extensive experimental studies over synthetic and real datasets.
1 Introduction The measure of similarity between objects plays a significant role in many real-world applications, including clustering, classification, information retrieval, and recommendation systems. The current similarity measures can be generalized into two broad categories: content-based similarity measures and link-based similarity measures [14]. The former is based on the content using a vector space model [16], and the latter is based on a link graph in which objects and relationships are modeled as nodes and edges. Examples of link graphs include citations between papers, social relationships in human or web hyperlinks. Effective and efficient similarity measures between objects in a link graph can greatly assist users for searching and analyzing information [20, 9, 19, 6, 5, 13], in particular when the relationships among objects are complex. Among the link-based similarity measures in the literature, SimRank [9] has attracted a considerable attention due to its intuition and sturdy theoretical foundation. The basic intuition behind SimRank is “two objects are similar if they are referenced by similar ∗ This
work was supported by MSRA Internet Service Funding FY09-RES-THEME-061. 1 Renmin University of China. {lp,hejun,duyong}@ruc.edu.cn 2 Tsinghua University.
[email protected] 3 The Chinese University of Hong Kong.
[email protected]
571
Jun He1
Xiaoyong Du1
objects”, which implies a mutual reinforcement naturally, by updating similarity score of (𝑎, 𝑏) (denoted by 𝑆(𝑎, 𝑏)) according to similarity scores of all in-neighbors of (𝑎, 𝑏) on the previous iteration. Based on the random surfer model [4], SimRank owns a theoretical foundation stemming from PageRank [17] and HITS [10]. For linkbased similarity measures, SimRank is considered as one of the promising ones that have a comparable impact as PageRank has for link-based ranking [14]. However, given a pair of nodes (𝑎, 𝑏), the efficiency of computing SimRank 𝑆(𝑎, 𝑏) is an obstacle for its applicability on a large graph. For a large graph 𝐺(𝑉, 𝐸) the time complexity required for 𝑘 iterations is 𝑂(𝑘𝑛2 𝑑2 ), where 𝑛 is the number of nodes in 𝐺 and 𝑑 is the average incoming degree of nodes, and 𝑂(𝑘𝑛4 ) in the worst case. Hence, new optimization techniques for SimRank computation are needed. In the literature, there exist four reported studies on SimRank optimization [14, 5, 12, 2]. These optimization techniques have their own merits and work effectively to obtain similarity scores of every pair of nodes in a graph. However, if a user only needs to assess the similarity of a given node-pair (𝑎, 𝑏), it becomes cumbersome, for the following reason. SimRank computes 𝑆(𝑎, 𝑏) based on the similarity of all in-neighbors of 𝑆(𝑎, 𝑏), which means that the SimRank for neighbors needs to be computed before-hand. A research issue we focus on in this paper is whether we can efficiently compute 𝑆(𝑎, 𝑏) without accuracy loss by avoiding unnecessary computational cost on computing similarity scores of other node-pairs. As an example, consider a simple graph 𝐺 shown in Fig. 1(a). For SimRank [9] and all its current optimizations, 𝑆(𝑎, 𝑏) is updated according to the similarity of all (𝑎, 𝑏)’s in-neighbor pairs, that is any nodepair (𝑥, 𝑦) ∈ {𝑐, 𝑑} × {𝑎, 𝑑}. Hence, the similarity of each node-pair (𝑥, 𝑦) should be computed beforehand, analogously for the similarity of all in-neighbor pairs of (𝑥, 𝑦). We call this kind of methods All-Pair SimRank, in which similarities are mutual reinforced together and we can not obtain 𝑆(𝑎, 𝑏) without computing similarities of other node-pairs. Another problem of All-Pair SimRank is its inadaptability on time-evolving graphs. Observing that the graph structure of many real-world applications changes over time, an addition/removal operation of
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
b
c
d
0.20 SimRank Score S(a,b)
a
(a) 𝐺
𝑆 𝑆(𝑎, 𝑏) 𝑅𝑘
0.10
0DE
0.05
0DE
𝑅𝑘 (𝑎, 𝑏)
iteration
0.00
e
Table 1: Notations
0 DE
00DE 0DE DE 0DE 0DE
0.15
0
1
2
3
4
5
𝐶 𝐾 𝑆𝑢𝑟𝑓 (𝑣)
(b) SimRank Score
Figure 1: A tiny graph 𝐺, and the relationship between SimRank score 𝑆(𝑎, 𝑏) and first-meeting probability 𝑀𝑘 (𝑎, 𝑏) on each iteration using factor 𝐶 = 0.5
𝑀𝑘 (𝑎, 𝑏) 𝑇 𝑃𝑘𝑎𝑏
edges may result in the change of many similarity scores. This effect is amplified by the nature of mutual reinforcement, and makes it hard to perform incremental computation of All-Pair SimRank. In this paper, we propose a new single-pair approach to compute 𝑆(𝑎, 𝑏) without accuracy loss, by tactfully avoiding computation of the similarity of other node-pairs except (𝑎, 𝑏). We outline our approach below. Based on the viewpoint of the random surfer model, SimRank score can be modeled as the firstmeeting probability of two random surfers on the reversed graph [9, 6]. To be more specific, the SimRank score of the 𝑘-th iteration is the sum of first-meeting probabilities on the first 𝑘 steps. Hence, the key of our approach is to compute the first-meeting probability of two surfers that start from nodes 𝑎 and 𝑏, and
meet somewhere exactly on the 𝑘-th step, denoted by
𝑀𝑘 (𝑎, 𝑏), as shown in Fig. 1(b). The key issue is how to compute 𝑀𝑘 (𝑎, 𝑏). To compute 𝑀𝑘 (𝑎, 𝑏), we first give a naive method by matching path-trees, and then discuss how to achieve a higher efficiency by computing position matrix iteratively. In addition, we propose two techniques to accelerate this computation
optimization
process.
The main contributions of this paper are summarized below. First, we provide a deep analysis on SimRank computation. Second, we propose a new singlepair approach to compute SimRank score of a given node-pair without accuracy loss. The computational cost of our approach is always less than All-Pair SimRank and obviously efficient when we only need to assess similarity of one or a few node-pairs. The rest of this paper is organized as follows. In Section 2, we review SimRank, especially the relationship between SimRank and first-meeting probability. Then in Section 3, we propose a Single-Pair SimRank via random surfing. Two methods are discussed: a naive method by path-tree matching, and an iterative method by position matrix iteration. Theoretical proofs and op-
572
𝑃˜𝑘𝑎𝑏
Similarity matrix The similarity of node-pair (𝑎, 𝑏) SimRank score matrix on the 𝑘-th iteration SimRank score of node-pair (𝑎, 𝑏) on the 𝑘-th iteration A decay factor between 0 and 1 The maximum iterations/steps A random surfer starting from node 𝑣 First-meeting probability of 𝑆𝑢𝑟𝑓 (𝑎) and 𝑆𝑢𝑟𝑓 (𝑏) on the 𝑘-th step Transition matrix Position probability matrix of 𝑆𝑢𝑟𝑓 (𝑎) and 𝑆𝑢𝑟𝑓 (𝑏) on the 𝑘-th step A special position matrix 𝑃𝑘𝑎𝑏 , in which only probabilities of 𝑆𝑢𝑟𝑓 (𝑎) and 𝑆𝑢𝑟𝑓 (𝑏) having not met before are considered
timization techniques are also developed in this section. Experimental results on performance as graph size and average incoming degree 𝑑 change are provided in Section 4. We discuss related works in Section 5 and conclude in Section 6. 2
SimRank
For link-based similarity measures, SimRank [9] is considered as a promising one that owns comparable impact as PageRank [17] has for link-based ranking [14]. In this section, in brief, we give an overview on SimRank, and emphasize SimRank on its matrix computation and first-meeting probabilities, which are not provided in [9] and other existing works. Table 1 lists the main notations we use in this paper. For a given graph, SimRank measures similarity by exploiting structural context around objects, without requiring any human predefined hierarchies. Given a directed graph 𝐺(𝑉, 𝐸) and a node 𝑣 ∈ 𝑉 , two kinds of neighbors are distinguished: In-neighbors, denoted by 𝐼(𝑣), and Out-neighbors, denoted by 𝑂(𝑣). Following the intuition behind SimRank, let 𝑅𝑘 (𝑎, 𝑏) denote the SimRank score of (𝑎, 𝑏) on the 𝑘-th iteration, an iterative computation is introduced as follows. (2.1) ∑ ∑ 𝐶 𝑅𝑘 (𝑖, 𝑗) 𝑅𝑘+1 (𝑎, 𝑏) = ∣𝐼(𝑎)∣∣𝐼(𝑏)∣ 𝑖∈𝐼(𝑎) 𝑗∈𝐼(𝑏)
where 𝑎 ∕= 𝑏 and 𝐶 is a decay factor between 0 and 1. Initially, 𝑅0 (𝑎, 𝑏) = 0 if 𝑎 ∕= 𝑏, otherwise 𝑅0 (𝑎, 𝑏) = 1. And the similarity 𝑆(𝑎, 𝑏) = 1 if 𝑎 = 𝑏, and 𝑆(𝑎, 𝑏) becomes as follows when 𝑎 ∕= 𝑏. (2.2)
𝑆(𝑎, 𝑏) = lim 𝑅𝑘 (𝑎, 𝑏) 𝑘→∞
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
a a 0 b " 0.5 " c" 0 " d" 0 e "$ 0
d b c 0 0.5 0.5 0 0 0.5
e 0! 0# # 0 0 1.0 0 # # 0 0.5 0 0.5# 0 1.0 0 0 #% (a)
a a 1.0 b "0.19 " c "0.27 " d "0.17 e "$0.27
b 0.19 1.0 0.29 0.11 0.09
c 0.27 0.29 1.0 0.09 0.04
d 0.17 0.11 0.09 1.0 0.26
e 0.27! 0.09 # # 0.04 # # 0.26# 1.0 #%
to Eq. (2.1), we have 𝑅𝑘+1 (𝑎, 𝑏)
= =
𝐶
∑
Figure 2: (a) Transition matrix 𝑇 of the reversed 𝐺. (b) Similarity matrix 𝑆 of nodes in graph 𝐺 with decay factor 𝐶 = 0.5.
It is because the decay factor 𝐶 < 1, and 𝑅𝑘 (𝑎, 𝑏) is known to converge to the similarity 𝑆(𝑎, 𝑏). 2.1 Similarity Matrix SimRank is based on random walk on graphs, which is a special case of Markov chain [1]. A transition matrix 𝑇 is used to describe the transitions between objects in a Markov chain. In SimRank, random walk paths are assumed to follow the reversed edges in the original graph [9, 5]. There are three steps to obtain a transition matrix 𝑇 for a given graph 𝐺 in computing SimRank: (1) Generate adjacency matrix 𝐴 of graph 𝐺, (2) Get 𝐴′ , the transposed matrix of 𝐴, and (3) Normalize 𝐴′ by summing up each row to 1 unless all elements in this row are zero. Finally, 𝑇 is obtained. Let 𝑆 be a similarity matrix representing the similarity of every pair of nodes in the graph 𝐺(𝑉, 𝐸). The element 𝑆(𝑖, 𝑗) on the 𝑖-th row and 𝑗-th column denotes the similarity between nodes 𝑖 and 𝑗. Here 𝑆 is a symmetrical matrix in deed due to 𝑆(𝑖, 𝑗) = 𝑆(𝑗, 𝑖). Take the graph 𝐺 in Fig. 1(a) as an example. The transition matrix 𝑇 of the reversed 𝐺 and the similarity matrix 𝑆 of nodes in graph 𝐺 are shown in Fig. 2(a) and Fig. 2(b) respectively. Let 𝑅𝑘 denote SimRank scores of each node-pair in 𝑉 × 𝑉 on the 𝑘-th iteration. We give a proposition on SimRank computation using the matrix notation. Proposition 2.1: Given a graph 𝐺(𝑉, 𝐸), similarity matrix 𝑆 can be computed iteratively by 𝑆 = lim 𝑅𝑘 𝑘→∞
and 𝑅𝑘+1 = 𝐶 ⋅ 𝑇 𝑅𝑘 𝑇 ′ + Θ
𝐶
∑
∑
𝑅𝑘 (𝑖, 𝑗)
𝑖∈𝐼(𝑎) 𝑗∈𝐼(𝑏)
∑
𝑖∈𝐼(𝑎) 𝑗∈𝐼(𝑏)
(b)
=
(2.3)
∑ 𝐶 ∣𝐼(𝑎)∣∣𝐼(𝑏)∣
∑
1 1 ⋅ 𝑅𝑘 (𝑖, 𝑗) ⋅ ∣𝐼(𝑎)∣ ∣𝐼(𝑏)∣ 𝑇𝑎𝑝 ⋅ 𝑅𝑘 (𝑝, 𝑞) ⋅ 𝑇𝑏𝑞
𝑝∈𝐼(𝑎) 𝑞∈𝐼(𝑏)
=
(𝐶 ⋅ 𝑇 𝑅𝑘 𝑇 ′ )𝑎𝑏
which proves Eq. (2.3). Similar to Eq. (2.2), when 𝑘 → ∞, 𝑅𝑘 converges to 𝑆(𝑉 ). □ Prop. 2.1 provides a matrix description for SimRank iteration equation. By Prop. 2.1, similarity matrix 𝑆 can be obtained conveniently and efficiently. In most real applications, since transition matrix 𝑇 is sparse, techniques of sparse matrix multiplication can help reduce complexity of Eq. (2.3) greatly. 2.2 SimRank by First-Meeting Probabilities In the following discussion, for simplicity, we mainly discuss SimRank for 𝑎 ∕= 𝑏. In the viewpoint of the random surfer model [4], for a given unweighted graph, SimRank has a 1/∣𝑂(𝑎)∣ probability of reaching one of the out-neighbors of a node 𝑎 on the next step, supposing a random surfer stands at node 𝑎. As discussed in [9], similarity 𝑆(𝑎, 𝑏) measures how soon two random surfers are expected to meet if they start at nodes 𝑎 and 𝑏 and randomly walk backwards over the graph. In other words, 𝑆(𝑎, 𝑏) can be computed by summing up first-meeting probabilities of two random surfers. However, this perspective to compute 𝑆(𝑎, 𝑏) is not provided by [9] and the other existing works. Below, we give a different way to compute firstmeeting probabilities of two random surfers. Let 𝑆𝑢𝑟𝑓 (𝑣) represent the random surfer starting from node 𝑣, and let 𝑀𝑘 (𝑎, 𝑏) denote the first-meeting probability of two random surfers on the 𝑘-th step, starting from nodes 𝑎 and 𝑏 respectively and walking through reversed edges. We can compute 𝑅𝑘 (𝑎, 𝑏), by summing up all first-meeting probabilities of 𝑆𝑢𝑟𝑓 (𝑎) and 𝑆𝑢𝑟𝑓 (𝑏) within 𝑘 steps, as shown in the following proposition. Proposition 2.2: 𝑅𝑘 (𝑎, 𝑏), the similarity of node-pair (𝑎, 𝑏) on the 𝑘-th iteration, can be computed by
where 𝑇 ′ is the transpose of 𝑇 , and Θ is a correction matrix making every element on the diagonal of 𝑅𝑘+1 (2.4) to be 1, which aims at satisfying the definition that 𝑆(𝑎, 𝑏) = 1 if 𝑎 = 𝑏. □ Proof Sketch: For a node-pair (𝑎, 𝑏) ∈ 𝑉 × 𝑉 , if 𝑎 = 𝑏, obviously Eq. (2.3) holds. If 𝑎 = ∕ 𝑏, according
573
𝑅𝑘 (𝑎, 𝑏) =
𝑘 ∑
𝑀𝑥 (𝑎, 𝑏)
𝑥=1
□ We prove it in the Appendix.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
3
Fast Computing 𝑆(𝑎, 𝑏): SimRank Approach
A
a
Single-Pair
b
c
In SimRank [9] and its current optimizations [14, 5, 12], similarities are updated following a strategy similar to Eq. (2.1). That means, to obtain similarity of (𝑎, 𝑏), similarity scores of (𝑖, 𝑗) should be computed beforehand for any 𝑖 ∈ 𝐼(𝑎) and 𝑗 ∈ 𝐼(𝑏), where 𝐼(𝑎) and 𝐼(𝑏) are the sets of nodes with edges to 𝑎 and 𝑏. We call this strategy an All-Pair SimRank approach. In this paper, on the contrary, we propose a SinglePair SimRank approach in this paper. To obtain 𝑆(𝑎, 𝑏), similarity scores of other node-pairs do not need to be computed. It is important to note that in our single-pair method the computational cost does not increase if the underneath graph becomes large. Even on a very large graph, the query of 𝑆(𝑎, 𝑏) can be computed quickly, without accuracy loss compared to the all-pair SimRank scores. It is worth noting that Fogaras et al. [5] also gave a naive method to query 𝑆(𝑎, 𝑏) using a single-pair style (see Algorithm 1 in [5]), but the value returned is different from the all-pair SimRank score. Our Single-Pair SimRank follows a strategy depicted in Prop. 2.2. The main idea behind is, if the first-meeting probability 𝑀𝑘 (𝑎, 𝑏) can be computed via an effective and efficient function, SimRank score 𝑅𝑘 (𝑎, 𝑏) can be returned easily, and so does 𝑆(𝑎, 𝑏). In this section, we present algorithms to compute 𝑀𝑘 (𝑎, 𝑏) accurately based on the random surfer model with theoretical foundation. Below, for simplicity, we suppose that all edges in graph 𝐺 have been reversed for random walk. In brief, any directed edge, (𝑢, 𝑣), in graph 𝐺 become (𝑣, 𝑢). Since 𝑆(𝑎, 𝑎) = 1, we assume 𝑎 ∕= 𝑏 for the given node-pair (𝑎, 𝑏) in the following discussions. Besides, some parameter settings and theoretical foundations are discussed, which will be used in presenting our Single-Pair SimRank algorithms. Decay Factor and Maximum Steps: Unlike the measures such as co-citation [18] or Jaccard coefficient that only utilize direct links, SimRank exploits indirect information in a multi-step way with decay factor 𝐶. Although 𝐶 is useful to capture the influence of indirect links, it should be set carefully to avoid overweighting the influence of indirect links. As discussed in [14], a higher decay factor also results in more and unnecessary iterations. It can be set as 𝐶 = 0.5 in practice. To judge the convergence of SimRank scores, a threshold Δ is introduced to measure the difference between 𝑆(𝑎, 𝑏) and 𝑅𝑘 (𝑎, 𝑏). We predict the maximum steps 𝐾 according to accuracy estimate introduced by [14]. Formally, if it converges, we get
574
d c
d c
a
e
e d
c
d c
d cc
e d
d
e c
Figure 3: An illustration of two path-trees within three steps on the reversed graph 𝐺
𝑆(𝑎, 𝑏) − 𝑅𝑘 (𝑎, 𝑏) ≤ 𝐶 𝐾+1 ≤ Δ (3.5)
⇒ 𝐾 ≥ log𝐶 Δ − 1 ⇒ 𝐾 = ⌊log𝐶 Δ⌋(𝑟𝑜𝑢𝑛𝑑𝑖𝑛𝑔)
The Probability of a Random Path: Based on the random walk theory, supposing a random surfer stands at node 𝑎 in an unweighted graph, the probability of an one-step path ⟨𝑣𝑖 , 𝑣𝑗 ⟩ is defined as: { 1 if 𝑣𝑗 ∈ 𝑂(𝑣𝑖 ) ∣𝑂(𝑣𝑖 )∣ 𝑃 (𝑣𝑖 , 𝑣𝑗 ) = 0 otherwise Here, 𝑂(𝑣) is the set of nodes with an outgoing edge from the node 𝑣. A random walk path 𝑃 𝑎𝑡ℎ = ⟨𝑣0 , 𝑣1 , . . . , 𝑣𝑥 ⟩ can be viewed as a sequence of one-step paths. Thus, the probability of this random path is computed by (3.6)
𝑃 (𝑃 𝑎𝑡ℎ) =
𝑥−1 ∏
𝑃 (𝑣𝑖 , 𝑣𝑖+1 )
𝑖=0
3.1 A Naive Path-Tree Matching Method In this section, we give a naive method to compute 𝑀𝑘 (𝑎, 𝑏), which is the first-meeting probability of two random surfers on the 𝑘-th step. We use a path-tree to describe the visiting of nodes via random surfing, where the root of the path-tree corresponds to the starting node. We generate a path-tree for each random surfer starting from a node. Since 𝑀𝑘 (𝑎, 𝑏) measures the firstmeeting of two random surfers, two path-trees will be generated. Finally, paths with a length of 𝑘 in different trees are compared and matched, with a first-meeting probability returned. As an example, for two random surfers starting from nodes 𝑎 and 𝑏 in graph 𝐺 (Fig. 1(a)), an illustration of two path-trees within three steps is shown in Fig. 3. Notice that edges in graph 𝐺 have been reversed. For the path-pair matching, let us assume that two paths with the length of 𝑘 in different trees are ⟨𝑎, 𝑎1 , 𝑎2 , . . . , 𝑎𝑘 ⟩ and ⟨𝑏, 𝑏1 , 𝑏2 , . . . , 𝑏𝑘 ⟩ respectively. Here 𝑎, 𝑏, and 𝑎𝑖 and 𝑏𝑖 for 1 ≤ 𝑖 ≤ 𝑘 represent nodes. Based on the definition of 𝑀𝑘 (𝑎, 𝑏), these two paths do first-meeting on the 𝑘-th step if and only if
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
In this section, we first introduce position matrix and then focus on our new Single-Pair SimRank algorithm. Two optimization techniques are also developed to accelerate iterative computation.
Algorithm 1 Path-Tree Matching Input: 𝐺(𝑉, 𝐸), 𝐶, (𝑎, 𝑏), 𝑘 Output: 𝑀𝑘 (𝑎, 𝑏) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
reverse all edges in 𝐺(𝑉, 𝐸); generate path-tree 𝑃 𝑇𝑎 for 𝑆𝑢𝑟𝑓 (𝑎) with length 𝑘; generate path-tree 𝑃 𝑇𝑏 for 𝑆𝑢𝑟𝑓 (𝑏) with length 𝑘; 𝑀𝑘 (𝑎, 𝑏) = 0; for each path 𝑃 𝑎𝑡ℎ𝑎 = ⟨𝑎, 𝑎1 , 𝑎2 , . . . , 𝑎𝑘 ⟩ in 𝑃 𝑇𝑎 do for each path 𝑃 𝑎𝑡ℎ𝑏 = ⟨𝑏, 𝑏1 , 𝑏2 , ..., 𝑏𝑘 ⟩ in 𝑃 𝑇𝑏 do if 𝑎𝑘 = 𝑏𝑘 then if ∀𝑖 ∈ {1, 2, . . . , (𝑘 − 1)}, 𝑎𝑖 ∕= 𝑏𝑖 then compute 𝑃 (𝑃 𝑎𝑡ℎ𝑎 ) by Eq. (3.6); compute 𝑃 (𝑃 𝑎𝑡ℎ𝑏 ) by Eq. (3.6); add 𝐶 ⋅ 𝑘 ⋅ 𝑃 (𝑃 𝑎𝑡ℎ𝑎 ) ⋅ 𝑃 (𝑃 𝑎𝑡ℎ𝑏 ) to 𝑀𝑘 (𝑎, 𝑏); end if end if end for end for
Position Matrix: We use Position Matrix to depict the probabilities of two random surfers standing at a specific node-pair on a specific moment. Definition 3.1: Let 𝑃𝑘𝑎𝑏 denote position matrix of 𝑆𝑢𝑟𝑓 (𝑎) and 𝑆𝑢𝑟𝑓 (𝑏) on the 𝑘-th step. The 𝑖-th row and 𝑗-th column element of 𝑃𝑘𝑎𝑏 represent probability of 𝑆𝑢𝑟𝑓 (𝑎) and 𝑆𝑢𝑟𝑓 (𝑏) standing at nodes 𝑖 and 𝑗 respectively, on the 𝑘-th step. □
Given a graph 𝐺(𝑉, 𝐸), 𝑃𝑘𝑎𝑏 is a ∣𝑉 ∣ by ∣𝑉 ∣ matrix. For the example shown in Fig. 3, assuming 𝑘 = 3, the probability of 𝑆𝑢𝑟𝑓 (𝑎) reaching node 𝑒 is 12 ⋅ 1 ⋅ 21 = 14 , and the probability of 𝑆𝑢𝑟𝑓 (𝑏) to 𝑒 is 12 ⋅ 12 ⋅ 12 = 81 . Thus, we obtain (𝑃3𝑎𝑏 )𝑒𝑒 = 14 ⋅ 81 ⋅ 𝐶 3 , which is the element at the fifth row and fifth column of 𝑃𝑘𝑎𝑏 . We observe that not all probabilities on a specific ∀𝑖 ∈ {1, 2, . . . , (𝑘 − 1)}, 𝑎𝑖 ∕= 𝑏𝑖 and 𝑎𝑘 = 𝑏𝑘 position make contributions to the first-meeting probaConsider all the pairs of paths with a length of bility. For example, for the probability (𝑃 𝑎𝑏 ) in Fig. 3, 3 𝑒𝑒 𝑘, we can compute 𝑀𝑘 (𝑎, 𝑏) accordingly as shown in 𝑆𝑢𝑟𝑓 (𝑎) follows path ⟨𝑎, 𝑐, 𝑑, 𝑒⟩, and 𝑆𝑢𝑟𝑓 (𝑏) follows Algorithm 1. Algorithm 1 compares path-pairs in path ⟨𝑏, 𝑎, 𝑑, 𝑒⟩. Since they have met at node 𝑑 on the different path-trees, and if a path-pair matches as the 2nd step, (𝑃 𝑎𝑏 ) is not the first-meeting probability. 3 𝑒𝑒 first meeting, the probability of this path-pair will be In fact, only path-pairs that have not met before have summed. Taking path-trees shown in Fig. 3 as an the possibility to be first-meeting in the future steps. example, we get values that accord with scores shown Hence, we give a revised definition of position matrix. in Fig. 1(b). Definition 3.2: Let 𝑃˜𝑘𝑎𝑏 denote the position matrix 1 1 𝑃𝑘𝑎𝑏 , in which only probabilities of two random surfers ⋅ ⋅ 𝐶 (meet at 𝑑) 𝑀1 (𝑎, 𝑏) = 2 2 having not met before are considered. □ 1 1 1 1 2 𝑎𝑏 𝑀2 (𝑎, 𝑏) = ( ⋅ + ⋅ ) ⋅ 𝐶 (meet at 𝑑 and 𝑐) Probabilities in 𝑃˜𝑘 may make first-meeting promis2 4 4 4 ingly. Following Definition 3.2, we show a proposition 1 1 1 1 𝑀3 (𝑎, 𝑏) = ( ⋅ + ⋅ ) ⋅ 𝐶 3 (meet at 𝑐) to discover the relationship between 𝑃˜𝑘𝑎𝑏 and 𝑀𝑘 (𝑎, 𝑏). 4 4 4 8 The naive approach is given in Algorithm 1. The Proposition 3.1: Let 𝐷𝑖𝑎𝑆𝑢𝑚(𝑀 ) denote the sum elements of matrix 𝑀 . The relationship computational cost of Algorithm 1 is independent of the of diagonal ˜ 𝑎𝑏 and 𝑀𝑘 (𝑎, 𝑏) is shown as follows: between 𝑃 𝑘 graph size, and is only related with the average outgoing degree of the reversed graph (namely, average incoming (3.7) 𝑀𝑘 (𝑎, 𝑏) = 𝐷𝑖𝑎𝑆𝑢𝑚(𝑃˜𝑘𝑎𝑏 ) degree 𝑑 of the original graph). Empirically, it takes 𝑂(𝑑2𝑘 ) time and 𝑂(2𝑑𝑘 ) space to perform a path-tree □ matching. Consider Prop. 2.2 and suppose there are 𝑘 𝑎𝑏 ˜ iterations before convergence, it takes 𝑂(𝑘𝑑2𝑘 ) time to Proof Sketch: Since 𝑃𝑘 is a ∣𝑉 ∣ × ∣𝑉 ∣ matrix and describes position probabilities of two random surfers obtain a SimRank score 𝑆(𝑎, 𝑏). that have not met before on the 𝑘-th step, for ∀𝑖 ∈ 𝑉 , (𝑃˜𝑘𝑎𝑏 )𝑖𝑖 means that two random surfers firstly meet at 3.2 A New Single-Pair SimRank Method Although the idea of “path-tree matching” is simple to node 𝑖. Thus, the sum of diagonal elements equals □ understand and computational cost remains unchanged 𝑀𝑘 (𝑎, 𝑏). as the graph size increases, there are some spare opportunities to reduce time cost further. For example, in line 8 of Algorithm 1, a large amount of time is spent on comparing nodes, whereas only a tiny number of pathpairs are first meeting at last.
575
Iterative Computation of Position Matrix: By Prop. 2.2 and Prop. 3.1, SimRank scores 𝑅𝑘 (𝑎, 𝑏) can be computed accurately. The key issue is how to develop new techniques to obtain 𝑃˜𝑘𝑎𝑏 effectively and efficiently. In this section, we propose an iterative method to
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
compute 𝑃˜𝑘𝑎𝑏 . j i b a First, consider the initial stage when 𝑘 = 0. Based 𝑎𝑏 on definition of 𝑃˜𝑘 , since two random surfers stand at k steps k steps Pkab+1 𝑎 and 𝑏 respectively before any steps have been taken, it is easy to understand that only element on the 𝑎-th row Figure 4: Inductive Step of Position Matrix and 𝑏-th column of 𝑃˜0𝑎𝑏 equals 1, and other elements equals zero intuitively. For the cases when 𝑘 ≥ 1, an iterative process is given to compute position matrix. (3.11) 𝑀𝑘 (𝑎, 𝑏) = 𝐷𝑖𝑎𝑆𝑢𝑚(𝑃˜𝑘𝑎𝑏 ) Proposition 3.2: 𝑃˜𝑘𝑎𝑏 can be iteratively computed by (3.8) where ∣𝑉 ∣ ∣𝑉 ∣ ( ) ( ′ ) ∑ ∑ 𝑎𝑏 𝑎𝑏 ˜ ˜ ∣𝑉 ∣ ∣𝑉 ∣ 𝑃𝑘 = 𝐶 ⋅ ⋅ 𝑇𝑖 𝑇𝑗 for 𝑘 ≥ 1 𝑃𝑘−1 ( ) ( ′ ) ∑ ∑ 𝑖𝑗 𝑎𝑏 𝑖=1 𝑗=1,𝑗∕=𝑖 (3.12) 𝑃˜𝑘𝑎𝑏 = 𝐶 ⋅ 𝑃˜𝑘−1 ⋅ 𝑇𝑖 𝑇𝑗 ′
where 𝑇𝑖 is the transpose of the 𝑖-th row in transition matrix 𝑇 , and 𝑇𝑗 is the 𝑗-th row in 𝑇 . □
𝑖𝑗
𝑖=1 𝑗=1,𝑗∕=𝑖
It is worth noting that the computation of 𝑃˜𝑘𝑎𝑏 can Proof Sketch: We prove it by mathematical induction be optimized. Since 𝐶 is a common factor of 𝑃˜𝑘𝑎𝑏 , factor upon the step number 𝑘. As the induction basis, since 𝐶 can be detached to avoid multiply operations with only the 𝑎-th row and 𝑏-th column of 𝑃˜0𝑎𝑏 equals 1 and every element in matrix on each iteration. Formally, ˜ 𝑎𝑏 = 𝑃˜ 𝑎𝑏 , the core computation of Single𝑎 ∕= 𝑏, 𝑆𝑢𝑟𝑓 (𝑎) (𝑆𝑢𝑟𝑓 (𝑏)) has a distribution of 𝑇𝑎 (𝑇𝑏 ) supposing 𝑄 0 0 to go to other nodes, for 𝑘 = 1. Consequently, the Pair SimRank method can be rewritten as position probability of 𝑆𝑢𝑟𝑓 (𝑎) and 𝑆𝑢𝑟𝑓 (𝑏) going to ˜ 𝑎𝑏 (3.13) 𝑀𝑘 (𝑎, 𝑏) = 𝐶 𝑘 ⋅ 𝐷𝑖𝑎𝑆𝑢𝑚(𝑄 𝑘 ) another node-pair can be computed by (3.9) where ∣𝑉 ∣ ∣𝑉 ∣ ( ) ( ′ ) ∑ ∑ ′ 𝑎𝑏 𝑎𝑏 ∣𝑉 ∣ ∣𝑉 ∣ ( ) ( ′ ) ∑ ∑ 𝑃˜1 = 𝐶 ⋅ 𝑇𝑎 𝑇𝑏 = 𝐶 ⋅ 𝑃˜0 ⋅ 𝑇𝑖 𝑇𝑗 𝑎𝑏 ˜ ˜ 𝑎𝑏 𝑖𝑗 (3.14) 𝑄𝑘 = 𝑄 ⋅ 𝑇𝑖 𝑇𝑗 𝑘−1 𝑖=1 𝑗=1,𝑗∕=𝑖 In the inductive step, compute position matrix for (𝑘 + 1), provided the proposition holds for 𝑘. For any probability (𝑃˜𝑘𝑎𝑏 )𝑖𝑗 ∈ 𝑃˜𝑘𝑎𝑏 , there are two possible cases. First, consider 𝑖 = 𝑗. In this case (𝑃˜𝑘𝑎𝑏 )𝑖𝑗 locates on the diagonal of 𝑃˜𝑘𝑎𝑏 . 𝑆𝑢𝑟𝑓 (𝑎) and 𝑆𝑢𝑟𝑓 (𝑏) meet exactly here. It makes no contribution to the firstmeeting probability on the next step for (𝑘 + 1). Second consider 𝑖 ∕= 𝑗. Here (𝑃˜𝑘𝑎𝑏 )𝑖𝑗 represents the probability of 𝑆𝑢𝑟𝑓 (𝑎) and 𝑆𝑢𝑟𝑓 (𝑏) reaching node 𝑖 and 𝑗 respectively on step 𝑘 and they have never met before. Consider the situation of (𝑘+1). As illustrated by Fig. 4, on the step (𝑘 + 1), 𝑆𝑢𝑟𝑓 (𝑎) reaches another node in graph with a probability distribution of 𝑇𝑖 , similarly for 𝑆𝑢𝑟𝑓 (𝑏) with a distribution of 𝑇𝑗 . Thus, distribution matrix for 𝑆𝑢𝑟𝑓 (𝑎) and 𝑆𝑢𝑟𝑓 (𝑏) reaching another node′ pair is (𝑃˜𝑘𝑎𝑏 )𝑖𝑗 ⋅ 𝑇𝑖 𝑇𝑗 . Then, consider the decay factor 𝐶 and all node-pairs in graph like (𝑖, 𝑗), we obtain the position matrix for step (𝑘 + 1) as follows.
𝑖=1 𝑗=1,𝑗∕=𝑖
𝑖𝑗
The New Single-Pair SimRank Algorithm: Eq. (3.13) and Eq. (3.14) provide us an alternative way to obtain SimRank score of (𝑎, 𝑏) without computing the similarity of other node-pairs. We show the new SinglePair SimRank in Algorithm 2 based on the propositions presented in the previous sections. The input parameter Δ is a threshold to judge the convergence of SimRank scores. A symbol 𝐾 is used to denote the total number of steps before convergence. For complexity analysis, time cost of Single-Pair SimRank is only related to the average incoming degree 𝑑. The main computational cost is spent in line 7. Although Eq. (3.14) is described by matrices, these matrices are usually very sparse. The proportion of nonzero elements in 𝑃˜𝑘𝑎𝑏 influences the speed of iterative computation directly. As steps 𝑘 increases, non-zero elements also expand in position matrix. Examples on the tiny graph 𝐺 (Fig. 1(a)) are shown in Fig. 5. ∣𝑉 ∣ ∣𝑉 ∣ ) ( ′ ) ( ∑ ∑ 𝑎𝑏 𝑎𝑏 More explicitly, since diagonal elements are omitted ˜ ˜ ⋅ 𝑇𝑖 𝑇𝑗 (3.10) 𝑃𝑘+1 = 𝐶 ⋅ 𝑃𝑘 𝑖𝑗 during the iterations, the number of non-zero elements 𝑖=1 𝑗=1,𝑗∕=𝑖 in position matrix on the 𝑘-th step is below 𝑑𝑘 , and Therefore, Prop. 3.2 holds. □ is also less than ∣𝑉 ∣2 obviously. Thus we get the time By integrating Prop. 2.2, Prop. 3.1, and Prop. 3.2, complexity of Single-Pair SimRank we generalize the core computation of our Single-Pair (3.15) 𝑂(Single-Pair) ≤ 𝑘𝑑2 ⋅ 𝑚𝑖𝑛{𝑑𝑘 , ∣𝑉 ∣2 } SimRank method, shown as follows.
576
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
propose two optimizations to accelerate iterative computation, and discuss techniques to apply Single-Pair SimRank on very large graphs.
Algorithm 2 Single-Pair SimRank Input: 𝐺(𝑉, 𝐸), 𝐶, (𝑎, 𝑏), Δ Output: 𝑆(𝑎, 𝑏) reverse all edges in 𝐺(𝑉, 𝐸); generate transition matrix 𝑇 of the reversed 𝐺; initialize 𝑆(𝑎, 𝑏) = 0; initialize position matrix 𝑃˜𝑘𝑎𝑏 with (𝑃˜𝑘𝑎𝑏 )𝑎𝑏 = 1; 𝐾 = ⌊log𝐶 Δ⌋; for 𝑘 = 1 to 𝐾 do compute 𝑃˜𝑘𝑎𝑏 and 𝑀𝑘 (𝑎, 𝑏) according to Eq. (3.13) and Eq. (3.14); 8: 𝑆(𝑎, 𝑏) = 𝑆(𝑎, 𝑏) + 𝑀𝑘 (𝑎, 𝑏); 9: 𝑘 = 𝑘 + 1; 10: end for 1: 2: 3: 4: 5: 6: 7:
abcde a b c d e
abcde a b c
d
e
a b c d e
abcde
a b c d e
Access Reduction: For the core computation shown in Eq. (3.13) and Eq. (3.14), optimization techniques ′ allow reducing access operations to matrix 𝑇𝑖 𝑇𝑗 . Since ′ 𝑇𝑖 𝑇𝑗 is fixed during the iterative process, we can memorize it or pre-compute it, to evade being recalculated when required next time. In the situation when the scale of 𝑇 is huge and the memory can not afford it, a ′ daemon thread would be employed to compute 𝑇𝑖 𝑇𝑗 if necessary. In other words, suppose ′
𝑉 𝑖𝑗 = 𝑇𝑖 𝑇𝑗 Eq. (3.14) can be updated as
abcde
(3.16)
˜ 𝑎𝑏 𝑄 𝑘 =
∣𝑉 ∣ ∣𝑉 ∣ ( ) ∑ ∑ ˜ 𝑎𝑏 𝑄 ⋅ 𝑉 𝑖𝑗 𝑘−1 𝑖=1 𝑗=1,𝑗∕=𝑖
Figure 5: Non-zero Elements (*) in 𝑃˜0𝑎𝑏 , 𝑃˜1𝑎𝑏 , 𝑃˜2𝑎𝑏 , 𝑃˜3𝑎𝑏 Recall that time complexity of the naive method is 𝑂(𝑘𝑑2𝑘 ). In most real datasets, 𝑑 is a constant below 10, which makes it possible to compute link-based similarity queries fast, using Single-Pair SimRank. Finally, we show a proposition to compare time cost between our new Single-Pair SimRank and the original SimRank. Proposition 3.3: Given a graph 𝐺(𝑉, 𝐸) and a nodepair (𝑎, 𝑏), the time cost to compute 𝑆(𝑎, 𝑏) by SinglePair SimRank is always less than the time cost by the original All-Pair SimRank. □ Proof Sketch: Based on Eq. (3.15), suppose ∣𝑉 ∣ = 𝑛, 𝑂(Single-Pair)< 𝑘𝑑2 ∣𝑉 ∣2 = 𝑘𝑑2 𝑛2 . Based on Prop. 2.1, 𝑂(𝑆𝑖𝑚𝑅𝑎𝑛𝑘) = 𝑘𝑑2 𝑛2 . □ In the worst case when 𝐺(𝑉, 𝐸) is a complete graph and in-degree 𝑑 = ∣𝑉 ∣, the time complexity of SinglePair SimRank will approach All-Pair SimRank. However, in most of real-world applications, 𝐺(𝑉, 𝐸) is usually sparse and Single-Pair SimRank wins huge efficiency compared with All-Pair SimRank to respond similarity query of a node-pair. Besides, when the graph is changing, it is very difficult to update similarity matrix due to the mutual reinforcement of All-Pair SimRank. For Single-Pair SimRank, the addition/removal of an edge is viewed as addition/removal of a subtree in path-tree, and thus can be computed incrementally.
𝑖𝑗
without accuracy loss. Threshold Filtering: Many graphs appear to be scalefree [11] whose degree distribution follows power law, including WWW, citation graphs, and social networks. Position probabilities on these graphs, although being non-zero, usually denote low value and thus little influence to the final similarity. A threshold filter can be used to exclude these small but non-zero probabilities in position matrix to improve performance remarkably. We use a parameter ℎ, and simply omit the non-zero probabilities if they are less than ℎ. Formally, (𝑃˜𝑘𝑎𝑏 )𝑖𝑗 = 0 if (𝑃˜𝑘𝑎𝑏 )𝑖𝑗 ≤ ℎ This threshold filtering method may lose some accuracy. An explicit evaluation of the accuracy loss is shown in our empirical studies.
Scaling to large graphs: From the storage perspective, usually transition matrix 𝑇 is sparse and static, so it can be stored in the representation of sparse matrix such as 3-tuple form to afford large graphs. However, position matrix in Single-Pair SimRank and similarity matrix in All-Pair SimRank are changing iteration by iteration, with the number of non-zero values increases rapidly. Since many real-world datasets, such as the web, are modeled as very large graphs, the storage of position matrix implemented on external memory is necessary. The same challenge also exists in similarity matrix of All-Pair SimRank. In [14], Lizorkin et al. imple3.3 Refining Performance of Single-Pair Sim- mented matrix storage on top of Oracle Berkeley DB1 , Rank and kept just several matrix rows in the main memory. We have proposed Single-Pair SimRank with a per1 http://www.oracle.com/technology/products/berkeleyformance better than All-Pair SimRank when assessing the similarity of a given node-pair. Below, we further db/index.html
577
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Accuracy Loss (%)
As another feasible approach, a position matrix can be represented as an increasing 3-tuple table and stored in a relational database. Both approaches are easy to implement and extend Single-Pair SimRank to the situation of very large graphs. Large graphs are usually very sparse, making the efficiency advantage of Single-Pair SimRank even more significant than All-Pair SimRank.
7 6 5 4 3 2 1 0
Scale-free Graph Random Graph
1
2
3
4
5
6
Iteration
Figure 6: Accuracy Loss of Threshold Filtering 4 Empirical Study This section reports experimental results to illustrate the practical effect of Single-Pair SimRank proposed in this paper. We implement the naive method, iterative method, and optimizations for Single-Pair SimRank, and then compare them to All-Pair SimRank over both synthetic graphs and real co-authorship network.
without accuracy loss, compared to original SimRank in [9]. The quality of SimRank scores has been evaluated in [9] and other related works, so it is not the focus of our experiments. We have proposed two optimization techniques in Section 3.3. While the first technique holds accuracy, the second one using threshold filtering Parameters: Due to reasons stated in Section 3, we set may lose some accuracy. We evaluate accuracy loss in decay factor 𝐶 = 0.5 in experiments. Notice that the this section. We set threshold ℎ = 0.0001 for filtering. When maximum steps 𝐾 is decided by decay factor 𝐶 and the threshold Δ for convergence. As an input parameter, threshold filtering is used, supposing similarity of (𝑎, 𝑏) we set Δ = 0.01, which satisfies accuracy demands in is 𝑆𝑇 𝐹 (𝑎, 𝑏), accuracy loss can be computed by most real applications. As a result, 𝑆𝑇 𝐹 (𝑎, 𝑏) 𝐾 = ⌊log𝐶 Δ⌋ = ⌊log0.5 0.01⌋ = 6 (4.17) 𝐿𝑜𝑠𝑠 = 1 − 𝑆(𝑎, 𝑏) For the second optimization using threshold filtering, a parameter ℎ is used to determine whether a nonTo avoid particularity, we repeat 100 times with zero but small probability is omitted or not. We set different node-pair (𝑎, 𝑏) and obtain average accuracy ℎ = 0.0001 which satisfies most cases. loss on each iteration shown in Fig. 6. Experiments are Datasets: To evaluate Single-Pair SimRank over dif- performed on both scale-free graph 𝐺1 (2K, 20K) and ferent situations, we employ both synthetic graphs and random graph 𝐺2 (2K, 20K). We can see accuracy loss real co-authorship network. For synthetic graphs, we distributes in an acceptable scope while performance is choose scale-free graphs [11] and random graphs [3] to optimized. Since random graphs have a relatively large simulate real-world datasets. All graphs here are pro- diameter, a node-pair in random graph is more likely to duced by Barabasi Graph Generator2 . We use 𝐺(𝑁, 𝑀 ) have a non-zero but small position probability, making to express a directed graph 𝐺 which has 𝑁 nodes and accuracy loss higher than that on scale-free graph. 𝑀 edges. For example, 𝐺(100, 500) represents a graph containing 100 nodes and 500 edges. 4.2 Performance We collected co-author information of all papers in Comparative Experiments: The motivation of our DBLP3 from 5 main data mining conferences (KDD, Single-Pair SimRank is assessing similarity between two SDM, ICDM, PKDD and PAKDD), and finally build a objects without performing all-pair iterations. While co-authorship network called “DM-Author” with 7,677 SimRank returns a similarity matrix on the whole nodes and 19,609 edges. graph, Single-Pair SimRank only returns the similarity Running Environment: All experiments are con- of a given node-pair. Comparisons between (All-Pair) ducted on a computer with a 1.86G Intel Core 2 proces- SimRank, naive Single-Pair SimRank (NSP), iterative sor, 2 GB RAM, and 32-bit Windows Server 2008. All Single-Pair SimRank (ISP) and optimization techniques algorithms are implemented in Java. (Access Reduction(AR), Threshold Filtering(TF)) are conducted on the same random graph 𝐺3 (4K, 20K). 4.1 Accuracy Again, for Single-Pair SimRank we repeat 100 times Our Single-Pair SimRank returns similarity scores with different node-pair (𝑎, 𝑏) and obtain average time cost per node-pair. We set the total iteration/step number 𝐾 = 6. In this testing, the transition matrix 2 Derek Dreier. Barabasi Graph Generator v1.4. University of 𝑇 is represented in the form of sparse matrix (3-tuple), California Riverside, Department of Computer Science. 3 http://dblp.uni-trier.de/xml/, last modified in September whereas similarity matrix 𝑆 and position matrix 𝑃˜𝑘𝑎𝑏 are 2009. stored in two-dimension arrays.
578
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Runtime (s)
400
Table 2: Runtime of all 6 iterations, by different comparative methods (unit: second) SimRank NSP-TF ISP ISP-AR ISP-TF 135.8 22.1 3.70 2.88 1.51
300 200 100 0
Average In-Degree
2
4
6
8
10
12
14
16
18
d 20
Table 3: A comparison of the number of non-zero values in position matrices as iterations increase, with threshold filtering function turned off or on respectively. without threshold with threshold 𝑃˜1 33 33 𝑃˜2 671 671 𝑃˜3 15402 2410 ˜ 𝑃4 338733 108 𝑃˜5 4961711 1 .. .. .. . . . 𝑃˜∞ (4𝐾)2 0
(a) 12 Runtime (s)
10 8
ISP-TF ISP
6 4 2 0 1K
Number of Nodes
2K
3K
4K
5K
6K
7K
8K
Runtime of All-Pair SimRank on different node sizes (s): 1K 2K 3K 4K 5K 6K 7K 8K 2.43 12.2 30.2 57.8 98.9 154.7 222.5 316.8 (b)
Figure 7: Performance test of iterative Single-Pair SimRank. In (a), let node size ∣𝑉 ∣=4000 and in-degree time cost of ISP raises as 𝑑 increases, which accords with 𝑑 increases. In (b), keep 𝑑 = 5 and increase node size. Eq. (3.15). In Fig. 7(b), we hold 𝑑 = 5 and increase ∣𝑉 ∣, Runtime includes all 6 iterations. time cost of Single-Pair SimRank increases near linearly and apparently more efficient than All-Pair SimRank. As Table 2 shows, SimRank takes more than 2 minutes to complete total 6 iterations. it is unfair to compare time cost of SimRank and Single-Pair SimRank directly, because the output of SimRank is a similarity matrix, whereas Single-Pair SimRank only returns similarity of the given node-pair. In fact, based on the number of queried node-pairs, we use All-Pair SimRank, if a majority of node-pairs are queried; and Single-Pair SimRank, if only a few of node-pairs are queried. Recall that time complexity of naive Single-Pair SimRank (NSP) is 𝑂(𝑘𝑑2𝑘 ). In this experiment, 𝑑 = 5 on average. Nevertheless, in some cases, the incoming degree 𝑑 is significantly larger than 5, which makes time and space cost of NSP hardly to be acceptable. So we conduct naive Single-Pair SimRank with Threshold Filtering (NSP-TF) instead. For optimizations in this experiment, Access Reduction (ISP-AR) achieves a performance improvement as high as 22% with an increase in space. We recommend Threshold Filtering (ISP-TF) in most of cases, due to its efficiency and acceptable accuracy.
Expanding of Non-Zero Position Probabilities: According to the core computation shown in Eq. (3.13) and Eq. (3.14), the number of non-zero elements in position matrix performs a significant role in computational cost of iterative process. From theoretical aspect, as steps continue, the number of node-pair where two surfers locate increases rapidly, with respect to 𝑑𝑘 but smaller than ∣𝑉 ∣2 . We run this experiment on scale-free graph 𝐺4 (4K, 20K) and collect numbers of non-zero position probabilities on each step with Threshold Filtering turned off and on. As Table 3 describes, if we turn off Threshold Filtering, the number of non-zero probabilities climbs rapidly following a power function. However, most of these non-zero probabilities are very small. The proof is, if we turn on Threshold Filtering and only collect probabilities higher than a threshold ℎ = 0.0001, the number of these filtered probabilities decreases greatly after 3 steps, as shown in Table 3. So Threshold Filtering is recommended to reduce computational cost effectively while threshold ℎ is given reasonably.
Performances of Single-Pair SimRank: In this experiment we evaluate the performance of iterative Single-Pair SimRank (ISP) as incoming degree 𝑑 and graph size ∣𝑉 ∣ increase. Fig. 7(a) illustrates time cost of ISP with respect to the average incoming degree 𝑑. We generate a series of scale-free graphs with node size ∣𝑉 ∣ = 4000 and edge size ∣𝐸∣ = ∣𝑉 ∣ ⋅ 𝑑. We can see that
Convergence Rate: According to Prop. 2.2 and Prop. 3.1, we have proved that Single-Pair SimRank produces exactly the same similarity score as All-Pair SimRank does. Thus, Single-Pair SimRank and AllPair SimRank has the same properties of convergence rate. We perform this experiment on 𝐺3 (4K,20K) with maximum iteration number 𝐾 = 6. As shown in
579
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Table 4: Some similarity querying examples in DMAuthor dataset and their query time. Author-Pair Similarity Query Time Bing Liu 0.0026 6.20 s Jiawei Han Bing Liu 0.0022 6.19 s Philip S. Yu Jiawei Han 0.0038 6.17 s Philip S. Yu Query by All-Pair SimRank: 157.82 s
0.8 0.6
Accuracy
Rate
1.0
0.4 0.2 0.0 12
Av 3 4 e ra 5 ge 6 In- 7 De 8 gre 9 e d 10 1
5 4 ns io
2
6
3 rat te I
(a)
In this section, experiments are performed over a real dataset to test its effectiveness and efficiency. We collected co-author information of all papers in DBLP 2.5 from 5 main data mining conferences, that are KDD, 2.0 SDM, ICDM, PKDD and PAKDD. There are 5494 1.5 papers in total and we finally build a weighted and 1.0 undirected co-authorship network called “DM-Author” with 7677 nodes and 19609 edges. Thus, a author has 0.5 2.55 co-authors on average, and the maximum number 0.0 1 2 3 4 5 6 of co-authors for a author is 102 by Philip S. Yu. Iteration Further statistics shows that top 30% nodes connect (b) with 69% edges, making this citation graph a scalefree graph indeed. SimRank in [9] computes similarity Figure 8: (a) Accuracy rate with different in-degree 𝑑 between every two authors based on link information. and iteration number. (b)Time cost on each iteration. If the requirement is querying similarity between two given authors, Single-Pair SimRank is useful to avoid Fig. 8(a), we test the convergence of SimRank scores unnecessary computation. In this experiment, we query when iterations and the average in-degree 𝑑 increase. 100 times for different pairs of authors, and the average On each plot, the relative accuracy rate is defined as response time is 6.18 seconds. In contrast, it takes 157.82 s to respond such a query by All-Pair SimRank. the average accuracy of all nodes, which is 3.0
ISP
Time Cost (s)
ISP-TF
(4.18)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑎, 𝑏) =
𝑅𝑘 (𝑎, 𝑏) 𝑆(𝑎, 𝑏)
for 𝑆(𝑎, 𝑏) ∕= 0 and 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑎, 𝑏) = 1 for 𝑆(𝑎, 𝑏) = 0. Note that this definition is more rigid than the absolute accuracy rate that uses ∣𝑆(𝑎, 𝑏) − 𝑅𝑘 (𝑎, 𝑏)∣. Fig. 8(a) shows that accuracy rate is above 95% after 6 iterations, and larger in-degree 𝑑 results in higher increase of accuracy rate between two iterations, because a nodepair receives more similarities from their neighbors as 𝑑 increases. Then we show time cost of iterative SinglePair SimRank (ISP) and ISP with threshold filtering (ISP-TF) on each iteration in Fig. 8(b). Since the performance of Single-Pair SimRank is decided by the number of non-zero values in position matrix, threshold filtering is effective to eliminate the “long tail” and wins a perfect performance, with an acceptable accuracy loss.
In Table 4 we show some similarity querying examples and their query time in DM-Author dataset. It is worth noting that Bing Liu has not co-authored with Jiawei Han, but they own a similarity propagated by their common co-authors such as Philip S. Yu. Table 4 also shows that All-Pair SimRank costs significantly more time to respond the similarity query of a given node-pair. Besides, when the graph structure changes over time, the nature of mutual reinforcement in AllPair SimRank will prevent its applicability to update similarity matrix quickly.
5 Related Work The measure of similarity between objects has a clear significance and was extensively studied following different disciplines. Traditional similarity measures mainly focus on assessing the similarity between documents, by 4.3 Experiments on Real Dataset employing or extending vector space model [16]. HowSingle-Pair SimRank can be applied on many real- ever, in many cases, linkages among objects can be world domains, including WWW and social networks. the only or the most explicit information available [8].
580
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Naively applying traditional measures on link structure may bring invalid conclusions. On the other hand, fast growth of data requires machine aid to help classify and cluster information. Unlike some similarity measures such as [15] which rely on human generated hierarchies, SimRank [9] follows a human intuition and measure structural-context similarity automatically. Some approaches like co-citation [18] and Jaccard coefficient also measure similarity based on structuralcontext. However, they only utilize one-step neighbors, whereas SimRank exploits multi-step with decay factor. The efficiency of SimRank is an obstacle to prevent its applicability on large datasets. Given a graph 𝐺(𝑉, 𝐸) and ∣𝑉 ∣ = 𝑛, the time required for 𝑘 iterations is 𝑂(𝑘𝑛2 𝑑2 ), and 𝑂(𝑘𝑛4 ) in the worst case. In the literature, there exists four research works focusing on SimRank optimization [14, 5, 12, 2]. Lizorkin et al. [14] presented an accuracy estimation and optimization techniques for SimRank that improve computational complexity from 𝑂(𝑘𝑛4 ) to 𝑂(𝑘𝑛3 ) in the worst case, without accuracy loss. Fogaras et al. [5] proposed a new SimRank variant, PSimRank, based on Monte Carlo method, which estimates 𝑆(𝑎, 𝑏) by calculating the first meeting time 𝜏𝑎,𝑏 for two random surfers starting from nodes 𝑎 and 𝑏 respectively. Li et al. [12] suggested a method to split the graph into sub-graphs, compute similarity scores in each sub-graph and then estimate similarity between nodes in different sub-graphs, which achieves a performance improvement from 𝑂(𝑘𝑛2 𝑑2 ) to 𝑂(𝑘𝑛4/3 𝑑2 ) on average. Finally, [2] presents two enhanced versions of SimRank, one that exploits the weights of edges in the click graph and another that exploits “evidence” supporting the similarity between queries. Nevertheless, these optimization works all adopt an all-pair style for computation. There are also some other link-based similarity measures. Xi et al. [19] proposed a similarity computation method called SimFusion, whose main purpose is “integrating relationships from heterogeneous sources”. There are some resemblances between SimFusion and SimRank, as authors of [14] deduced in related works. It is worth noting that since SimFusion do not contain a decay factor 𝐶, SimFusion scores may not converge to a steady value as SimRank scores do. Besides, Lin et al. [13] proposed PageSim to measure link-based similarity between web pages, by propagating PageRank scores to neighbors following out-links. Yin et al. [5] proposed a hierarchical structure called SimTree to describe similarities between objects in a compact way. SimTree only computes and stores the similarities of sibling nodes and the ratio of each node compared with its parent. Blondel et al. [21] introduced a measure of similarity between vertices of two
581
graphs 𝐺𝐴 and 𝐺𝐵 , by iteratively update similarity matrix 𝑆𝑘+1 = 𝐵𝑆𝑘 𝐴𝑇 + 𝐵 𝑇 𝑆𝑘 𝐴, where 𝐴 and 𝐵 are adjacency matrices of 𝐺𝐴 and 𝐺𝐵 . Fouss et al. [6] also presents a new perspective on characterizing the similarity between nodes of a graph without human predefined hierarchies. Based on random walk, [6] computes average commute time and Euclidean Commute Time Distance (ECTD) between nodes. The pseudoinverse of the Laplacian matrix is computed based on ECTD, and therefore provides similarities of each node-pair. We propose Single-Pair SimRank in this paper. Single-Pair SimRank is based on radom walk on graphs, which is a special case of Markov chain [1]. A node in graph represents a state in Markov chain, and transition probability is assigned by each link between nodes. Besides, techniques of sparse matrix computation [7] could be used to accelerate computation of Single-Pair SimRank, but it is not the focus of this paper. 6
Conclusion
In this paper, we propose Single-Pair SimRank to compute the similarity of a given node-pair without accuracy loss. Unlike the original SimRank, similarity computations of other node-pairs are avoided tactfully in our approach. Thus, if the requirement is querying similarities of a few node-pairs, our method wins great efficiency and responds quickly. There are some avenues for future works. Techniques of sparse matrix computation could accelerate iterative process further. For very dense graphs, the advantage of Single-Pair SimRank is not very decisive compared to All-Pair SimRank, which needs more considerations. Finally, a “top-k similar objects” query on the link graph may be another interesting topic to study. 7
Appendix
Proof of Prop. 2.2: The proof is organized by mathematical induction. Let 𝐼 be the ∩ common in-neighbors of 𝑎 and 𝑏, such that 𝐼 = 𝐼(𝑎) 𝐼(𝑏). As the induction basis, let 𝑘 = 1. According to Eq. (2.1), we have 𝑅1 (𝑎, 𝑏)
=
∑ 𝐶 ∣𝐼(𝑎)∣∣𝐼(𝑏)∣
∑
𝑅0 (𝑖, 𝑗)
𝑖∈𝐼(𝑎) 𝑗=1∈𝐼(𝑏)
=
∑ 𝐶 𝑅0 (𝑖, 𝑖) ∣𝐼(𝑎)∣∣𝐼(𝑏)∣ 𝑖∈𝐼
=
𝐶 ⋅ ∣𝐼∣ = 𝑀1 (𝑎, 𝑏) ∣𝐼(𝑎)∣∣𝐼(𝑏)∣
Below, in the inductive step, we prove the case of (𝑘 + 1), provided that Eq. (2.4) holds for a given 𝑘 for all node-pairs. There are two distinguished cases.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
X
a
Ii(a)
Ij(b)
b
Figure 9: The illustration for Prop. 2.2 First, consider 𝑖 ∕= 𝑗 for 𝑖 ∈ 𝐼(𝑎) and 𝑗 ∈ 𝐼(𝑏). ∑𝑘 𝑅𝑘 (𝑖, 𝑗) = Without 𝑥=1 𝑀𝑥 (𝑖, 𝑗) already holds. loss of generality, for any 𝑥 such that 1 ≤ 𝑥 ≤ 𝑘, suppose that two random walk paths are as follows: ⟨𝑖, 𝑎1 , 𝑎2 , ⋅ ⋅ ⋅ , 𝑎𝑥−1 , 𝑎⟩ for the first random surfer starting at 𝑖, and ⟨𝑗, 𝑏1 , 𝑏2 , ⋅ ⋅ ⋅ , 𝑏𝑥−1 , 𝑎⟩ for the second random surfer starting at 𝑗. These two paths meet at node 𝑎 on the step 𝑥. In a similar fashion, for the case on the step (𝑥 + 1) starting from 𝑎 and 𝑏, we can elongate the above two random paths to ⟨𝑎, 𝑖, 𝑎1 , 𝑎2 , ⋅ ⋅ ⋅ , 𝑎𝑥−1 , 𝑎⟩ and ⟨𝑏, 𝑗, 𝑏1 , 𝑏2 , ⋅ ⋅ ⋅ , 𝑏𝑥−1 , 𝑎⟩ respectively. Notice that every first-meeting path-pair for 1 ≤ 𝑥 ≤ 𝑘 can be elongated in this way. An illustration is provided in Fig. 9, and heed that random walk path in SimRank follows the reversed edges. By elongating all possible first-meeting path-pairs starting from in-neighbors of (𝑎, 𝑏), we obtain all pairs of the first-meeting paths starting from (𝑎, 𝑏) with steps of (𝑥 + 1). That is, (7.19) ∑ ∑ 𝐶 𝑀𝑥 (𝑖, 𝑗) 𝑀𝑥+1 (𝑎, 𝑏) = ∣𝐼(𝑎)∣∣𝐼(𝑏)∣ 𝑖∈𝐼(𝑎) 𝑗∈𝐼(𝑏)
Consider all 𝑥 ∈ {1, 2, ⋅ ⋅ ⋅ , 𝑘}, for 𝑖 ∕= 𝑗, the sum of all first-meeting probability within 𝑘+1 steps starting from nodes 𝑎 and 𝑏 is (7.20)
′
𝑅𝑘+1 (𝑎, 𝑏) =
𝑘+1 ∑
𝑀𝑥 (𝑎, 𝑏)
𝑥=2
Second, consider 𝑖 = 𝑗 for 𝑖 ∈ 𝐼(𝑎) and 𝑗 ∈ 𝐼(𝑏). In this case, the random surfers starting from 𝑎 and 𝑏 meet on the first step. That is (7.21)
′′
𝑅𝑘+1 (𝑎, 𝑏) = 𝑀1 (𝑎, 𝑏)
By combining Eq. (7.20) and Eq. (7.21), we prove Eq. (2.4) eventually. □ Acknowledgment: The work was also supported in part by the National Nature Science Foundation of China under Grant No. 70871068, 70890083, 70621061 and 60873017, and the Research Grants Council of the Hong Kong SAR, China No. 419008 and 419109. References
582
[1] D. Aldous and J. Fill. Reversible markov chains and random walks on graphs. Book in preparation. [2] I. Antonellis, H. Molina, and C. Chang. Simrank++: query rewriting through link analysis of the click graph. PVDLB, 1(1):408–421, 2008. [3] B. Bollobs. Random Graphs. Cambridge University Press, 2nd edition, 2001. [4] P. Chebolu and P. Melsted. Pagerank and the random surfer model. In SODA, pages 1010–1018, 2008. [5] D. Fogaras and B. Racz. Scaling link-based similarity search. In WWW, pages 641–650, 2005. [6] F. Fouss, A. Pirotte, J. Renders, and M. Saerens. Random walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE TKDE, 19(3):355–369, 2007. [7] A. George, J. Gilbert, and J. Liu. Graph Theory and Sparse Matrix Computation. Springer-Verlag, 1993. [8] L. Getoor and C. Diehl. Link mining: A survey. SIGKDD Explorations, 7(2):3–12, 2005. [9] G. Jeh and J. Widom. Simrank: A measure of structural-context similarity. In KDD, pages 538–543, 2002. [10] J. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632, 1999. [11] L. Li et al. Towards a theory of scale-free graphs: Definition, properties, and implications (extended version). CoRR, abs/cond-mat/0501169, 2005. [12] P. Li, Y. Cai, H. Liu, J. He, and X. Du. Exploiting the block structure of link graph for efficient similarity computation. In PAKDD, pages 389–400, 2009. [13] Z. Lin, M. Lyu, and I. King. Pagesim: A novel linkbased measure of web page similarity. In WWW, pages 1019–1020, 2006. [14] D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov. Accuracy estimate and optimization techniques for simrank computation. PVDLB, 1(1):422–433, 2008. [15] A. G. Maguitman et al. Algorithmic computation and approxi-mation of semantic similarity. World Wide Web, 9(4):431–456, 2006. [16] C. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval. Cambridge University Press, 2008. [17] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford Univ., 1998. [18] H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4):265–269, 1973. [19] W. Xi et al. Simfusion: measuring similarity using unified relationship matrix. In SIGIR, pages 130–137, 2005. [20] X. Yin, J. Han, and P. Yu. Linkclus: Efficient clustering via heterogeneous semantic links. In VLDB, pages 427–438, 2006. [21] V.D. Blondel et al. A Measure of Similarity between Graph Vertices: Applications to Synonym Extraction and Web Searching. SIAM REVIEW, 46(4):647-666, 2004.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.