Aug 10, 2016 - further subdivided into smaller communities. For instance, in social media an automotive group may be analyzed to an. F1 group, a car group, ...
On converting community detection algorithms for fuzzy graphs in Neo4j
arXiv:1608.02235v1 [cs.SI] 7 Aug 2016
Georgios Drakopoulos, Andreas Kanavos, Christos Makris, and Vasileios Megalooikonomou Computer Engineering and Informatics Department University of Patras Patras, Hellas 26500 {drakop, kanavos, makri, vasilis}@ceid.upatras.gr
Abstract—An essential feature of large scale free graphs, such as the Web, protein-to-protein interaction, brain connectivity, and social media graphs, is that they tend to form recursive communities. The latter are densely connected vertex clusters exhibiting quick local information dissemination and processing. Under the fuzzy graph model vertices are fixed while each edge exists with a given probability according to a membership function. This paper presents Fuzzy Walktrap and Fuzzy Newman-Girvan, fuzzy versions of two established community discovery algorithms. The proposed algorithms have been applied to a synthetic graph generated by the Kronecker model with different termination criteria and the results are discussed. Keywords-Fuzzy graphs; Membership function; Community detection; Termination criteria; Walktrap algorithm; NewmanGirvan algorithm; Edge density; Kronecker model; Large graph analytics; Higher order data
I. I NTRODUCTION Vast amount of empirical evidence suggests that large scale graphs such as brain connectivity graphs, protein-toprotein interaction graphs, transportation networks, and the Web graph, strongly tend to exhibit modularity. In other words, they are composed of recursively built communities, a crucial factor for scaling property [6][2]. Communities are highly connected vertex subsets which communicate with each other with few long distance edges. The communities account for the quick local information diffusion and processing, whereas the long distance edges serve as exchange points. Often large communities can be further subdivided into smaller communities. For instance, in social media an automotive group may be analyzed to an F1 group, a car group, and a motorcycle group depending on the particular interests of its users. As a result of the importance of community discovery in developing large graph analytics, various algorithms have been developed. Two of the most prominent ones are the Newman-Girvan and the Walktrap algorithms. The former is deterministic and is based on local edge density, whereas the latter is heuristic and relies on the concept of an edge traversing random walker. These radically different approaches indicate the flexibility inherent in graph analytics as well as the multitude of the ways a graph can be interpreted. The primary contribution of this paper is the development of the fuzzy versions of Walktrap and Newman-Girvan
algorithms. The proposed versions have been based on the fuzzy graph model introduced among others in [17] and its properties. They have been applied to a synthetic graph obtained by the Kronecker model with various termination criteria. The remaining of this work is structured as follows. Scientific literature regarding community discovery is reviewed in section II. The fuzzy graph model and its fundamental properties are outlined along with two remarks about higher order data and partitioning algorithm termination criteria in section III. Fuzzy Walktrap and Fuzzy Newman-Girvan algorithms are presented in detail in sections IV and V respectively. The Kronecker model is briefly reviewed in section VI, where the results of the application of Fuzzy Walktrap and Fuzzy Newman-Girvan to a number of Kronecker synthetic graphs are reported. Finally, the main findings of this paper as well as future research directions are discussed in section VII. Table I summarizes the symbols used in this paper. Notice that vk and ek are shorthand notations for the k-th vertex and the k-th edge respectively. The graph they refer to should be clear or implied by the context. Symbol 4 = ∼ ⊗ N µ0 , σ02 {s1 , s2 , . . . , sn } |S| (e1 , . . . , em ) ζ deg (vk ) A[i, j] H (s1 , . . . , sn )
Meaning Definition or equality by definition Distribution according to a density function Kronecker tensor product Gaussian distribution with µ0 and σ02 Set with elements s1 , s2 , . . . , sn Cardinality of set S Path comprised of edges e1 , . . . em Graph diameter Degree of vertex vk Matrix entry in row i and column j Harmonic mean of s1 , . . . , sn
Table I S YMBOLS USED IN THIS PAPER .
Definition 1: A (deterministic) partition of a set S = {s1 , . . . , sn } is an assignment of elements sk to p sets Sj such that ∪pj=1 Sk = S and Si ∩ Sj = ∅ for each distinct
index pair i and j. Therefore p X
|Sj | = |S|
have been introduced among others in [17], where a fuzzy extension of Cypher termed FUDGE is presented. (1)
III. F UZZY G RAPHS
j=1
A partition is denoted as S = {Sj }. Definition 2: A fuzzy partition of a set S = {s1 , . . . , sn } is an assignment of elements sk to p sets Sj where the fuzzy membership operator ∈F quantifies the degree of participation of sk to Sj . sk ∈F Sj = µk,j ∈ [0, 1]
(2)
A fuzzy partition is denoted as S = {Sj }F . Notice that p n X X
µk,j = |S|
Within the context of this paper fuzzy graphs are combinatorial objects with a fixed set of vertices and a fuzzy set of edges. Its formal definition is the following. Definition 3: A fuzzy graph is the ordered triplet G = (V, E, h)
(4)
where V = {vk } is the set of vertices, E = {ek } ⊆ V × V is the set of edges, and h(·) is the edge membership function h : E → (0, 1]
(3)
k=1 j=1
Notice that, contrary to deterministic set partition, fuzzy partitioning creates subsets which are pairwise overlapping. The actual overlap magnitude depends on the number of subsets and on the selected membership function. II. R ELATED W ORK Large graph community detection or community identification is significant for building big data analytics. This problem is algorithmically reduced to either graph partitioning or data clustering in general [4][5][13][20]. Graph partitioning can be performed either structurally or spectrally. In the former case the graph is interpreted as an algebraic object and the partitioning is based on the properties of the graph adjacency matrix [8][21], whereas in the latter the graph is treated as a combinatorial object and the partitioning exploits features such as edge density and community coherence [2][20][13]. A research problem related to community detection is authority estimation. In [1] several graph features, such as the degree distribution, and hub and authority scores, are used to model the relative importance of a given user. Alternatively, in the expertise ranking model [7] authorities are derived by performing link analysis to the graph induced from interactions between users. Moreover, in [23] authors employ Latent Dirichlet Allocation and a PageRank variant to cluster the graph according to topics and then the authorities for each topic are identified. This was extended in [16] with additional features, advanced clustering, and real-time applicability. Analytics are an integral part of large graph processing systems such as massive distributed graph computing systems like Google Pregel [11] and graph based machine learning frameworks like Graphlab [10]. In these systems graphs play a dual role as the computational flow model as well as the learning model. Interest in the graph processing field [14][19] has been invigorated with the advent of open source graph databases such as BrightStar [3], Neo4j [12], Sparksee [22], and GraphDB [15]. Finally, fuzzy graphs
(5)
which measures the degree of participation of ek to G. Notice that for brevity non-existent edges are not included in the fuzzy graph. Notice that a more concise definition of a fuzzy graph would need only V and h with the latter being zero for each vertex pair in the set (V × V ) \ E. Definition 4: Under the fuzzy graph model the cost δ(ek ) of traversing ek is 1 ∈ [1, +∞) h(ek )
4
δ(ek ) =
(6)
Definition 5: The cost ∆p of a path p = (e1 , . . . , em ) of a fuzzy graph is the sum of the cost of its edges 4
∆p =
X
δ(ek ) =
ek ∈p
m X k=1
1 h(ek )
1 = m H (h(e1 ), . . . , h(em ))
(7)
where H (h(e1 ), . . . , h(en )) is the harmonic mean of h(ek ). Observe that ∆p is dominated by the minimum of h(ek ), indicating that low cost paths contain exclusively edges that are highly likely to belong to the graph. Definition 6: The strength Σp of a path p = (e1 , . . . , em ) of a fuzzy graph is defined as the minimum value the membership function takes in p 4
Σp = min {h(ek )} ek ∈p
(8)
Definition 7: The distance d(vs , vt ) between vs and vt is defined as the minimum cost over all paths connecting them 4
d(vs , vt ) = min {∆p } , p
p = (vs , v1 , . . . , vn , vt )
(9)
In [17] it is established that d(·, ·) is a distance function, namely it satisfies the following properties • d(vi , vj ) = 0 ⇔ vj = vi , ∀vi , vj ∈ V • d(vi , vj ) = d(vj , vi ), ∀vi , vj ∈ V • d(vi , vj ) ≤ d(vi , vk ) + d(vk , vj ), ∀vi , vj , vk ∈ V Although h(·) can be arbitrary, the fuzzy graphs which have been used in the implementation were chosen such as
that h(ek ) would be distributed according to the noncentral χ22 distribution with two degrees of freedom, namely r g12 + g22 1 1 ht (ek ) ∼ , g1,2 ∼ N , (10) 2 2 6 where the gk are identical and independently distributed. This distribution generates only positive values, a strict requirement in order to build a fuzzy graph, which are relatively concentrated around the mean value, less strongly than the Gaussian case though. The mean and the variance have been chosen as the interval [µ0 − 3σ0 , µ0 + 3σ0 ]
(11)
to coincide with [0, 1]. The class of fuzzy graphs defined in this section are a typical example of higher order big data. In the community identification case the order of the data is expressed in terms of the number of vertices involved or the number of edges that need to be traversed. This can be attributed to the linked nature of graphs which balances the distribution between local and global information. An indication of the connection between community detection and the higher order analytics is that the smallest community is closed triangle. In terms of both vertices and edges this is clearly a third order metric. This stems from the fact that isolated edges do not qualify by themselves as communities. In other words, in a given vertex group in order to qualify as a community there has to be at least one vertex connecting the remaining ones. IV. F UZZY WALKTRAP The Walktrap algorithm is based on the concept of an edge traversing random walker. The crucial observation is that although the walker starts from an arbitrary vertex, she will eventually spend more time in densely interconnected graph segments. This is based on the fact that it is more probable for a randomly picked edge to lead to another vertex inside the community the walker is currently in than to a vertex of another community [18]. The probability that the walker moves from vi to vj is pi,j =
A[i, j] deg (vi )
(12)
where A denotes the adjacency matrix of the graph where ( 1, (vi , vj ) ∈ E 4 |V |×|V | A[i, j] = ∈ {0, 1} (13) 0, (vi , vj ) 6∈ E If the probability that the random walker reaches vj from vi through a path of length ` is denoted by p`i,j , then the following should hold: ` • If vi and vj belong to the same community, then pi,j should be large for at least large values of `. Note that the converse is not always true. In other words,
•
•
depending on graph topology p`i,j may be high even if vi and vj belong to different communities. If vi and vj belong to the same community, then the random walker tends to treat them in a very similar manner. Then for each ` the condition p`i,j = p`j,i should hold. It should be also noted that p`i,j depends on deg (vj ), as the walker is more probable to select a higer degree vertex.
The above conditions lay the groundwork for defining the distance di,j between vi and vj as v u u |V | p` − p` 2 X u i,k j,k 4 (14) di,j = t deg (vk ) k=1
In a similar manner, the transition probability p`C,k from any vertex belonging to a community C to vk in ` steps is defined as 1 X ` p`C,k = pi,k (15) |C| i∈C
Generalizing (14), the distance rCi ,Cj between the communities Ci and Cj is defined as v 2 u u |V | p` uX Ci ,k − p`Cj ,k (16) rCi ,Cj = t deg (vk ) k=1
Algorithm 1 Deterministic Walktrap Require: Graph G(V, E), termination criterion τ0 Ensure: G is partitioned into communities; V = {Vk } 1: for all vk ∈ V do 2: Vk ← {vk } 3: end for 4: repeat 5: for all distinct pairs (Vi , Vj ) do 6: ρi,j ← rVi ,Vj and ρj,i ← ρi,j 7: end for 8: (i∗ , j ∗ ) ← argmini,j {ρi,j } 9: V ← V \ Vi∗ and V ← V \ Vj ∗ 10: p ← |V | and Vp+1 ← Vi∗ ∪Vj ∗ and V ← V ∪Vp+1 11: until G satisfies τ0 12: return G The Walktrap algorithm eventually terminates as in each step two communities are merged into a single one until the graph becomes one single partition. However, this is a trivial case which is rarely useful. Therefore, termination criteria τ0 should yield at least two communities. Notice that termination criteria are not trivially selected. These include: •
A fixed number of communities k0 . As this cannot be known in advance, it is usually selected empirically.
•
The Tanimoto similarity measure. Observe this cannot be applied to deterministic community detection algorithms, since they create non-overlapping communities. Recall that for two sets S1 and S2 the Tanimoto coefficient γ0 as 4
γ0 =
•
|S1 ∩ S2 | |S1 ∩ S2 | = |S1 ∪ S2 | |S1 | + |S2 | − |S1 ∩ S2 |
When the number of communities remain unchanged and the pairwise Tanimoto coefficient is small, then the community identification algorithm terminates. The ratio of average intracommunity diameter to the overall graph diameter. If V = {Vk }F with 1 ≤ k ≤ p is a fuzzy partition of V , ζj is the diameter of Vj , and ζ is the overall graph diameter then 4
γ1 =
•
(17)
p 1 X ζj pζ j=1
(18)
When this ratio becomes small enough, then the community detection algorithm terminates. The average ratio of edges inside each community to the number of edges of the fully connected graph of the size size p X |ek ∈ Vj | 4 1 γ2 = (19) |Vj | p j=1 2
Conversely, the community detection algorithm terminates when this ratio becomes large enough. The Fuzzy Walktrap begins by assigning each vertex to its own community like in the deterministic case. Notice that in the fuzzy case the entries of AF now are ( h((vi , vj )), (vi , vj ) ∈ E 4 |V |×|V | F A [i, j] = ∈ [0, 1] 0, (vi , vj ) 6∈ E (20) Therefore, they are continuous and belong to [0, 1] instead of being binary. The algorithm then proceeds as the deterministic Walktrap in order to locate communities with the distance metrics unaltered. V. F UZZY N EWMAN -G IRVAN The Newman-Girvan or edge betweeness algorithm [6] is based on betweeness centrality, an edge centrality metric which counts the fraction of the number of the shortest paths connecting two vertices vi and vj a given edge ek is part k of, denoted by σi,j , to the total number of shortest paths connecting vi and vj , denoted by σi,j . Then the betweeness centrality for ek , denoted by Bk , is computed by averaging over each vertex pair ! k X σi,j 1 4 Bk = |V | , vi 6= vj (21) σi,j 2
Algorithm 2 Fuzzy Walktrap Require: Fuzzy graph G(V, E, h), termination criterion τ0 Ensure: G is partitioned into communities; V = {Vk }F 1: for all vk ∈ V do 2: Vk ← {vk } 3: end for 4: repeat 5: for all distinct pairs (Vi , Vj ) do F F F 6: ρF i,j ← rVi ,Vj and ρj,i ← ρi,j 7: end for 8: (i∗ , j ∗ ) ← argmini,j ρF i,j 9: V ← V \ Vi∗ and V ← V \ Vj ∗ 10: p ← |V | and Vp+1 ← Vi∗ ∪Vj ∗ and V ← V ∪Vp+1 11: until G satisfies τ0 12: return G
(vi ,vj )∈E
In [6] a process for computing Bk for each ek in a manner
resembling breadth-first search is described. The rationale is that vertices belonging to different communities should rely on edges connecting communities for information exchange. However, note that the converse need not be true. Moreover, depending on graph topology, some of the community connecting edges may not rank high in terms betweeness centrality as other edges may be more preferable. Therefore, the edge e∗ with the highest betweeness centrality should be removed and the process should be applied again to the new graph. Eventually, all edges connecting communities will be identified. Intuitively, the edge sequence he∗ i should contain the graph bridges as well, which are a subset of the community connecting edges. In case the graph becomes disconnected, then the process is repeated to each of the connected components. Algorithm 3 Deterministic Newman-Girvan Require: Graph G(V, E), termination criteria τ0 Ensure: G is partitioned into communities; V = {Vk } 1: while E 6= ∅ and τ0 not satisfied do 2: Compute shortest paths in G 3: for all ek ∈ E do 4: Compute Bk 5: end for 6: e∗ ← argmaxk {Bk } 7: E ← E \ {e∗ } 8: end while 9: return G
The structure of Fuzzy Newman-Girvan is identical to that of its deterministic counterpart. This happens because the fuzzy edge costs affect only the computation of BkF , the fuzzy edge betweeness centrality, and not the edge selection. It should be noted that for a computational point of view, any algorithm for weighted graphs can be employed.
Algorithm 4 Fuzzy Newman-Girvan Require: Fuzzy graph G(V, E, h), termination criteria τ0 Ensure: G is partitioned into communities; V = {Vk }F 1: while E 6= ∅ and τ0 not satisfied do 2: Compute shortest fuzzy paths in G 3: for all ek ∈ E do 4: Compute BkF 5: end for 6: e∗ ← argmaxk BkF 7: E ← E \ {e∗ } 8: end while 9: return G
Figure 1.
VI. R ESULTS The Kronecker synthetic graph generation model [9] relies on the adjacency matrix of a graph and the recursive nature of scale free graphs in order to progressively build a large graph from smaller similar ones. Given a preselected original adjacency matrix A, the following graph sequence is generated A0 = A An+1 = An ⊗A,
n ≥ 1
(22)
For the experimental evaluation of the proposed fuzzy algorithms an undirected generator was chosen with 9 vertices and 33 edges. Since A0 is moderately dense, then A4 would be at least moderately dense as well because Kronecker graphs density relatively quickly. Therefore, only a small number of communities is expected to be discovered by any well performing algorithm. Once the graph has been created, ht has been applied to its edges in order to make it fuzzy. Table II shows the time required for each algorithm. It is evident that Fuzzy Newman-Girvan is approximately three times more expensive than Fuzzy Walktrap. Also, in both cases the diameter ratio algorithm termination criterion caused a slight algorithm slowdown. Although more experiments could shed light into this phenomenon, it should probably be attributed to the global nature of diameter computation. Edge density on the contrary is based exclusively on local properties. Algorithm Fuzzy Walktrap Fuzzy Walktrap Fuzzy Newman-Girvan Fuzzy Newman-Girvan
Citerion Edge density Diameter ratio Edge density Diameter ratio
Time (sec) 2.001 2.127 5.901 6.582
Table II T IME FOR F UZZY WALKTRAP AND F UZZY N EWMAN -G IRVAN .
In figure 1 the number and size of each community detected by Fuzzy Walktrap and Fuzzy Newman-Girvan in
Community sizes.
each of the two graphs depicted This information is also described in table III for clarity. FW
FN-G
9.6% 7.2% 4% 10.4% 12.4% 5.5%
14.3% 6.10% 1.1% 2% 1.1% 0.8%
6.8% 5.3% 1.3% 8.9% 11.6% 7.5%
6.9% 10.5% 1.5% 3.1% 1.3% 0.6%
10.6% 2%
9.7% 3.1%
18.8% 10.5%
4% 1.5%
Table III S IZE AND NUMBER OF COMMUNITIES .
It should be noted that Fuzzy Walktrap is preferable as the input graph grows larger. The reason is that Fuzzy NewmanGirvan computes all the shortest paths among every pair of source and destination vertices, effectively meaning that large graph portions need to be visited, while Fuzzy Walktrap spends a considerable amount of time crossing adjacent edges before eventually moving on to other communities, exploiting thus local information. Moreover, from an implementation viewpoint Fuzzy Walktrap is preferable as it relies on local graph segments, meaning that fewer graph segments need to be loaded to the main memory, whereas the systematic graph traversal required to find shortest paths translates to more read and synchronization operations. Therefore, Fuzzy Walktrap may be the algorithm of choice in distributed graph processing systems. VII. C ONCLUSIONS AND F UTURE W ORK The primary contribution of this paper is the development of Fuzzy Walktrap and Fuzzy Newman-Girvan, two fuzzy graph community detection algorithms based on their deterministic namesakes. The underlying fuzzy graph model was the one used in [17] among other papers. Its main characteristic is that vertices remain fixed while edges may or may not be part of the graph, a fact reflected on the
values of a membership function. The latter is a special function whose domain is the set of graph edges and its return values belong to (0, 1], asserting thus the degree each edge actually participates to the graph. The higher the value, the more the edge belongs to the graph and the easier is to traverse it. Experimental results conducted on synthetic fuzzy Kronecker graphs indicate that Fuzzy Newman-Girvan is considerably more expensive and it returns more communities which are more compact, while Fuzzy Walktrap is quicker and it makes efficient use of local information at the expense of community coherence. Two community compactness criteria have been used, the ratio of average diameter of the communities to the overall graph diameter and the average ratio of the number of edges in each community to the number of edges of a fully connected graph of size equal to that of the community. In possible future directions should be included the development of more sophisticated fuzzy community identification algorithms. Such algorithms could possibly take advantage of the specific membership function of a given graph or at least of some partial knowledge regarding a particular membership function in order to build heuristics for computation acceleration. For instance, knowledge about the first non-central moment of the membership function would help such an algorithm decide whether the cost of an edge is unusually high. Notice that both Fuzzy Walktrap and Fuzzy Newman-Girvan are in fact membership function oblivious, so in essence they treat a fuzzy graph more like an ordinary weighted graph. Another direction is the extension of the fuzzy model such that vertices would also participate to the graph according to a vertex membership function. Therefore, edge participation would be affected not only by the edge membership function but also by the vertex membership values of the two adjacent vertices. R EFERENCES [1] E. Agichtein, C. Castillo, D. Donato, A. Gionis, and D. Mishne. Finding high-quality content in social media. In Web Search and Data Mining conference, WSDM, pages 183–194. ACM, 2008. [2] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of community hierarchies in large networks. Journal of Statistical Mechanics: Theory and Experiment, P1000, 2008. [3] BrightStar. www.brightstardb.com, June 2015. [4] P. J. Carrington, J. Scott, and S. Wasserman. Models and Methods in Social Network Analysis. Cambridge University Press, 2005. [5] S. Fortunato. Community detection in graphs. Reports, 486:75–174, 2010.
Physics
[6] M. Girvan and M. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(2):7821–7826, 2002. [7] P. Jurczyk and E. Agichtein. Discovering authorities in question answer communities by using link analysis. In Conference of Information and Knowledge Management, CIKM, pages 919–922, 2007. [8] B. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 49(1):291–307, 1970. [9] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. J. Mach. Learn. Res., 11:985–1042, Mar. 2010. [10] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, California, July 2010. [11] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD ’10, pages 135–146, New York, NY, USA, 2010. ACM. [12] Neo Technologies. www.neo4j.com, June 2015. [13] M. Newman. Networks: An Introduction. Oxford University Press, 2010. [14] Onofrio Panzarino. Learning Cypher. PACKT publishing, May 2014. [15] Ontotext. www.ontotext.com, June 2015. [16] A. Pal and S. Counts. Identifying topical authorities in microblogs. In Web Search and Data Mining, WSDM, pages 45–54, 2011. [17] O. Pivert, V. Thion, H. Jaudoin, and G. Smits. On a fuzzy algebra for querying graph databases. In Proceedings of the 26th International Conference on Tools with Artificial Intelligence, ICTAI 2014, pages 748–755. IEEE, November 2014. [18] P. Pons and M. Latapy. Computing communities in large networks using random walks. In arXiv:physics/0512106. www.arxiv.org, December 2005. [19] I. Robinson, J. Webber, and E. Eifrem. Graph Databases. O’Reilly, June 2013. [20] J. Scott. Social Network Analysis: A Handbook. Publications Ltd., 2000.
SAGE
[21] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [22] Sparksee. www.sparsity-technologies.com, September 2015. [23] J. Weng, E.-P. Lim, J. Lim, and Q. H. Jiang. Twitterrank: Finding topic-sensitive influential twitterers. In Web Search and Data Mining, WSDM, pages 261–270, 2010.