www.ietdl.org Published in IET Systems Biology Received on 4th November 2007 Revised on 26th November 2008 doi: 10.1049/iet-syb.2007.0061
ISSN 1751-8849
Detection of local community structures in complex dynamic networks with random walks G.S. Thakur1 R. Tiwari1 M.T. Thai1 S.-S. Chen1 A.W.M. Dress2 1
CISE, University of Florida, Gainesville, FL, USA CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, People’s Republic of China E-mail:
[email protected]fl.edu 2
Abstract: Identification of interaction patterns in complex networks via community structures has gathered a lot of attention in recent research studies. Local community structures provide a better measure to understand and visualise the nature of interaction when the global knowledge of networks is unknown. Recent research on local community structures, however, lacks the feature to adjust itself in the dynamic networks and heavily depends on the source vertex position. In this study the authors propose a novel approach to identify local communities based on iterative agglomeration and local optimisation. The proposed solution has two significant improvements: (i) in each iteration, agglomeration strengthens the local community measure by selecting the best possible set of vertices, and (ii) the proposed vertex and community rank criterion are suitable for the dynamic networks where the interactions among vertices may change over time. In order to evaluate the proposed algorithm, extensive experiments and benchmarking on computer generated networks as well as real-world social and biological networks have been conducted. The experiment results reflect that the proposed algorithm can identify local communities, irrespective of the source vertex position, with more than 92% accuracy in the synthetic as well as in the real-world networks.
1
Introduction
Finding community structures in complex networks based upon the patterns of interaction has gained a lot of recent attention [1 – 3]. Networks play an important role in many fields, such as Internet, social research, life sciences and biology [4 – 6]. All these networks can be modelled as a graph G ¼ (V , E), where V is a set of vertices and E is a set of edges that represent the interaction between the vertices. A community in a network is defined as a group of vertices that have more edges among themselves than that to vertices outside the group. Generally, a community is a functional grouping of entities that exhibit certain generic characteristics. For instance, a community in a citation network can represent literature that belongs to the protein folding problem, a food web can represent feeding relationships between species within an ecosystem. Thus, a 266 & The Institution of Engineering and Technology 2009
widespread applicability has made community structures a mainstay research topic in today’s scenario. Researchers from various disciplines approached the community structure problem through eigenvectors and sparse matrix formulations [7], algebraic connectivity [8], partitioning techniques [9], small world effect [10], parameterised linear programming [11] and many other methods. These approaches provide better results, but require global information of the graph and hence are computationally expensive. A recent work [12] has introduced the concept of local community structure, which detects the community given a source vertex and local information. Subsequently, several methods have been proposed to reduce the complexity of this task, however, they suffer in one or more ways. For IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061
www.ietdl.org instance, proposed methods in [13, 14] are specific to particular cases of either portal browsing or they require a minimal connected initial topology to promote the growth of a local community structure. The methods illustrated in [15 – 17] expect some degree of global information to ascertain partitions that are not easily distinguishable. The shell-spreading method mentioned in [18] performs well only if the source vertex is in the middle of the enclosed community. The measure of modularity defined in [12] considers those vertices that reside on the boundary of a subgraph. An important issue in these approaches is that they lack the ability to self-adjust the community structure when vertices are added to it. Furthermore, it is not possible to capture changes that occur in highly dynamic networks where patterns of interaction change constantly.
source vertex s [ V , in the absence of the global knowledge, the graph G is explored one vertex at a time to form a community G 0 ¼ (V 0 , E 0 ), where G 0 is a subgraph of G and s is highly connected within V 0 than to any vertex in V n V 0 . The quantitative measure of local community structure must be independent of the global topology. Based on the available local information, we crawl G one vertex at a time starting from the source vertex s. The method attempts to quantify the patterns and interactions that result in the formation of a clique. The optimisation function ranks a subset of vertices from a large unknown size group that are more closely related to each other than to the remaining vertices of the group. The proposed algorithm has three mains steps:
In this paper, we propose a novel algorithm to detect local community structure using iterative agglomeration and local optimisation. This algorithm is self-adjusting in nature and performs well in static as well as dynamic networks. It optimises the local community structure by selecting the vertices that are close enough to form a clique. As described in the later sections, the measure of ‘cliqueness’ is calculated using the available local information of vertices and communities rankings. We evaluate the efficiency of the proposed algorithm on computer generated dynamic and sparse networks with different source vertex positions. The analysis of the results shows that the proposed algorithm is very effective in addressing a wide variety of social [19] and biological network problems [20, 21]. We also evaluate the performance of our algorithm on two real-world networks: Zachary Karate Club and NRC/MASC receptor complexes. In Zachary Karate Club, 94% of club members, regardless of source vertex position (including border vertices) are correctly identified within their original communities. These results are superior to the previous published results in [3, 18] that constraints the presence of source vertex in the middle of an enclosed community. Another experiment to understand protein interactions mapping in NRC/MASC receptor complexes yield two big clusters enclosing local community of motifs with 92% accuracy level in core MASC proteins functional elements. The rest of the paper is organised as follows: Section 2 presents the proposed algorithm. The evaluation and benchmarking of the algorithm on computer generated and real-world networks is illustrated in Section 3. Section 4 discusses some suggestions to further optimise the generated communities. Finally, Section 5 concludes the paper.
2
The proposed algorithm
2.1 Overview The local community structure can be formally defined as follows: Given an undirected graph G ¼ (V , E) and a IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061
1. Initially, a series of random walk is performed to generate an initial set of communities C. 2. The set of communities and their vertices obtained in the step 1 are ranked using local optimisation functions, which are defined later in this section. 3. Based on these ranks, the vertices between the communities are exchanged to optimise the formation of the local community structure. The algorithm stops when the local optimisation function cannot be improved any further or if the size of the highest ranked community reach to a user defined value k. Finally, the algorithm returns this highest ranked community as the solution.
2.2 Algorithm description The algorithm begins with a series of random walks from the source vertex s and explores unknown portions of the graph G to generate an initial set of communities. These communities are formed as a subset of vertices and edges of an unknown size graph G. Each vertex of a community is assigned a value g, which is defined as its degree divided by number of its immediate neighbours present in the same community. Within a community, the enclosed set of vertices are positioned with respect to g. This implies that a vertex with larger g has a higher probability to discover immediate neighbours in other communities; such vertices are called active vertices. It should be noted that, because of the inherent nature of the random walk algorithm, a vertex can be included in more than one community but with different g values. Similarly, each community is assigned the value G, which is calculated as the cumulative average of g values of its enclosed set of the vertices. These communities are positioned with respect to their G value. This reflects two important aspects: (i) communities with large G value are more active to form a local community structure; (ii) communities with small G value are proclaimed local community structures or they are insufficient in terms of the basic requirement. 267
& The Institution of Engineering and Technology 2009
www.ietdl.org Thus, a community C should either siphon off its active vertices or acquire the immediate neighbours of its active vertices to yield a local community structure.
fashion, subsequent vertices are added to the same community until a previously labelled vertex encounters (hence, backward walk is not possible).
In order to accomplish this, we perform iterative agglomeration on these communities in decreasing order of their G values. The agglomeration takes place through the exchange of active vertices among the higher ranked communities. It is now apparent that the set of vertices that belongs to these communities has a higher probability to discover their immediate neighbours in other communities. This exchange results in the change of a vertex’s neighbourhood and subsequently its g value changes. This in turn changes the ranking of these communities. As a result of successive iterations, vertices come closer to satisfying their neighbour requirements. In doing so, their respective g values will decrease, which further attributes towards the creation of a local community structure. This process continues until the algorithm has agglomerated k (given as input) number of vertices or the algorithm has discovered an entire local community structure. Fundamentally, the algorithm involves a process of building a community structure by clustering together a pattern of highly interactive vertices.
Iteratively, such random walks are performed to generate a set C of initial communities. Let C ¼ {C1 , C2 , . . . , Ch } be the set of initial communities generated through a series of random walks, where jCi j l for all i ¼ {1 . . . h}. There are two different ways of deciding the total number of communities to generate: (i) user provided input of value h and (ii) a random walk that is generated twice. The first option is useful when user wants to restrict the number of generated communities, while with second option, the algorithm maintains unique hashes (e.g. md5 hashes) of performed random walks. A hash is generated from the unique set of nodes and edges encountered in that walk. If the same hash is generated twice the random walk algorithm is terminated and the number of communities that are generated till that iteration is termed h. Fig. 1 shows a number of generated communities with user input option for the source vertex s ¼ 34. Furthermore, this finite Markov chain random walk also has the capability to reach significantly obscure locations of a graph. Vertices that are not part of any community are considered to be leftover vertices. Later on, these leftover vertices form a community, which is checked against a classified local community for any possible exchange of vertices. Pseudo-code for user input option is given in Algorithm 1 (Fig. 2).
In the following subsections, we first discuss the random walk algorithm that generates an initial set of communities. Then we discuss a method to find vertex and community positions and give formulae to calculate their respective rankings. We also discuss the underlying rules to exchange active vertices among the higher ranked communities. Finally, the dynamic optimisation to achieve community formation in dynamic networks is presented.
2.3 Random walk There exists a variety of random walk algorithms based on different parameters such as Euclidean commute time distance [22], reversible Markov chains [23], timing parameters [24] and available global information [25]. However, these methods do not meet the requirements to compute a local community structure due to two main reasons: (i) the source vertex position to initialise the local community structure and (ii) the minimum convergence time to form a community based on the available local information. Hence, we propose a random walk based on vertex degree distribution, which meets all the basic requirements to form a local community. The basic assumption of the proposed random walk is that the probability to explore a vertex is proportional to its degree distribution. The random walk on the graph G is performed in order to generate an initial set of communities. At first, the given source vertex s is labelled and added to a community Ci . Then, one of its immediate neighbouring vertex v is selected with a probability of Ps ¼ 1=d (s), where d(s) is the degree of source vertex s. The selected vertex v is labelled and added to the community Ci . Continuing in this 268 & The Institution of Engineering and Technology 2009
Algorithm 1 can be modified to implement the hash-based generation of initial communities. Instead of user input of value h, the algorithm will maintain hash of random walks generated. After every walk the new hash is compared with exisiting hashes and if a match is found the algorithm terminates.
2.4 Local optimisation The classification to partition a graph into local community structure can be interpreted as optimising a quantitative
Figure 1 Shown above are four communities generated from random walk As shown vertex 34 is the source vertex s, used in Zachary Karate Club experiment
IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061
www.ietdl.org clique even if we know the network’s global information is an NP-complete problem [26]. To solve this problem, we first define a vertex rank function to rank all the vertices explored during the random walk. The rank of a vertex decides its position within its community and is the basis on which the vertices in general are agglomerated with high probabilistic coefficient to form a clique. Let C ¼ {C1 , C2 , . . . , Ch } be the set of communities generated from the random walk. Each community Ci contains a set of vertices VCi ¼ {v1 , v2 , . . . , vl }. The vertex rank is defined as
Figure 2 Random walk algorithm based on vertex degree distribution notion of a community structure [12]. In case of local community structures, the biggest challenge is the lack of global knowledge of the network. Therefore any optimisation must rely only on the available local information. Here, the local information of a community is the edge connectivity of its enclosed set of vertices and the knowledge of their immediate neighbours. Thus, from edge connectivity (adjacency matrix) it can be determined that how many immediate neighbours of a vertex are within its own community and how many of them are in some different community. Furthermore, if two or more such communities share their local information, it is inferential that a local community structure can be discovered. During this process, a vertex will lookout for its immediate neighbours in other communities, then it will either join the community of one of its immediate neighbours or ask its immediate neighbours to join its own community. This results in the formation of a local community structure. Thus, local optimisation is defined as a process to discover community structure based on the exchange of available local information between communities.
gji ¼
(2dji ) (kji (kji 1))
where j is a vertex in a community Ci , dji is the degree of vertex j and kji are the possible number of its immediate neighbours that are present in the same community. An active vertex within a community is a vertex that has the greatest g value. This implies that, such an active vertex has the highest probability to find more neighbours if it is transferred to some other community. This further explains the exchange of vertices from one community to another community as a way to mutually decrease the value g, which in turn brings all the neighbouring vertices together to form a local community structure. Pseudo-code of this process is given in Algorithm-2 and -3 (Figs. 3 and 4). The algorithm performs multiple
To further understand local optimisation, we define two formulae: (i) vertex ranking, and (ii) community ranking. In vertex ranking, the vertices of a community are ranked on the basis of their available local information related to their neighbour degree requirements. In community ranking, a community is ranked based on its state and capability to transform itself into a local community structure. Let us define these formulae.
2.4.1 Vertex ranking: From the discussion it is clear that the vertex exchange between communities takes place on one-on-one basis. So, we can think to define some criteria to optimise the processing time and reduce the vertex-tovertex comparison. Note that the purpose to maximise the local community structure is to identify a set of vertices that are close enough to form a clique. However, to identify a IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061
Figure 3 Pseudo-code to find local community structures 269
& The Institution of Engineering and Technology 2009
www.ietdl.org
Figure 4 Pseudo-code to compute possible exchange of vertices iterations to agglomerate a set of vertices enclosed within a local community structure. Using multiple iterations, the algorithm self-adjusts the community by adding the neighbouring vertices from several other communities and by siphoning off the vertices that may belong to some other community. Finally, the algorithm outputs an assorted set of vertices that are close enough to form a clique.
2.4.2 Community ranking: A vertex’s neighbour requirement is very much local to its community. For a vertex to look out for its potential neighbours, it is required to have the knowledge of other communities. But from optimisation point of view, to explore all the generated communities is infeasible. Hence, we should derive some quantitative notion of a community to selectively perform neighbour lookup for its enclosed set of vertices. The rank of a community is defined as an average g value of its enclosed set of vertices. Thus community rank Gi of a community Ci is defined as P Gi ¼
vji [Ci
gji
jCi j
270 & The Institution of Engineering and Technology 2009
From the above equation it is clear that G value of a community is directly proportional to the sum of g values of all the vertices enclosed in that community. From the discussion on vertex rank, we can conclude further that a community is relatively more active if it encloses more active vertices. Such active communities are preferred for neighbour lookup and possible exchange of their vertices.
2.4.3 Mutual exchange: The mutual exchange of vertices among communities is an act to mutually decrease the G value of these communities. At any point in time, a community with a lowest G value is the fittest community and is considered as a candidate to become the local community structure. The fundamental act of mutual exchange of the vertices between a pair of active communities is performed in the decreasing order of their G values. So, when a vertex is added to a new community, it has more neighbours than it had in its previous community. Thus this vertex is closer to satisfying its immediate neighbour requirement. It is very likely that this vertex was originally a part of this new community. During the exchange, if common edges match, both vertex and IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061
www.ietdl.org community rankings decrease with the movement of vertices from one community to another. So, the vertices are grouped together and their new rankings are recalculated. Similarly, rank of the communities are recalculated with respect to their new set of vertices.
where k0 is the number of neighbours enclosed in new community. This is true for all other vertices. 2. Changed ranking value of CA is calculated as P G0A
¼
vai [CA lA0
gai
The mutual exchange criteria within a pair of communities is defined as: given two top ranked communities CA and CB such that GA GB , va [ CA and value ga .
where lA0 is the number of vertices after va is removed.
1. An exchange of va exists if
3. Changed ranking value of CB is calculated as P
g0a , ga where g0a is new g value calculated for va considering its neighbouring vertices are enclosed in community CB . 2. For CA and CB , a vertex exchanges occurs if Avg(CA0 ,
CB0 )
, Avg(CA , CB )
where Avg(CA , CB ) ¼
GA þ GB 2
2.4.4 Dynamic optimisation: The previous published work on the local community structure does not allow any change in the already formed intermediate communities [12 – 17, 27]. Although, it is an inexpensive computational approach that might not produce an optimised community. In the current approach, using the dynamic optimisation, whenever a vertex is exchanged between two sets of communities, its ranking is re-calculated with respect to its immediate neighbours (if any) in this new community. Thus the iteration always concludes with the most recent value of g for the enclosed set of vertices belonging to that community. As illustrated in Algorithms-2, in each iteration the algorithm calculates the maximum average reduction for the individual community CA ¼ {va1 , va2 , . . . , vala }, CB ¼ {vb1 , vb2 , . . . , vblb } and the vertex to be exchanged. The algorithm compares their respective averages before the transfer of their vertices based on the conditions mentioned in mutual exchange scheme. When a vertex v [ CA community is transferred to CB community, its g value changes, which in turn, causes a change in value G of both CA and CB communities. The changed values are calculated as follows: 1. For vertex v, the changed ranking value is calculated as
g0v ¼
(k0v
2dv (k0v 1))
IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061
G0B
¼
vbi [CB lB0
gbi
where lB0 is the number of vertices after va is added.
3
Experiments
3.1 Benchmarking In this section, we use the benchmarking criteria as illustrated in [28] to provide an objective comparison with the existing methods of finding local community structures. Initially, a Baraba´si – Albert graph G ¼ (V , E) of V ¼ 512 and e0 ¼ 8 is created, which is then randomly partitioned into four reference communities to contain nearly equal number of vertices. Every vertex in these communities has an average degree z ¼ zi þ zout ¼ 16. Clearly, a small zout displays a strong community structure. Finally, vertices are rewired through a pair of edges within same communities without changing their degree distribution.
3.1.1 Evaluation: The algorithm generates a community C and a non-community of V n C of vertices. For the comparison purpose, the reference partition is termed PR ¼ {CR , CR0 } and found partition as PF ¼ {CF , CF0 }. The evaluation criteria uses the normalised mutual information (NMI) [29 –31] to measure the likeliness in PR and PF . In NMI, a confusion matrix N is created, with its columns corresponds to the found communities and rows to the reference communities. The similarity measure is defined as follows: I (PR , PF ) ¼ PC
R
i¼1
PCR PCF
Nij log (Nij N =Ni: N:j ) P CF Ni: log (Ni: =N ) þ j¼1 N:j log (N:j =N )
2
i¼1
j¼1
where Nij is the number of nodes in the reference community i that appear in the found community j. CR are the number of reference communities and CF are the number of found communities. Ni : is the sum over row i and Nj is the sum over column j. Thus, I (PR , PF ) gives a quantitative measure of how much the found community is similar to the reference community. A score of 1 shows both communities are identical and a score of 0 is when they are totally independent. 271
& The Institution of Engineering and Technology 2009
www.ietdl.org As shown in Figs. 5 and 6, the algorithms have performed very well and comparable to the illustrated results in [28]. The algorithm is able to correctly identify the communities with the small values of zout with adecrease in the accuracy as the values of zin and zout became equal. In Fig. 3, the algorithm performs very well with large number of rewiring, and its accuracy decrease as less number of edges are rewired, which indicates the communities are not sharply separated with a lesser rewiring count.
3.2 Computer generated experiments The following section illustrates the experimental setup and the application of the proposed algorithm to the problems
related to socio-physics and computational biology. To verify that algorithm can identify and recognise boundaries of a local community, we first apply the proposed algorithm on computer generated graphs VGJ. Drawing graphs with visualizing graphs with Java, April 1998. VGJ, Visualizing Graphs with Java, is a tool for graph drawing and graph layout. The experimental random graphs are generated using VGJ and Matlab Bioinformatics tool box that have well-known community structures. As previously mentioned, the proposed algorithm can be used in dynamic networks also. To verify this scenario, we artificially add and subtract vertices on the fly during the execution of the algorithm. A graph is constructed for n ¼ 120 vertices, divided into four local community structures known already. Edges are placed independently at random between vertex pairs such that the expected average degree of each vertex is equal to 10. We can tune the sparseness of graph by varying average degree of vertices. Fig. 7 shows a dendrogram generated from this algorithm. It contains a community of n ¼ 30 vertices. As can be seen, all the vertices are identified with a major split at the top of the tree. Fig. 8 shows corresponding value G against number of iterations required to achieve it. Here we cannot put a bound on the lower limiting value of G as it is related to an average degree of all vertices. However, our experiments have shown that any value for 0 , G 1 is an indication of a good community.
Figure 5 A performance analysis of algorithm for 128-node network, averaged over 1000 realisations The algorithm performs well for low zout indicating strongly separated communities and decrease in accuracy for higher values meaning community separation blurs. The error bars show the deviation from the mean value
Figure 6 Performance of algorithm on 512-node rewired network, averaged over 1000 realisations For large number of rewiring the algorithms performs well, but a gradually decreases in accuracy as less number of edges are exchanged. The error bars show the deviation from the mean value
272 & The Institution of Engineering and Technology 2009
In Fig. 9, result of an experiment that involves graph sparseness is shown. A typical bar graph chart illustrates percentage of vertices of a community identified correctly against a varying average vertex degree. The algorithm performed quite well for an average of d ¼ 6 and onwards identifying almost 92% of vertices correctly. For an average low degree (d 4), the results deteriorate to a level of 75%. This result indeed clarifies an important aspect of random walk. Initial solution generated from random walk contains significant number of vertices that contribute to other communities. As the sparseness level of a graph
Figure 7 Plot of dendrogram shown for a computer generated local community structure A community with a set of 30-vertex is identified correctly for a mean value of G ¼ 0.86. Numbers at the bottom denote vertex identifiers belong to that community. The top single line agree with the formation of a complete community
IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061
www.ietdl.org
Figure 8 Plot against number of iterations and average G value is shown When number of iterations of the algorithm reaches 30, value of G ¼ 02.86 signifies a local community structure is formed with all vertices classified correctly
Figure 9 Bar graph showing the efficiency of algorithm on various levels of graph sparseness The algorithm performed quite well while average degree of community is d ¼ 6
increases, random walk traverses these non-localised parts of community. The dynamic property of our algorithm was also evaluated as shown in Fig. 10. For a known community of n ¼ 43 vertices, we feed the algorithm with vertices on the fly during its execution process. These externally added vertices are not part of any initial solution generated from random walk. The spikes (squares) show a temporary increase in the value G when a set of vertices are fed to network. A comparison is made against static nature of community (circles) with same set of vertices. In static community, all the vertices were part of initial solution generated from random walk. At the end of execution, value G merges to a common point for both communities. It is expected, given that set of vertices eventually remains the same in both the communities. Fig. 11 presents a comparative study of local community convergence by varying the position of source vertex. IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061
Figure 10 A relative comparison of static and dynamic community values G are shown for the same set of vertices The two curves show average value G ( y-axis) with number of vertices in communities (x-axis). As expected, dynamic community (squares) show spikes in G value when vertices are added on the fly. Also, static community (circles), which is based on initial random solutions, show a steady decrease in its G value
Several iterations of the proposed algorithm on a constant set of n ¼ 120 vertices are carried out for a known community of n ¼ 43 vertices. In each iteration, a different source vertex s is chosen to carry out the community formation. For the purpose of clarity, only five different source vertices are illustrated with their respective values of G against number of iteration, before they converge to a local community. As revealed in Fig. 11, for all source vertices, final communities converge for an average value G ¼ 0:82 signifying good quality of local structures assuming all communities contain same set of vertices. Fig. 12 presents execution time comparison of various local and global community-based algorithms that are referenced in this paper. As can be seen local information-based algorithms have good execution time performance over global information-based algorithms. Also, within local community structures proposed method performs well over other local methods. From these results, we conclude that the proposed algorithms effectively identify the local community structure in static and in dynamic networks.
3.3 Application to real networks In this section, we evaluate the proposed algorithm on the realworld social and biological networks. The first example is taken from Zachary karate club in which a factional division led to a formal separation of the club into two organisations. It was earlier studied by Bagrow and Bellt [18] and [32] with different classes of hierarchical clustering methods. Next, we apply the proposed algorithm on a biological example to map protein interaction to create a network representation of neurotransmitter receptor complexes. 273
& The Institution of Engineering and Technology 2009
www.ietdl.org
Figure 11 Plot showing an effect on value G (y-axis) and number of iteration(x-axis) when s is chosen differently The merge on the bottom-right denotes convergence to local community structure with nearly same G value. Almost 95% of vertices are correctly identified despite varying source vertices
Figure 12 Comparison of the execution time with varying number of nodes in the network The algorithm used are in the following order [13 –16], proposed algorithm and [17]. As can be seen local information-based community structure algorithms performs way better than global information-based algorithms. Also, among local algorithms proposed methods performs better than other referenced local community structure algorithms
3.3.1 Zachary Karate Club: This example is built on Zachary Karate Club [33], which presents data from a university-based karate club for n ¼ 34 vertices, in which a factional division led to a formal separation of the club into two organisations. The input consists of 34 vertices representing members of a big community. A disagreement developed between the administrator and instructor of the club, which resulted in instructor leaving and starting a new club, taking about a 274 & The Institution of Engineering and Technology 2009
half of the original club members. We apply the proposed algorithm, in an attempt to identify the factions and its members involved in the split. Fig. 13 shows the original karate club, where circles represent members associated with the administrator’s faction and squares represents the instructor’s faction. To justify the correctness of the proposed algorithm, we run the experiment for a distinct pair of source vertices from each local faction. Considering s ¼ 22 as a border vertex and IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061
www.ietdl.org
Figure 15 Zachary Karate Club convergence graph for the identification of communities The above graph shows the convergence of communities taking s ¼ 22 and 34, for h ¼ 25. x-axis shows the number of times the iterative agglomeration is performed
Figure 13 Actual breakdown of Zachary Karate Club [32]
s ¼ 34 as a centre vertex with h ¼ 25, we fed this network to the proposed algorithm. Fig. 14 shows the resultant generated communities. Except for the vertices nine and ten, algorithm is able to correctly identify communities; however, it can be justified as both vertices are equally linked to both factions with same number of edges outward to both communities. The graph in Fig. 15 shows a steady convergence in G values of the formed communities. It can also be interpreted that
Figure 14 Two local factions in Zachary Karate Club generated using the proposed algorithm are presented The community represented using circle s ¼ 34 and squares represented community with s ¼ 22
IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061
border vertices bring a sharp decrease in the value G of the community structure. With an average G ¼ 0:51 for s ¼ 22 and G ¼ 0:65 for s ¼ 34, an accuracy of 94% is achieved, which is superior to previous results of [13, 18].
3.3.2 Neurotransmitter receptor complexes: Mapping protein interactions to create a network representation of the complexes helps reveal many important biological functions. Here we explore the organisation and function of N-methyl-D-aspartate receptor complex/ MAGUK associated signalling complex (NRC/MASC) [27] using local community approach. Using the information on the function, interactions patterns and the phylogeny of each protein, we are able to develop certain inferences on the structural and functional aspects of synapse complexes. The annotation study carried before only considers the list of components and does not take into consideration their interaction patterns and organisation. When represented as an undirected graph, the input consisted of 101 proteins vertices connected by 246 interactions edges. This constitutes core functional elements of MASC proteins. It links together all glutamate receptors and a high proportion of the signal transduction machinery responsible for the reception and integration of calcium and G-protein-coupled synaptic signalling. Given 13 different sparse and dense collections of network communities, we focused our work on the identification of ionotropic glutamate receptor proteins that also contain number of PDZ/DHR/GLGF scaffolding molecules and Ser/Thr kinase PKA, known as an integrator of signals in synaptic plasticity containing some traces of tyrosine kinases and SH2 motif proteins. Fig. 16 shows the results of the application forming two big clusters of local community of motifs. The algorithm is able to identify more than 92% of them correctly for an average G ¼ 0:81 as compared to the original community structure mentioned in [33]. The two clusters in Fig. 13 show that the proteins perform fast information sharing and response 275
& The Institution of Engineering and Technology 2009
www.ietdl.org
Figure 16 NRC/MASC analysis Communities generated for neurotransmitter receptor complexes
coordination, first within themselves and then with other module members. As inferred, the clusters of proteins around ionotropic glutamate receptors and Tyrosine protein kinase form the primary sites for signal reception regulating effector mechanism and vesicular trafficking [33]. They are similar in nature and influences a particular set of functional processes specific to their structure and individual behaviour. Thus, the formed grouping shows community of significant overlap in function and phenotypic annotations. This conveys an important result that irrespective of type of network, the proposed algorithm has performed consistently well across wide range of implementation.
4
Further improvements
Once the agglomeration process is complete, additional activities can be performed to further optimise the discovered community. Successive merging of the communities is an important aspect that involves grouping small communities to form a big enclosed community (when l , k). By performing this process, we are also able to avoid possible local minimas that are formed during community generation. Secondly, when agglomeration does not optimise top ranked communities, instead an exchange operation between complementary ranked communities can be performed. 276 & The Institution of Engineering and Technology 2009
† Community merging: Set of small enclosed communities (with l , k) can represent a big community, although a single vertex exchange among them does not optimise the solution. Using the basis of vertex and community rank, algorithm successively merge these communities together to form a bigger community. † Complementary exchange: As discussed before, the top ranked communities are considered for vertex exchange. Sometimes communities with high value G may not necessarily generate an optimise solution. They remain divided in themselves and look for those vertices that exist in comparably stable communities. During the course of iterations, if exchange does not generate a better solution, algorithm marks such communities such that they will not take part in future iterations to save degenerative counts. To further optimise the solution, algorithm performs exchange operation among complementary ranked communities with respect to their adjacency and value G.
5
Conclusion
In this paper, we introduced a novel approach to finding the local community structure, when the complete information of a network is unknown. Using the random walk, we explored each vertex at a time and added it to the initial set IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061
www.ietdl.org of communities. We then introduced the concept of vertex and community rankings. This helped the iterative agglomeration technique to improve the initial set of communities. During the course of iterative agglomeration, the vertices on the basis of rankings are transferred from one community to another in search for immediate neighbours. This process brings the high ranked communities closer to form a clique. Each iteration optimizes the current set of communities until k number of vertices are explored or a local community structure is discovered. An extension to this algorithm can be made for the identification of global community structure with different source vertex positions. Computer simulated experiments proved the effectiveness of algorithm in both static as well as dynamic networks. Using normalised mutual information, we benchmark proposed algorithm to quantify its accuracy to discover the local communities. Experiments performed on Zachary Karate Club and neurotransmitter receptor complexes shows the applicability of the proposed algorithm in a wide range of applications.
6
[10] TRAVERS J., MILGRAM S.: ‘An experimental study of the small world problem’, Sociometry, 1969, 32, (4), pp. 425– 443 [11] CHEN W.Y.C., DRESS A.W.M., YU W.Q.: ‘Checking the reliability of a new approach towards detecting community structures in networks using linear programming’, IET Syst. Biol., 2007, 1, (5), pp. 286– 291 [12] CLAUSET A. : ‘Finding local community structure in networks’, Phys. Rev. E, 2005, 72, p. 026132 [13] ALMEIDA R.B. , ALMEIDA V.A.F. : ‘Local community identification through user access patterns’, Clin. Orthopaedics Relat. Res., 2002, cs.IR/0212045 [14] BARBOSA V.C., DONANGELO R., SOUZA S.R.: ‘Emergence of scalefree networks from local connectivity and communication trade-offs’, Phys. Rev. E, 2006, 74, p. 016113 [15] FARUTIN V., ROBISON K. , LIGHTCAP E., ET AL .: ‘Edge-count probabilities for the identification of local protein communities and their organization’, Proteins, 2006, 62, (3), pp. 800– 818
References
[1] GLOOR P., LAUBACHER R., OYNES S., ZHAO Y.: ‘Visualization of interaction patterns in collaborative knowledge networks for medical applications’. Proc. HCII, Crete, Greece, June, 2003 [2] FORTUNATO S., LATORA V., MARCHIORI M.: ‘Method to find community structures based on information centrality’, Phy. Rev. E, 2004, 70, (5), p. 056104 [3] GIRVAN M., NEWMAN M.E.J.: ‘Community structure in social and biological networks’, Proc. Natl. Acad. Sci. USA, 2002, 99, p. 7821 [4] CHO S. , PARK S.G., LEE D.O.H. , PARK B.C. : ‘Protein-protein interaction networks: from interactions to networks’, J. Biochem. Mol. Biol., 2004, 37, (1), pp. 45– 52 [5] DUNNE J.A., WILLIAMS R.J., MARTINEZ N.D.: ‘Food-web structure and network theory: the role of connectance and size’, Proc. Natl. Acad. Sci., 2002, 99, (20), pp. 12917 – 12922 [6] WASSERMAN S., FAUST K.: ‘Social network analysis’ (Cambridge University Press, Cambridge, 1994) [7] POTHEN A., SIMON H.D., LIOU K.-P.: ‘Partitioning sparse matrices with eigenvectors of graphs’, SIAM J. Matrix Anal. Appl., 1990, 11, (3), pp. 430– 452
[16] PALLA G., DERENYI I., FARKAS I., VICSEK T.: ‘Uncovering the overlapping community structure of complex networks in nature and society’, Nature, 2005, 435, p. 814 [17] SPIRIN V., MIRNY L.A.: ‘Protein complexes and functional modules in molecular networks’, Proc. Natl. Acad. Sci. USA, 2003, 100, (21), pp. 12123 – 12128 [18] BAGROW J.P., BOLLT E.M.: ‘Local method for detecting communities’, Phys. Rev. E (Stat. Nonlinear Soft Matter Phys.), 2005, 72, (4), p. 046108 [19] DE SOLLA PRICE D.J. : ‘Networks of Scientific Papers’, Science, 1965, 149, pp. 510– 515 [20] ITO T., CHIBA T., OZAWA R., YOSHIDA M., HATTORI M., SAKAKI Y.: ‘A comprehensive two-hybrid analysis to explore the yeast protein interactome’, Proc. Natl. Acad. Sci. USA, 2001, 98, (8), pp. 4569 – 4574 [21] KLEINBERG J.M., KUMAR R., RAGHAVAN P., RAJAGOPALAN S., TOMKINS A.S. : ‘The Web as a graph: Measurements, models and methods’, Lect. Notes Comput. Sci., 1999, 1627, pp. 1 – 17
[8] FIEDLER M.: ‘Albegraic connectivity of graphs’, Czech. Math J., 1973, 23, pp. 298– 305
[22] FOUSS F. , PIROTTE A., SAERENS M.: ‘A novel way of computing similarities between nodes of a graph, with application to collaborative recommendation’. Proc. 2005 IEEE/WIC/ACM Int. Conf. Web Intelligence, WI ‘05: 2005, pp. 550– 556
[9] KERNIGHAN B.W., LIN S.: ‘An efficient heuristic procedure for partitioning graph’, Bell Syst. Tech. J., 1970, 49, pp. 291– 307
[23] ALDOUS D. , FILL J.A.: ‘Reversible Markov chains and random walks on graphs – chapter 9: a second look at general Markov chains’, 2002
IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061
277
& The Institution of Engineering and Technology 2009
www.ietdl.org [24] LOVSZ L.: ‘Combinatorics, Paul Erdos is eighty’ (Jnos Bolyai Mathematical Society, Hungary, 1993) [25] LATAPY M. , PONS P.: ‘Computing communities in large networks using random walks’, ArXiv Condensed Matter e-prints, 2004 [26] GAREY M.R., JOHNSON D.S.: ‘Computers and intractability: a guide to the theory of NP-completeness, (series of books in the mathematical sciences)’ (W.H. Freeman, January 1997)
[29] DANON L., DI´AZ-GUILERA A., DUCH J., ARENAS A.: ‘Comparing community structure identification’, J. Stat. Mech.: Theory Exp., 2005, 2005, (09), P. 09008 [30] FRED A.L., JAIN A.K.: ‘Robust data clustering’, Comp. Vis. Pattern Recognit., IEEE Comp. Soc., 2003, 02, p. 128 [31] STREHL A., GHOSH J.: ‘Cluster ensembles – a knowledge reuse framework for combining multiple partitions’, J. Mac. Lear. Res., 2002, 3, pp. 583 – 617
[27] NEWMAN M.E.J., GIRVAN M.: ‘Finding and evaluating community structure in networks’, Phys. Rev. E, 2004, 69, p. 026113
[32] POCKLINGTON A.J., ARMSTRONG J.D., GRANT S.G.N.: ‘Organization of brain complexity – synapse proteome form and function’, Brief Funct. Genomic. Proteomic., 2006, 5, (1), pp. 66 – 73
[28] BAGROW J.P.: ‘Evaluating local community methods in networks’, J. Stat. Mech.: Theory Exp., 2008, 2008, (05), P. 05001 (16p.)
[33] ZACHARY W.W.: ‘An information flow model for conflict and fission in small groups’, J. Anthropol. Res., 1977, 33, pp. 452– 473
278 & The Institution of Engineering and Technology 2009
IET Syst. Biol., 2009, Vol. 3, Iss. 4, pp. 266– 278 doi: 10.1049/iet-syb.2007.0061