Mining and Visualizing the Evolution of Subgroups in Social Networks Tanja Falkowski
Jörg Bartelheimer
Myra Spiliopoulou
University Magdeburg, Germany
[email protected]
University Magdeburg, Germany
[email protected]
University Magdeburg, Germany
[email protected]
Abstract A social network consists of people who interact in some way such as members of online communities sharing information via the WWW. To learn more about how to facilitate community building e.g. in organizations, it is important to analyze the interaction behavior of their members over time. So far, many tools have been provided that allow for the analysis of static networks and some for the temporal analysis of networks - however only on the vertex and edge level. In this paper we propose two approaches to analyze the evolution of two different types of online communities on the level of subgroups: The first method consists of statistical analyses and visualizations that allow for an interactive analysis of subgroup evolutions in communities that exhibit a rather membership structure. The second method is designed for the detection of communities in an environment with highly fluctuating members. For both methods, we discuss results of experiments with real data from an online student community.
1. Introduction Effective knowledge sharing within an organization is a serious concern as it directly influences its performance. Since social networks can enhance knowledge sharing, organizations are interested to foster successful networks such as online communities. Therefore it is necessary to learn more about the organizational as well as technological infrastructure that may positively affect a community. For this, the temporal development of online communities needs to be analyzed to determine the factors that cause the dynamics and drive a certain development. An online community is at each point defined by its current members. Thus, the evolution of communities over time is typically analyzed by observing changes in the interaction behavior of their members. This works well for rather stable communities with a considerable amount of members who participate over a long time (core members) and a small amount of fluctuating mem-
bers. However, since communities can be highly dynamic social networks whose structure changes over time, we often observe another type of community where members of a community leave gradually, while new ones join. Then, the community is still there even if all of its original members have left. To analyze the evolution of these two different types of communities, we propose different community detection and visualization methods. In the first approach we partition the time axis and build a graph of interactions. On this graph we apply a hierarchical divisive edge betweenness clustering algorithm to find subgroups of densely connected individuals. In a next step statistical measures are used to analyze the temporal development of these subgroups. The second approach tackles the problem of analyzing the evolution of communities in environments with a high membership fluctuation. For this, we use the same clustering as in the first approach to detect subgroups in graphs (community instances). Then we build a graph of similar community instances and cluster this graph to detect groups of similar instances. These clusters are visualized and the temporal development is analyzed. By this we detect persistent structures and their transitions in a graph of interactions among fluctuating members. The transitions in stable groups as well as structural changes in groups of fluctuating members might have been triggered by internal or external events. We present results of experiments with real data from a community of international students which exhibits due to semester breaks a large amount of fluctuating members. The remainder of the paper is organized as follows: In Section 2 a survey of the related work is given and in Section 3 the data set is briefly presented. Our proposed approach to find subgroups in social networks and to analyze their development over time is discussed in Section 4. In Section 5 a method to detect persistent structures in networks with fluctuating members is presented. Furthermore, it is discussed how events that lead to structural changes in communities can be detected. The outcome of the analysis with data from an online student community is presented in different visualizations. A conclusion of our findings is presented in Section 6.
2. Related Work
3. The Data Set
The formation of densely connected members (subgroups, clusters) can be observed in many networks. Finding these clusters is therefore of interest in many research fields. Examples are social sciences such as citations networks [5], biology such as genetic networks [12] or food webs [1], in computer science such as the WWW [8] or email log files [11]. In all these research areas we find discussions about the definition of online communities. Mainly in the field of sociological research but also in information sciences, especially the need to understand and support online communities has driven the discussion. Several algorithms to detect online communities have been developed. They mainly base on the assumption that a community consists of a group of actors/participants who interact with each other more closely than with others. Thus, communities are considered as groups of densely connected actors that are only loosely connected to the rest of the network. For our understanding of communities, these methods are relevant to find dense groups in graphs. Leskovec et al. [9] have studied the properties of the time evolution of graphs. The results give insights into the evolution of graph properties over time and statements about trends can be made. The authors consider properties on the graph level such as the average vertex degree and the distances between pairs of nodes. Properties on the level of subgroups that could be used to study the evolution of communities over time are not observed. Several software tools for the analysis and/or visualization of social networks have been developed such as StOCNET, UCINET or Pajek. These tools provide an analysis of static networks but do not incorporate methods for a temporal analysis of social networks. Lately, some visualization methods have been proposed to observe graphs over time. Tools such as SoNIA [10] and TeCFlow [4] have been developed that allow for temporal social network visualization. Both tools visualize the temporal development of graphs on the vertex and edge level. They create movies and thus visualize changing behavior between single actors. It is not possible to explore the dynamics of subgroups. In contrast, we propose to observe the temporal changes on the subgroup level to allow for an exploration of subgroup dynamics. Thus, we do not care about changing behavior of single actors as they do not necessarily represent a community. Online or Web communities in the WWW have furthermore been inferred by studying its link topology. To our knowledge all approaches are based on the seminal work of Kleinberg [3] which assumes, that communities contain a core of “authoritative pages” which are linked by “hub pages”. Since we do not share this assumption, this approach is not applicable for us.
The data that we use in the following for the experiments is taken from an online student community. The aim of the online student platform, which started in summer of 2004, is to bring together international students at the University of Magdeburg. The community has currently about 1,000 members from more than 50 countries. The platform provides information for (prospective) students from abroad for example about the housing situation or insurance and finance issues. Besides this, the platform provides a Web space for each student. This page can be used to provide personal information, to link with friends or to search for language tandems. Each member has a personal guestbook that can be used by other members to post public messages or comments. Most of the interactions between the students take place in these guestbooks, which work similar to email except that messages are not private but can be read by everybody. The data set contains 250,000 guestbook entries over a period of 18 months (June 2004 – November 2005, 75 weeks). Each member is represented as a vertex and two vertices are connected with an edge if at least one bilateral message exchange toke place (both members have written at least one message in the other member’s guestbook). Members are made anonymous so that it is not possible to trace back the results of the experiments to a certain individual.
4. Mining and Visualizing the Evolution of Subgroups in Social Networks A subgroup in a social network is a set of densely connected members that are rather loosely connected to other members. A subgroup is a circle in which “one knows each other”. Especially in online communities, subgroups gain in importance as they provide for a familiarity between the members. This is especially important for fast growing communities as old user still have a tight bond to other members with whom they are comfortable. Furthermore, established subgroups are important for new users too, as it is easier for them to find a group of people with similar interests (see, e.g., [7]). Thus, facilitating the emergence of subgroups in social networks as well as observing and analyzing them are of particular interest for organizations and community providers. In the following we present our approach to detect, visualize and analyze subgraphs and their dynamics.
4.1. Mining for Subgroups in Social Networks To detect subgroups in social networks we apply a three-step approach which is briefly described in the following: In the first step the time axis is partitioned into slots (time windows t) to allow for a temporal analysis of
the network (overlapping sliding window approach; in our experiments we used a sliding window with a length of 14 days and a step width of 1/2 of the window length.) In the second step a weighted graph Gt of interactions between individuals for each time window is built (the weight corresponds to the number of bilateral message exchanges). In the third step a hierarchical edge betweenness clustering of the graph is applied to find subgroups of highly connected individuals in each time window. The process is described in more in detail in [2] The results of the clustering process can be visualized in a dendrogram as shown in Figure 1. This tree diagram illustrates the resulting partitioning of the network into subgroups. By marking one row in the table at the bottom of Figure 1 the user can choose the time window to investigate. The table furthermore provides information such as the number of subgroups detected, the number of participants in the subgroups and the number of edges to be deleted to obtain the highest modularity. In the window above the table the corresponding dendrogram to the chosen period is shown. By default, the yellow slider in the dendrogram shows the network partition with the highest modularity. However, the user can move the slider to experiment with other partitions. At the left side of the dendrogram, we see the detected communities. The matrix window on the right shows the communication within one selected community from the dendrogram.
4.2. Analyzing Subgroup Dynamics Subgroups in social network are not static but evolve over time. Typical transition types are growing, declining, merging and splitting. To observe these transitions we track a detected subgroup over time by measuring the structural equivalence. The development of one subgroup can be described and assessed by measuring and interpreting different measures which are: Stability, density and cohesion, Euclidean distance, correlation coefficient and group activity. The measures are computed for each
Figure 1. Clustering results in a dendrogram
time window and the results are compared in two ways (i) a chosen time window is compared with all other time windows, called fixed, to determine how the measure changed compared to the initial chosen time window and (ii) each time window is compared to the previous time window, called periodic, to assess the development in consecutive periods. Figure 2 shows a screenshot of the statistical analysis. On the left side, the user can mark a subgroup that has been detected in a certain time window and the resulting comparisons are shown on the right side in several curves. The implemented measures and their interpretation are briefly described in the following. Stability: The stability shows how stable the composition of the group is. The fixed stability indicates how many of the original members in the chosen time window are active in all other time windows. If the value is one, all members of the chosen subgroup are active. For the periodic stability the set of members is compared to the previous time window. A low curve indicates high changes in the membership structure of the subgroup. Density and Cohesion: The density indicates the connectivity inside the group. It is the proportion of the number of edges inside the group to the number of possible edges. The higher the density the more connected the group members are. The cohesion indicates how connected the group is to non-subgroup-members. It is obtained by dividing the average strength of the interaction inside the subgroup (using the edge weights) by the average interaction strength to actors outside the subgroup. If the density increases greater than the cohesion, the connectivity to members outside the subgroup grows faster, resulting in a less stable subgroup. Euclidean Distance: The structural equivalence of two subgroups is measured by calculating the Euclidean distance of their vector representations. The distance will be zero if the subgroups are structurally equivalent. The distance is large if groups are not equivalent.
Figure 2. Temporal development of subgroups
Correlation Coefficient: The correlation coefficient is obtained by dividing the covariance of the vector representation of the graphs by the product of their standard deviations. The correlation coefficient takes a value between -1 and +1. If two subgroups are structurally equivalent, the correlation will be +1. Group Activity: The lowermost curve in Figure 2 shows the actual activity of the group members measured in number of interactions in each time window. Shown are for example the Internal Group Activity and the External Group Activity. The Min Internal Group Activity illustrates the number of reciprocal interactions inside the group. Thus, by comparing the Internal Group Activity with the Min Internal Group Activity the reciprocity inside the group can be determined. The Min External Group Activity in comparison with the External Group Activity can be interpreted analogously. In our experiment with the online student community, we obtained 1181 clusters with more than three members. Ordered by the number of group members we obtain a Zipf distribution. The largest group has 45 members and the average group size is 6.4. Thus, the hierarchical edge betweenness clustering algorithm detects in our data set many small community instances. In Figure 2 we chose a subgroup with 23 members that was detected in April 2004. The curves of the Fixed Correlation Coefficient, the Fixed Euclidean Distance, the Cohesion and the Density indicate that the subgroup shows up only once in this formation. The Group Activity and the Periodic Correlation show, that some of the members are interacting with each other in earlier and later time windows. However, in this data set, no long-term stable subgroups could be found; therefore tracking a subgroup over time was not possible. The second approach in which similar subgroups are clustered solves this problem by merging similar instances to trackable communities.
eral “community instances”. We first specify the notion of similarity among community instances that belong to the same community. Then, we establish a graph of community instances, where edges denote similarity among instances that were discovered at different times. Finally, we present our graph clustering algorithm that groups instances of the same community and the visualization of the clustering results. 5.1.1. Similarity of Community Instances. We define similarity among community instances that have been discovered at different times as the overlap of members between the two community instances. Thus, two instances are similar if their overlap exceeds a given threshold. In particular, let ti, tj be two distinct time periods, let Gi, Gj denote the corresponding graphs of interactions and let C G , C G be the corresponding clusterings. Further, let x ∈ C G and y ∈ C Gj be two community instances. We define the overlap between two community instances as: j
i
i
overlap ( x, y ) =
x∩ y
(
min x , y
)
(1)
where |x| is the number of vertices in a community instance or intersection. Using this notion of overlap, we define a function sim(x,y) as follows: ⎧⎪1 overlap ( x, y ) ≥ τ overlap sim( x, y ) = ⎨ ⎪⎩0 otherwise
(2)
5.1. Mining Communities
The overlap threshold τoverlap captures the tolerance to member fluctuation: In some communities, the number of interactions of any member is low and the number of time periods where this member is active is also small; nonetheless, the community is there for many more timeslots than any of its members. To identify community instances that constitute such a community, we need a low overlap threshold. The definitions of overlap and similarity allow for comparisons between community instances in arbitrarily remote time periods. It is of course possible to compute the similarity of a community instance to any past community instance. However, this is computationally expensive. Moreover, it may lead to noisy results in environments where individuals participate in several communities but are active in different communities at different times. Therefore, we introduce an upper boundary τperiods to the number of periods that may separate two potentially similar community instances; instances that are more than τperiods apart have zero similarity by default. Thus, we extend the similarity definition of Eq. (2) into the following:
From now on, we denote a “subgroup” detected in the first clustering step as a “community instance”; a community is then a superordinate entity encompassing sev-
DEFINITION 2. Let ti, tj be two time periods, such that ti < tj let Gi, Gj be the graphs of interactions in those periods and let x Gi ∈ C Gi , y Gj ∈ C Gj be two community instances in the corre-
5. Mining and Visualizing the Dynamics of Communities with Fluctuating Members As shown in the last section, some online communities do not have a considerable amount of members that participate over a longer period. This is typical for communities where members leave gradually while new ones join. Thus, members and interactions are in some communities fluctuating through time even though the community is still there. To be able to analyze the development of such a community that is characterized by a high fluctuation of members we propose a method that detects persistent structures of similar community instances.
sponding clusterings. The similarity of the community instances is defined as: t j − ti < τ periods ∧ ⎧ ⎪1 G similarity ( x Gi , y j ) = ⎨ overlap( x, y ) ≥ τ overlap ⎪0 otherwise ⎩
(3)
5.1.2. Graph of Similar Community Instances. We use the notion of similarity in Eq. (3) to create a graph G (V , E ) of community instances: The set of vertices consists of all community instances detected during the periods of observation, i.e. V = ∪iC G . An edge between i
two vertices x, y exists if and only if similarity(x,y) = 1. According to Eq. (3), the community instances that may be compared and found similar need not belong to adjacent time periods. This means that there may be edges connecting community instances that are two, three or as many as periods as τperiods apart from each other. Those edges bring temporarily remote community instances close to each other. This ensures (a) that similar community instances are connected to each other independently of their temporal proximity; subject only to τperiods and (b) that communities that experience little change over time correspond to highly connected groups of community instances, i.e. to tight clusters. The mining algorithm presented in the next subsection discovers such clusters that correspond to stable, persistent communities. 5.1.3. Clustering the Graph of Similar Community Instances. We perform clustering upon the graph to detect groups of similar community instances. As for the discovery of community instances in each time period, we weight the edges of the graph on edge betweenness and then we progressively remove edges as part of a hierarchical divisive clustering algorithm. The connected subgraphs retained after k iterations correspond to the communities found thus far. The algorithm takes as input the number of iterations k to be performed, or equivalently the maximum number of edges to be removed from the graph. k is determined empirically and thus chosen by the user. The value of k depends on the size and the structure of the network. In each round, the edge betweenness of the remaining edges is (re-) computed. A connected subgraph returned at the end of the iterative process is a community, comprised of similar community instances.
part of Figure 3. Sliders are used to define the observation period, the similarity threshold (τoverlap), the maximal number of periods between similar community instances (τperiods) and the number of clustering iterations (k). Optional are the minimal number of periods between similar community instances and the maximum size of a community instance. The outcome of the analysis process is represented in different visualizations as shown in Figure 3 and 4. Both are described in more detail in the following. To visualize the similarity between community instances we use a graphical representation that positions more similar vertices closer to each other than less similar. For this we implemented a Kamada-Kawai [6] graph layout which positions the vertices in a way that the Euclidean distance between them is as close as possible to their graph-theoretic (path) distance. This layout is displayed in Figure 3 in the upper part. Each community instance is represented by a circle and the diameter of the circle represents the size of the community instance (= number of participants). Instances that are detected as similar according to τoverlap are connected with an edge. The whole graph can be rotated and single community instances can be moved to obtain a different view. After moving or rotating, the layout is renewed and all nodes are repositioned. Note, that in this visualization, the temporal development can not be observed, as the communities are only displayed according to their similarity. The clustering result is visualized by different colors of the nodes. That is, at first all community instances have the same color. Groups of similar community instances that are detected during the clustering process are visualized by a different color. The example in Figure 3 shows three communities that have been detected after 38 clustering iterations. One community is shown in red, another in blue and one in green. In a next step, the filtered and clustered data is copied
5.2. Visualizing the Evolution of Subgroups The proposed method to find communities in a changing environment and to detect structural changes in evolving communities has been implemented in a software tool. The specifications for the cluster analysis settings are done in a control panel which is displayed in the lower
Figure 3. Visualization of similar community instances and control panel for the analysis.
to a community history view (see Figure 4, left side). For this, the y-coordinates for each community instance are taken from the graph representation in Figure 3. This is important to note, because a rotation or repositioning of vertices in the graph results in a different layout in the history view which displays the temporal development. The x-coordinate of the vertex is determined according to the period it appears in. Thus, community instances displayed on the left side are detected in earlier periods. To the right, more recent community instances are displayed. Each community instance is now represented as a rectangle and the height of the rectangle corresponds to the size of the community instance. All communities that are considered as similar according to the actual setting of τoverlap are connected by edges. Clusters of similar community instances are displayed in the same color. By moving a rectangle (see red rectangle on left side in Figure 4) a cutout view of the graph (zoom in) is displayed which shows the structure of the community instances in more detail. The edges between rectangles represent similar community instances. If for example a community instance has edges to more than one rectangle in a later period it shows that members have separated. The horizontal box in Figure 4 on the right provides further information about a marked community instance. It shows the time period when the instance was detected and the list of participating members. This information is shown when the user marks a community instance. The transformation of the graph in Figure 3 to the temporal view in Figure 4, allows observing the development along the time axis because each community instance is displayed according to the period when it was observed. Now, the colors of the community instances indicate the boarders of the clusters. The break between two differently colored clusters may show a period when a change in the community structure occurred. In our example we
t
Figure 4. Graph Visualization (left) and cutout view (right).
see two breaks that separate three communities. The changes in the structure can have different reasons, e.g. the set of participants strongly fluctuated, the interaction behavior of the participants changed, or both. This visualization reveals periods that exhibit structural changes and thus offers a starting point for users to analyze the triggers for such developments.
5.3. Changes in the Evolution of Subgroups In the following we show that the community monitor can track evolving communities and detect breaks in this evolution. In the experiments, the similarity threshold τoverlap is kept stable at 0.5 and τperiods is set to 6. Thus, we obtain a graph of 1025 similar community instances in which only those community instances are included that are separated by maximal 6 periods. The given graph of similar community instances is iteratively clustered. The first change in the evolution is detected after 27 clustering iterations, the second after 38 iterations and the third after 48 iterations. The results of the graph clustering are shown in Figure 5. The figure consists of four screenshots that display the temporal graph after a number of clustering iterations. The leftmost screenshot shows the respective graph of similar cluster instances without clustering. The second screenshot (b) shows the graph when the first break in the community evolution was detected. The third screenshot (c) shows the second break and the fourth screenshot (d) the third. The revealed communities are displayed in different colors. To improve the readability, vertical bars that indicate the period when the break occurred are added. The obtained clustering results can be verified qualitatively in a way that “global events” can be related to the breaks between community clusters. The student community is especially oriented towards the integration of international students. Many international students stay at the University for one or two semesters only and this is usually the time when they participate in the community. Thus, the community members highly fluctuate over time and the highest fluctuation can be observed after the end of a semester and at the beginning of a new one. This observation about fluctuating members corresponds to the results of our experiments. One break in the community structure can be observed during the summer break in 2005. This change can be attributed to the fact that at the end of the summer term many students gradually leave the community because their semester abroad ends. Just a few students stay in touch for a longer time after leaving Germany. Thus, the set of participants as well as the interaction behavior change at this time. A second structural change can be observed at the beginning of the winter term 2005/06. The start of a new semester is characterized by many new members and this
(a) (b) (c) (d) Figure 5. Structural breaks in community evolution for τperiods = 6 and τoverlap = 0.5 indicated by vertical lines. Vertices mapped as described in Sect. 4. The individual graphs represent: a) no edges removed (red community), b) after 27 clustering iterations (red and blue communities), c) after 38 iterations (red, green and blue communities), d) after 48 iterations (red, green, brown and blue communities)
is especially true for the winter term as most studies start in winter. We can observe that those new members form many smaller communities very fast. These new groups sometimes grow but many students break up and others join existing or other new communities. This period thus exhibits a high fluctuation of members and interactions. The third break can be attributed to the changing interaction behavior during the Christmas break in 2004. It could be observed that people who were active on the community platform during the Christmas break contacted students that were online too, even though they had not been interacting before. This resulted in a rapidly changing community structure, as many new edges appeared in a very short time.
6. Conclusion We presented two approaches to analyze the dynamics of two different types of online communities. One method consists of statistical analyses and visualizations that can be used to analyze the dynamics of subgroups in communities with a rather stable membership structure. Since many communities do not exhibit such a stable membership we presented a second method to detect and analyze the evolution of subgroups in communities with a high fluctuation of members. Both methods have been implemented in an interactive software tool and we presented results from experiments with real data of an online student community. Since the community consists mainly of international students who stay only for one or two semesters the data set is characterized by a high fluctuation of members. However, we could show that our method reveals changes in the community structure that are caused by these fluctuations during semester breaks.
7. References [1]
J. A. Dunne, R. J. Williams and N. D. Martinez, Food-web structure and network theory: The role of connectance and size, PNAS, 99(20), pp. 12917-12922, 2002. [2] T. Falkowski, J. Bartelheimer and M. Spiliopoulou, Community Dynamics Mining, In: Proc. of 14th European Conference on Information Systems, (on CD-ROM), 2006. [3] D. Gibson, J. Kleinberg and P. Raghavan, Inferring Web Communities from Link Topology, In: Proc. of the 9th ACM Conference on Hypertext and Hypermedia, 1998. [4] P. A. Gloor and Y. Zhao, TeCFlow - A Temporal Communication Flow Visualizer for Social Networks Analysis, In: CSCW'04 Workshop on Social Networks, ACM, 2004. [5] H. Jeong, Z. Néda and A.-L. Barabási, Measuring preferential attachment in evolving networks, Europhysics Letters, 61(4), pp. 567-572, 2003. [6] T. Kamada and S. Kawai, An algorithm for drawing general undirected graphs, Information Processing Letters, 31(1), pp. 7-15, 1989. [7] A. J. Kim, Community Building on the Web, AddisonWesley, 2000. [8] J. Kleinberg and S. Lawrence, The Structure of the Web, Science, 294, pp. 1849-1850, 2001. [9] J. Leskovec, J. Kleinberg and C. Faloutsos, Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations, In: Proc. of KDD'05, 2005. [10] J. Moody, D. Mc Farland and S. Bender-deMoll, Dynamic Network Visualization, American Journal of Sociology, 110(4), pp. 1206-1241, 2005. [11] J. R. Tyler, D. M. Wilkinson and B. A. Huberman, Email as Spectroscopy: Automated Discovery of Community Structure within Organizations, In: M. Huysman, E. Wenger and V. Wulf (eds.), Communities and Technologies, Kluwer Academic Publishers, Dordrecht, 2003. [12] D. M. Wilkinson and B. A. Huberman, A method for finding communities of related genes, Proc. National Academy of Sciences U.S.A., 10(1073), 2004.