Scalable Learning of Collective Behavior - Semantic Scholar

1

Scalable Learning of Collective Behavior Lei Tang, Xufei Wang, and Huan Liu Abstract—This study of collective behavior is to understand how individuals behave in a social networking environment. Oceans of data generated by social media like Facebook, Twitter, Flickr, and YouTube present opportunities and challenges to study collective behavior on a large scale. In this work, we aim to learn to predict collective behavior in social media. In particular, given information about some individuals, how can we infer the behavior of unobserved individuals in the same network? A social-dimension-based approach has been shown effective in addressing the heterogeneity of connections presented in social media. However, the networks in social media are normally of colossal size, involving hundreds of thousands of actors. The scale of these networks entails scalable learning of models for collective behavior prediction. To address the scalability issue, we propose an edge-centric clustering scheme to extract sparse social dimensions. With sparse social dimensions, the proposed approach can efficiently handle networks of millions of actors while demonstrating a comparable prediction performance to other non-scalable methods. Index Terms—Classification with Network Data, Collective Behavior, Community Detection, Social Dimensions

F

1

I NTRODUCTION

T

HE

advancement in computing and communication technologies enables people to get together and share information in innovative ways. Social networking sites (a recent phenomenon) empower people of different ages and backgrounds with new forms of collaboration, communication, and collective intelligence. Prodigious numbers of online volunteers collaboratively write encyclopedia articles of unprecedented scope and scale; online marketplaces recommend products by investigating user shopping behavior and interactions; and political movements also exploit new forms of engagement and collective action. In the same process, social media provides ample opportunities to study human interactions and collective behavior on an unprecedented scale. In this work, we study how networks in social media can help predict some human behaviors and individual preferences. In particular, given the behavior of some individuals in a network, how can we infer the behavior of other individuals in the same social network [1]? This study can help better understand behavioral patterns of users in social media for applications like social advertising and recommendation. Typically, the connections in social media networks are not homogeneous. Different connections are associated with distinctive relations. For example, one user might maintain connections simultaneously to his friends, family, college classmates, and colleagues. This relationship information, however, is not always fully available in reality. Mostly, we have access to the connectivity information between users, but we have no idea why they are connected to each other. This Lei Tang, Yahoo! Labs, Santa Clara, CA 95054, USA, [email protected]. Xufei Wang and Huan Liu, Computer Science & Engineering, Arizona State University, Tempe, AZ 85287, USA, {xufei.wang, huan.liu}@asu.edu

heterogeneity of connections limits the effectiveness of a commonly used technique — collective inference for network classification. A recent framework based on social dimensions [2] is shown to be effective in addressing this heterogeneity. The framework suggests a novel way of network classification: first, capture the latent affiliations of actors by extracting social dimensions based on network connectivity, and next, apply extant data mining techniques to classification based on the extracted dimensions. In the initial study, modularity maximization [3] was employed to extract social dimensions. The superiority of this framework over other representative relational learning methods has been verified with social media data in [2]. The original framework, however, is not scalable to handle networks of colossal sizes because the extracted social dimensions are rather dense. In social media, a network of millions of actors is very common. With a huge number of actors, extracted dense social dimensions cannot even be held in memory, causing a serious computational problem. Sparsifying social dimensions can be effective in eliminating the scalability bottleneck. In this work, we propose an effective edge-centric approach to extract sparse social dimensions [4]. We prove that with our proposed approach, sparsity of social dimensions is guaranteed. Extensive experiments are then conducted with social media data. The framework based on sparse social dimensions, without sacrificing the prediction performance, is capable of efficiently handling real-world networks of millions of actors.

2

C OLLECTIVE B EHAVIOR

Collective behavior refers to the behaviors of individuals in a social networking environment, but it is not simply the aggregation of individual behaviors. In a

2

connected environment, individuals’ behaviors tend to be interdependent, influenced by the behavior of friends. This naturally leads to behavior correlation between connected users [5]. Take marketing as an example: if our friends buy something, there is a better-than-average chance that we will buy it, too. This behavior correlation can also be explained by homophily [6]. Homophily is a term coined in the 1950s to explain our tendency to link with one another in ways that confirm, rather than test, our core beliefs. Essentially, we are more likely to connect to others who share certain similarities with us. This phenomenon has been observed not only in the many processes of a physical world, but also in online systems [7], [8]. Homophily results in behavior correlations between connected friends. In other words, friends in a social network tend to behave similarly. The recent boom of social media enables us to study collective behavior on a large scale. Here, behaviors include a broad range of actions: joining a group, connecting to a person, clicking on an ad, becoming interested in certain topics, dating people of a certain type, etc. In this work, we attempt to leverage the behavior correlation presented in a social network in order to predict collective behavior in social media. Given a network with the behavioral information of some actors, how can we infer the behavioral outcome of the remaining actors within the same network? Here, we assume the studied behavior of one actor can be described with K class labels {c1 , · · · , cK }. Each label, ci , can be 0 or 1. For instance, one user might join multiple groups of interest, so ci = 1 denotes that the user subscribes to group i, and ci = 0 otherwise. Likewise, a user can be interested in several topics simultaneously, or click on multiple types of ads. One special case is K = 1, indicating that the studied behavior can be described by a single label with 1 and 0. For example, if the event is the presidential election, 1 or 0 indicates whether or not a voter voted for Barack Obama. The problem we study can be described formally as follows: Suppose there are K class labels Y = {c1 , · · · , cK }. Given network G = (V, E, Y ) where V is the vertex set, E is the edge set and Yi ⊆ Y are the class labels of a vertex vi ∈ V , and known values of Yi for some subsets of vertices V L , how can we infer the values of Yi (or an estimated probability over each label) for the remaining vertices V U = V − V L? It should be noted that this problem shares the same spirit as within-network classification [9]. It can also be considered as a special case of semi-supervised learning [10] or relational learning [11] where objects are connected within a network. Some of these methods, if applied directly to social media, yield only limited success [2]. This is because connections

TABLE 1 Social Dimension Representation Actors 1 2 .. .

Affiliation-1 0 0.5 .. .

Affiliation-2 1 0.3 .. .

··· ··· ··· .. .

Affiliation-k 0.8 0 .. .

in social media are rather noisy and heterogeneous. In the next section, we will discuss the connection heterogeneity in social media, review the concept of social dimension, and anatomize the scalability limitations of the earlier model proposed in [2], which provides a compelling motivation for this work.

3

S OCIAL D IMENSIONS

Connections in social media are not homogeneous. People can connect to their family, colleagues, college classmates, or buddies met online. Some relations are helpful in determining a targeted behavior (category) while others are not. This relation-type information, however, is often not readily available in social media. A direct application of collective inference [9] or label propagation [12] would treat connections in a social network as if they were homogeneous. To address the heterogeneity present in connections, a framework (SocioDim) [2] has been proposed for collective behavior learning. The framework SocioDim is composed of two steps: 1) social dimension extraction, and 2) discriminative learning. In the first step, latent social dimensions are extracted based on network topology to capture the potential affiliations of actors. These extracted social dimensions represent how each actor is involved in diverse affiliations. One example of the social dimension representation is shown in Table 1. The entries in this table denote the degree of one user involving in an affiliation. These social dimensions can be treated as features of actors for subsequent discriminative learning. Since a network is converted into features, typical classifiers such as support vector machine and logistic regression can be employed. The discriminative learning procedure will determine which social dimension correlates with the targeted behavior and then assign proper weights. A key observation is that actors of the same affiliation tend to connect with each other. For instance, it is reasonable to expect people of the same department to interact with each other more frequently. Hence, to infer actors’ latent affiliations, we need to find out a group of people who interact with each other more frequently than at random. This boils down to a classic community detection problem. Since each actor can get involved in more than one affiliation, a soft clustering scheme is preferred. In the initial instantiation of the framework SocioDim, a spectral variant of modularity maximiza-

3

tion [3] is adopted to extract social dimensions. The social dimensions correspond to the top eigenvectors of a modularity matrix. It has been empirically shown that this framework outperforms other representative relational learning methods on social media data. However, there are several concerns about the scalability of SocioDim with modularity maximization: • Social dimensions extracted according to soft clustering, such as modularity maximization and probabilistic methods, are dense. Suppose there are 1 million actors in a network and 1, 000 dimensions are extracted. If standard double precision numbers are used, holding the full matrix alone requires 1M × 1K × 8 = 8G memory. This large-size dense matrix poses thorny challenges for the extraction of social dimensions as well as subsequent discriminative learning. • Modularity maximization requires us to compute the top eigenvectors of a modularity matrix, which is the same size as a given network. In order to extract k communities, typically k − 1 eigenvectors are computed. For a sparse or structured matrix, the eigenvector computation costs O(h(mk + nk 2 + k 3 )) time [13], where h, m, and n are the number of iterations, the number of edges in the network, and the number of nodes, respectively. Though computing the top single eigenvector (i.e., k=1), such as PageRank scores, can be done very efficiently, computing thousands of eigenvectors or even more for a mega-scale network becomes a daunting task. • Networks in social media tend to evolve, with new members joining and new connections occurring between existing members each day. This dynamic nature of networks entails an efficient update of the model for collective behavior prediction. Efficient online updates of eigenvectors with expanding matrices remain a challenge. Consequently, it is imperative to develop scalable methods that can handle large-scale networks efficiently without extensive memory requirements. Next, we elucidate on an edge-centric clustering scheme to extract sparse social dimensions. With such a scheme, we can also update the social dimensions efficiently when new nodes or new edges arrive.

4

S PARSE S OCIAL D IMENSIONS

In this section, we first show one toy example to illustrate the intuition of communities in an “edge” view and then present potential solutions to extract sparse social dimensions. 4.1

Communities in an Edge-Centric View

Though SocioDim with soft clustering for social dimension extraction demonstrated promising results, its scalability is limited. A network may be sparse

Fig. 1. A Toy Example

Fig. 2. Edge Clusters

TABLE 2 Social Dimension(s) of the Toy Example Actors 1 2 3 4 5 6 7 8 9

Modularity Maximization -0.1185 -0.4043 -0.4473 -0.4473 0.3093 0.2628 0.1690 0.3241 0.3522

Edge Partition 1 1 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1

(i.e., the density of connectivity is very low), whereas the extracted social dimensions are not sparse. Let’s look at the toy network with two communities in Figure 1. Its social dimensions following modularity maximization are shown in Table 2. Clearly, none of the entries is zero. When a network expands into millions of actors, a reasonably large number of social dimensions need to be extracted. The corresponding memory requirement hinders both the extraction of social dimensions and the subsequent discriminative learning. Hence, it is imperative to develop some other approach so that the extracted social dimensions are sparse. It seems reasonable to state that the number of affiliations one user can participate in is upper bounded by his connections. Consider one extreme case that an actor has only one connection. It is expected that he is probably active in only one affiliation. It is not necessary to assign a non-zero score for each of the many other affiliations. Assuming each connection represents one involved affiliation, we can expect the number of affiliations an actor has is no more than that of his connections. Rather than defining a community as a set of nodes, we redefine it as a set of edges. Thus, communities can be identified by partitioning edges of a network into disjoint sets. An actor is considered associated with one affiliation if one of his connections is assigned to that affiliation. For instance, the two communities in Figure 1 can be represented by two edge sets in Figure 2, where the dashed edges represent one affiliation, and the remaining edges denote the second affiliation. The disjoint edge clusters in Figure 2 can be converted into the representation of social dimensions as shown in the last two columns in Table 2, where an entry is 1 (0) if an actor is (not) involved in that corresponding social dimension. Node 1 is affiliated with both communities because it has edges in both sets. By contrast,

4

3 −1.5 2.9 2.8 −2 2.7 2.6

α

−2.5

2.5

Density in log scale −2 (−2 denotes 10 )

2.4 −3 2.3 2.2 2.1 2 10

−3.5 3

10

4

10

K

Fig. 3. Density Upper Bound of Social Dimensions node 3 is assigned to only one community, as all its connections are in the dashed edge set. To extract sparse social dimensions, we partition edges rather than nodes into disjoint sets. The edges of those actors with multiple affiliations (e.g., actor 1 in the toy network) are separated into different clusters. Though the partition in the edge view is disjoint, the affiliations in the node-centric view can overlap. Each node can engage in multiple affiliations. In addition, the extracted social dimensions following edge partition are guaranteed to be sparse. This is because the number of one’s affiliations is no more than that of her connections. Given a network with m edges and n nodes, if k social dimensions are extracted, then each node vi has no more than min(di , k) non-zero entries in her social dimensions, where di is the degree of node vi . We have the following theorem about the density of extracted social dimensions. Theorem 1: Suppose k social dimensions are extracted from a network with m edges and n nodes. The density (proportion of nonzero entries) of the social dimensions based on edge partition is bounded by the following: Pn i=1 min(di , k) density ≤ nk P P {i|di ≤k} di + {i|di >k} k = (1) nk Moreover, for many real-world networks whose node degree follows a power law distribution, the upper bound in Eq. (1) can be approximated as follows: α−11 α−1 − − 1 k −α+1 (2) α−2k α−2 where α > 2 is the exponent of the power law distribution. The proof is given in the appendix. To give a concrete example, we examine a YouTube network1 with 1+ million actors and verify the upper bound of the density. The YouTube network has 1, 128, 499 nodes and 2, 990, 443 edges. Suppose we want to extract 1, 000 dimensions from the network. Since 232 1. More details are in Section 6.

nodes in the network have a degree larger than 1000, the density is upper bounded by (5, 472, 909 + 232 × 1, 000)/(1, 128, 499 × 1, 000) = 0.51% following Eq. (1). The node distribution in the network follows a power law with exponent α = 2.14 based on maximum likelihood estimation [14]. Thus, the approximate upper bound in Eq. (2) for this specific case is 0.54%. Note that the upper bound in Eq. (1) is network specific, whereas Eq. (2) gives an approximate upper bound for a family of networks. It is observed that most power law distributions occurring in nature have 2 ≤ α ≤ 3 [14]. Hence, the bound in Eq. (2) is valid most of the time. Figure 3 shows the function in terms of α and k. Note that when k is huge (close to 10,000), the social dimensions become extremely sparse (< 10−3 ), alleviating the memory demand and thus enabling scalable discriminative learning. Now, the remaining question is how to partition edges efficiently. Below, we will discuss two different approaches to accomplish this task: partitioning a line graph or clustering edge instances. 4.2

Edge Partition via Line Graph Partition

In order to partition edges into disjoint sets, one way is to look at the “dual” view of a network, i.e., the line graph [15]. We will show that this is not a practical solution. In a line graph L(G), each node corresponds to an edge in the original network G, and edges in the line graph represent the adjacency between two edges in the original graph. The line graph of the toy example is shown in Figure 4. For instance, e(1, 3) and e(2, 3) are connected in the line graph as they share one terminal node 3. Given a network, graph partition algorithms can be applied to its corresponding line graph. The set of communities in the line graph corresponds to a disjoint edge partition in the original graph. Recently, such a scheme has been used to detect overlapping communities [16], [17]. It is, however, prohibitive to construct a line graph for a mega-scale network. We notice that edges connecting to the same node in the original network form a clique in the corresponding line graph. For example, edges e(1, 3), e(2, 3), and e(3, 4) are all neighboring edges of node 3 in Figure 1. Hence, they are adjacent to each other in the line graph in Figure 4, forming a clique. This property leads to many more edges in a line graph than in the original network. Theorem 2: Let n and m denote the numbers of nodes and connections in a network, respectively, and N and M the numbers of nodes and connections in its line graph. It follows that 2m −1 (3) N = m, M ≥m n Equality is achieved if and only if the original graph is regular, i.e., all nodes have the same degree.

5

TABLE 3 Edge Instances of the Toy Network in Figure 1

e(1,3)

e(1,4)

e(1,5)

e(2,3)

e(3,4)

e(1,6)

e(5,6)

e(5,8)

e(2,4)

e(7,8)

e(6,9)

e(8,9)

Fig. 4. The line graph of the toy example in Figure 1. Each node in the line graph corresponds to an edge in the original graph.

Many real-world large-scale networks obey a power law degree distribution [14]. Consequently, the lower bound in Eq. (3) is never achievable. The number of connections in a line graph grows without a constant bound with respect to the size of a given network. Theorem 3: Let α denote the exponent of a power law degree distribution for a given network, n the size of the network, and M the number of connections in its corresponding line graph. It follows that diverges if 2 < α 3 The divergence of the expectation tells us that as we go to larger and larger data sets, our estimate of number of connections with respect to the original network size will increase without bound. As we have mentioned, the majority of large-scale networks follow a power law with α lying between 2 and 3. Consequently, the number of connections in a line graph can be extremely huge. Take the aforementioned YouTube network as an example again. It contains approximately 1 million nodes and 3 million links. Its resultant line graph will contain 4, 436, 252, 282 connections, too large to be loaded into memory. Recall that the original motivation to use sparse social dimensions is to address the scalability concern. Now we end up dealing with a much larger sized line graph. Hence, the line graph is not suitable for practical use. Next, we will present an alternative approach to edge partition, which is much more efficient and scalable. 4.3

e(1, 3) e(1, 4) e(2, 3) .. .

1 1 1 0

2 0 0 1

3 1 0 1

Features 4 5 6 0 0 0 1 0 0 0 0 0

7 0 0 0

8 0 0 0

9 0 0 0

·········

e(1,7)

e(6,7)

e(5,9)

Edge

Edge Partition via Clustering Edge Instances

In order to partition edges into disjoint sets, we treat edges as data instances with their terminal nodes as

features. For instance, we can treat each edge in the toy network in Figure 1 as one instance, and the nodes that define edges as features. This results in a typical feature-based data format as in Table 3. Then, a typical clustering algorithm like k-means clustering can be applied to find disjoint partitions. One concern with this scheme is that the total number of edges might be too huge. Owing to the power law distribution of node degrees presented in social networks, the total number of edges is normally linear, rather than square, with respect to the number of nodes in the network. That is, m = O(n) as stated in the following theorem. Theorem 4: The total number of edges is usually linear, rather than quadratic, with respect to the number of nodes in the network with a power law distribution. In particular, the expected number of edges is given as nα−1 E[m] = , (5) 2 α−2 where α is the exponent of the power law distribution. Please see the appendix for the proof. Still, millions of edges are the norm in a large-scale network. Direct application of some existing k-means implementation cannot handle the problem. E.g., the k-means code provided in the Matlab package requires the computation of the similarity matrix between all pairs of data instances, which would exhaust the memory of normal PCs in seconds. Therefore, an implementation with online computation is preferred. On the other hand, the data of edge instances is quite sparse and structured. As each edge connects two nodes, the corresponding edge instance has exactly only two non-zero features as shown in Table 3. This sparsity can help accelerate the clustering process if exploited wisely. We conjecture that the centroids of k-means should also be feature-sparse. Often, only a small portion of the data instances share features with the centroid. Hence, we just need to compute the similarity of the centroids with their relevant instances. In order to efficiently identify instances relevant to one centroid, we build a mapping from features (nodes) to instances (edges) beforehand. Once we have the mapping, we can easily identify the relevant instances by checking the non-zero features of the centroid. By taking into account the two concerns above, we devise a k-means variant as shown in Figure 5. Similar to k-means, this algorithm also maximizes

6

Input: data instances {xi |1 ≤ i ≤ m} number of clusters k Output: {idxi } 1. construct a mapping from features to instances 2. initialize the centroid of cluster {Cj |1 ≤ j ≤ k} 3. repeat 4. Reset {M axSimi }, {idxi } 5. for j=1:k 6. identify relevant instances Sj to centroid Cj 7. for i in Sj 8. compute sim(i, Cj ) of instance i and Cj 9. if sim(i, Cj ) > M axSimi 10. M axSimi = sim(i, Cj ) 11. idxi = j; 12. for i=1:m 13. update centroid Cidxi 14. until change of objective value <

Input: network data, labels of some nodes, number of social dimensions; Output: labels of unlabeled nodes. 1. convert network into edge-centric view. 2. perform edge clustering as in Figure 5. 3. construct social dimensions based on edge partition. A node belongs to one community as long as any of its neighboring edges is in that community. 4. apply regularization to social dimensions. 5. construct classifier based on social dimensions of labeled nodes. 6. use the classifier to predict labels of unlabeled ones based on their social dimensions.

Fig. 6. Algorithm for Learning of Collective Behavior

Fig. 5. Algorithm of Scalable k-means Variant within cluster similarity as shown in Eq. (6) arg max S

k X X i=1 xj ∈Si

xj · µi kxj kkµi k

(6)

where k is the number of clusters, S = {S1 , S2 , . . . , Sk } is the set of clusters, and µi is the centroid of cluster Si . In Figure 5, we keep only a vector of M axSim to represent the maximum similarity between one data instance and a centroid. In each iteration, we first identify the instances relevant to a centroid, and then compute similarities of these instances with the centroid. This avoids the iteration over each instance and each centroid, which will cost O(mk) otherwise. Note that the centroid contains one feature (node), if and only if any edge of that node is assigned to the cluster. In effect, most data instances (edges) are associated with few (much less than k) centroids. By taking advantage of the feature-instance mapping, the cluster assignment for all instances (lines 5-11 in Figure 5) can be fulfilled in O(m) time. Computing the new centroid (lines 12-13) costs O(m) time as well. Hence, each iteration costs O(m) time only. Moreover, the algorithm requires only the feature-instance mapping and network data to reside in main memory, which costs O(m + n) space. Thus, as long as the network data can be held in memory, this clustering algorithm is able to partition its edges into disjoint sets. As a simple k-means is adopted to extract social dimensions, it is easy to update social dimensions if a given network changes. If a new member joins the network and a new connection emerges, we can simply assign the new edge to the corresponding clusters. The update of centroids with the new arrival of connections is also straightforward. This k-means scheme is especially applicable for dynamic largescale networks. 4.4

Regularization on Communities

The extracted social dimensions are treated as features of nodes. Conventional supervised learning can be

conducted. In order to handle large-scale data with high dimensionality and vast numbers of instances, we adopt a linear SVM, which can be finished in linear time [18]. Generally, the larger a community is, the weaker the connections within the community are. Hence, we would like to build an SVM relying more on communities of smaller sizes by modifying the typical SVM objective function as follows: min λ

n X

1 |1 − yi (xTi w + b)|+ + wT Σw 2 i=1

(7)

where λ is a regularization parameter2 , |z|+ = max(0, z) the hinge loss function, and Σ a diagonal matrix to regularize the weights assigned to different communities. In particular, Σjj = h(|Cj |) is the penalty coefficient associated with community Cj . The penalty function h should be monotonically increasing with respect to the community size. In other words, a larger weight assigned to a large community results in a higher cost. Interestingly, with a little manipulation, the formula in Eq. (7) can be solved using a standard SVM with modified input. Let X represent the input data with each row being an instance (e.g., Table 2), and w ˜ = Σ1/2 w. Then, the formula can be rewritten as: X 1 1 1 min λ |1 − yi (xTi w + b)|+ + wT Σ 2 Σ 2 w 2 i X 1 1 1 = min λ |1 − yi (xTi Σ− 2 Σ 2 w + b)|+ + kwk ˜ 22 2 i X 1 = min λ |1 − yi (˜ xTi w ˜ + b)|+ + kwk ˜ 22 2 i 1

where x ˜Ti = xTi Σ− 2 . Hence, given an input data 1 matrix X, we only need to left multiply X by Σ− 2 . It is observed that community sizes of a network tend to follow a power law distribution. Hence, we recommend h(|Cj |) = log |Cj | or (log |Cj |)2 . Meanwhile, one node is likely to engage in multiple communities. Intuitively, a node who is devoted to a few select communities has more influence on 2. We use a different notation λ from standard SVM formulation to avoid the confusion with community C.

7

TABLE 4 Statistics of Social Media Data Data Categories Nodes (n) Links (m) Network Density Maximum Degree Average Degree

BlogCatalog 39 10, 312 333, 983 6.3 × 10−3 3, 992 65

Flickr 195 80, 513 5, 899, 882 1.8 × 10−3 5, 706 146

YouTube 47 1, 138, 499 2, 990, 443 4.6 × 10−6 28, 754 5

other group members than those who participate in many communities. For effective classification learning, we regularize node affiliations by normalizing each node’s social dimensions to sum to one. Later, we will study which regularization approach has a greater effect on classification. In summary, we conduct the following steps to learn a model for collective behavior. Given a network (say, Figure 1), we take the edge-centric view of the network data (Table 3) and partition the edges into disjoint sets (Figure 2). Based on the edge clustering, social dimensions can be constructed (Table 2). Certain regularizations can be applied. Then, discriminative learning and prediction can be accomplished by considering these social dimensions as features. The detailed algorithm is summarized in Figure 6.

5

E XPERIMENT S ETUP

In this section, we present the data collected from social media for evaluation and the baseline methods for comparison. 5.1

Social Media Data

Two data sets reported in [2] are used to examine our proposed model for collective behavior learning. The first data set is acquired from BlogCatalog3 , the second from a popular photo sharing site Flickr4 . Concerning behavior, following [2], we study whether or not a user joins a group of interest. Since the BlogCatalog data does not have this group information, we use blogger interests as the behavior labels. Both data sets are publicly available at the first author’s homepage. To examine scalability, we also include a mega-scale network5 crawled from YouTube6 . We remove those nodes without connections and select the interest groups with 500+ subscribers. Some statistics of the three data sets can be found in Table 4. 5.2

Baseline Methods

As we discussed in Section 4.2, constructing a line graph is prohibitive for large-scale networks. Hence, 3. 4. 5. 6.

http://www.blogcatalog.com/ http://www.flickr.com/ http://socialnetworks.mpi-sws.org/data-imc2007.html http://www.youtube.com/

the line-graph approach is not included for comparison. Alternatively, the edge-centric clustering (or EdgeCluster) in Section 4.2 is used to extract social dimensions on all data sets. We adopt cosine similarity while performing the clustering. Based on cross validation, the dimensionality is set to 5000, 10000, and 1000 for BlogCatalog, Flickr, and YouTube, respectively. A linear SVM classifier [18] is then exploited for discriminative learning. Another related approach to finding edge partitions is bi-connected components [19]. Bi-connected components of a graph are the maximal subsets of vertices such that the removal of a vertex from a particular component will not disconnect the component. Essentially, any two nodes in a bi-connected component are connected by at least two paths. It is highly related to cut vertices (a.k.a. articulation points) in a graph, whose removal will result in an increase in the number of connected components. Those cut vertices are the bridges connecting different bi-connected components. Thus, searching for bi-connected components boils down to searching for articulation points in the graph, which can be solved efficiently in O(n + m) time. Here n and m represent the number of vertices and edges in a graph, respectively. Each bi-connected component is considered a community, and converted into one social dimension for learning. We also compare our proposed sparse social dimension approach with classification performance based on dense representations. In particular, we extract social dimensions according to modularity maximization (denoted as M odM ax) [2]. M odM ax has been shown to outperform other representative relational learning methods based on collective inference. We study how the sparsity in social dimensions affects the prediction performance as well as the scalability. Note that social dimensions allow one actor to be involved in multiple affiliations. As a proof of concept, we also examine the case when each actor is associated with only one affiliation. Essentially, we construct social dimensions based on node partition. A similar idea has been adopted in a latent group model [20] for efficient inference. To be fair, we adopt k-means clustering to partition nodes of a network into disjoint sets, and convert the node clustering result into a set of social dimensions. Then, SVM is utilized for discriminative learning. For convenience, we denote this method as N odeCluster. Note that our prediction problem is essentially multi-label. It is empirically shown that thresholding can affect the final prediction performance drastically [21], [22]. For evaluation purposes, we assume the number of labels of unobserved nodes is already known, and check whether the top-ranking predicted labels match with the actual labels. Such a scheme has been adopted for other multi-label evaluation works [23]. We randomly sample a portion of nodes as labeled and report the average performance of 10

8

TABLE 5 Performance on BlogCatalog Network Proportion of Labeled Nodes EdgeCluster Micro-F1(%) BiComponents M odM ax N odeCluster EdgeCluster Macro-F1(%) BiComponents M odM ax N odeCluster

10% 27.94 16.54 27.35 18.29 16.16 2.77 17.36 7.38

20% 30.76 16.59 30.74 19.14 19.16 2.80 20.00 7.02

30% 31.85 16.67 31.77 20.01 20.48 2.82 20.80 7.27

40% 32.99 16.83 32.97 19.80 22.00 3.01 21.85 6.85

50% 34.12 17.21 34.09 20.81 23.00 3.13 22.65 7.57

60% 35.00 17.26 36.13 20.86 23.64 3.29 23.41 7.27

70% 34.63 17.04 36.08 20.53 23.82 3.25 23.89 6.88

80% 35.99 17.76 37.23 20.74 24.61 3.16 24.20 7.04

90% 36.29 17.61 38.18 20.78 24.92 3.37 24.97 6.83

TABLE 6 Performance on Flickr Network Proportion of Labeled Nodes EdgeCluster Micro-F1(%) BiComponents M odM ax N odeCluster EdgeCluster Macro-F1(%) BiComponents M odM ax N odeCluster

1% 25.75 16.45 22.75 22.94 10.52 0.45 10.21 7.90

2% 28.53 16.46 25.29 24.09 14.10 0.46 13.37 9.99

3% 29.14 16.45 27.30 25.42 15.91 0.45 15.24 11.42

4% 30.31 16.49 27.60 26.43 16.72 0.46 15.11 11.10

5% 30.85 16.49 28.05 27.53 18.01 0.46 16.14 12.33

6% 31.53 16.49 29.33 28.18 18.54 0.46 16.64 12.29

7% 31.75 16.49 29.43 28.32 19.54 0.46 17.02 12.58

8% 31.76 16.48 28.89 28.58 20.18 0.46 17.10 13.26

9% 32.19 16.55 29.17 28.70 20.78 0.47 17.14 12.79

10% 32.84 16.55 29.20 28.93 20.85 0.47 17.12 12.77

7% 38.94 25.24 — 31.15 30.75 7.61 — 25.50

8% 39.46 24.44 — 31.85 31.23 7.63 — 26.02

9% 39.92 25.62 — 32.29 31.45 7.76 — 26.44

10% 40.07 25.53 — 32.67 31.54 7.91 — 26.68

TABLE 7 Performance on YouTube Network Proportion of Labeled Nodes EdgeCluster Micro-F1(%) BiComponents M odM ax N odeCluster EdgeCluster Macro-F1(%) BiComponents M odM ax N odeCluster

1% 23.90 23.90 — 20.89 19.48 6.80 — 17.91

2% 31.68 24.51 — 24.57 25.01 7.05 — 21.11

3% 35.53 24.80 — 26.91 28.15 7.19 — 22.38

runs in terms of Micro-F1 and Macro-F1 [22].

6

E XPERIMENT R ESULTS

In this section, we first examine how prediction performances vary with social dimensions extracted following different approaches. Then we verify the sparsity of social dimensions and its implication for scalability. We also study how the performance varies with dimensionality. Finally, concrete examples of extracted social dimensions are given. 6.1

Prediction Performance

The prediction performance on all data is shown in Tables 5-7. The entries in bold face denote the best performance in each column. Obviously, EdgeCluster is the winner most of the time. Edge-centric clustering shows comparable performance to modularity maximization on BlogCatalog network, yet it outperforms M odM ax on Flickr. M odM ax on YouTube is not applicable due to the scalability constraint. Clearly, with sparse social dimensions, we are able to achieve comparable performance as that of dense social dimensions. But the benefit in terms of scalability will be tremendous as discussed in the next subsection.

4% 36.76 25.39 — 28.65 29.17 7.44 — 23.91

5% 37.81 25.20 — 29.56 29.82 7.48 — 24.47

6% 38.63 25.42 — 30.72 30.65 7.58 — 25.26

The N odeCluster scheme forces each actor to be involved in only one affiliation, yielding inferior performance compared with EdgeCluster. BiComponents, similar to EdgeCluster, also separates edges into disjoint sets, which in turn deliver a sparse representation of social dimensions. However, BiComponents yields a poor performance. This is because BiComponents outputs highly imbalanced communities. For example, BiComponents extracts 271 bi-connected components in the BlogCatalog network. Among these 271 components, a dominant one contains 10, 042 nodes, while all others are of size 2. Note that BlogCatalog contains in total 10, 312 nodes. As a network’s connectivity increases, BiComponents performs even worse. For instance, only 10 bi-connected components are found in the Flickr data, and thus its Macro-F1 is close to 0. In short, BiComponents is very efficient and scalable. However, it fails to extract informative social dimensions for classification. We note that the prediction performance on the studied social media data is around 20-30% for F1 measure. This is partly due to the large number of distinctive labels in the data. Another reason is that only the network information is exploited here. Since

9

TABLE 8 Sparsity Comparison on BlogCatalog data with 10, 312 Nodes. ModMax-500 corresponds to modularity maximization to select 500 social dimensions and EdgeCluster-x denotes edge-centric clustering to construct x dimensions. Time denotes the total time (seconds) to extract the social dimensions; Space represent the memory footprint (mega-byte) of the extracted social dimensions; Density is the proportion of non-zeros entries in the dimensions; Upper bound is the density upper bound computed following Eq. (1); Max-Aff and Ave-Aff denote the maximum and average number of affiliations one user is involved in. Methods M odM ax − 500 EdgeCluster − 100 EdgeCluster − 500 EdgeCluster − 1000 EdgeCluster − 2000 EdgeCluster − 5000 EdgeCluster − 10000

Time 194.4 300.8 357.8 307.2 294.6 230.3 195.6

Space 41.2M 3.8M 4.9M 5.2M 5.3M 5.5M 5.6M

Density 1 1.1 × 10−1 6.0 × 10−2 3.2 × 10−2 1.6 × 10−2 6 × 10−3 3 × 10−3

Upper Bound — 2.2 × 10−1 1.1 × 10−1 6.0 × 10−2 3.1 × 10−2 1.3 × 10−2 7 × 10−3

Max-Aff 500 187 344 408 598 682 882

Ave-Aff 500 23.5 30.0 31.8 32.4 32.4 33.3

TABLE 9 Sparsity Comparison on Flickr Data with 80, 513 Nodes Methods M odM ax − 500 EdgeCluster − 200 EdgeCluster − 500 EdgeCluster − 1000 EdgeCluster − 2000 EdgeCluster − 5000 EdgeCluster − 10000

Time 2.2 × 103 1.2 × 104 1.3 × 104 1.6 × 104 2.2 × 104 2.6 × 104 1.9 × 104

Space 322.1M 31.0M 44.8M 57.3M 70.1M 84.7M 91.4M

Density 1 1.2 × 10−1 7.0 × 10−2 4.5 × 10−2 2.7 × 10−2 1.3 × 10−2 7 × 10−3

Upper Bound — 3.9 × 10−1 2.2 × 10−1 1.3 × 10−1 7.2 × 10−2 2.9 × 10−2 1.5 × 10−2

Max-Aff 500 156 352 619 986 1405 1673

Ave-Aff 500.0 24.1 34.8 44.5 54.4 65.7 70.9

TABLE 10 Sparsity Comparison on YouTube Data with 1, 138, 499 Nodes Methods M odM ax − 500 EdgeCluster − 200 EdgeCluster − 500 EdgeCluster − 1000 EdgeCluster − 2000 EdgeCluster − 5000 EdgeCluster − 10000 EdgeCluster − 20000 EdgeCluster − 50000

Time N/A 574.7 606.6 779.2 558.9 554.9 561.2 507.5 597.4

Space 4.6G 36.2M 39.9M 42.3M 44.2M 45.6M 46.4M 47.0M 48.2M

Density 1 9.9 × 10−3 4.4 × 10−3 2.3 × 10−3 1.2 × 10−3 5.0 × 10−4 2.5 × 10−4 1.3 × 10−4 5.2 × 10−5

SocioDim converts a network into attributes, other behavioral features (if available) can be combined with social dimensions for behavior learning. 6.2

Scalability Study

As we have introduced in Theorem 1, the social dimensions constructed according to edge-centric clustering are guaranteed to be sparse because the density is upper bounded by a small value. Here, we examine how sparse the social dimensions are in practice. We also study how the computation time (with a Core2Duo E8400 CPU and 4GB memory) varies with the number of edge clusters. The computation time, the memory footprint of social dimensions, their density and other related statistics on all three data sets are reported in Tables 8-10. Concerning the time complexity, it is interesting that computing the top eigenvectors of a modularity matrix is actually quite efficient as long as there is no memory concern. This is observed on the Flickr

Upper Bound — 2.3 × 10−2 9.7 × 10−3 5.0 × 10−3 2.6 × 10−3 1.0 × 10−3 5.1 × 10−4 2.6 × 10−4 1.1 × 10−4

Max-Aff 500 121 255 325 375 253 356 305 297

Ave-Aff 500.00 1.99 2.19 2.32 2.43 2.50 2.54 2.58 2.62

data. However, when the network scales to millions of nodes (YouTube), modularity maximization becomes difficult (though an iterative method or distributed computation can be used) due to its excessive memory requirement. On the contrary, the EdgeCluster method can still work efficiently as shown in Table 10. The computation time of EdgeCluster for YouTube is much smaller than for Flickr, because the YouTube network is extremely sparse. The number of edges and the average degree in YouTube are smaller than those in Flickr, as shown in Table 4. Another observation is that the computation time of EdgeCluster does not change much with varying numbers of clusters. No matter how many clusters exist, the computation time of EdgeCluster is of the same order. This is due to the efficacy of the proposed k-means variant in Figure 5. In the algorithm, we do not iterate over each cluster and each centroid to do the cluster assignment, but exploit the sparsity of edge-centric data to compute only the similarity

10 Normalization Effect on Flickr

Normalization Effect on Flickr 0.34

0.22

0.34

None Node Affiliation Community Size Both

0.32 0.2

Macro−F1

0.26 0.24 0.22 0.2

1%

0.18

0.16

0.14

0.12

5% 0.18

0.32

Micro F1−Measure

Macro F1−Measure

0.3 0.28

None Node Affiliation Community Size Both

0.3

0.28

0.26

0.24

10%

0.16 2 10

0.1 3

4

10 10 Number of Edge Clusters

5

10

1

2

3

4

5

6

7

8

9

10

Percentage of Training set(%)

0.22

1

2

3

4

Fig. 7. Sensitivity to Dimensionality

Fig. 8. Regularization Effect on Flickr

of a centroid and those relevant instances. This, in effect, makes the computational cost independent of the number of edge clusters. As for the memory footprint reduction, sparse social dimension does an excellent job. On Flickr, with only 500 dimensions, M odM ax requires 322.1M, whereas EdgeCluster requires less than 100M. This effect is stronger on the mega-scale YouTube network, where ModMax becomes impractical to compute directly. It is expected that the social dimensions of M odM ax would occupy 4.6G memory. On the contrary, the sparse social dimensions based on EdgeCluster requires only 30-50M. The steep reduction of memory footprint can be explained by the density of the extracted dimensions. For instance, in Table 10, when we have 50,000 dimensions, the density is only 5.2 × 10−5 . Consequently, even if the network has more than 1 million nodes, the extracted social dimensions still occupy only a tiny memory space. The upper bound of the density is not tight when the number of clusters k is small. As k increases, the bound becomes tight. In general, the true density is roughly half of the estimated bound.

6.4

6.3

Sensitivity Study

Our proposed EdgeCluster model requires users to specify the number of social dimensions (edge clusters). One question which remains to be answered is how sensitive the performance is with respect to the parameter. We examine all three data sets, but find no strong pattern to determine optimal dimensionality. Due to the space limit, we include only one case here. Figure 7 shows the Macro-F1 performance change on YouTube data. The performance, unfortunately, is sensitive to the number of edge clusters. It thus remains a challenge to determine the parameter automatically. However, a general trend across all three data sets is observed. The optimal dimensionality increases as the portion of labeled nodes expands. For instance, when there is 1% of labeled nodes in the network, 500 dimensions seem optimal. But when the labeled nodes increase to 10%, 2000-5000 dimensions becomes a better choice. In other words, when label information is scarce, coarse extraction of latent affiliations is better for behavior prediction. But when the label information multiplies, the affiliations should be zoomed to a more granular level.

5

6

7

8

9

10

Percentage of Training set(%)

Regularization Effect

In this section, we study how EdgeCluster performance varies with different regularization schemes. Three different regularizations are implemented: regularization based on community size, node affiliation, and both (we first apply community size regularization, then node affiliation). The results without any regularization are also included as a baseline for comparison. In our experiments, the community size regularization penalty function is set to (log(|Ci |))2 , and node affiliation regularization normalizes the summation of each node’s community membership to 1. As shown in Figure 8, regularization on both community size and node affiliation consistently boosts the performance on Flickr data. We also observe that node affiliation regularization significantly improves performance. It seems like community-size regularization is not as important as the node-affiliation regularization. Similar trends are observed in the other two data sets. We must point out that, when both community-size regularization and node-affiliation are applied, it is indeed quite similar to the tf-idf weighting scheme for representation of documents [24]. Such a weighting scheme actually can lead to a quite different performance in dealing with social dimensions extracted from a network. 6.5

Visualization of Extracted Social Dimensions

As shown in previous subsections, EdgeCluster yields an outstanding performance. But are the extracted social dimensions really sensible? To understand what the extracted social dimensions are, we investigate tag clouds associated with different dimensions. Tag clouds, with font size denoting tags’ relative frequency, are widely used in social media websites to summarize the most popular ongoing topics. In particular, we aggregate tags of individuals in one social dimension as the tags of that dimension. However, it is impossible to showcase all the extracted dimensions. Hence, we pick the dimension with the maximum SVM weight given a category. Due to the space limit, we show only two examples from BlogCatalog. To make the figure legible, we include only those tags whose frequency is greater than 2. The dimension in Figure 9 is about cars, autos, automobile, pontiac, ferrari, etc. It is highly relevant to the category Autos. Similarly, the dimension in

11

Fig. 9. Dimension selected by Autos

Figure 10 is about baseball, mlb, basketball, and football. It is informative for classification of the category Sports. There are a few less frequent tags, such as movies and ipod, associated with selected dimensions as well, suggesting the diverse interests of each person. EdgeCluster, by analyzing connections in a network, is able to extract social dimensions that are meaningful for classification.

7

R ELATED W ORK

Classification with networked instances are known as within-network classification [9], or a special case of relational learning [11]. The data instances in a network are not independently identically distributed (i.i.d.) as in conventional data mining. To capture the correlation between labels of neighboring data objects, typically a Markov dependency assumption is assumed. That is, the label of one node depends on the labels (or attributes) of its neighbors. Normally, a relational classifier is constructed based on the relational features of labeled data, and then an iterative process is required to determine the class labels for the unlabeled data. The class label or the class membership is updated for each node while the labels of its neighbors are fixed. This process is repeated until the label inconsistency between neighboring nodes is minimized. It is shown that [9] a simple weighted vote relational neighborhood classifier [25] works reasonably well on some benchmark relational data and is recommended as a baseline for comparison. However, a network tends to present heterogeneous relations, and the Markov assumption can only capture the local dependency. Hence, researchers propose to model network connections or class labels based on latent groups [20], [26]. A similar idea is also adopted in [2] to differentiate heterogeneous relations in a network by extracting social dimensions to represent the potential affiliations of actors in a network. The authors suggest using the community membership of a soft clustering scheme as social dimensions. The extracted social dimensions are treated as features, and a support vector machine based on that can be constructed for classification. It has been shown that the proposed social dimension approach significantly outperforms representative methods based on collective inference. There are various approaches to conduct soft clustering for a graph. Some are based on matrix factor-

Fig. 10. Dimension selected by Sports

ization, like spectral clustering [27] and modularity maximization [3]. Probabilistic methods are also developed [28], [29]. Please refer to [30] for a comprehensive survey. A disadvantage with soft clustering is that the resultant social dimensions are dense, posing thorny computational challenges. Another line of research closely related to the method proposed in this work is finding overlapping communities. Palla et al. propose a clique percolation method to discover overlapping dense communities [31]. It consists of two steps: first find out all the cliques of size k in a graph. Two k-cliques are connected if they share k − 1 nodes. Based on the connections between cliques, we can find the connected components with respect to k-cliques. Each component then corresponds to one community. Since a node can be involved in multiple different k-cliques, the resultant community structure allows one node to be associated with multiple different communities. A similar idea is presented in [32], in which the authors propose to find out all the maximal cliques of a network and then perform hierarchical clustering. Gregory [33] extends the Newman-Girvan method [34] to handle overlapping communities. The original Newman-Girvan method recursively removes edges with highest betweenness until a network is separated into a pre-specified number of disconnected components. It outputs non-overlapping communities only. Therefore, Gregory proposes to add one more action (node splitting) besides edge removal. The algorithm recursively splits nodes that are likely to reside in multiple communities into two, or removes edges that seem to bridge two different communities. This process is repeated until the network is disconnected into the desired number of communities. The aforementioned methods enumerate all the possible cliques or shortest paths within a network, whose computational cost is daunting for large-scale networks. Recently, a simple scheme proposed to detect overlapping communities is to construct a line graph and then apply graph partition algorithms [16], [17]. However, the construction of the line graph alone, as we discussed, is prohibitive for a network of a reasonable size. In order to detect overlapping communities, scalable approaches have to be developed. In this work, the k-means clustering algorithm is used to partition the edges of a network into disjoint

12

sets. We also propose a k-means variant to take advantage of its special sparsity structure, which can handle the clustering of millions of edges efficiently. More complicated data structures such as kd-tree [35], [36] can be exploited to accelerate the process. If a network might be too huge to reside in memory, other kmeans variants can be considered to handle extremely large data sets like online k-means [37], scalable kmeans [38], and distributed k-means [39].

8

C ONCLUSIONS

AND

F UTURE W ORK

It is well known that actors in a network demonstrate correlated behaviors. In this work, we aim to predict the outcome of collective behavior given a social network and the behavioral information of some actors. In particular, we explore scalable learning of collective behavior when millions of actors are involved in the network. Our approach follows a social-dimensionbased learning framework. Social dimensions are extracted to represent the potential affiliations of actors before discriminative learning occurs. As existing approaches to extract social dimensions suffer from scalability, it is imperative to address the scalability issue. We propose an edge-centric clustering scheme to extract social dimensions and a scalable k-means variant to handle edge clustering. Essentially, each edge is treated as one data instance, and the connected nodes are the corresponding features. Then, the proposed k-means clustering algorithm can be applied to partition the edges into disjoint sets, with each set representing one possible affiliation. With this edge-centric view, we show that the extracted social dimensions are guaranteed to be sparse. This model, based on the sparse social dimensions, shows comparable prediction performance with earlier social dimension approaches. An incomparable advantage of our model is that it easily scales to handle networks with millions of actors while the earlier models fail. This scalable approach offers a viable solution to effective learning of online collective behavior on a large scale. In social media, multiple modes of actors can be involved in the same network, resulting in a multimode network [40]. For instance, in YouTube, users, videos, tags, and comments are intertwined with each other in co-existence. Extending the edge-centric clustering scheme to address this object heterogeneity can be a promising future direction. Since the proposed EdgeCluster model is sensitive to the number of social dimensions as shown in the experiment, further research is needed to determine a suitable dimensionality automatically. It is also interesting to mine other behavioral features (e.g., user activities and temporalspatial information) from social media, and integrate them with social networking information to improve prediction performance.

ACKNOWLEDGMENTS This research is, in part, sponsored by the Air Force Office of Scientific Research Grant FA95500810132. R EFERENCES [1] [2]

[3]

[4]

[5]

[6] [7]

[8] [9] [10] [11] [12] [13] [14] [15] [16] [17]

[18] [19] [20]

[21] [22]

L. Tang and H. Liu, “Toward predicting collective behavior via social dimension extraction,” IEEE Intelligent Systems, vol. 25, pp. 19–25, 2010. ——, “Relational learning via latent social dimensions,” in KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2009, pp. 817–826. M. Newman, “Finding community structure in networks using the eigenvectors of matrices,” Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), vol. 74, no. 3, 2006. [Online]. Available: http://dx.doi.org/10.1103/PhysRevE.74.036104 L. Tang and H. Liu, “Scalable learning of collective behavior based on sparse social dimensions,” in CIKM ’09: Proceeding of the 18th ACM conference on Information and knowledge management. New York, NY, USA: ACM, 2009, pp. 1107–1116. P. Singla and M. Richardson, “Yes, there is a correlation: - from social networks to personal behavior on the web,” in WWW ’08: Proceeding of the 17th international conference on World Wide Web. New York, NY, USA: ACM, 2008, pp. 655–664. M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds of a feather: Homophily in social networks,” Annual Review of Sociology, vol. 27, pp. 415–444, 2001. A. T. Fiore and J. S. Donath, “Homophily in online dating: when do you like someone like yourself?” in CHI ’05: CHI ’05 extended abstracts on Human factors in computing systems. New York, NY, USA: ACM, 2005, pp. 1371–1374. H. W. Lauw, J. C. Shafer, R. Agrawal, and A. Ntoulas, “Homophily in the digital world: A LiveJournal case study,” IEEE Internet Computing, vol. 14, pp. 15–23, 2010. S. A. Macskassy and F. Provost, “Classification in networked data: A toolkit and a univariate case study,” J. Mach. Learn. Res., vol. 8, pp. 935–983, 2007. X. Zhu, “Semi-supervised learning literature survey,” 2006. [Online]. Available: http://pages.cs.wisc.edu/∼jerryzhu/ pub/ssl survey 12 9 2006.pdf L. Getoor and B. Taskar, Eds., Introduction to Statistical Relational Learning. The MIT Press, 2007. X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in ICML, 2003. S. White and P. Smyth, “A spectral clustering approach to finding communities in graphs,” in SDM, 2005. M. Newman, “Power laws, Pareto distributions and Zipf’s law,” Contemporary physics, vol. 46, no. 5, pp. 323–352, 2005. F. Harary and R. Norman, “Some properties of line digraphs,” Rendiconti del Circolo Matematico di Palermo, vol. 9, no. 2, pp. 161–168, 1960. T. Evans and R. Lambiotte, “Line graphs, link partitions, and overlapping communities,” Physical Review E, vol. 80, no. 1, p. 16105, 2009. Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, “Link communities reveal multi-scale complexity in networks,” 2009. [Online]. Available: http://www.citebase.org/abstract?id=oai: arXiv.org:0903.3178 R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008. J. Hopcroft and R. Tarjan, “Algorithm 447: efficient algorithms for graph manipulation,” Commun. ACM, vol. 16, no. 6, pp. 372–378, 1973. J. Neville and D. Jensen, “Leveraging relational autocorrelation with latent group models,” in MRDM ’05: Proceedings of the 4th international workshop on Multi-relational mining. New York, NY, USA: ACM, 2005, pp. 49–55. R.-E. Fan and C.-J. Lin, “A study on threshold selection for multi-label classication,” 2007. L. Tang, S. Rajan, and V. K. Narayanan, “Large scale multilabel classification via metalabeler,” in WWW ’09: Proceedings of the 18th international conference on World wide web. New York, NY, USA: ACM, 2009, pp. 211–220.

13

[23] Y. Liu, R. Jin, and L. Yang, “Semi-supervised multi-label learning by constrained non-negative matrix factorization,” in AAAI, 2006. [24] F. Sebastiani, “Machine learning in automated text categorization,” ACM Comput. Surv., vol. 34, no. 1, pp. 1–47, 2002. [25] S. A. Macskassy and F. Provost, “A simple relational classifier,” in Proceedings of the Multi-Relational Data Mining Workshop (MRDM) at the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003. [26] Z. Xu, V. Tresp, S. Yu, and K. Yu, “Nonparametric relational learning for social network analysis,” in KDD’2008 Workshop on Social Network Mining and Analysis, 2008. [27] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007. [28] K. Yu, S. Yu, and V. Tresp, “Soft clustering on graphs,” in NIPS, 2005. [29] E. Airodi, D. Blei, S. Fienberg, and E. P. Xing, “Mixed membership stochastic blockmodels,” J. Mach. Learn. Res., vol. 9, pp. 1981–2014, 2008. [30] S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3-5, pp. 75–174, 2010. [31] G. Palla, I. Der´enyi, I. Farkas, and T. Vicsek, “Uncovering the overlapping community structure of complex networks in nature and society,” Nature, vol. 435, pp. 814–818, 2005. [32] H. Shen, X. Cheng, K. Cai, and M. Hu, “Detect overlapping and hierarchical community structure in networks,” Physica A: Statistical Mechanics and its Applications, vol. 388, no. 8, pp. 1706–1712, 2009. [33] S. Gregory, “An algorithm to find overlapping community structure in networks,” in PKDD, 2007, pp. 91–102. [Online]. Available: http://www.cs.bris.ac.uk/Publications/ pub master.jsp?id=2000712 [34] M. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical Review E, vol. 69, p. 026113, 2004. [Online]. Available: http://www.citebase. org/abstract?id=oai:arXiv.org:cond-mat/0308217 [35] J. Bentley, “Multidimensional binary search trees used for associative searching,” Comm. ACM, vol. 18, pp. 509–175, 1975. [36] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An efficient k-means clustering algorithm: Analysis and implementation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 881–892, 2002. [37] M. Sato and S. Ishii, “On-line EM algorithm for the normalized gaussian network,” Neural Computation, 1999. [38] P. Bradley, U. Fayyad, and C. Reina, “Scaling clustering algorithms to large databases,” in ACM KDD Conference, 1998. [39] R. Jin, A. Goswami, and G. Agrawal, “Fast and exact outof-core and distributed k-means clustering,” Knowl. Inf. Syst., vol. 10, no. 1, pp. 17–40, 2006. [40] L. Tang, H. Liu, J. Zhang, and Z. Nazeri, “Community evolution in dynamic multi-mode networks,” in KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2008, pp. 677–685. [41] M. Hazewinkel, Ed., Encyclopaedia of Mathematics. Springer, 2001.

A PPENDIX A P ROOF OF T HEOREM 1 Proof: It is observed that the node degree in a social network follows power law [14]. For simplicity, we use a continuous power-law distribution to approximate the node degrees. Suppose

Hence,

p(x) = (α − 1)x−α ,

x ≥ 1.

It follows that the probability that a node with degree larger than k is Z +∞ P (x > k) = p(x)dx = k −α+1 . k

Meanwhile, we have Z

k

xp(x)dx Z k (α − 1) x · x−α dx 1

=

1

α − 1 −α+2 k x |1 −α + 2 α−1 1 − k −α+2 α−2

= = Therefore, P density

≤ = ≈ = =

{i|di ≤k}

di +

P

{i|di >k}

k

nk k X 1 |{i|di ≥ k}| |{i|di = d}| ·d+ k n n d=1 Z k 1 xp(x)dx + P (x > k) k 1 1α−1 1 − k −α+2 + k −α+1 kα−2 α−11 α−1 − − 1 k −α+1 α−2k α−2

The proof is completed.

A PPENDIX B P ROOF OF T HEOREM 2 Proof: Let G and L(G) denote a graph and its line graph respectively, di the degree of node i in G. Suppose G has n nodes and m edges, L(G) has N nodes and M edges. As each edge in G becomes a node in L(G), it is straightforward that (8)

N =m

Because the connections of one node in G form a clique in L(G), we have 1X M = di (di − 1) 2 i 1X 2 1X = d − di 2 i i 2 i =

1 2 (d + d22 + · · · + d2n )(1 + 1 + · · · + 1) − m | {z } 2n 1 n

p(x) = Cx−α ,

x≥1

where x is a variable denoting the node degree and C is a normalization constant. It is not difficult to verify that C = α − 1.

1 ≥ (d1 · 1 + d2 · 1 + · · · + dn · 1)2 − m 2n 1 = 4m2 − m 2n 2m = m −1 n

(9)

(10)

14

The inequality (9) is derived following the Cauchy– Schwarz inequality [41]. The equation is achieved if and only if d1 = d2 = · · · = dn .

A PPENDIX C P ROOF OF T HEOREM 3 Proof: "

n X

1 = E n

E[2M/n]

d2i

−

i=1

n X

Lei Tang is a scientist at Yahoo! Labs. He received his Ph.D. in computer science and engineering from Arizona State University, and B.S. from Fudan University, China. His research interests include social computing, data mining and computational advertising. His book on Community Detection and Mining in Social Media is published by Morgan and Claypool Publishers in 2010. He is a professional member of AAAI, ACM and IEEE.

!# di

i=1

= E[X 2 ] − E[X] where X denotes the random variable (node degree). For a power law distribution p(x) = (α − 1)x−α , it follows that [14], α−1 , if α > 2; α−2 α−1 , if α > 3. E[X 2 ] = α−3 E[X] =

Xufei Wang is a PhD student in computer science and engineering at Arizona State University. He received his Masters from Tsinghua University, and Bachelor of science from Zhejiang University, China. His research interests are in social computing and data mining. Specifically, he is interested in collective wisdom, familiar stranger, tag network, crowdsourcing, and data crawling. He is an IEEE student member.

Hence, E[2M/n]

diverges =

α−1 (α−3)(α−2)

if 2 < α 3

(11)

The divergence of the expectation tells us that as we go to larger and larger data sets, our estimate of number of connections in the line graph, with respect to the original network size, will increase without bound.

A PPENDIX D P ROOF OF T HEOREM 4 Proof: Suppose a network with n nodes follows a power law distribution as p(x) = Cx−α ,

x ≥ xmin > 0

where α is the exponent and C is a normalization constant. Then the expected number of degree for each node is [14]: E[x] =

α−1 xmin α−2

where xmin is the minimum nodal degree in a network. This can be verified via the properties of power law distribution. In reality, we normally deal with nodes with at least one connection, so xmin ≥ 1. Hence, the expected number of connections in a network following a power law distribution is E[m] =

1α−1 n. 2α−2

Huan Liu is a professor of Computer Science and Engineering at Arizona State University. He received his Ph.D. from University of Southern California and Bachelor of Engineering from Shanghai Jiao Tong University. He has worked at Telecom Research Labs in Australia, and taught at National University of Singapore before joining ASU in Year 2000. Huan Liu has been recognized for excellence in teaching and research in Computer Science and Engineering at Arizona State University. His research interests are in data/web mining, machine learning, social computing, and artificial intelligence, investigating problems that arise in many real-world applications with high-dimensional data of disparate forms such as social media, modeling group interaction, text categorization, bioinformatics, and text/web mining. His professional memberships include AAAI, ACM, ASEE, SIAM, and IEEE. He can be contacted via http://www.public.asu.edu/∼huanliu.

Scalable Learning of Collective Behavior - Semantic Scholar

Scalable Learning of Collective Behavior - Semantic Scholar

Suggest Documents

Scalable Learning of Collective Behavior Based ... - Semantic Scholar

Collective Animal Behavior from Bayesian ... - Semantic Scholar

Duetting as a Collective Behavior - Semantic Scholar

Scalable Analysis of Collective Behaviour in Smart ... - Semantic Scholar

Collective learning and semiotic dynamics - Semantic Scholar

Scalable Bayesian Reinforcement Learning for ... - Semantic Scholar

Learning and Generalization of Behavior ... - Semantic Scholar

designing collective behavior in a group of ... - Semantic Scholar

Shop 'Til You Drop II: Collective Adaptive Behavior ... - Semantic Scholar

Toward Collective Behavior Prediction via Social ... - Semantic Scholar

Quantifying collective behavior in mammalian cells - Semantic Scholar

Towards Understanding Learning Behavior ... - Semantic Scholar

Collective Programming - Semantic Scholar

Collective wisdom - Semantic Scholar

Collective Robotics - Semantic Scholar

Collective Annotation - Semantic Scholar

Collective Communications for Scalable Programming

Collective Communications for Scalable Programming

Murmurations - collective behavior [PDF]

Collective Behavior - Alexandre Bergel

Learning from the Post-ItÂ®: Building collective ... - Semantic Scholar

Toward Scalable Learning with Non-uniform ... - Semantic Scholar

Toward Scalable Learning with Non-uniform ... - Semantic Scholar

Collective tree exploration - Semantic Scholar