Knowledge-Based Systems 26 (2012) 30–39
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Community detection based on a semantic network ZhengYou Xia ⇑, Zhan Bu College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
a r t i c l e
i n f o
Article history: Received 14 January 2011 Received in revised form 19 June 2011 Accepted 20 June 2011 Available online 25 June 2011 Keywords: Online social network Comment content Semantic network Community detection Giant component
a b s t r a c t As information technology has advanced, people are turning more frequently to electronic media for communication, and social relationships are increasingly found in online channels. Massive amounts of the real data collected from online social networks (e.g., Internet newsgroups, BBS, and chat rooms) are network structured. Discovering the latent communities therein is a useful way to better understand the properties of a virtual social network. However, community-detection tasks were infeasible in previous studies of online social networks, especially with large-scale or weighted networks. In this paper, we constructed a semantic network using the semantic information extracted from comment content. In our modeling, we considered the impact of the weight on every edge and focused on the ‘‘giant component’’ of the online social network to reduce computational complexity; thus, our method can handle large-scale networks. In the experimental work, we evaluated our method using real datasets and compared our approach with several previous methods based on comment interactions; the results show that our method is much faster, more effective and robust. Ó 2011 Elsevier B.V. All rights reserved.
1. Introduction Online social networks (e.g., Internet newsgroups, BBS, and chatrooms) are an appealing way for members of such groups to communicate because they are easily accessed from almost anywhere in the world and offer the potential for camouflaging organized crime among the numerous background communications. Members of organized crime networks are often camouflaged as legal Internet users. Recent applications of data mining to online social networks have shown that increasing amounts of real data are network structured [1]: the users in these networks (people, IP addresses, etc.) are usually modeled by nodes of graphs; the connection relations (trust or dependent relations) between members are represented by the graph edges [2]. Though such data often involve massive relational information among objects, we attempt to divide all the latent criminals within the networks into several latent communities. Community detection is significant in understanding network structure. The methodology for finding these latent communities within networks comes from the physics community and is based on the use of deterministic algorithms. These algorithms focus on optimizing an energy-based cost function that is always defined with fixed parameters over possible community assignments of nodes [3,4]. A notable work proposed by Newman and Girvan [4] introduced modularity as a posterior measure of network structure. Modularity measures interconnectivity and no interconnectivity; this metric has been influential in the community⇑ Corresponding author. E-mail address:
[email protected] (Z. Xia). 0950-7051/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2011.06.014
detection literature and has found success in many applications [5,6]. However, much previous work on online social networks has reported the infeasibility of community-detection tasks, especially in large-scale or weighted networks, mainly because: (a) Previous work [7–9] has been mainly based on comment interactions between network users, whereas the comment content has largely been ignored in these studies. These comments are typically very noisy, replete with nonstandard grammar and spelling, usually unedited, and often cryptic and uninformative, so extracting and processing their content is somewhat complex. However, comments constitute a substantial part of online social networks and can be seen as an indicator of the popularity of networks themselves [10–15]. With this consideration, we hoped to utilize the semantic information within semantic networks to improve community-detection performance. (b) Information loss: the weight of relations reflects the importance of the interactions between objects. This important information is lost in those models that only consider binary relations [16]. These models only utilize information on the presence or absence of relations, and they ignore their importance, which has been proven to be infeasible in many real-world applications [17,18]. Therefore, here we considered the impact of the weight on every edge. (c) Computational complexity: in most social networks, the vast majority of nodes form a ‘‘giant component’’, leaving only a small proportion of nodes disconnected from that component. These isolated nodes are grouped mainly in pairs or, at
Z. Xia, Z. Bu / Knowledge-Based Systems 26 (2012) 30–39
most, in small clusters of four. In this work, we argue that, if we focus on detecting the former communities, it is unnecessary to consider those isolated nodes. This can reduce the computational complexity while yielding a comparable performance. Motivated by the three problems above, we extracted the comment content from initial HTML source files and used it to construct semantic networks. To utilize the weight information associated with the networks, the strategies of the previously described sentiment-analysis method [14] were used to improve the identification of comment orientation and scoring. As comments can be considered implicit links between people, we associate a score with each link, which is calculated by averaging over all the scores of the comments between two user IDs. To reduce the computational complexity, we introduced a threshold n, such that an undirected edge exists if and only if the score on this edge is within the range specified by n. The classical community-detection algorithm is then implemented on the ‘‘giant component’’ of the above semantic network. We evaluated our method on real datasets and compared our approach with several previous methods. The results show that our method is much faster, more effective and robust. The remainder of this paper is organized as follows. In Section 2, we provide an introduction to the research motivation and a description of the study data. In Section 3, we apply sentiment analysis to every comment and construct the corresponding semantic network. In Section 4, several characteristics of the semantic network are identified through statistical analysis. In Section 5, the community structure is particularly studied, and some interesting observations are made. A discussion of the results and conclusions follow in Section 6. 2. Motivation and dataset In this section, the research motivation and the data we used are introduced.
31
2.1. Motivation Online social networks (e.g., Internet newsgroups, BBS, and chatrooms) are an appealing way for members of such groups to communicate because they are easily accessed from almost anywhere in the world and offer the potential for camouflaging organized crime among the numerous background communications. Members of organized crime networks are often camouflaged as legal Internet users. The users in these criminal networks (people, IP addresses, etc.) are usually modeled by nodes of graphs. The connection relations (trust or dependent relations) between members are represented by the graph edges [2]. With the help of preexisting community-detection algorithms, we hope to divide all the latent criminals within the networks into several latent communities. This task has proven infeasible in previous studies, especially for large-scale or weighted networks. Here, we present a fast, effective and robust method for community detection based on an online social network. 2.2. Data and preprocessing The data used in this paper were downloaded from Tianya.com, a popular bulletin-board service in China. It includes more than 300 boards, and the total number of registered user identifications (IDs) is more than 32 million. Since its introduction in 1999, it has become the leading social-networking site in China due to its openness and freedom. Each article on Tianya contains the author ID (the user ID posting the current article), title, board information, date and time, and contents; if the post is a reply article, the replier ID (the user ID who posted the article that the current article comments on) and replied contents are also included. All the above information is regularly distributed in an HTML source file; here, we implemented a relatively simple analysis tool to extract the data using regular expressions. Fig. 1 shows a typical HTML source file on Tianya.com, in which all the necessary components are marked.
Fig. 1. A typical HTML source file with necessary components marked.
32
Z. Xia, Z. Bu / Knowledge-Based Systems 26 (2012) 30–39
In this study, we selected the worldview board on Tianya.com from which to collect statistics for the BBS networks defined below. The networks were created from the articles posted between July 2003 and January 2010, including 324,666 users, 99,735 threads and 4712,859 replies. 3. Semantic network Traditional online social network studies are based on the comment interaction between their users, while the comment content is largely ignored. In this section, we first introduce the traditional network-construction method using comment interaction, and then we propose a new analytic method for the extraction of comment content, with which we construct the corresponding semantic network. 3.1. Previous studies In traditional studies of online social networks, every ID registered is a node i e V in a graph G = hV, Ei. The edges of the graph (i, j) e E indicate social relations between two individuals, which are consequence of their comment activity. Let nij be the number of times that user i writes a comment to user j. Links between two users can be defined using nij in several ways. In the work of Gomez et al. [8], three network types were discussed systematically: Undirected dense network: an undirected edge exists between users i and j if either nij > 0 or nji > 0. The weight of that edge wij equals the sum nij + nji. Undirected sparse network: An undirected edge exists between users i and j if nij > 0 and nji > 0. The weight of that edge wij is defined as wij = min (nij, nji). Directed network: A directed edge exists from user i to user j if nij > 0 regardless of the value of nij. The weight wij = nij. In our previous research, we proposed the definition of an interest network, which is also based on comment interaction. Adopting the formalism above, the definition of it is as below: Interest network: an undirected edge exists between users i and j if nip > 0 and njp > 0 where user p began a thread. The weight of an P edge wij is defined as wij = peP min (nip, njp) where P is the set of users who have begun a thread. The average number of follow-up postings on Tianya is approximately 50 per thread; thus, the total number of network edges is O(np 502), where np is the number of BBS threads. Due to the enormity of this number, we introduce a threshold k, such that an undirected edge exists if and only if wij > k. 3.2. Comment extraction Traditional online social network studies have largely ignored the comment content; comments are typically very noisy, use nonstandard grammar and spelling, are usually unedited, and are often cryptic and uninformative; thus, extracting and processing their content is somewhat complex. A previous study conducted by Gilad Mishne and Natalie Glance [10] showed that less than 2% of comment content is currently available in syndication form. However, as forum platforms have developed and new standards allowing comment syndication were adopted, comment extraction from forum HTML content has become much easier than before. As noted in Fig. 1, here, every comment is sandwiched between two regular affixes (e.g., the prefix is ‘
’). Using regular expressions, we can extract every comment from the forum HTML source file. To test the accuracy of our method, we manually evaluated its output on a set of 10 randomly selected forum posts from our
Table 1 Comment-extraction evaluation.
1 2 3 4 5 6 7 8 9 10
U
m
a
/ (%)
25 4033 260 156 73 321 115 1354 598 354 –
25 4027 262 155 76 321 120 1370 600 353 –
100 99.85 99.23 99.36 95.89 100 95.65 98.82 99.67 99.72 98.81
HTML source file; in this set, every post had a different number of comments. For a given post i, the accuracy /i was tested by comparing the manual comment extraction (the actual number of comments mi) and the automated one (the number obtained by i ai j . The total accuracy of automated machine ai), namely, /i ¼ 1 jmm i Q 1=10 10 our method was then measured as U ¼ . The results i¼1 /i
of this evaluation are given in Table 1. The figure in the last line shows that the accuracy of our method is very high. 3.3. A new network type based on semanteme Forum comments serve as a simple and effective way for users to interact with their readership. They are among the defining set of weblog characteristics, and most posters identify comment feedback as an important motivation for their writing. What’s more, on examining every comment, we may find that most of them have an implicit orientation that is mostly appraised by several emotional words. Those emotional words basically include two types: supportive and opposing. For example, phrases such as ‘‘ ’’, ‘‘ ’’ or ‘‘ ’’ are supportive, whereas words such as ‘‘NND’’, ‘‘TM’’ or ‘‘ ’’ are opposing. We count emotional terms/phrases in the comment data, including both supportive words and opposing ones; and select 50 items with maximum frequency respectively; then every term/phrase is assigned with a value between 0 and 1 according to their tone manually. A higher value corresponds to a greater degree of support; if the phrase is neutral, we assigned it a value of 0.5. Thus, every phrase has an associated numerical ‘‘trust’’. In Table 2 we roughly identified several terms or phrases, with English version in parentheses, from the public discussions on Tianya.com as either supportive or opposing. Accordingly, the semantic network is constructed as follows: (1) The undirected dense network is initialized, with every edge given a ‘‘trust’’ value of 0.5.
Table 2 Emotional phrases with English version in parentheses associated with cores.
1 2 3 4 5 46 47 48 49 50
Keywords
Core
Orientation
/ding(Top) (Classic) (Sofa) (Fantastic) (Love) NND/nnd(TNND) SB/sb(Sliithead) TM/tm(Fuck) YY/yy(Psycliosexuality) (Idiot)
1.0 0.8 0.7 0.7 0.7 0.25 0.1 0.25 0 0.2
Supportive Supportive Supportive Supportive Supportive Opposing Opposing Opposing Opposing Opposing
Z. Xia, Z. Bu / Knowledge-Based Systems 26 (2012) 30–39
4.1. Global properties
Table 3 Statistics of the Tianya social networks. Semantic (n = 0)
Undirected dense
Undirected sparse
Interest (k = 4)
hki
162747 (161543) 678189 (677548) 0.0512% 99.26% (99.91%) 8.33 (32.58)
323745 (323509) 2987953 (2987827) 0.0057% 99.93% (99.99%) 18.46 (91.66)
12047 (10289) 17680 (16672) 0.0244% 85.41% (94.30%) 2.94 (7.62)
r C Crand l lrand D
0.0760 0.0086 0.0134 4.2119 5.6607 11
0.0899 0.0712 0.00364 3.7781 4.3517 10
0.1285 0.0287 0.0133 5.29 8.7134 17
20170 (19943) 1142204 (1142049) 0.5615% 98.87% (99.99%) 113.26 (205.35) 0.1129 0.4259 0.1028 2.97 2.0957 9
N M Connectivity Maxclust
33
(2) Every comment between two users is analyzed and the ‘‘trust’’ value of every edge is updated (if there are several emotional words in one comment, we take the average). (3) The edges with ‘‘trust’’ values between 0.5 n and 0.5 + n are discarded.
4. Statistical analysis of the semantic network In this section, the statistical properties of the semantic network are analyzed, and we compare the results with three other mainstream networks based on comment interactions to characterize how they differ or resemble one another.
Here, we discuss network characteristics from a global perspective. The detailed statistics of the semantic network are listed in Table 3 along with those of the other three networks. The connectivity of the network (row 3) is the ratio of actual links M to the potential number of links O(N2). As shown in Table 3, the semantic network is highly sparse compared to the others. In the semantic network, the ‘‘giant component’’ comprises more than 99% of the users. These statistics indicate that the semantic network is characterized by a compact community and a small proportion of isolated users, in agreement with typical social networks. In the semantic network, hki and its standard deviation are both low, meaning that users in this network have a relatively small circle of friends and resemble each other in their interaction behaviors. The average shortest path length is small for the semantic network, suggesting that it is a ‘‘small-world’’ network. Moreover, this statistic is roughly the same as that for a random graph lrand [19–21]. The diameter D of this social network is also very small. This has also been seen in other traditional social networks. The clustering coefi ficient of a node i is defined as ci ¼ k ðk2E1Þ , and the clustering coefi i ficient of the whole network is the average of the individual C 0i [12]. We observed that, for the semantic network, C is much higher than the randomized counterpart Crand [19–21]. Large C values indicate that discussions can easily be initiated among groups of users. Small l values indicate that ideas and opinions can propagate rapidly from one person to another. Hence, the small-world topologies of BBS networks ensure the propagation of discussions among users. This is consistent with other real-world networks, which exhibit similar deviations from the random graph, and adds to the collection of networks having the ‘‘small-world’’ property [22]. Another statistic of social networks is the degree correlation, or mixing coefficient, that indicates whether highly connected users are
Fig. 2. Degree and strength distributions of the four networks.
34
Z. Xia, Z. Bu / Knowledge-Based Systems 26 (2012) 30–39
preferentially linked to other highly connected users. Such preferential linking is known as assortative mixing by degree and is present in many social networks. Table 3 shows the correlation coefficient r [23,24] (also called the Pearson correlation coefficient) for our four networks. Interestingly, unlike traditional social networks, which exhibit significant assortative mixing, the semantic network is characterized by disassortative mixing. This aspect is analyzed in more detail below.
Table 4 Descriptive coefficients. Semantic (n = 0)
aPðkÞ bPðkÞ
aPðsÞ bPðsÞ
acðkÞ bcðkÞ
4.2. Degree distribution
aknnðkÞ bknnðkÞ
The degree ki of a user i, which is the number of users with whom he/she is connected, is distributed according to a power law followed by an exponential cutoff, namely, p(k) a kb, as shown in Fig. 2(a). The cumulative distribution function (cdf) of the degrees is shown in Fig. 2(b). The strength si of user i is the sum of the weight of each edge attached to i [19]. Therefore, P si Nj aij wij , where aij is the corresponding component of the adjacency matrix; its value is 1 if an edge connects vertices i and j and 0 otherwise, and wij is the weight of the edge between i and j. As shown in Fig. 2(c), the strength distribution is also followed by an exponential cutoff, but the cutoff value b is larger. As expected, these distributions are all heavy-tailed, indicating a high heterogeneity between the users. 4.3. Other distributions The clustering function C(k) is defined as the average of Ci overall vertices with a given degree k. For the semantic network, C(k) decays as a log (k) + b, with a < 0, which is consistent with previous reports [25]. As shown in Fig. 3(a), the trends of the semantic network and the undirected dense network are nearly identical. The weighted version is shown in Fig. 3(d), where C(s) decays as a ebs, with b < 0, for all four networks. The verage nearest-neighbor degree function knn(k) [26,27], which is defined as the average
acðsÞ bcðsÞ
aknnðsÞ bknnðsÞ
asðkÞ bsðkÞ
alðkÞ blðkÞ
Undirected dense
Undirected sparse
Interest (k = 4)
0.47016 1.6
0.41206 1.5
0.61708 2.1
0.064799 0.65
0.095183 0.7
0.38402 1.5
0.4968 1.8
0.064799 0.75
5.1813 1.0093
0.0166 0.082
0.0139 0.1221
0.5033 0.0007
19.082 203.48 0.4658 0.7874 165.17 1752.7
4.399 30.846 3.1206 0.76995 11.299 93.518
62.589 778.05 0.3213 0.7854 179.13 1947.9
82.332 27.048 2.8881 0.8681 539.11 207.52
5.1813 1.0093
15.334 1.0981
0.9413 1.0838
5.6924 1.0637
0.1997 4.2989
0.3407 5.2983
0.1531 3.837
0.1975 3.6266
degree of the neighbors of vertices of degree k, also follows a logarithmic distribution, knn(k) a log (k) + b, for all these networks. As shown in Fig. 3(b), knn(k) exhibits a slight downward curvature for the semantic network, with a > 0. The weighted version obeys the same tendency as shown in Fig. 3(e). The user-strength distribution shows a scaling behavior with degree s(k) a kb, as seen in Fig. 2(c), where the b values of the four networks are roughly the same. The nonlinear relationship between s and k implies that hub members tend to post messages considerably more frequently than other people. The average shortest-path degree function l(k), which is defined by the average shortest path from vertices of degree k to other vertices in the ‘‘giant component’’, obeys a logarithmic distribution, l(k) a log (k) + b, with a < 0, meaning that hub members are more likely to be acquainted with other people. The
Fig. 3. c(k), knn(k), c(s), knn(s), s(k), l(k) of the four networks.
Z. Xia, Z. Bu / Knowledge-Based Systems 26 (2012) 30–39
35
detailed parameters of the semantic network are listed in Table 4 along with those of the other three networks.
networks are compared; finally, we propose an evaluation criterion to quantify the veracity of our communities.
5. Community structure
5.1. Community structure of the semantic network
As with other social networks, the semantic network is generally globally sparse yet locally dense. It has vertices in a group structure, where the vertices within the group have a higher edge density, and the vertices between groups have a lower edge density. This kind of structure is called a community, which is an important network property and can reveal many hidden features of a given network. Users belonging to the same community are likely to have properties in common. Monitoring the aggregate trends and opinions revealed by these communities provides valuable insight into a number of social applications, such as criminal investigation and rumor-spreading investigations. Hence, community identification is a fundamental step not only for discovering what causes entities to form but also for understanding the overall structural and functional properties of a large network. In this section, we first analyze the community structure of the semantic network, and then the community properties of the other three
We randomly selected 1500 pairs of nodes with ‘‘trust’’ values satisfying o < ‘‘trust’’ < 0.3 or 0.7 < ‘‘trust’’ < 1 from the dataset used to construct our semantic network. Figs. 4 and 5 show the primitive semantic network and its ‘‘giant component’’, respectively. The blue nodes constitute backbone networks. We can easily discern two communities from Fig. 4: community A, in the central area, and community B, surrounding it. Because we discarded the edges with ‘‘trust’’ values between 0.5 n and 0.5 + n, community A mainly includes edges with ‘‘trust’’ values greater than 0.5 + n, and community B includes those with values less than 0.5 n. This is also consistent with the actual community structure, for Tianya.com is populated by two kinds of people: the so-called elites and the angry youths; the former pays more attention to his/her words on the forum, whereas the latter is often outspoken. Obviously, these two types of people prefer to post in their own circles, which is why we can find two communities by visual inspection.
Fig. 4. The semantic network with n = 0.3 (1932 nodes and 1500 edges).
Fig. 5. The ‘‘giant component’’ of the above semantic network (643 nodes and 650 edges).
36
Z. Xia, Z. Bu / Knowledge-Based Systems 26 (2012) 30–39
Fig. 6. Nineteen communities in the ‘‘giant component’’ of the semantic network.
When we further extract the ‘‘giant component’’ from the semantic network, we may find that the community structure becomes elaborate. Using the classical community-detection algorithm [29], we obtained 19 communities, as shown in Fig. 6; here, the modularity Q is 0.853, which is a very large figure. This means that community character is obvious in the ‘‘giant component’’. It should be added P that the above modularity Q is defined as Q ¼ i ðeii a2i Þ ¼ 2 TrE kE k [4], in which E is a n n symmetric matrix whose element eij is the fraction of all edges in the network that link vertices in community i to vertices in community j. The trace of this matrix P TrE = ieii is the fraction of edges in the network that connect vertices in the same community, while the row (or column) sums P ai = jeij give the fraction of edges that connect to vertices in community i. If the network is such that the probability to have an edge between two sites is the same regardless of their eventual belonging to the same community, one would have eij = aiaj. The modular-
ity measures the degree of correlation between the probability of having an edge joining two sites and the fact that the sites belong to the same community. 5.2. Variation with the threshold n The community structure of the semantic network changes with its topology, and the topology of the semantic network changes with the threshold n. Therefore, we would expect that the community structure of the semantic network also changes with the threshold n. The implication of the threshold n has been discussed above; namely, a larger threshold yields a stronger semantic correlation. In this section, we discuss the relationship between the community structure and the threshold n in detail. We randomly chose 1500 pairs of interactive nodes from the dataset with ‘‘trust’’ values satisfying 0 < ‘‘trust’’ < 0.5 or 0.5 + n < ‘‘trust’’ < 1. The threshold n in the
Fig. 7. Variations with the threshold n.
Z. Xia, Z. Bu / Knowledge-Based Systems 26 (2012) 30–39
above interactivity was varied from 0 to 0.5 in intervals of 0.01. We then constructed the corresponding semantic networks and performed community detection on their ‘‘giant components’’. As shown in Fig. 7, as the threshold n was increased, both the number of nodes and edges in the ‘‘giant component’’ declined, as did the number of communities, whereas the modularity Q grew with the threshold n. This means that a stronger semantic correlation leads to a more obvious community structure in the ‘‘giant component’’.
5.3. Modularity comparison With the classical community-detection algorithm, we obtained the community structures of three other networks. Figs. 8–10 show the results for the undirected dense network, the undirected sparse network and the Interest network, respectively. A comparison of the modularities for the network divisions found by the classical methods for the four networks of varying sizes is listed in Table 5. The figure in row 1 shows that community structure in the semantic network was more significant than in the other three networks.
37
5.4. Time complexity The method we proposed to detect underlying communities is based on the classical community-detection algorithms. The main influence on the time complexity comes from different networkscale parameters (e.g., the number of nodes N and number of edges M). Then, to compare the time complexity of different methods, we only need to consider their network sizes. The time complexity of three classical community-detection algorithms is listed in Table 6. The parameter d in GNM [4] is the depth of the dendrogram describing the community structure. The four networks are all hierarchical as shown in Fig. 6, with d log N, in which case the GNM algorithm runs in essentially linear time, O(M log2 N). The first two rows in Table 3 tell that all the four networks are not sparse networks, so will compare them in the arvitary modus. Though the time complexity is influenced by the network scale, that does not mean the smaller the network scale, the better the community character: the Undirected sparse network includes 12047 nodes, only 3.72% compare to the total, which means a majority of nodes are discarded according to the rule, wij = min (nij, nji), and massive relational information is also missed; the
Fig. 8. Twenty-nine communities in the ‘‘giant component’’ of the undirected dense network.
Fig. 9. Eighteen communities in the ‘‘giant component’’ of the undirected sparse network.
38
Z. Xia, Z. Bu / Knowledge-Based Systems 26 (2012) 30–39
Fig. 10. Nine communities in the ‘‘giant component’’ of the interest network.
Table 5 Comparison of modularities for the network division found by classical methods for the four networks of varying sizes. GN, Girvan and Newman [28]; Newman et al. [4]; DA, Duch and Arenas [29]. Network
Size N
Edges M
Semantic Und dense Und sparse Interest
643 935 194 420
650 1369 229 8989
Modularity Q GN
CNM
DA
0.816 0.611 0.603 0.325
0.853 0.630 0.634 0.326
0.853 0.634 0.688 0.326
constructed with this method is of high research value. Based on the comment content and his research experience, our expert divided this network into 19 different communities. To measure the effectiveness of our method, we used a precision rate defined as follows:
Pn precision ¼
max1jn N CM j
i¼1
N Ci
n
;
where NCi is the number of IDs in the i th community obtained by the expert, and N CM is the number of given IDs in the jth community j obtained by the classical community-detection algorithm. Thus, max1jn fNM g j
Table 6 Comparison of the time complexity for the network division found by classical methods. GN, Girvan and Newman [28]; Newman et al. [4]; DA, Duch and Arenas [29].
Arbitrary graph Sparse graph Semantic Und dense Und sparse Interest
GN
GNM
DA
O(M2N)
O(Md log N)
O(N2 log N)
O(N3) Fast Common – Slow
O(N log 2N) Fast Common – Slow
O(N2 log N) Fast Common – Fast with big k slow with small k
NCi
is the partial precision rate for the ith community. The
overall precision rate of our modified greedy algorithm is the sum of the partial precision rates. Here, the precision rate of our algorithm was 80.14%, which is relatively high in finding potential organizations. The same manipulation program is deduced to the undirected dense network, the Undirected sparse network and the Interest network; and the precision rates were presented as 53.1%, 65.1%, and 71.2%, respectively. That means our method is better than the former models.
6. Discussion and conclusions discussion on the time complexity based on it seems to be meaningless. The semantic network has balanced N and M compared to other two networks, which has either a larger N (e.g., Und dense) or a larger M (e.g., Interest). So the method based on the semantic network appears to work well in real-world situations, and is faster than previous methods. 5.5. Evaluation criterion To evaluate the accuracy of our method, a senior sociology expert from Nanjing University who has studied social-networking sites for more than 20 years was invited to our laboratory. After spending a large amount of time performing extensive database searches and analyzing comments between 643 of the above users, this expert manually constructed a semantic network. Although this is very time-consuming and labor-intensive work, the network
Massive amounts of real data collected from online social networks are network-structured, and discovering the latent communities is a useful way to better understand the properties of a virtual social network. In this study, we constructed a semantic network from semantic information extracted from user-comment content. We considered the impact of the weight on every edge and focused on the ‘‘giant component’’ of the online social network to reduce the computational complexity. In simulations, we evaluated our method on real datasets and compared our approach with several previous methods based on comment interactions. The results show that our method is much faster, more effective and robust. Furthermore, we proposed an evaluation criterion for evaluating the accuracy of our method; although application of the criterion is very time consuming and labor intensive, the definitive result confirms that our method was effective at finding potential organizations. Several limitations of our model are discussed below.
Z. Xia, Z. Bu / Knowledge-Based Systems 26 (2012) 30–39
(a) In our model, we searched for emotional words in every comment and took a rough average of these as the ‘‘trust’’ value of every edge. As some comments may contain sarcastic remarks, the ‘‘trust’’ value may not be accurately evaluated using the above method. (b) As we took the rough averages of the ‘‘trust’’ values of every edge, this may cause additional ambiguity. For example, consider the case where there is only one comment between user A and user B, and the ‘‘trust’’ value of this comment is relatively high; simultaneously, there are dozens of comments between user A and user C, among which both supportive and opposing words can be found. We cannot automatically conclude that the semantic relationship between A and B is stronger than that between A and C, because the degree of comment interaction is also very important in measuring the ‘‘trust’’ value of every edge. This will be addressed in our future work. (c) In the process of statistical analysis, we found that many network features of the semantic network and the undirected dense network were surprisingly similar, whereas the community structures of two networks were significantly different. This is a very interesting finding and should be further studied.
Acknowledgment This work was supported by the Ministry of Science and Technology of the People’s Republic of China, No. 2009GJE00035. References [1] Jennifer J. Xu, Hsinchun Chen, Fighting organized crimes: using shortest-path algorithms to identify associations in criminal networks, Decis. Support Syst. 38 (2004) 473–487. [2] ZhengYou Xia, Fighting criminals: adaptive inferring and choosing the next investigative objects in the criminal network, Knowl. Based Syst. 21 (5) (2008) 434–442. [3] J. Reichardt, S. Bornholdt, Statistical mechanics of community detection, Phys. Rev. E 74 (2006) 016110. [4] M. Newman, M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69 (2004) 066111.
39
[5] M. Newman, Detecting community structure in networks, Eur. Phys. J. B 38 (2004) 321–330. [6] L. Danon, J. Duch, A. Diaz-Guilera, A. Arenas, Comparing community structure identification, J. Stat. Mech. Theor. Exp. 29 (09) (2005). [7] Kou Zhongbao, Zhang Changshui, Reply networks on a bulletin board system, Phys. Rev. E 67 (2003) 036117. [8] Vicenc Gomez, Andreas Kaltenbrunner, Vicente Lopez, Satistical analysis of the social network and discussion threads in Slashdot, in: WWW 2008, April 21– 25, 2008, Beijing, China. [9] N. Matsumura, D. Goldberg, X. Llora, Mining directed social network from message board, in: Proceedings of 14th WWW’05, ACM Press, New York, USA, 2005, pp. 1092–1093. [10] Gilad Mishne, Natalie Glance, Leave a reply: an analysis of weblog comments, in: WWW2006, May 22–26, 2006. [11] Chuntao Jiang, Frans Coenena, Robert Sandersona, Michele Zitoa, Text classification using graph mining-based feature extraction, Knowl. Based Syst. 23 (4) (2010) 302–308. [12] E.M. Trevino, Blogger motivations: power, pull, and positive feedback, in: Internet Research 6.0, 2005. [13] Yu Ichifuji, Susumu Konno, Hideaki Sone, An advisory method for BBS users and evaluation of BBS comments, Procedia Social and Behavioral Sciences 2 (2010) 218–224. [14] K. Nigam, M. Hurst, Towards a robust metric of opinion, in: The AAAI Symposium on Exploring Attitude and Affect in Text (AAAI–EAAT), 2004. [15] Lu Zhen, Zuhua Jiang, Hy-SN: hyper-graph based semantic network, Knowl. Based Syst. 23 (8) (2010) 809–816. [16] E. Airoldi, D. Blei, S. Fienberg, E. Xing, Mixed membership stochastic blockmodels, J. Mach. Learn. Res. 9 (2008) 1981–2014. [17] H. Shan, A. Banerjee, Bayesian co-clustering, Techique Report, 2008. [18] H. Zhang, B. Qiu, C. Lee Giles, H.C. Foley, J. Yen, An ldea-based community structure discovery approach for large-scale social networks, in: IEEE International Conference on Intelligence and Security Informatics, 2007. [19] A. Barrat, M. Weigt, On the properties of small-world network models, Eur. Phys. J. B 13 (2000) 547. [20] B. Bollobas, Random Graphs, Academic Press, London, 1985. [21] D.J. Watts, Small Worlds, Princeton University Press, Princeton, NJ, 1999. [22] Duncan J. Watts, Steven H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature 393 (1998) 440–442. [23] M.E.J. Newman, Assortative mixing in networks, Phys. Rev. Lett. 89 (2002) 208701. [24] M.E.J. Newman, Mixing patterns in networks, Phys. Rev. E 67 (2003) 026126. [25] Sara Nadiv Soffer, Alexei Vazquez, Network clustering coefficient without degree-correlation biases, Phys. Rev. E 71 (2005) 051701. [26] M. Boguna, R. Pastor-Satorras, Epidemic spreading in correlated complex networks, Phys. Rev. E 66 (2002) 047104. [27] M. Boguna, R. Pastor-Satorras, A. Vespignani, Absence of epidemic threshold in scale-free networks with degree correlations, Physics 625 (2003) 127. [28] M. Girvan, M.E.J. Newman, Community structure in social and biological networks, Proc. Natl. Acad. Sci. USA 99 (2002) 7821–7826. [29] J. Duch, A. Arenas, Community detection in complex networks using external optimization, Phys. Rev. E 72 (2005) 027104.