The Structure of Comment Networks Vahed Qazvinian Department of EECS University of Michigan Ann Arbor, MI
[email protected]
Jafar Adibi Center for Advanced Research PricewaterhouseCoopers LLP San Jose, CA
[email protected]
Abstract
shows in [27, 10]. In this work, we look at the evolution of comment networks, in which nodes represent bloggers and links represent comments. The structure of various networks has been previously studied [5, 8, 9, 23, 13] and modelled [18, 12, 15, 16, 17, 19]. However, no existing model captures the growth of comment networks. Here, we study a large comment network and try to explain its growth behavior. Comment network is different from web graph, paper citation networks, or even other types of blog networks for two main reasons. First, a higher frequency of comments between two bloggers might show a stronger tie between them. Other unweighted networks such as traditional web, paper citation networks, and blogrolling links lack this important feature. Second, a comment is usually followed by an identity and a hyperlink to the person who leaves it. This may help a blogger to make links to her own blog in other pages that will literally help attract more audience. These two features imply the fact that comments are more than mere links, which makes it crucial to take a deeper look at the comments network. The rest of the paper is organized as follows. We review some of the previous works on blogspace in Section 2. Section 3 describes our dataset, and how we create the comment data with a time evolving network. We make a number of observations and analysis on the this network, and list our findings in Section 4. Finally we propose a model of growth that best describes the comment network.
Blogs form an important on-line social network as they are maintained periodically, and are easily accessible. An important aspect of any blogspace is the relationship of bloggers based on comments. This important aspect is, however, largely ignored in previous works. In this work, we study the evolution of this specific type of social network, blogosphere comment graph. We look at the densification of the comment network, and study its local patterns. We observe how the comment network evolves as different bloggers place comments on each other. We investigate the densification of this specific network and observe a high correlation between the number of comments placed and received. Finally, we propose a growth model that best describes the behavior of users who place comments.
1
Abtin Rasoulian Department of Informatics Technischen Universit¨at M¨unchen Munich, Germany
[email protected]
Introduction
Social networks are not just about the crowds. Adar et al. [7] indicates that a typical science journal might have 500 readers a day, a typical science blog has the same number of readers, and a popular one has as many as 6, 000 daily viewers. In contrast, Serendip (which is a “kind of group blog”) gets 15,000 visitors a day. Blogs connect people and groups to each other through links and comments. Therefore, analyzing the blogging behavior to understand this social interaction is very crucial. Several researchers have discussed different aspects of the blogspace in the past few years. The formation of large social networks in blog communities, the interaction among local clusters in blogosphere, the ranking mechanism for social media [4], information propagation, information epidemics [1], and political blogspace dynamics [2] are studied in detail. We will review some of these works in section 2. Blogspace provides readers with the opportunity to place their opinions as comments. Comments are important in blog analysis and play an essential role in blogosphere as
2
Related Work
A number of studies have focused on the blogspace. To name few: Adamic et al. [2] studied the network structure of political blogs, and Thelwall [26] gave a descriptive analysis of blog postings around the London attacks. The dynamics of blogosphere is studied in [4, 14]. Adar et al. [4] described the information epidemics in blogs and introduced a new ranking algorithm for blog pages. Lin et al. [20] has defined a mutual awareness relationship to discover 1
Weblogs Posts Comments Commented Posts Uncommented Posts
22,306 348,700 1,257,561 339,884 (97.5%) 8,816 (2.5%)
Nodes Edges Number of Comments # single nodes # Strongly Connected Components # Weakly Connected Components Strongly LCC size Weakly LCC size W/S Clustering Coefficient Undirected W/S Clustering Coefficient Diameter Undirected Diameter Average Distance Undirected Average Distance
Table 1. Basic analysis on corpus size and accuracy
communities in blog social network based on two factors. First, communities are formed according to the actions of individuals in the network. Second, the semantic of the hyperlink structure in blogs is different from that of traditional web. While many previous works on weblogs have focused on post data, few researchers have studied the network of comments. Trevino et al. [27] and Gumbrecht et al. [10] showed the importance of comments in blogosphere analysis, and concluded that comments play an essential role in interactive nature of blogs. Herrig et al. [11] studied a small comment dataset of 203 weblogs. A larger scale study on comments investigates the relations of comments and posts, and extracts commenting pattern based on blog popularity [22]. Mishne and Glance [22] observed that 28% of the nearly 36, 000 blogs in their corpus contained comments. They also showed that utilizing blog comments will help improve the search recall. However, the corpus used in [22] does not cover a long period of time and thus, is not useful in a study of comments over time. Despite the wide range of previous works on blogs , there is no significant work, to the knowledge of authors, that models the growth of the comment network.
3
Table 2. Basic statistics of the comment network G
Similar to other blog datasets, links in this collection are categorized into four different classes: • Blog Roll Link. Blogrolls are hyperlinks, put in the side bar of blog page, and usually point to blogs or pages read regularly by the blog maintainer. • Post Link. Post links are hyperlinks put in the content (body) of an entry pointing to another page. • Comment outlink. Comment outlinks are hyperlinks, put in the content (body) of a comment. • Comment inlink. A comment inlink is a hyperlink left in the footer of a comment and points to the blog, homepage, or email address of the person who leaves it. This dataset has two main properties that distinguish it from other corpora, and make it appropriate for this work. First, it includes comments posted by readers for each entry. Second, it covers 15 months of the blog archives. This long period enables us to look at the evolution of comment network over time.
Preliminaries
3.1
21, 124 69, 583 226, 133 11, 679 4, 342 201 5, 080 9, 169 0.056 0.079 11 10 4.066 3.734
Data
For this study we use the Persian Blog dataset introduced by Qazvinian et al. [25]. The authors crawled a Persian blog host1 , and performed the preprocessing to export the collected pages into XML files. This dataset contains monthly archives of more than 22, 000 weblogs in a 15-month period. The number of posts exceeds 347, 800, which contains 1, 258, 000 comments, with an average of 3.6 comments per post. This average is much higher than that of the blog corpora in [11, 22], for which the average comment per post is 0.3, 0.9 respectively. Such high ratio makes this dataset a better candidate for the study of comment networks. Table 1 shows the basic statistics of this dataset.
3.2
Comment Network
We study the structure of the evolving comment network by observing its properties in an interval of equally spaced points in time. To achieve this, we introduce a definition for timed graph. Our definition is similar to the definition of this concept in [15]. Let’s define a timed graph, G, to be an ensemble of weighted graph snapshots, taken at different time slots. We look at these snapshots as an ordered set,
1 www.persianblog.com
G ≡ {G(V, E, t, w); t = 1, 2, · · · , n} 2
where V is the set of nodes, E ⊆ V × V is the set of edges, and w is a positive weight function w : E → R+ that attaches to every edge e(i, j) ∈ E a positive weight at time t. For simplicity, we denote the graph observed at time t, G(V, E, t, w), by Gt and its corresponding weight function by wt . In other words, wt (e(i, j)) shows the weight of the directed edge from i to j in Gt . Obviously, wt (e(i, j)) = 0 shows the existence of no comments between i and j till time t. Under this framework, we can build a network of comments, in which nodes are bloggers and directed edges represent comments. To build this network, we use the “comment inlink” data. A comment inlink, as mentioned before, is the hyperlink pointing to the page of the person who leaves the comment. This is important since a blogger can deliberately put a link to her own blog by leaving a comment at another’s post. A directed edge from i to j shows the comments left for i by j. In this setting, wt (e(i, j)) shows the number of all comments that j left for i at time t and before that. We extracted all comments that were left by the bloggers within the network. To make the timed graph, we extracted all comments left in a period of 52 weeks which is one year worth of blogging of the entire 15 month period, covered by the corpus. To ensure the high quality of the comment set, we ignored the first and the last 45 days of the 15 month corpus since it takes a while till a post gets all of its comments. For an even simpler representation, let’s denote the final comment network, G52 , by G. Table 2 summarizes the basic statistics of G. The number of nodes in the comment network is less than the number of blogs in the corpus. This indicates some bloggers have left no comments, nor have they ever received any, so we ignore such blogs in this study. Obviously, the number of comments in the network is equal to the sum of all weights of the edges.
4
10000 4.8
10
8000 7000 6000 5000 10
20
30
week
40
50
4.4
10
y = 0.021*x1.65; R=1.00
4.2
10 60
3.6
10
.
6
3.7
10
3.9
6
5
10
5
10
comment
comment
y = 0.003*x1.97; R=1.00
4
10
3.8
10 10 Number of nodes
10
Number of comments
Number of comments
4.6
10
edge
node y = 94.01*x+4843.84; R=1.00
10
3.6
10
3.7
10
3.8
3.9
10 10 Number of nodes
y = 0.294*x1.22; R=1.00
4
10
.
4.2
10
4.4
4.6
10 10 Number of edges
4.8
10
Figure 1. Comment network growth. (a) Number of nodes per week. (b) Number of edges vs. nodes. (c) Number of comments vs. nodes. (d) Number of comments vs. number of edges. b,c,d are plotted in a log-log scale. Slopes are 94.01, 1.65, 1.97, 1.22 respectively.
number of nodes. We observe the densification power-law in the comment network with the following properties: n(t) e(t)
∝ t ∝ n(t)a1
c(t) c(t)
∝ ∝
e(t)a2 n(t)a1 +a2 =a3
(1) (2) (3) (4)
Figure 1 shows the number of nodes per week, the number of edges versus nodes, the number of comments versus nodes, and the number of comments versus edges. The last three are plotted in a log-log scale and have slopes greater than 1 which confirms a non-linear growth in the number of edges and comments. Figure 1 (a) indicates that the nodes are added to the network at a constant rate of approximately 92 nodes a week. According to the fact that the data has a missing past, and we don’t have the network data all the way back to its birth, we would have a sudden increase in the number of nodes in the first few weeks. Therefore, we opted to fit the linear line starting from the 10th week, assuming that the pre-existing nodes have had at least one activity during the first ten week of the observations. Such a densification power-law in comments and edges should result in the emergence of shrinking diameters. Shrinking diameters have been observed before in citation networks [18], Yahoo! 360, and Flickr [15] social net-
Observations
We try to look at the comment network through different lenses and analyze its structure from different perspectives. In this section we will describe the structure of this network, and its evolution.
4.1
Number of edges
Number of nodes
9000
Growth
For the comment graph, G, we study the number of nodes v(t), the number of edges e(t), and the number of comments c(t), at each point in time t. Leskovec et al. [18] observes densification power-laws in citation networks. In a growing network with an underlying densification powerlaw, the number of edges is proportional to a power of the 3
14
4.4
13
4.2
12
diameter
average directed distance
4.6
4
(a) OutD
t
4
10
4
10
2
2
10
10
11
3.8
10
3.6
9
0
4000
6000 nodes
8000
(a) average distance
10000
. .
4000
6000 nodes
8000
10 0 10
10000
(b) diameter
0
1
2
10
10
3
10
(c) InDt
4
10
Figure 2. (a) Average directed distance, and (b) diameter in the strongly connected component of the comment graph vs. the number of nodes
4
2
2
10
4
10
(d) InWt
2
10
0
10 0 10
10 0 10
10
10
0
1
10
2
10
3
10
10 0 10
2
10
R.V. InDt OutDt InWt OutWt
Description # people who receive comments from a node # people place comments for a node # comments a node places # comments a node receives
Table 3. Description of four basic random variables
Distributions The power-law degree distribution in different networks has been studied before [24]. Our network, however, involves in four different distributions corresponding to the four random variables introduced above. Since placing comments is time consuming, we believe bloggers usually have a limited number of neighbors. Based on this argument Jin et al. [12] indicates that a growing social network does not exhibit a power-law degree distribution. Instead, it is strongly peaked around a certain mean degree and is not remarkably right-skewed. Even though comment network might be somewhat similar to the friendship network in [12], our observations show that the distributions of the four random variables are consistent with a model in which x is drawn from a power-law of the form p(x) ∝ x−α . Table 4 shows the exponent (α) and the correlation coefficient (R) for the best fit power-laws in G (i.e. at t = 52). Figure 3 illustrates these four distributions. The power-law exponent in these four distributions is just slightly above 1. This small exponent might suggest that although the growth of the comment network has some flavors of the preferential attachment, it is not an immediate result of that model.
We look at four different distributions in the comment network. Indegree, outdegree, and two weight distributions. Each of these features have different interpretations. Indegree of a node, vi , in Gt shows the number of people for whom the blogger vi has left comments at time ≤ t. Similarly, the outdegree of vi , in Gt shows the number of people from whom the bloger vi has received comments at time ≤ t. We also look at two other features of a node vi . The sum of the weights on the incoming edges, which is the number of comments left by vi , and the sum of the outgoing edges, which is the number of comments received by vi . We define the following four random variables at the time t: the number of people who place comments on vi (OutDt ), who receive comments from vi (InDt ), the number of comments received by vi (OutWt ), and the number of comments placed by vi (InWt ). In this network the degree of a node is equal to OutDt + InDt , and the sum of the weights is the number of comments which is equal to OutWt + InWt . Table 3 summarizes the description of these random variables. 4
4
10
Figure 3. Distribution of (a)OutDt , (b)OutWt , (c)InDt , (d)InWt
works. Here, we look at the average directed distance in the strongly connected component of the comment network. Figure 2 (a) shows this value as the graph grows over time. This might suggest senior bloggers (i.e., bloggers who joined earlier) place comments on other bloggers which decreases the average distance to their own blog. The network diameter (i.e., longest shortest path), as shown in Figure 2, is not decreasing in the strongly connected component due the following argument. While the network diameter is shrinking due to densification power-law, a new node increases the diameter by a value smaller than, or equal to 1.
4.2
(b) OutWt
R.V. OutDt OutWt InDt InWt
α 1.133 1.102 1.096 1.034
ond one, R2 , randomizes the graph regardless of the number of bidirectional edges. Previous works have observed the existence of reciprocal relationships in a variety of social networks. Kumar et al. [15] observes that a large number of edges are bidirectional in Yahoo! 360, and Flickr friendship networks. In our work, we don’t look at the friendship relationship (expressed by blogroll links). Rather, we look at the patterns of comments. Table 6 shows that the number of mutual comments between bloggers is 43.43%. This number is astonishingly lower in a randomized network with the same sparsity. In other words, if John places a comment on Mary’s post, he will most likely receive a comment from Mary. This probability could be as high as 0.43 in the comment network, while negligible in a randomized one. Let’s call the directed edge from i to j (i.e. comment from j to i) left at time t, a loop-closing edge (comment), if it satisfies the following three conditions.
R 0.996 0.998 0.995 0.997
Table 4. Best power-law fit for the four basic random variables, at t = 52 R.V. OutDt OutWt InDt InWt
OutDt 1.00
OutWt 0.84 1.00
InDt 0.74 0.59 1.00
InWt 0.72 0.79 0.80 1.00
Table 5. Correlation coefficient (ρ) of the four basic random variables, at t = 52
1. wt (e(i, j)) = 1 4.2.1
Correlation
2. wt0 (e(i, j)) = 0;
The correlation coefficient (ρ) of two random variables measures the strength and direction of a linear relationship between those two random variables. Table 5 shows the pairwise correlation coefficient of the four random variables. According to the symmetric property of this measure only the upper triangle of the matrix is illustrated. Clearly, there is a high correlation between the number of people one interacts with, and the number of comments she receives or leaves. This conclusion is based on two high correlations, ρ(OutDt , OutWt ) = 0.84 and ρ(InDt , InWt ) = 0.80. Furthermore, the number of comments one leaves exhibits a high correlation with the number of comments she receives ρ(InWt , OutWt ) = 0.79.
4.3
∀t0 < t
3. wt− (e(j, i)) > 0 The first two conditions ensure that an edge exists from i to j at time t but not before that. The third condition indicates that at least one edge from j to i existed before the edge from i to j was created. In other words, a loop-closing edge from i to j is the first comment from j on another blog, i, who has commented on j before. We would like to find the fraction of edges which are loop-closing. To achieve this, we plot the number of such edges versus the total number of edges, as the graph grows. Figure 4 shows this plot, both in a normal scale and a log-log scale. The regression line in the log-log scale has a slope value equal to a = 0.96. This value which is very close to 1, suggests a linear relation between the the two values. Therefore, the total number of new edges added to the network would be
Local Patterns
Local patterns (e.g. triangles) in the comment network are good indicators of bloggers’ interactions. In particular, looking at motifs of size 2 and 3 is quite useful to describe the structure of this network. Some previous works also used local patterns and motifs to describe the structure of expertise social networks [30] and knowledge sharing networks [3]. We compare the occurrence frequencies of subgraphs of size 2 and 3 in the comment network with randomized ones [21]. We use FANMOD [29] as a tool to extract the motifs from our comment network. Table 6 shows the frequency of each motif structure in the comment network, and its expected frequency in a sample of 1000 randomized versions. We used two randomization processes. The first one, R1 , switches the edges between nodes while maintaining the number of bidirectional edges globally constant. The sec-
e(t) = e(1) (t) + e(2) (t)
(5)
where the component e(1) (t) = α.e(t) the number of loopclosing edges, and the component e(2) (t) = (1 − α) represents the bloggers’ attempt to initiate new relationship. In our data, according to the regression line in Figure 4 (a), we have α ≈ 0.44. Subgraph structures of size 3 are also illustrated in Table 6. The majority subgraphs are of type 36 in which a blogger, bi , has placed comments on two other bloggers who have never placed comments on an entry of bi . An interesting pattern is motif 46 where bi places comments on two other bloggers who comment on each other, but not bi . The frequency of this motif is significantly higher in comment network than both randomized versions. It’s also interesting to see that a one directional loop, as in motif 140, 5
4
x 10
5
0.09 0.088
2.5
0.086
0.086
2 1.5
0.5
Recp. edges
Recp. edges y = 0.44*x; R=1.00 0
2
4 edges
(a)
4
10
y = 0.68*x0.96; R=1.00
3
6
8 4
x 10
10
4.2
10
4.4
10
4.6
10 edges
W/S CC
0.088
W/S CC
3
1
0.084
0.082
0.08
0.08
0.078
4.8
0
20
Week
40
60
(a)
(b)
W/S CC vs. nodes
0.084
0.082
10
0.078 2000
4000
6000 nodes
8000
10000
(b)
Figure 5. Watts Strogatz clustering coefficient (CC) in the undirected G (a)over time, (b)versus nodes.
Figure 4. The number loop-closing edges versus the number of edges in (a) linear scale (b) log-log scale
5.1 rarely occurs in a comment network. This nonreciprocal transitive relationship accounts for only 0.01 percent of the triads in this network.
Existing Models
In past decade, several evolution models are proposed to explain degree distribution, average shortest path, clustering coefficient, and other properties of online social networks. Classical models such as the preferential attachment [6] and the copying model [16] are not comprehensive enough to capture all properties of complex networks. Recently, more advanced models [12, 18, 15] are proposed as a result. The community guided attachment, and the forest fire model [18] require the nodes to make links only upon arrival. This model describes the citation networks in which an article cites some other previously published articles when published. Unlike comment networks, citation networks lack bidirectional edges. Moreover, in the forest fire model, the nodes perform breadth first citations with certain probabilities decreasing with the breadth level. This will naturally form triangles in citation networks, which might cause a high clustering coefficient. The social network growth model proposed by Jin et al. [12] does not exhibit a power-law distribution of degree. This model assumes a constant number of nodes in the network with certain mean degree, which means a person can maintain only a certain number of friends. In this model, nodes connect to each other based on a probability proportional to the number of mutual neighbors or friends. This implies a significantly higher clustering coefficient if the friendship decay factor is small. This is in contrast with the observed small clustering coefficient in the comment networks. The model described by Kumar et al. [15] captures the behavior of the growing friendship networks in the societies of Yahoo! 360 and Flickr, where the friendship is determined based on the appearance of one node in the other’s contact list. This model can successfully describe the growth of friendship network based on the blogrolling
In addition to the above observations, we can see in Table 6 that triangles contribute to a significantly smaller portion of the triads in the comment network, which causes a small clustering coefficient. Table 1 shows that the clustering coefficient of undirected G is 0.079. Figure 5 illustrates the Watts Strogatz clustering coefficient [28] of the same network versus weeks and number of nodes. Although the comment network follows a densification power-law [18] in that the number of distinct edges grow exponentially with the number of nodes, the clustering coefficient maintains a decreasing tone overtime and obtains tiny values. This shows that bloggers do not necessarily place comments on neighbors of their neighbors. This is in contrast with several other well-known social networks as shown in [17] in which 30%60% of edges close triangles (i.e., the tail is only two hops from the head).
5
W/S CC over time
0.09
10
reciprocal edges
reciprocal edges
3.5
Model
Based on the above discussions and observations, we would like to propose a growth model for comment network. One of the major differences of the comment network with previously modelled networks is its dynamic nature. The set of interactions of a blogger is not limited to the arrival time. Rather, we look at a set of comments over time. Previous work has not captured this behavior. In the following, we will briefly describe the shortcomings of some of these models. 6
Motif Motif ID 36 164 12 14 6 78 38 174 166 46 238 102 140 G 56.57%43.43%28.07%22.44%14.82%13.09%11.51% 8.52% 0.40% 0.37% 0.26% 0.24% 0.16% 0.11% 0.01% R1 56.57%43.43%24.70%27.57%10.20%12.38%12.38%12.15%0.13% 0.17% 0.14% 0.06% 0.07% 0.04% 0.01% R2 0.99% 0.01% 35.43%22.44%14.82%13.09%20.12% 0.03% 0.71% 0.01% 0.03% 0.02% 0.01% 0.04% 0.14% S 56.36 43.64% 6.54% 15.02%10.30% 3.08% 4.84% 55.56%0.31% 2.67% 0.17% 0.05% 0.95% 0.18% 0.09%
Table 6. Distribution of motifs in comment network, randomized ones with maintaining the number of bidirectional edges (R1), and without maintaining the number of bidirectional edges (R2), and the synthetic network S
links, but does not have an underlying dynamic structure which is the main characterization of comment networks. The frequency of interactions between two nodes, in a friendship network or a blogroll network, is limited to one. However, a blogger can place or receive comments several times in the comment network environment.
5.2
cause others to learn about the existence of her blog. During the inter-arrival time of two new nodes, t − 1 and t, a number of ct comments are added to the network, (6)
ct = c(t) − c(t − 1) ∝ n(t)a3 − n(t − 1)a3
where a3 = a1 + a2 . Our model aims to describe how this number of comments are distributed in the network. Let’s assume that newly added comments are made up of two components,
Proposed Model
The model that successfully captures the growth of comment network should have the following properties.
(1)
(2)
(7)
ct = c t + c t (1)
where ct accounts for all comments that are left to build new relationships (i.e., non loop-closing new links), and (2) ct accounts for all other comments, including loop-closing (2) (1) links. In fact, ct is equal to et = e(2) (t) − e(2) (t − 1) from equation 5. Thus equation 7 can be written as
• Nodes are added to the network at a constant rate over time. • The number of distinct edges should have a non-linear growth with the number of nodes. Moreover, the number of comments should also grow nonlinearly with the number of links, and therefore nodes.
ct
• The model should exhibit a power-law distribution in indegree, outdegree, the number of comments received, and the number of comments placed, as well as a high correlation between these four basic random variables, InDt , OutDt , InWt , and OutWt . Moreover, it should create a small slope in the power-law distribution which indicates the model is not a direct result of preferential attachment behavior.
=
(2)
(2)
(8)
et + c t
(2)
In our model at each time step t, the number of ct = (2) ct − et comments are left from bloggers to those from whom they have received comments. The probability that j leaves a comment for i at t, is proportional to the number of comments j has received from i before t. More formally, the probability that an edge appears from i to j or its weight increased (in case it already existed) at time t, is proportional to wt (e(j, i)). This component will ensure that the probability of receiving a comment increases with the number of comments left. (2) At the same time et new edges are distributed in the network with the following procedure. These edges are comments, placed to attract more readers. The probability that j is the tail of the edge (comment placer) decreases linearly with j’s outdegree. Thus, those who have received fewer comments by time t, will be more likely to leave com(2) ments at t, contributing to the et component. We assume the head of the edge (the blog who receives the comment) is determined uniformly at random which is also the case in node arrivals.
• The growth model should capture the high fraction of loop-closing edges and the low non-increasing clustering coefficient. We propose a simple model for the evolution of comment network. In our model, a new node joins the network at each time t. This addition does not necessarily imply creation of a new blog. A blog may exist long before it receives or leaves its first comment. Here, by addition of a blog to the comment network, we imply that it joins the network. At each time step t, a new blog places one comment on an existing blog chosen uniformly at random. This will 7
Nodes Edges Number of Comments W/S Clustering Coefficient Diameter Average Distance
2, 000 80, 523 411, 372 0.27 6 2.10
R.V. OutDt OutWt InDt InWt
OutWt 0.86 1.00
InDt 0.99 0.85 1.00
InWt 0.93 0.82 0.93 1.00
Table 8. Correlation coefficient (ρ) of the four basic random variables, in synthetic network
Table 7. Statistics of the synthetic comment network S
OutW , InD, InW in log-log scale. We also calculated the correlation coefficient of these four variables. Table 8 shows a high correlation between these variable created by the model, which confirms the earlier observation on the real data. Since the network growth model encourages mutual comments (i.e., comments in both directions), we expect a high number of loop-closing and reciprocal edges. The fraction of reciprocal edges in S and G are 43.64% and 43.43% respectively. We extracted the network motifs for the synthetic network as well. The fourth row in Table 6 lists the distribution of motifs of size 2 and 3 in the synthetic data. The frequency of different subgraphs of size 3 indicates that the synthetic network exhibits the same microscopic properties of the comment network. The frequency of triangles is still a significantly smaller portion of the triads in the network. This causes low clustering coefficient of the synthetic network of approximately 0.27. The only significant difference in the frequencies of motifs between G and S is that of motifs 36, 6, and 78. Motif 36 is the case where a blogger i placed comments on j and k but have received no comments from them. The case in motif 6 is slightly different, in which a blogger i has received comments from j, and k, but have never replied back. However, if the bloggers were to leave loop-closing comments with higher probabilities, then both motifs would change to motif 78. In motif 78, a blogger i has received comments from j and k, and has placed comments on both of them as well. The probability of forming a non loop-closing new edge is indirectly controlled by one parameter, α which in our simulation is set to 0.5. If we decrease α fewer loopclosing edges will appear, and the frequency of motif 78 will decrease. This will immediately cause an increase in occurrence frequency of both motifs 36, and 6.
The intuition behind this model is clear. We assume the time and the cost in placing comments in the blogosphere is limited. Therefore, bloggers who receive a lot of comments, will spend most of their time on replying back, while those who receive fewer comments or have a smaller set of friends try to place comments on strangers.
5.3
OutDt 1.00
Simulation Results
In this section we describe the simulation of our model. The procedure to generate the synthetic network is as follows. In each iteration t, 1. the node t joins the network by adding an edge e(j, t) with weight 1, where j < t and is determined uniformly at random. 2. (1 − α).et new links are added to the network where et = e(t)−e(t−1). Each of these edges are from i to j where j is selected with a probability decreasing with j’s outdegree, and i is chosen uniformly at random. Here, et = e(t) − e(t − 1) and e(t) = n(t)a2 . 3. The number of ct − (1 − α).et edges are distributed in the network. The probability of formation of the edge e(i, j) (or an increment in weight, in case of an existing edge) is proportional to the weight of e(j, i). Here ct = c(t) − c(t − 1) and c(t) = n(t)a3 . In our simulation, we set α = 12 , a2 = 1.4, and a3 = 1.7 and create a 2000 node network. Let’s denote the final synthetic network by S. Table 7 shows the basic statistics of S. This table confirms that the model creates a low clustering coefficient, as well as a small average distance. The distributions for the simulated network exhibit a power-law with small slopes. The model has some flavor of preferential attachment in that receiving a comment has a probability proportional to OutW . However, new link are generated based on a uniform distribution, which makes small slopes in the log-log plots. The slope for the real network are slightly above 1, and for the synthetic data with the above parameter setting are below 1. Figure 6 illustrates the distribution of the four random variables, OutD,
5.4
Discussion
The model introduced in this paper describes the growth of the comment network in a specific blogosphere. Our method is based on empirical analysis of data and observations on a single large dataset. This model is formalized based on three parameters, a3 , a2 , and α. This model, as shown, successfully describes the growth of the giant 8
2
10
1
10
1
10
0
10 0 10
0
2
10
4
10
10 0 10
(a)OutD
5
10
(b)OutW
3
10
3
2
10
1
10
10
2
10
1
10
0
10 0 10
work. Our observation indicates high correlation between the number of comments placed, and the number of comments received by bloggers. This suggests the more a blogger place comments she receives more comments from other bloggers. We also examined the distribution of major network parameters: indegree, outdegree, number of comments placed and received by bloggers. The power-law exponents for the distributions of all of these parameters are slightly above 1, which suggests the comment network has a very weak flavor of preferential attachment. We believe this is according to the lack of global knowledge on the blogosphere where either bloggers do not know who is well connected and famous or they do not place comments on those blogs. In addition, our study also illustrated that the behavior of bloggers to place comments in blogosphere does not tend to form triangles, which means they do not place comments on blogs of friends of friends. We showed that this behavior causes a low clustering coefficient of the network. The comment network has a special characteristic, which provides readers with the opportunity to place their opinions as comments. This unique feature makes the blog environment an appropriate place for the spread of link spams, and spam comments to increase popularity. Our future work is in three directions. First, we are interested in the analysis of comments network from game theory point of view to find a best strategy of placing comments on other blogs to maximize one’s score (e.g. Pagerank). As mentioned earlier this is a totally different problem with other well known Pagerank applications. Second, we showed that bloggers’ strategies to place comments on other blogs does not follow a preferential attachment model, hence it needs new approaches to detect spams in contrast with well-know spam detection techniques in internet and email. We illustrated that analysis of motifs can give us valuable information about the microscopic behavior of bloggers. We plan to continue this work for spam detection using motif based features. Third, it is worthwhile to compare the network of comments with that of post links, and blogrolls. A major fraction of comments fall between friends, whose relationship could be determined by studying blogrolling links. Also it is important to look at comment networks of different languages or genres. One good future direction might be to see if the comment network differs from one category to another (e.g. Science vs. Sports), and if cultural issues might affect the network properties.
2
10
0
2
10
4
10
10 0 10
(c)InD
5
10
(d)InW
Figure 6. Distributions of (a)OutD, (b)OutW , (c)InD, (d)InW in S with the best power-law fit exponent equal to α = 0.61; R = 0.73, α = 0.66; R = 0.76, α = 0.73; R = 0.91, α = 0.79, R = 0.95 respectively
component in the comment network. However, given that each node connects to the giant component upon arrival, the model will create a single weakly connected component. However, Table 2 shows that the giant component in the comment network hardly constitutes for half of the network, while there are 201 weakly connected components. One way to create the network with different disconnected components, is to introduce a new parameter. Each new node, upon arrival, makes a link with a certain probability. Thus, those nodes that do not connect to the giant component, may receive comments later from future new nodes, and form smaller components. This will produce the growth of different components, each of which follows the model described in this paper. An empirical or an exact analytical solution to the growth of the comment network with disconnected components might be a good future direction.
6
Conclusion and Future Work
In this paper we introduced, illustrated and analyzed our understanding on a rich and dense comment network extracted from PersianBlog dataset. Our goal was to measure the characteristics of this network and to provide rather a simple model to explain the behavior of comment net-
References [1] L. Adamic and E. Adar. Tracking information epidemics in blogspace. In Proc. of the 2005 IEEE/WIC/ACM inter-
9
[2]
[3]
[4] [5] [6] [7] [8] [9]
[10]
[11]
[12] [13] [14]
[15]
[16]
[17] J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins. Microscopic evolution of social networks. In proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008. [18] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: Densification law, shrinking diameters and possible explanations. In Proc. of 11th ACM SIGKDD international conference on knowledge discovery and data mining., pages 177–178, 2005. [19] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In Twelfth International Conference on Information and Knowledge Management, pages 556–559. ACM, November 2003. [20] Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng. Blog community discovery and evolution based on mutual awareness expansion. In WI ’07: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pages 48–56, Washington, DC, USA, 2007. IEEE Computer Society. [21] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298(5594):824–827, Oct 2002. [22] G. Mishne and N. Glance. Leave a reply: An analysis of weblog comments. In Third annual workshop on the Weblogging ecosystem, Edinburgh, Scotland, May 2006. [23] M. J. Newman. The structure and function of complex networks, 2003. [24] M. J. Newman. Power laws, pareto distributions and zipf’s law. Contemporary Physics, 46:323–351, 2005. [25] V. Qazvinian, A. Rassoulian, and M. Shafiei. A large-scale study on persian weblogs. In the proceedings of 12th international joint conference on Artificial Intelligence, workshop of TextLink2007, 2007. [26] M. Thelwall. Bloggers during the london attacks: Top information sources and topics. In Proceedings of the WWW06 Workshop on Web Intelligence, 2006. [27] E. M. Trevino. Blogger motivations: Power, pull, and positive feedback. In Internet Research 6.0, 2005. [28] D. J. Watts and S. Strogatz. Collective dynamics of smallworld networks. Nature, 393:440–442, June 1998. [29] S. Wernicke and F. Rasche. Fanmod: a tool for fast network motif detection. Bioinformatics, 22(9):1152–1153, 2006. [30] J. Yang, L. A. Adamic, and M. S. Ackerman. Competing to share expertise: the taskcn knowledge sharing community. In proceedings of international conference on weblogs and social media, (ICWSM2008), 2008.
national conference on Web Intelligence, pages 207–214, 2005. L. Adamic and N. Glance. The political blogosphere and the 2004 u.s. election: Divided they blog. In Proceedings of the WWW2005 Conference’s 2nd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis, and Dynamics, 2005. L. A. Adamic, J. Zhang, E. Bakshy, and M. Ackerman. Knowledge sharing and yahoo answers: Everyone knows something. In proceedings of international conference on world wide web, (WWW2008), 2008. E. Adar, L. Zhang, L. A. Adamic, and R. M. Lukose. Implicit structure and the dynamics of Blogspace. In WWWWS2004B, 2004. R. Albert and A.-L. Barabasi. Statistical mechanics of complex networks. Reviews of Modern Physics, 72(1):48–97, 2002. A.-L. Barab´asi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999. L. Blankenship. Blogging science: The spin and what we can do about it. In The Center of Science in Society, Brown Bag Discussion Group, April 2005. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Comput. Netw., 33(1-6):309–320, 2000. M. Faloutsos, P. Faloutsos, and C. Faloutsos. On powerlaw relationships of the internet topology. In SIGCOMM ’99: Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication, pages 251–262, New York, NY, USA, 1999. ACM. M. Gumbrecht. Blogs as “protected space”. In In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, at WWW 04: the 13th international conference on World Wide Web, 2004. S. C. Herring, L. A. Scheidt, S. Bonus, and E. Wright. Bridging the gap: A genre analysis of weblogs. In HICSS ’04: Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS’04) - Track 4, page 40101.2, Washington, DC, USA, 2004. IEEE Computer Society. E. M. Jin, M. Girvan, and M. E. Newman. Structure of growing social networks. Phys. Rev. E, 64(4):046132, Sep 2001. G. Kossinets and D. J. Watts. Empirical analysis of an evolving social network. Science, 311(5757):88–90, January 6 2006. R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. In WWW ’03: Proceedings of the 12th international conference on World Wide Web, pages 568–576, New York, NY, USA, 2003. ACM. R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 611–617, New York, NY, USA, 2006. ACM. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. In FOCS ’00: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, page 57, Washington, DC, USA, 2000. IEEE Computer Society.
10