Their warmest cares were the best support for me during the long, cold winters. Finally .... Proportion of blogs at 4 large blog hosting sites over the two datasets, demonstrat- ing that TREC is less ..... Several online services, such as BlogPulse ...
THE STRUCTURE AND DYNAMICS OF INFORMATION SHARING NETWORKS
by Xiaolin Shi
A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Phyilosophy (Computer Science and Engineering) in The University of Michigan 2009
Doctoral Committee: Assistant Professor Lada A Adamic, Co-Chair Associate Professor Martin J Strauss, Co-Chair Professor Hosagrahar V Jagadish Associate Professor Kevin J Compton Associate Professor Anna C Gilbert Associate Professor Dragomir R Radev
c
Xiaolin Shi 2009 All Rights Reserved
To Mom and Dad
ii
ACKNOWLEDGEMENTS
First of all, I would like to thank my research advisor, Prof. Lada Adamic. This thesis would not be possible without her guidance, advice and encouragement. She sets a great model as a successful researcher for me to follow. I also would like to thank my academic advisor, Martin Strauss, for his support and advice, especially for those milestones during my PhD years, such as the prelim examination and the job search. This thesis is the result of the joint endeavor with my collaborators, each of whom deserves my gratitude. Chapter II is joint work with Belle Tseng and Lada Adamic. Part of this work was done while I was a research intern at NEC Laboratories America in summer 2006. Chapter III is joint work with Matthew Borner, Lada Adamic and Anna Gilbert. Chapter IV is joint work with Lada Adamic and Martin Strauss. Chapter V is joint work with Lada Adamic, Belle Tseng and Gavin Clarkson, and part of this work was done while I was a research intern at NEC in summer 2007. Chapter VI is joint work with my colleagues, Jun Zhu, Rui Cai and Lei Zhang, when I was a research intern at Microsoft Research Asia in summer 2008. All the collaborating experiences were wonderful and fruitful. I would especially like to thank Dr. Belle Tseng, who was my mentor at NEC Laboratories America during the summers of 2006 and 2007. She provided me with great freedom in pursuing research topics of my interests. I would thank Belle and other members in her group at NEC: Xiaodan Song, Yun Chi and Koji Hino, for their countless help, research
iii
insights and interesting discussions. The summer I spent at Microsoft Research was another terrific experience. Jun Zhu was a great collaborator and supportive friend. Rui Cai and Lei Zhang all nicely provided me with lots of help and conveniences for my work. I would like to acknowledge my thesis committee members, Prof. Kevin Compton, Prof. Anna Gilbert, Prof. H V Jagadish and Prof. Dragomir Radev. Their insightful comments and kind support have improved this thesis a lot. Another professor I would like to acknowledge is Prof. Yaoyun Shi. He served as my advisor in my first year, and it was him who led me into the graduate life at the University of Michigan. My graduate life would have been much tougher without my wonderful friends at Michigan: Bin Liu, Joseph Xu, Ying Zhang, Yunyao Li, Hailing Cheng, and many others. Their warmest cares were the best support for me during the long, cold winters. Finally, I would thank my parents. Their eternal and selfless love is the best motivation for me to pursue my dream bravely and consistently.
iv
TABLE OF CONTENTS
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
LIST OF FIGURES
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
CHAPTER I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
II. Sampled Data of Blogosphere . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1 2.2
. . . . . . . . . . . . . . . .
9 10 12 14 14 15 19 20 22 23 24 24 26 26 27 29
III. Important Vertices in Networks . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.3
2.4
2.5 2.6
3.1 3.2
3.3
3.4
Introduction . . . . . . . . . . . . . . . . . . . Description of data sets . . . . . . . . . . . . . 2.2.1 Dataset overlap . . . . . . . . . . . . 2.2.2 Other network datasets . . . . . . . . Topological features and network comparisons 2.3.1 Degree distributions . . . . . . . . . 2.3.2 Small-world effect . . . . . . . . . . . 2.3.3 Connectivity . . . . . . . . . . . . . . 2.3.4 Clustering coefficient and reciprocity Temporal features . . . . . . . . . . . . . . . . 2.4.1 Degree distributions . . . . . . . . . 2.4.2 Connectivity . . . . . . . . . . . . . . 2.4.3 Clustering coefficient and reciprocity 2.4.4 Densification law . . . . . . . . . . . Blogs in blog hosting sites . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . Preliminaries . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Importance measures . . . . . . . . . . . . 3.2.2 Description of network datasets . . . . . . Important vertices . . . . . . . . . . . . . . . . . . . 3.3.1 Network properties and important vertices 3.3.2 Important vertices in their subgraphs . . . 3.3.3 Original vs. subgraph properties . . . . . 3.3.4 Summary . . . . . . . . . . . . . . . . . . Compression with guarantees . . . . . . . . . . . . .
v
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
31 33 33 35 36 37 40 45 47 49
. . . . . . . .
49 50 51 52 53 54 56 57
IV. Strong Ties in Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
3.5
3.6 3.7
4.1 4.2 4.3
3.4.1 Hardness of compression with guarantees . 3.4.2 Heuristic algorithms . . . . . . . . . . . . 3.4.3 Empirical evaluation and trade-offs . . . . Analytical discussions . . . . . . . . . . . . . . . . . 3.5.1 Erd¨ os-Renyi graphs . . . . . . . . . . . . . 3.5.2 Power law graphs . . . . . . . . . . . . . . Related work . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . .
. . . . . . . .
. . . . . .
. . . . . . . .
. . . . . . . .
. . . . . .
. . . . . . . .
. . . . . . . .
. . . . . .
. . . . . . . .
. . . . . . . .
. . . . . .
. . . . . . . .
. . . . . . . .
86
. . . . . .
. . . . . . . .
. . . . . . . .
V. Information Diffusion in Citation Networks . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . Description of data sets . . . . . . . . . . . . . . . . . . Discipline proximity . . . . . . . . . . . . . . . . . . . . Impact of information flows . . . . . . . . . . . . . . . . Alternate definitions of proximity between communities Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
59 63 70 71 75 78 83 84
5.1 5.2 5.3 5.4 5.5 5.6
. . . . . . . .
. . . . . . . .
. . . . . . . .
4.4 4.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . Online social networks without weak ties . . . . . . . . . . Random graphs composed of strong ties . . . . . . . . . . 4.3.1 Degree distribution . . . . . . . . . . . . . . . . . 4.3.2 Accidental triangles and the clustering coefficient 4.3.3 Phase transition and the giant component . . . . Average shortest paths of networks of strong ties . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . .
. . . . . . . .
. . . . . .
. 86 . 88 . 89 . 94 . 101 . 103
VI. Information Diffusion in Online Forums . . . . . . . . . . . . . . . . . . . . . . 104 6.1 6.2 6.3
6.4
6.5
6.6
Introduction . . . . . . . . . . . . . . . . . . Related work . . . . . . . . . . . . . . . . . . Overview of the networks . . . . . . . . . . . 6.3.1 Datasets description . . . . . . . . 6.3.2 User-community bipartite network Community membership . . . . . . . . . . . 6.4.1 Friends of reply relationship . . . . 6.4.2 Community sizes . . . . . . . . . . 6.4.3 Average ratings of top posts . . . . 6.4.4 Similarities of users . . . . . . . . . 6.4.5 Summary . . . . . . . . . . . . . . Statistical user grouping model . . . . . . . . 6.5.1 Bipartite markov random fields . . 6.5.2 Feature function definition . . . . . 6.5.3 Model fitting and testing . . . . . . 6.5.4 Observations . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
104 107 109 109 110 111 113 114 115 118 119 121 121 123 124 128 131
VII. Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.1 7.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Work in Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
vi
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
139
LIST OF FIGURES
Figure 2.1
Proportion of blogs at 4 large blog hosting sites over the two datasets, demonstrating that TREC is less concentrated at large hosting sites than BlogPulse . . . . . .
13
Overlap in coverage between TREC and BlogPulse: (a) overlap in crawled blogs with around 50% of the blogs covered in TREC being covered in the BlogPulse sample (b) overlap in between-blog links in the TREC and BlogPulse datasets restricted to blogs that occur in both . . . . . . . . . . . . . . . . . . . . . . . . . .
14
The degree distributions of the BlogPulse data with splogs (grey curves) and without splogs (black curves) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
In-degree distributions of Web, BlogPulse and TREC data, exponentially binned, all showing power-law structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.5
Out-degree distributions of BlogPulse, TREC and Web, exponentially binned . . .
18
2.6
Temporal changes in the in-degree and out-degree distributions in TREC . . . . .
25
2.7
The number of edges versus number of nodes in log-log scale for blogs crawled over different time durations, which obeys the densification power law . . . . . . . . . .
27
The degree distributions of online networks of BuddyZoo data, TREC blog data and Web data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2
The slopes of the distributions of hkineigh show the assortativities. . . . . . . . . . .
39
3.3
In the top row are subgraphs induced by the top 100 important vertices of BuddyZoo for all four importance measures, while in the bottom row are subgraphs induced by the 100 highest degree vertices in the other three networks. . . . . . . . . . . . .
41
The sizes of largest connected component of the sub-networks of important vertices in Erd¨ os-Renyi random graph and three real online networks. . . . . . . . . . . . .
42
The growth of numbers of edges between important vertices. The slope of the black dash line in each plot is the ratio of the number of edges v.s. the number of vertices in the entire network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
The ASP of: all vertices in the entire networks (black dashed line), important vertices in the the subgraphs (solid points), important vertices in the entire networks (hollow points). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
2.2
2.3
2.4
3.1
3.4
3.5
3.6
viii
3.7
3.8
3.9
4.1
4.2
4.3
4.4
4.5
4.6
4.7
5.1
5.2
Pearson correlations of importance values of vertices in subgraphs and original graphs. The black dashed lines are the base lines starting from 0 when the number of vertices is 0; and ending at 1 when all the vertices in the networks are included.
48
The distance of important vertices a and b in the original graph is 3 and n − 3 in the compressed graph obtained by KeepOne. The ratio of distances can be made arbitrarily large as limn→∞ n−3 3 = ∞. . . . . . . . . . . . . . . . . . . . . . . . . .
51
The number of edges between important vertices, where importance is measured by degree, in three networks: 1) power law network with α = 2.2, n = 1000, 2) Erd¨ os-Renyi graph with the same average degree, and 3) power-law graph with the same exponent but a cutoff at k = 100. Two dotted lines show what the number of edges would be if the average degree in the subgraph were equal to the average degree in the original network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
The distribution of the strength of ties, measured as the number of triads each tie participates in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
The largest component of the reduction of the BuddyZoo network where each tie participates in at least 47 triads. The triads themselves are not all shown — only the ties that share a threshold number of them. . . . . . . . . . . . . . . . . . . . .
67
The size of the giant component as only ties of a minimum strength (measured in the number of triads it is a part of) are kept in the network. The inset shows the growth of the average shortest path between connected pairs. . . . . . . . . . . . .
69
The ratio of the number of accidentally formed triangles to the number randomly chosen by the model. For fixed average degree and increasing number of nodes, the ratio of accidentally formed triangles drops as 1/N . . . . . . . . . . . . . . . . . . .
76
Examples of triangle graphs with 1000 nodes with varying numbers of triangles M . Accidental triangles are marked with bold lines. . . . . . . . . . . . . . . . . . . . .
77
Comparison of numerical simulations with analytical solutions for the fraction of the network occupied by the giant component of a 10,000 node triangle graph and the corresponding Erd¨ os-Renyi graph . . . . . . . . . . . . . . . . . . . . . . . . . .
81
Numerical comparison of the average shortest path in triangle graphs and Erd¨osRenyi graphs with the same number of nodes and edges. The inset shows the average shortest path as a function of the size of the giant component rather than the total number of nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
Information flow matrix for journals in the JSTOR database. The direction of information flow is from the column discipline to the row discipline, with Zij , the Z-score, corresponding to the ith row and j th column. Each entry is shaded according to a normalized Z-score representing whether the number of citations between disciplines is higher or lower than expected at random. Darker shading represents higher Z-scores. The diagonal represents citations within the same discipline. . . .
91
Information flow matrix for patents, with several related areas labeled. . . . . . . .
93
ix
5.3
5.4
5.5
Correlations between proximity Z and impact γ, partitioned by percentile of impact. For example, at the 20% percentile, we show ρ(Z, γ) for the bottom 20% of publications by their impact γ, and for the top 20% by γ. No correlations are shown for the bottom 10-20% of publications because they received no citations. .
96
Average community proximity of citations by impact of citing article in JSTOR. The inset shows the average trend for patents . . . . . . . . . . . . . . . . . . . . .
97
Correlations between citation proximity and impact, for patents published between 2000 and 2006, separated by whether the citation was added by an inventor or patent examiner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
5.6
Average community proximity between communities over time. . . . . . . . . . . . 101
5.7
Average pij between communities over time. . . . . . . . . . . . . . . . . . . . . . . 102
6.1
The growth of edges versus the growth of users in the bipartite networks. . . . . . 112
6.2
The probability of a user joining a community in the forum as a function of the number of reply friend k who are active in that community at the previous time snapshot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.3
The probability of a user joining a community in the forum as a function of the normalized community size at the previous time snapshot. The insets show the probability before normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4
The probability of a user joining a community as a function of the average rating of the top 10% high rating posts in the community at the previous time snapshot.
117
6.5
The user similarities versus the community overlaps. The main plots use the communication frequency between users as the user similarity, and the insets use the number of common friends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6
A bipartite MRF model with N communities and M users at time t. {et } is an instance of the connections between users and communities at time t. The dashed edges are observed evidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
x
LIST OF TABLES
Table 2.1
Connectivity comparison between the Web graph and blogosphere samples . . . . .
21
2.2
Temporal changes in the connectivity in TREC . . . . . . . . . . . . . . . . . . . .
25
2.3
Blogs in hosting sites in the BlogPulse dataset . . . . . . . . . . . . . . . . . . . . .
28
2.4
Links among blog hosting sites in the BlogPulse dataset . . . . . . . . . . . . . . .
29
3.1
The average shortest path (ASP) and other characteristics of the largest components of the graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Spearman correlations between importance measures of vertices. All the p-values of the correlations are < 0.0001. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
Comparison of the properties of subgraphs generated by different methods with important vertices in Erd¨ os-Renyi random graph, BuddyZoo and TREC. Sub-Importance Measure100 is the subgraph induced by top 100 important vertices only; KO- is the subgraph generated by KeepOne; KA- is the subgraph generated by KeepAll. LC is the fraction of important vertices in the large component of the subgraph. Avg PSP is the average pairwise shortest path length in the subgraph. . . . . . . .
52
4.1
Distribution of connected components in online communities. . . . . . . . . . . . .
65
4.2
Distribution of connected components in the BuddyZoo AOL instant messenger community. A tie is considered weak if two users who list each other on their buddy lists do not list a third person in common. . . . . . . . . . . . . . . . . . . .
68
3.2
3.3
5.1
Citing behavior and subsequent citations earned. . . . . . . . . . . . . . . . . . . . 100
6.1
Statistics about the bipartite networks. . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2
The two representative feature functions in BiMRF. cs denotes the features of normalized community-size and us denotes the two types of user similarity who are the same in defining feature functions. . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3
Distributions of the number of related users on different datasets for frequencyuser-similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4
Evaluation results of different BiMRF models on the four datasets. . . . . . . . . . 127
6.5
Evaluation results of the top-post-rating, and user-similarity on Digg and Google Earth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
xi
CHAPTER I
Introduction
The importance of information sharing networks is gaining increasing attention from research scientists. These networks play a crucial role in how we acquire information, how we convey information to one another, and how we interact with other people. Many information sharing networks can help normal users with various daily activities, such as reading and recommending news, making or contacting friends and online purchasing. All of those activities are actual electronic transactions and can be recorded. Nowadays, such data are continuously growing and evolving, and are an indispensable source of information for researchers to study the underlying human dynamics that are reflected by its various patterns. Most of such data has either explicit or implicit link patterns, and these links are potential paths for information to spread over that entire online social environment. Due to the wide usage of online social environments in daily lives and the availability of the data, this type of data helps researchers investigate the behavioral patterns of people foraging for information and interacting with each other in two ways [53]. First, as such networks are of unprecedented size and are evolving rapidly, they are increasingly closer to the human activity dynamics of the real world. Second, information sharing networks are prevalent and important in people’s daily activities so
1
2
that human behavior is greatly affected by them. For example, networks of emails or online social networks are two of the major ways that people communicate and socialize with others. As these two ways converge, the research that analyzes and studies information sharing networks reveals more truths about real-world human behavior and their dynamic systems. Most research work that has been done on information sharing networks can be mainly classified into three categories. The first type of research is done by defining networks from real online data and studying what properties the networks have, either static or with temporal changes [81]. It is well known that, in spite of the random dynamic changes of information systems, there are a number of strong regularities both in the structural and temporal features of those networks, such as the power-law of the degree distribution and the small world phenomenon of the Web graph [2, 23]. The second is understanding why networks have those properties. Many features of the networks are explained and simulated by some relatively simple dynamical mechanisms or modeled by some simple rules in random graphs, such as the preferential attachment mechanism [15] and the forest fire models [68]. These models help the understanding of online networks and thus to predict their future behavior. The third is utilizing the properties of real-world networks to achieve some tasks, such as searching for information [86], summarizing the data [104] and finding communities [87]. Such research helps the development of more powerful and efficient algorithms and software for collecting and retrieving information, and it also provides valuable insights for designing better systems for users. The research work in this thesis, which is about the structural features of information networks and their underlying relationship with information dynamics, is mainly in the first two categories. However, many results also imply potential applications in information
3
management, information retrieval, etc. As this thesis focuses on uncovering the features of structure and dynamics of information sharing networks, it studies the topological structures of these networks, the relationship of the structures and information diffusion, and the influences of information flows on network evolution. The first three pieces of work (Chapters II, III and IV) aim to understand the structures of networks upon which information flows, and the last two pieces (Chapters V and VI) are trying to answer the question of how the network structure influences information diffusion and the relationship between communities and information diffusion. Before looking at the structural features of information sharing networks in detail, we should be aware that such networks are massive and rapidly evolving, so that their properties are difficult to trace. Thus, the first question we have is whether the structure resulting from information sharing can be reliably measured in networks that are continually evolving. In Chapter II, we answer this question by studying sampled data sets of blogosphere. The blogosphere serves as a medium for self-expression, community formation and communication, and information diffusion and aggregation. The rich structure of the blogosphere has proven to be fertile ground for exploring research questions from a variety of fields. Some have focused on the motivations behind blogging [21], the relationship between a person’s tendency to keep blogging and their embeddedness in the online social network [63], and explored the possibility of extending blog’s interactive nature for research and commercial collaboration [20]. A few studies have specifically focused on the LiveJournal blog network and found patterns of link distribution across geography [70], factors contributing to link formation such as common interests and age [59], and even the likelihood of a blogger joining a new LiveJournal interest group if many of
4
their blogging friends have [13]. Others, closer to our current goals, have pursued a systematic approach of analyzing the large scale network structure of the blogosphere. Kumar et al. examined the structure of the blogosphere, both in terms of the bursty nature of linking activity, the uneven distribution in the concentration of such links, and the effect of time windowing on the appearance of that distribution [58, 59, 60]. Information diffusion studies have aimed to use the link structure and other blog properties to infer the path of information flow [45, 4]. Moreover, other studies have used the link structure to solve problems such as splog detection [56] and community identification [117]. Splogs (also known as spam blogs) are blogs whose sole purpose is to direct traffic and increase the search engine rankings of particular websites. In this work, by comparing two large blog datasets, we demonstrate that samples from the blogosphere might differ significantly in their coverage but still show consistency in their aggregate network properties. The results of the work also show that properties such as degree distributions and clustering coefficients depend on the time frame over which the network is aggregated [101]. While Chapter II shows that it is possible to get reliable metrics of real-world networks based on comprehensive samples, we may still face challenges in obtaining such samples, when, for example, the network is simply too massive. To address this problem, in Chapter III, we observe that some vertices in many large information sharing networks can play an important role in graph representation and information diffusion. The numbers of such special vertices can be very small compared to the size of the original networks. In this chapter, which deals with important vertices and their graph synopses, we examine the properties of subgraphs of the most prestigious vertices, i.e., vertices of the highest values using some well-established importance measures, in several online networks, including those of blogs, websites in general, and
5
instant messaging users. There are previous studies about compressing web graphs for space-efficient data storage and transfer [116, 5], using a subgraph to represent the original large graph (the graph sampling problem) [67, 62], mining a subgraph for visualization of the original graph [39, 123], placing sensors to detect information flow [65], constructing a synopsis by projecting queries [66], and quantifying the extent to which important vertices hold online social networks together [76]. However, this work has focused on keeping or representing the properties of the original networks; i.e., studying the entire networks. We study the more fundamental properties of the subgraphs induced by important vertices. Our principled and rigorous study of the properties, construction and utilization of subsets of special vertices in large online networks showed that vertex-importance graph synopses provide small, relatively accurate portraits, independent of the importance measure [104]. In addition to revealing interesting characteristics of the vertices in information sharing networks and their roles in information diffusion, further investigation about a set of special edges, the strong ties, and their relationship with information diffusion are studied in Chapter IV. Simple connectivity through arbitrary ties is sometimes insufficient to transmit information, because ties may need to be of a given strength in many real-world scenarios. For example, sensitive information or information that may confer an advantage to those who have it, may only be shared between individuals who know one another well [46]. This chapter analyzes the connectivity and information transmission of strong ties. From some online social networks, we show that strong ties occupy a large portion of the network and that removing all other ties does not change the majority of the giant connected component and the average shortest path of the online friendship networks. What is more, the cost of forming transitive ties (which we take as the definition of strong ties) by modeling
6
a random graph composed entirely of closed triads is evaluated. Both the empirical study and random model point to the robustness of strong ties with respect to the connectivity and small world properties of social networks. Thus, this work shows that it is still possible, under the restriction of tie strength, for information to be transmitted widely and rapidly in empirically observed social networks [102]. After examining the structural features of the networks in which the information is possibly transmitted, this thesis further investigates how the structure influences information diffusion and the relationship between network communities and information diffusion. In Chapter V, we examine information diffusion between communities and its subsequent impact in information sharing networks by studying the citation networks. Published scholarly work is a traditional social medium for the exchange of scientific ideas and knowledge. The structure and growth of such citation networks have been studied extensively to measure the impact of individual articles and the evolution of entire fields [29, 31, 93, 17]. Applying bibliometrics to citation networks to study the impact of fields, individuals, and particular papers has been the purview of the field of scientometrics [31]. As early as in the 1960s, de Solla Price first developed models to explain the heavy tailed distribution in the citations an individual paper receives [29]. Recently, the availability of large scale citation data and computational power has enabled the visualization and quantification of the amount of information flow between different areas in science [19, 16], in effect mapping human scientific knowledge. These visual maps leave open the question, however, of the size, speed and impact of information flows across community boundaries. Prior work has shown these flows to be relatively insignificant; omitting information flow between communities when one models citation networks still provides realistic citation distributions and clustering coefficients [17, 97]. Not only are information flows
7
across scholarly communities infrequent, they are also delayed: on average more time elapses between the citing and cited articles for citations across disciplines than ones within a discipline [93]. In this chapter, we view citation networks from the perspective of information diffusion. We study the structural features of the information paths through the networks and analyze the impact of various citation choices on the subsequent impact of the publication. The analysis shows that a publication’s citing across disciplines is tied to its subsequent impact. In the case of patents and natural science publications, those that are cited at least once are cited slightly more when they draw on research outside of their area. In contrast, in the social sciences, citing within one’s own field tends to be positively correlated with impact [100, 103]. In Chapter VI, we study the factors that could influence information diffusion among online communities. We examine the diffusion curves and the likelihood that a user will join a group based on the pattern of her interaction with other users and the features of groups in online forums. As new ideas and controversial discussions are always emerging and propagating among online forums, it is interesting to study the process of information diffusion in this social medium. The human behavior of gathering together and forming groups has been an important theme in studying information diffusion, because people taking the same actions as their neighbors is strong evidence that information flow has occurred [13]. Characterizing user grouping behavior in online social environments does not only help researchers to understand many of the sociological problems of human behavior, but also facilitates them to improve various applications in the online environment, such as the recommendation systems [110]. In this chapter, patterns in user behavior in joining groups and the feature factors associated with users or groups that influence such behavior are studied. We show the diffusion patterns associated with features of
8
users and groups, and we use Markov random graph models to help understand the relationships of these features, as well as the differences in their impact in different types of online forums [105]. At last, we will conclude the work in this thesis, and discuss future work in perspective at Chapter VII.
CHAPTER II
Sampled Data of Blogosphere
2.1
Introduction
In this chapter we first address the question of how information sharing networks, which are constantly evolving, can be captured and understood. Blogs are especially well suited for this study, since they form a vast dynamic and growing network, with new blogs continuously emerging, millions of existing blogs creating new content, while some lay abandoned as their authors start other blogs or activities. Of particular interest are the direct citation patterns between the blogs, because they indicate interaction and information diffusion—blogs linking to posts they read on another blog while possibly writing additional content of their own. Tracking information diffusion in the blogosphere is not just an intriguing research problem, but is of interest to those tracking trends and sentiments. Several online services, such as BlogPulse and Technorati, report the most actively discussed topics in the blogosphere. A heavily blogged topic, even if it originates in the blogosphere, is likely to make its way into the mainstream media. In fact, many mainstream media sources now host blogs as an integral part of their websites, while some of the most popular blogs rival most mainstream media online outlets in popularity [10]. In this chapter, we have two objectives. The first is to examine how robust the
9
10
features of the blogosphere are when examined through the lenses of two different samples. The second is to compare these features with previously studied Web and social network datasets, in order to understand the blogosphere network structure in the wider context of other social and technological networks. Our blog datasets stem from two sources, BlogPulse and TREC (described in more detail below), both intended for use by the research community to study different aspects of the blogosphere. We compare these two sets of blogs directly, first in terms of their coverage and overlap, then in terms of their network properties. We find that although the datasets differ widely in size, cover different time durations, and are set months apart, their properties show remarkable consistencies. Unfortunately, a fair fraction of blogs are in fact spam blogs, automatically generated blogs created with the intention of altering search engine results and directing traffic to specific websites. These splogs account for a large fraction of the links in the datasets. Consequently, we also study the effect of splog removal on the properties of the networks. Furthermore, we examine the effect of aggregating the network over time, similar to previous work by Kumar et al. [60], and find that the degree distribution and other properties converge when the network is aggregated. Finally, we contrast the linking patterns within and between different blog hosting sites, finding that most large blog hosting sites tend to be “exporters” of links—with many of those links going to blogs with their own domain names. 2.2
Description of data sets
We use two datasets in our study of the blogosphere. One is the WWW2006 Weblog Workshop dataset from BlogPulse, which has 1,426,954 blog URLs in total, and 1,176,663 distinct blog-to-blog hyperlinks. This dataset covers 3 weeks of blogging
11
activity, from July 4, 2005 to July 24, 2005. It contains hyperlinks that occur only in the blog entries themselves, and exclude blogrolls or comments. Consequently, the network is quite sparse—among the over 1.4 million blogs, only around 141,046 (10%) of them have links to other blogs in the dataset. If we only consider the blogs having at least one in-link (receiving a citation from another blog) or out-link (giving a citation to another blog) in the dataset, the average degree of this network is hki = 4.924. We omit from our analysis the additional 160,670 URLs, that were identified by BlogPulse to be blogs with at least 1 citation, but whose entries were not included in the dataset. The TREC (Text REtrieval Conference) Blog-Track 2006 dataset is a crawl of 100,649 RSS and Atom feeds collected over 11 weeks, from December 6, 2005 to February 21, 2006. In our experiments, we removed duplicate feeds and feeds without a homepage or permalinks. We also removed over 300 Technorati tags (e.g., Technorati.com/tag/war on terror), which appear to be blogs, but are in fact automatically generated from tagged posts. Different from the BlogPulse data, the TREC dataset contains hyperlinks of various forms, including blogrolls, comments, trackbacks, etc. There are 198,141 blog-to-blog hyperlinks in total, and 33,385 blogs having at least one such link. However, in order to do a fair comparison of the two different blog datasets, we restrict the TREC data to only the 61,716 hyperlinks occurring within entries. There are 16,432 making or receiving at least one such link, giving us an average degree of hki = 3.8. The work of [73] describes the creation of the TREC data in more details, and reports some statistics about this dataset, such as the degree distribution. Aside from the differences in the sizes and time spans of the two blog datasets, the nature of the two corpora and the way they are constructed are also different.
12
The BlogPulse dataset is more like a complete snapshot, while the TREC dataset is more biased and artificially sampled. Considering all these factors, one of the main purposes of our work is to explore how these factors would affect the observations of blogosphere. Some previous work has identified a certain fraction of splogs in these two datasets. In BlogPulse, according to the splog detection methodology presented in [56], the percentage of splogs is 7.48%. And in TREC, the percentage of splogs is about 18% [73], while after restricting the blogs to those have homepages, the percentage of splogs detected was around 7% [71]. 2.2.1
Dataset overlap
The BlogPulse and TREC datasets are two samples of the same blogosphere, albeit of vastly different sizes, covering different time durations and about 5 months apart. We are interested in comparing them in two respects in order to assess the difficulty in obtaining a comprehensive sample of the blogosphere. First, we compare the coverage of the two sets according to different blog hosting sites, as shown in Figure 2.1. In the TREC dataset, a smaller fraction of the blogs is hosted by the major blog hosting sites. The largest subset at 28% is hosted by LiveJournal, followed by 6% hosted at TypePad. In contrast in the BlogPulse dataset , a full 48% is hosted at LiveJournal, followed by 20% hosted at Xanga. Second, we directly compare the overlap between the two blog datasets in terms of the blogs commonly crawled by both. Figure 2.2(a) shows that of the 16,432 blogs whose entries were included in the TREC dataset, 7,225 (or 44%) are also in the much larger BlogPulse dataset. Finally, we take this common set of blogs and compare the overlap in the undirected edges in the two datasets. Specifically, if blog A cites blog B (or vice versa) during the 3 week period covered by the BlogPulse
13
data, we examine what percentage of the time we also observe blog A citing blog B or vice versa in the 11 week period of the TREC crawl 5 months later. Somewhat surprisingly, we find very little overlap. There were only 2823 pairs with edges in both datasets, compared to 56,387 pairs with edges in the BlogPulse data and 57,091 in the TREC data. This means that the same blogs that might be mentioning one another during one short period have only a 5% chance of doing so about half a year later. The above shows us that two relatively large datasets representing “samples” of the blogosphere actually have dramatically different coverage of blogs. Even where the two network samples overlap in nodes, we find that the connectivity, namely the links between the blogs, are likely to change substantially over time.
0.0
0.1
0.2
0.3
0.4
BlogPulse TREC
LiveJournal
Xanga
MSN
Blogspot
Figure 2.1: Proportion of blogs at 4 large blog hosting sites over the two datasets, demonstrating that TREC is less concentrated at large hosting sites than BlogPulse
14
21,372 TREC-only blogs
22,310 blogs
TREC-only 57,091 links 2,823 links
BlogPulse-only 1,426,954 blogs
(a)
BlogPulse-only 56,387 links
(b)
Figure 2.2: Overlap in coverage between TREC and BlogPulse: (a) overlap in crawled blogs with around 50% of the blogs covered in TREC being covered in the BlogPulse sample (b) overlap in between-blog links in the TREC and BlogPulse datasets restricted to blogs that occur in both
2.2.2
Other network datasets
To understand the properties of the blogosphere and how they differ from other networks, we study similar features in the Web graph, which was presented in [8] and [23]. The former dataset contains 325,729 documents and 1,469,680 links taken from a 1999 crawl of the nd.edu domain. The latter crawl from 2000 contains 200 million web pages and 1.5 billion links. 2.3
Topological features and network comparisons
In this section, we study the properties and topological features of the blogosphere by analyzing the two networks constructed from the BlogPulse and TREC datasets. We first restrict our analysis to those links that are located within crawled entries and cite blogs within the data set. We then include any additional hyperlinks, such as blogrolls, comments, and trackbacks, that were included in the TREC dataset. The networks are treated as directed but unweighted graphs where we are simply
15
taking into account whether a blog cites another blog, and not how many times it does so. 2.3.1
Degree distributions
In a directed graph, for a vertex v, we denote the in-degree din (v) as the number of arcs to v and the out-degree dout (v) as the number of arcs from it. The distribution of in-degree pkin is the fraction of vertices in the graph having in-degree k and pkout is the fraction of vertices having out-degree k. If both the in-degree and out-degree of a vertex are 0, then the vertex is isolated. The average in-degree is:
hkiin =
(2.1)
1 X din (v) |V | v∈V
which is a global quantity but measured locally. The average out-degree hkiout is defined similarly.
!
!
! !
5
10
50 k_in
100
500
1 e−01 Prob(D>=d)
1 e−03
! !
!
! !! !
!
!
1
! ! ! ! ! ! !! ! !! ! !! ! ! ! !!!!! !! !!!! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !!!! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! !
1 e−05
1 e−03
! ! ! ! ! !! !! !! !! !! !!!! !!!!!!!! ! ! ! !! ! !! ! ! ! ! ! ! ! ! ! ! ! !!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! !! !! ! ! ! ! ! !! !! ! ! !! ! ! ! !! ! !
1 e−05
Prob(D>=d)
1 e−01
! !
! !
1
5
10
50
100
500
k_out
Figure 2.3: The degree distributions of the BlogPulse data with splogs (grey curves) and without splogs (black curves)
16
First, we observe that the degree distributions are greatly affected by the existence of splogs. Considering all the blogs in the BlogPulse data, both in-degree and out-degree distributions have an unusually high number of blogs with degrees ranging from 10 to 500. This results in irregular shapes for the cumulative degree distributions, which represent the proportion of blogs having at least k in-links or out-links. However, after removing splogs identified by Pranam et al. [56] for the BlogPulse dataset, we replicate the result that the cumulative in-degree and outdegree distributions show smoother curves, as shown in Figure 2.3. After excluding splogs from the BlogPulse data, we compare the degree distributions of the blogosphere and the Web, using the Web degree distributions measured by Broder et al. [23] for a 1999 Alta Vista crawl of 200M pages. This previous study, along with a study of the nd.edu domain [8, 7], and a crawl in 2001 of 200M pages by the WebBase project at Stanford [32] found that the indegree distribution of the Web is scale free with a power law exponent α of 2.1. From Figure 2.4, we can see that the in-degrees of the BlogPulse and TREC datasets show similar power-law distributions to the Web graph. TREC exhibits a slightly shallower slope, while the BlogPulse data presents a slightly steeper one. This is consistent with the previous finding that sampling a power-law network can produce networks with steeper power laws [115]. Since the BlogPulse data is of a shorter time duration than the TREC data it may be more likely to resemble a subsample of the full network. Of course it is difficult to directly compare a web page crawl which contains a single static snapshot of a page, with an aggregation over 11 weeks of an RSS feed for a blog. Although a single download of a blog would usually contain a limited number of entries (with previous ones usually moved to an archive), the RSS feed would correspond to a single long page where content is added over time, but not deleted. It is possible
17
that this aggregation over a longer time period accounts for the similarity of slope for the TREC data compared to the Web. The Web outdegree distribution has been found either not to follow a power law distribution at all, or to exhibit a steeper power law only in the tail. Broder et al. measured the tail to have a power-law exponent of 2.7, Albert and Barabasi measured it to be 2.45 [8, 7], while Donato et al. found it not to follow a power-law at all [32]. The out-degree distributions of TREC and BlogPulse, shown in Figure 2.5, drop off much more rapidly than the Web graph. On the one hand, this may again be due to sampling. For example, Pennock et al. [90] showed that when certain subcategories of pages are sampled, what starts out as a power-law degree distribution can exhibit sharp drop-offs. Certainly, blogs are only a subcategory of all web pages, and we are furthermore only considering links among a sampled set of blogs. But, more likely, the number of hyperlinks a blog can generate in a limited time period is bounded. This constraint is also observed in many social networks, e.g., co-authorship networks [79]. So while it is possible for one blog to gather much attention (inlinks) in a short time period, it appears less likely for a single blog to lavish as much attention on as many different blogs in the same time period. The same tends to hold true on the web, where some webpages are linked to by thousands of others, but it is much less likely for a single page to contain thousands of hyperlinks. Our results also concur with previous measurements of the blogosphere, which have revealed power-law distributions of in-degrees based on blogrolls and in-post citations [60, 70, 106]. Here we were interested in whether we would still observe the power-laws when considering only the in-post citations.
18
WWW !=2.15 BlogPulse !=2.38 TREC ! = 2.12
5
frequency
10
0
10
−5
10
0
2
10
4
10
6
10
10
indegree Figure 2.4: In-degree distributions of Web, BlogPulse and TREC data, exponentially binned, all showing power-law structure
WWW BlogPulse TREC
8
10
6
frequency
10
4
10
2
10
0
10
−2
10
−4
10
0
10
1
10
2
10
3
10
4
10
outdegree Figure 2.5: Out-degree distributions of BlogPulse, TREC and Web, exponentially binned
19
2.3.2
Small-world effect
The small world effect states that in the network, the average shortest path between every pair of reachable nodes is short compared to the total number of nodes in the network[52]. The studies of Web graph have shown that the WWW has the small-world property. Even as the number of web pages has grown exponentially, the average number of hyperlinks that need to be traversed to get from one page to any other (provided such a path exists) has remained relatively small. Albert et al. [8] give a formula to compute the average shortest paths in Web graph if the number of web pages N is known:
(2.2)
hli = 0.35 + 2.06 log(N )
The estimate for the average shortest path using this formula for a graph of 200 million nodes is hli = 17.45, which is quite close to the measured value (hli = 16) found by Broder et al. [23] for a data set of 200 million web pages. Quite similar to the Web graph, our experiments show that even when considering only entry-to-blog links, the blogosphere has the small-world property. Our two datasets, although of different time durations and only partial overlapping in blogs and links, have very consistent shortest paths considering their network sizes. For the TREC dataset (16,432 blogs), it is hli = 7.12, and for the BlogPulse dataset, which is of 143,736 blogs, it is hli = 9.27. If we let N = 16, 432 or N = 143, 736 in Formula 2.2, then the hlis calculated for TREC and BlogPulse are 9.03 and 10.97 respectively, which are larger than what they have been in our experiments. However, this does not necessarily mean that it is easy for information to diffuse
20
widely in the blogosphere. This is because information diffusion is not only related to the average shortest path, but also the connectivity of the graph. Since the average shortest path is only computed between all reachable pairs, it doesn’t take into account what proportion of pairs of blogs could not be reached one from the other simply by following hyperlinks. Our experiments show that only 12.37% of the pairs of blogs in TREC are reachable. For the BlogPulse data, only 6.13% of the pairs of blogs are reachable. Even when we consider the network of TREC data with all forms of hyperlinks contributing to its edges, the percentage of reachable pairs is still only 22.11%. The low percentage of reachable pairs of nodes is also true in the Web: over 75% of time there is no directed path from a random start node to a random destination node [23]. If the connectivity is low in the network, as it is for the TREC and BlogPulse data, it will yield a small average shortest path, but at the same time produce many infinite paths that are not counted. In the following section we examine the important question of connectivity in more detail. 2.3.3
Connectivity
For a directed graph, there are two types of connected components: weakly connected components (WCCs) and strongly connected components (SCCs). A strongly (weakly) connected component is the maximal subgraph of a directed graph such that for every pair of vertices in the subgraph, there is an directed (undirected) path from vx to vy . Thus a weakly connected component is a larger subgraph than a strongly connected component. In the two blog datasets, within a weakly connected component one can follow links within posts to reach either blog A from blog B or vice versa (but not necessarily in both directions), for each pair of blogs A and B. In practice these paths
21
Network Web [8] Web [23] BlogPulse TREC
# of nodes 325,729 203,549,046 143,736 16,432
Max WCC 325,729 (100%) 186,771,290 (91.76%) 107,916 (75.08%) 15,321 (93.24%)
Max SCC 53,968 (16.57%) 56,463,993 (27.74%) 13,393 (9.32%) 2,327 (14.16%)
Fraction of SCC in WCC 16.57% 30.23% 12.41% 15.19%
Table 2.1: Connectivity comparison between the Web graph and blogosphere samples
may be hard to find because the link leading to the path to the second blog could be in any one of the posts made over the 3 or 11 week period. Nevertheless, the connected components give us a sense of the connectedness of the datasets. Our experiments show that, in the TREC data, the largest weakly connected component includes 15,321 nodes, and the largest strongly connected component is of size 2,327. The sizes of largest weakly connected component and strongly connected component of BlogPulse data are 107,916 and 13,393 respectively. We have a comprehensive comparison of the connectivities of blogosphere and the Web in table 2.1. Similar to the Web [33], the discrepancies in the size of connected components are most likely due to the different ways the datasets are crawled, and the time periods in which the networks form. In section 4, we will further discuss the temporal features of the connectivity of blogosphere. On the other hand, if we also consider other forms of hyperlinks in TREC, including 33,385 blogs and 198,141 blog-to-blog hyperlinks, then the resulting network has much better connectivity. The size of the largest weakly connected component is 88.93%, and the size of largest strongly connected component is 44.36%. This shows that the blogosphere is glued together by blogrolls, even if over a limited time period there is relatively little active citation. Another interesting observation about the connectivity of the blogosphere is the following: before cleaning the TREC data, there are 363 technorati.com tag URLs, with 47,521 links either from or to these URLs. Our experiments show that the
22
existence of such extremely high in-degree or out-degree nodes does not affect the overall connectivity of the blogosphere. Before removing the Technorati tag URLs and their links, the size of largest weakly connected component is 30,180 (90.40% of the whole network) and the size of largest strongly connected component is 15,176 (45.46%) - only slightly larger than the components with the tag URLs removed. This observation is similar to the one made for the Web graph by Broder et al. [23], showing that high degree nodes do not play the function of “junctions” in the connectivity of the Web. 2.3.4
Clustering coefficient and reciprocity
The Clustering coefficient is a measurement of the percentage of closed triads in a network. For every vertex vi , its clustering is defined as:
(2.3)
Ci =
number of closed triads connected to vi number of triples of vertices centered on vi
Then the clustering coefficient for the whole graph is averaged over all vertices i. In an Erd¨os-Renyi random graph (a random graph in which every pair of vertices are connected by probability p) [35] with n nodes and a constant average degree, the clustering coefficient is O(n−1 ). However, in most real-world networks, the clustering coefficient, O(1), is much higher, reflects the prevalence of closed triads [81]; i.e., if vertex vx is connected to vertices vy and vz , then the probability for vy and vz to be connected is higher than expected at random. For measuring the clustering coefficient in a directed graph, we ignore the directions of arcs. The clustering coefficients of TREC 0.0617 and BlogPulse 0.0632 (including splogs in both datasets) are large compared with what they are in the corresponding Erd¨osRenyi random graphs. We see that for these values of clustering coefficients, the two
23
datasets are showing nice consistency in spite of the differences in crawling and time duration. These high values are also similar to measurements of the clustering for the Web graph (C = 0.29 [81]) and co-authorship networks (C = 0.19 [79]). Reciprocity is another measurement that shows a significant difference between real-world networks and the Erd¨os-Renyi graph. The reciprocity values (how often, when A links to B, B links to A) is another measure of cohesion, reflecting mutual awareness at a minimum, and potentially online interaction and dialogue. In the datasets, we actually observe very little reciprocity: in TREC, 4.98% edges are reciprocal, and in BlogPulse 3.29% edges are reciprocal. However, if we also consider other types of links in TREC, making the network significantly denser, then the clustering coefficient of this graph is 0.13, and the reciprocity is 20.06%, both of them are significantly larger than they are in the blogosphere merely with entry-to-blog links as its edges. A possible explanation is that people often create entry-to-blog links to cite information. Other types of links, such as comments and trackbacks are by their nature interactive (and trackbacks are by definition reciprocal). Even blog rolls may exhibit higher reciprocity, because bloggers tend to list their friends’ blogs as well as other blogs they tend to read, and friendship is often, though not always, reciprocal. Therefore the low reciprocity we observe could be due to the nature of entry-to-blog links themselves and the short time window of the samples, where we simply haven’t waited long enough to observe a reciprocal link. 2.4
Temporal features
As we have described before, the time ranges of the two datasets are of different lengths: the BlogPulse sample covers 3 weeks, while TREC is crawled over 11 weeks.
24
In order to explore the effects of the crawling periods on the observations of the blog datasets, we take the longer-period TREC dataset and study the properties of the subgraphs in the TREC network over 4 different time windows. We assign a timestamp for each entry-to-blog link as the time the entry is created, where 74.72% of the entries have timestamp. Four overlapping time periods are chosen corresponding to the first 10, 20, 30 and 40 days of the TREC crawl. The 10 days capture 5,793 blogs and 8,818 entry-to-blog links with an average degree of hki = 1.5. The first 20 days capture 8,054 blogs, 16,206 links, bringing the average degree up to hki = 2.0. The first 30 days capture 9,085 blogs with 20,411 links and hki = 2.2. The last subset of 40 days contains 10,433 blogs and 27,724 links with hki = 2.657. This illustrates that as the time duration increases, the average degree also increases. 2.4.1
Degree distributions
We plot the degree distributions of the four time-overlapping subnetworks (10 days, 20 days, 30 days, 40 days), as well as the entire network of 11 weeks with link time stamps (denoted by “Links with TS”), and the entire network with or without link time stamps (denoted by “All links”). From the in-degree and out-degree distributions in Figures 2.6, it is apparent that different time windows yield very similarly shaped curves for both the indegree and outdegree distributions. However, as the time periods get shorter, the curves for both in-degrees and out-degrees are steeper. 2.4.2
Connectivity
In section 3.3, we found that both the BlogPulse and TREC samples have large weakly connected components, but relatively small strongly connected components,
All links
! ! ! ! ! ! ! ! !! !! ! ! ! ! !! !!! ! !! !!! !! ! ! ! ! !! ! !! !! !! !! ! ! !! !! !!! !!!!!!! !!!! ! ! !! ! !!!! ! ! !! ! !! ! ! ! ! ! !! !! !!!!!!! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !!! !!!! ! ! ! ! ! ! ! ! ! ! ! !!!!!! ! ! ! ! ! ! ! ! ! !! ! ! ! !! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! !! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !!! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! !! ! ! ! ! !!! ! ! ! ! ! ! !! !!! ! !! ! ! !! ! ! !!! !!! ! !! ! !! ! ! ! !! !! !! ! ! ! ! !! ! !!! ! !!!! ! !! ! !! ! ! ! ! !! ! !! ! ! !! ! ! ! ! ! ! ! ! ! !! !! ! ! !! !! ! ! ! ! !!!! ! ! ! !!! ! !! ! ! ! ! ! ! ! ! ! !
1 e−04 1
5
10
50
100
! ! ! ! !
! ! ! ! ! !
!
500
!
! ! ! ! ! ! ! ! !
All links Links with TS Links of 40 days Links of 30 days Links of 20 days Links of 10 days
! ! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! !! ! ! !! ! !! ! !!!!! !! !! ! !! ! ! ! !! ! ! !!! !! !! !! ! ! ! ! ! !! ! ! ! ! ! !!! ! !!!!!! ! ! ! ! ! ! ! !!!!!! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! ! !!! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !!! ! ! ! ! ! ! !! !!! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! !! ! !! !! ! !! ! ! ! !! !! ! ! ! ! ! !! ! ! ! ! ! ! ! !
1 e−03
1 e−02
Links of 20 days Links of 10 days
! ! ! !
1 e−01
Links with TS Links of 40 days Links of 30 days
!
1 e−02
!
1 e−04
! ! ! ! ! !
(D>=d)
! ! ! ! !
1 e−03
Prob(D>=d)
1 e+00
!
1 e−01
1 e+00
25
!
1
2
k_in
5
10
20
50
100
200
k_out
(a) In-degree distributions
(b) Out-degree distributions
Figure 2.6: Temporal changes in the in-degree and out-degree distributions in TREC Subset First 10 days First 20 days First 30 days First 40 days All blogs
# of nodes 5,793 8,054 9,085 10,433 16,432
Max WCC 4,719 (81.46%) 7,162 (88.92%) 8,249 (90.80%) 9,662 (92.61%) 15,321 (93.24%)
Max SCC None (0%) 349 (4.33%) 471 (5.18%) 730 (7.00%) 2,327 (14.16%)
Fraction of SCC in WCC 0% 4.87% 5.71% 7.55% 15.19%
Table 2.2: Temporal changes in the connectivity in TREC
with the TREC sample showing better connectivity than BlogPulse, in spite of BlogPulse having larger average degree. The dynamics in the connectivity of blog subgraphs over different time windows is shown in Table 2.2. As time goes on, both the size of the largest weakly connected component and the size of the largest strongly connected component grow larger, and thus the connectivity is increasing. It can also be observed that the weakly connected component is formed earlier and grows more rapidly. In contrast, it takes a much longer period for the strongly connected component to form; however, after a certain period of time, the growth of the largest weakly connected component is relatively stable near 100% of the network, while the largest strongly connected component continues to grow.
26
2.4.3
Clustering coefficient and reciprocity
Next we examine the temporal changes in reciprocity and the clustering coefficient. Our experiments show that the values of reciprocity of the links from the first 10 days to the first 40 days are 2.88%, 3.85%, 3.84% and 4.12%. We can see that except for the shortest time period, all the other values are bigger than in the 3-weeks of BlogPulse (reciprocity of 3.29%) but smaller than in the entire 11-weeks of TREC (reciprocity of 4.98%) . This indicates that reciprocity grows with time, because blogs have a longer opportunity to reciprocate. It also demonstrates that reciprocal links are still extremely sparse. The clustering coefficients from the first 10 days to the first 40 days are 0.034, 0.043, 0.046 and 0.052. All of them are smaller than the clustering coefficient in both BlogPulse (0.0632) and TREC over the full time period (0.0617). So we know that although longer time would increase the clustering coefficient, it may depend more on the density of the sample. 2.4.4
Densification law
Leskovec et al. [68] described the densification law prevalent in many networks: the number of edges grows superlinearly in the number of nodes over time: e(t) ∝ n(t)α . For example, in the Internet, there are new routers appearing and at the same time the number of connections per router is increasing, and the densification exponent is α = 1.18. In patent networks, all the links are added from a patent at the time it is inserted into the network. The densification exponent is α = 1.66. But in our network, probably most of the blogs already existed before the beginning of the crawl (it would be interesting to repeat the analysis with new blogs appearing). During the crawl, as more and more links are added into the network,
27
originally isolated blogs start connecting to each other. The number of edges, shown in Figure 2.7 is increasing nearly quadratically with the number of nodes (α = 1.928). This relatively large value tells us that the densification of a crawled blogosphere with a static set of blogs is a faster process than some other networks such as the Internet
alpha = 1.928
15000 10000
# of links
20000
25000
and patent networks.
6000
7000
8000
9000
10000
# of blogs
Figure 2.7: The number of edges versus number of nodes in log-log scale for blogs crawled over different time durations, which obeys the densification power law
2.5
Blogs in blog hosting sites
Another way of understanding the blogosphere is by analyzing it through different blog hosting sites. Currently, the four largest blog hosting sites are LiveJournal, BlogSpot, Xanga and MSN. They are also the largest four in the BlogPulse dataset, as shown in Table 2.3. In the table, “all links” either originate from or terminate at a blog at the specific blog hosting site; “in links” originate outside of the blog hosting site, but terminate within it; “out links” point from within the hosting site
28
to an outside blog; “internal links” lie between blogs within the hosting site. All these links occur only within entries, and no links of other forms, such as blogrolls, comments, etc. are included. The table also lists in italic the corresponding numbers of blogs and links after removing splogs according to [56]. An immediate observation we can make is that although splogs by number constitute only a small fraction of the total blogosphere, they account for a substantial proportion of the links. # Blogs LJ 678,676 676,719 Xanga 284,693 283,952 MSN 170,108 162,147 BlogSpot 112,184 62,256
All links
Inlinks
Outlinks
Internal links
155,665 95,161
4,561 2,718
15,735 8,731
135,369(87.0%) 83,712(88.0%)
58,741 8,454
2,354 437
3,067 1,534
53,320(90.8%) 6,483(76.7%)
58,811 1,271
44,180 90
1,528 699
13,103(22.3%) 482(37.9%)
845,093 42,830
34,979 12,540
73,730 18,519
736,384(87.1%) 11,771(27.5%)
Table 2.3: Blogs in hosting sites in the BlogPulse dataset
From Table 2.3, we also notice that, for most of these large blog hosting sites, no matter whether or not splogs are removed, the internal links usually occupy the greatest portion of the all links. The percentage of internal links of MSN is relatively small (22.3%). This is because in the BlogPulse dataset, there is a large portion of links from blogs of BlogSpot to blogs of MSN, which are mostly splogs (over 43,000 links). Due to this fact, it lowers the percentage of internal links of MSN. This is another aspect that tells us how splogs would affect our observations of blogosphere. This suggests that blogs within one hosting site are more likely to form densely connected communities, while it is less likely for blogs in different blog hosting sites to be in a community. This pattern may be a result of bloggers preferring to use the same hosting site as their friends, and different hosting sites being prevalent in
29
different countries. In the Table 2.4, we can see that links connecting two different blog hosting sites are very sparse, both with and without splog links. Src & dst Blogs LJ Xanga MSN BlogSpot
LiveJournal 135,369 83,712 1,208 612 61 36 1,109 707
Xanga 873 159 53,320 6,483 179 23 832 151
MSN 160 10 124 11 13,103 482 43,113 17
BlogSpot 4,215 1,714 659 236 309 66 736,384 11,771
Table 2.4: Links among blog hosting sites in the BlogPulse dataset
Another thing one observes from the Table 2.3 is that the numbers of out links always exceed the numbers of in links for a blog hosting site, and most of those out links point to blogs with their own domain names. Since it is easier and often free to create blogs in the blog hosting sites, these kinds of blogs are more casual and personal; in contrast, blogs with their own domain names are more likely to be maintained in a more formal and professional way. And in this sense, it is natural for the self-hosting blogs to have more in links from other blogs. 2.6
Conclusions
For analyzing the topology of a large network such as the blogosphere, it is impossible for researchers to get all the data about it. Rather, one uses various sampling methods to gather some data, typically a small fraction of the whole network, to analyze. Thus, it is very important to examine how robust the topological features of the blogosphere are when incorporating different time durations and ways in crawling the data. This work shows that, for the two different samples of blogosphere, BlogPulse and TREC, in spite of the low overlap in their coverage and time durations of collecting, some topological features, such as the degree distributions, average shortest paths, connectivity, clustering coefficient and reciprocity show great consistency.
30
This work also shows that as the time duration of a crawl is extended, the features start to converge. This tells us that by obtaining some fairly comprehensive samples of the blogosphere, one can start to obtain good estimates of the topological features of the whole space. We also examined the effects of the existence of splogs in the blogosphere, and found that splogs contribute a fair fraction of the total links volume in the blogosphere, and consequently affect the degree distributions greatly. Moreover, by looking at blogs in some large hosing sites, we find that blogs within the same hosting site are more likely to be connected than blogs in different hosting sites. However, this does not mean that there are few links outside of blog hosting sites. Rather many of the links originating at large hosting cites point to blogs with their own domain names. For understanding the topological structure of the blogosphere, we further compared the features with some other large networks, such as the Web graph and some specific social networks. We find that they share some similarities, such as in-degree distributions, the small-world effect, and overall connectivity. However, they differ in other aspects, such as the out-degree distributions and level of clustering. This chapter already shows that some vertices play a much more important role in the network than others. A natural question that arises from this observation is whether it is possible for us to get valuable information from a small subset of ‘important vertices’ in the network without knowing the entire of the network. We examine this problem in the next chapter.
CHAPTER III
Important Vertices in Networks
3.1
Introduction
In the previous chapter, we raised the question of whether some vertices play a more important role than others. In this chapter we examine whether one can create “graph synopses” using subgraphs of important vertices. To study the flow of information, to optimize engineering systems, to design efficient algorithms [22, 54, 55], and to investigate social structure and interaction, we study the statistical and graph properties of entire networks, including such features as degree distributions, connectivity, diameter, clustering properties, and evolution of such networks [6, 23]. For a variety of online networks, small subsets of vertices are relevant for efficient algorithms and dominate various graph and statistical properties. Frequently, these smaller subsets or graph synopses are easier to study and to understand. One might be interested in whether relationships among web pages can be described without crawling the whole web graph and might be inferred from a small set of vertices. We might also study the “communication” among the most influential political blogs [3] and determine whether information flows directly among them or through intermediate blogs. Despite these examples, there is little principled study of the properties, the construction, and the utilization of subsets of special vertices or edges in large
31
32
real networks. Such a study is challenging because it is hard to define precisely what is meant by a small version of the graph. Also, it is difficult to evaluate the quality of a compressed graph. We would like a simple, principled approach to graph synopsis for a number of reasons. First, there are a number of online networks in which a synopsis of the graph is sufficient to capture the relevant information we seek. For example, rather than continuously tracking millions of blogs, one may use occasional snapshots of the blogosphere to construct a subgraph of the most “important blogs” according to a desired measure, and crawl, query, and analyze this smaller synopsis. The synopsis will allow us to capture predominant features of the much larger underlying graph, but, due to its small size, can be stored much more efficiently and even distributed and replicated amongst a number of resource-constrained computers which themselves can execute queries on the content and links. To build a principled approach to graph synopsis, we start with the definition of predominant vertices and define a precise construction of a graph synopsis from these. Typically, the subset of vertices which capture the graph features are those which are “important.” Furthermore, the importance of these vertices is highly skewed—only few of them are of great importance and the majority are less important. These vertices and subgraphs have been studied extensively in online networks [127, 27], but not with the idea of using them for graph synopses. Following much of this work, we choose four standard definitions of importance: degree, betweenness, closeness and PageRank. We demonstrate empirically for a number of representative online networks that these subsets of vertices do not depend highly on the choice of importance measure. Next, we show that it is possible to glean accurate information about the communication, relationship, and flow of information on the original graph
33
and among the top vertices simply from a subgraph constructed from the important vertices. Furthermore, these properties are consistent, regardless of the importance measure we use, and are appropriate for efficient algorithm design and information management. We give a clear, precise definition of the algorithmic problem of vertex-importance graph synopsis in Section 3.2 and discuss the computational hardness of this problem in Section 3.4. We show in Sections 3.3 and 3.4 that most online networks are far from the worst-case graph; they exhibit features (e.g., power-law degree distribution, short average diameter, and high clustering) that allow us to efficiently compute a graph synopsis. Moreover, we tie properties of the subgraphs to measures, such as assortativity, of the original networks. Finally, in Section 3.5, we match the empirical observations to analytical results. 3.2 3.2.1
Preliminaries Importance measures
The definitions of importance or prominence on vertices vary significantly depending on the specific network and application. Most such measures describe the topological location of the vertices. We choose four of the most commonly used measures in various applications as our objects of study: degree, betweenness, closeness, PageRank. Let the graph G(V, E) have |V | = n vertices, the four importance values defined on vertices vi are listed below: 1. Degree D(vi ): previously defined in Chapter II, is a measure of how many vertices in G are connected to vi directly. If G is a undirected graph, then D(vi ) is the number of undirected edges incident to vi ; if G is a directed graph, then
34
D(vi ) is the sum of indegree and outdegree of vi , where indegree is a count of the number of directed edges to the vertex, and outdegree is the number of directed edges from that vertex to others. Degree reflects a local property of the vertices in the graph. 2. Betweenness B(vi ): a measure of how many pairs of vertices go through vi in order to connect through shortest paths in G: B(vi ) =
X
gjk (vi )/gjk
j