Marco Conti. June 2015 ... central server for storing users' data and enable users to have more control on their profile ..... A.8 The hosting and the deployment . ...... Not fully centralized P2P systems have a dedicated controller node which is ... overhead, per node state maintenance, and their resilience, DHT shows the best.
` degli Studi di Pisa Universita
Dipartimento di Informatica Dottorato di Ricerca in Informatica
Ph.D. Thesis
A distributed Dunbar-based Framework for Online Social Network Barbara Guidi
Supervisor
Supervisor
Laura Ricci
Marco Conti
June 2015
It always seems impossible until it’s done. Nelson Mandela
ii
Abstract Online Social Networks (OSNs) are becoming more and more popular on the Web. Distributed Online Social Networks (DOSNs) are OSNs which do not exploit a central server for storing users’ data and enable users to have more control on their profile content, ensuring a higher level of privacy. In this thesis we propose DiDuSoNet, a novel P2P Distributed Dunbar-based Online Social Network where users can exercise full access control on their data. Our system exploits trust relationships over the novel Dunbar-based Social Overlay for providing a set of important social services like information diffusion and data availability. In particular, our system manages the problem of data availability by proposing two P2P dynamic trusted storage approaches. By following the Dunbar concept, our system stores the data of a user only on friend nodes, which have regular contacts with it. Differently from other approaches, nodes chosen to keep data replicas are not statically defined but dynamically change according to users’ churn. Furthermore, the system provides a new epidemic protocol able to spread social updates in DOSN overlays, where the links between nodes are defined by considering the social interactions between users. Our approach is based on the notion of Weighted Ego Betweenness Centrality (WEBC), which is an ego-centric social measure approximating the Betweenness Centrality. The weights considered in the computation of the WEBC correspond to the tie strength between friends so that nodes having a higher number of interactions are characterized by an higher value of the WEBC. The lack of real dataset containing structural and temporal information of users is the main limitation to the research on this field. To fill the gap, we have developed a Facebook application and we have crawled data about more than 300 Facebook users. We have studied the dataset in deep to obtain useful information to characterize OSNs users not only under the structural point of view, but also to understand the user behaviours in term of session length. A set of experimental results, conducted by using our Facebook dataset and an old Facebook Regional Network dataset, proving the effectiveness of our system are presented.
iv
Contents Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
1 Introduction 1.1 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .
1 8
I
Basic Concepts
2 Background and Related Works 2.1 P2P Systems . . . . . . . . . . . . . . . . . . . 2.1.1 The P2P look-up service: DHT . . . . . 2.2 Distributed Online Social Networks . . . . . . . 2.3 Data management in DOSNs . . . . . . . . . . . 2.3.1 Data Availability and Persistence . . . . 2.3.2 Information diffusion in DOSNs . . . . . 2.4 Security and Privacy . . . . . . . . . . . . . . . 2.5 Privacy breaches . . . . . . . . . . . . . . . . . 2.5.1 Privacy breaches from centralized service 2.5.2 Privacy breaches from other users . . . . 2.6 Existing Approaches . . . . . . . . . . . . . . . 2.7 Analysis of existing DOSNs . . . . . . . . . . .
11 . . . . . . . . . . . .
13 13 15 20 21 22 26 30 31 31 31 33 40
3 Complex Networks Analysis for DOSNs 3.1 Complex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Small World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Degree distribution, scale-free network model and network resilience 3.5 Centrality Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Social Networks: the Dunbar’s property . . . . . . . . . . . . . . . . 3.6.1 Ego Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 The Dunbar circles . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Tie Strength in Online Social Networks . . . . . . . . . . . .
43 43 44 45 46 48 49 49 50 51
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . providers . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
vi
CONTENTS
II A P2P Dunbar-based Distributed Online Social Network 53 4 System Model 4.1 Profile-Based Communication . . . . . . . . . . . . . . 4.2 The Social Graph . . . . . . . . . . . . . . . . . . . . . 4.3 DiDuSoNet: the general architecture . . . . . . . . . . 4.3.1 Mapping the social graph onto the social overlay 4.3.2 The Dunbar-based Social Overlay . . . . . . . . 4.4 The DHT lookup service: Pastry . . . . . . . . . . . . 4.5 Analysis of the Dunbar Connections . . . . . . . . . . . 5 Friendly Routing over Dunbar-based Overlays 5.1 Friendly Routing . . . . . . . . . . . . . . . . . 5.2 Social Pastry: the DHT structure . . . . . . . . 5.3 Friendly Routing by using goLLuM . . . . . . . 5.3.1 The Routing Algorithm . . . . . . . . . 5.3.2 Privacy . . . . . . . . . . . . . . . . . . 5.3.3 Independence . . . . . . . . . . . . . . . 5.4 Experimental Results . . . . . . . . . . . . . . . 6 Data Availability: a trusted storage approach 6.1 Trusted Social Storage . . . . . . . . . . . . . . 6.2 Basic Trust: k-trusted-replicas approach . . . . 6.2.1 Social Score and Selection Strategies . . 6.2.2 PoS Election Algorithms . . . . . . . . . 6.2.3 Dynamism and Points of Storage (PoS) . 6.2.4 Data Consistence . . . . . . . . . . . . . 6.3 Full Trust: Network Coverage . . . . . . . . . . 6.3.1 Social Score Definition . . . . . . . . . . 6.3.2 Computing Social Coverage . . . . . . . 6.4 Experimental Results . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
55 56 56 58 59 60 63 66
. . . . . . .
69 70 71 73 74 75 75 77
. . . . . . . . . .
79 80 81 83 86 88 92 93 97 98 103
7 Information Diffusion on Dunbar-based Overlays 105 7.1 Ego Betweenness Centrality . . . . . . . . . . . . . . . . . . . . . . 108 7.2 Ego Betweenness Centrality Computation . . . . . . . . . . . . . . 110 7.2.1 Ego Betweenness Centrality Computation in directed graphs 112 7.3 Weighted Ego Betweenness Centrality . . . . . . . . . . . . . . . . . 112 7.4 WEBC-based information diffusion . . . . . . . . . . . . . . . . . . 116 7.4.1 The Information Diffusion algorithm . . . . . . . . . . . . . 117 7.5 Distributed protocols for WEBC computation . . . . . . . . . . . . 119
vii
0.0. CONTENTS
7.5.1 EBC Broadcast Protocol . . . . . . . . . . . . . . . . . . . . 119 7.5.2 EBC Gossip Protocol . . . . . . . . . . . . . . . . . . . . . . 120 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.6
III
System Evaluation
125
8 Experimental Results: The Datasets 127 8.1 The Zhao’s Dataset: general characteristics . . . . . . . . . . . . . . 129 8.2 SocialCircles! dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.2.1 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.2.2 Analysis of the users interactions and ego network properties 150 8.2.3 Analysis of temporal characteristics related to data availability153 9 System Evaluation 9.1 PeerFactSim.KOM . . . . . . . . . . . . . . 9.2 Social DHT: Experimental Results . . . . . 9.2.1 Experimental Setup . . . . . . . . . . 9.2.2 Experimental Results . . . . . . . . . 9.3 Data Availability: Experimental Results . . 9.3.1 Experimental Setup . . . . . . . . . . 9.3.2 Experimental Results . . . . . . . . . 9.4 Information Diffusion: Experimental Results 9.4.1 Experimental Setup . . . . . . . . . . 9.4.2 Experimental Results . . . . . . . . .
IV
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Conclusion
163 . 163 . 164 . 165 . 168 . 173 . 174 . 174 . 183 . 184 . 184
191
10 Conclusion and Future Work 193 10.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 193 10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 A The A.1 A.2 A.3
Facebook application: SocialCircles! Introduction . . . . . . . . . . . . . . . . What data can we obtain? . . . . . . . . The SocialCircles! functionalities . . . . A.3.1 The Graph page . . . . . . . . . . A.3.2 The Statistics page . . . . . . . . A.3.3 The Interactions page . . . . . . A.3.4 Friends Map page . . . . . . . . . A.3.5 About and Privacy statements . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
197 . 197 . 198 . 199 . 200 . 201 . 202 . 203 . 204
viii
CONTENTS
A.4 The application structure . . . . . . . . . . . . A.5 The development technologies and tools . . . . A.5.1 Front-End technologies . . . . . . . . . A.5.2 Server technologies . . . . . . . . . . . A.5.3 Database technologies . . . . . . . . . A.6 The server side . . . . . . . . . . . . . . . . . A.6.1 The filter level . . . . . . . . . . . . . . A.6.2 The servlet level . . . . . . . . . . . . A.6.3 The background demons level . . . . . A.6.4 Other demons and maintenance script A.7 The database . . . . . . . . . . . . . . . . . . A.7.1 Service tables . . . . . . . . . . . . . . A.7.2 Topology and profiles tables . . . . . . A.7.3 Interaction Information tables . . . . . A.7.4 Online status table . . . . . . . . . . . A.8 The hosting and the deployment . . . . . . . . A.9 The application publishing . . . . . . . . . . . Bibliography Bibliography
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
204 205 205 207 208 208 209 209 213 214 215 216 216 219 222 222 223 223 225
List of Figures 2.1 2.2 2.3 2.4
2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 3.1
An Abstract P2P Overlay Network Architecture . . . . . . . . . . Comparison of central server, flooding search, and distributed indexing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interface of a DHT with a simple put/get − interface. . . . . . . . . State of a hypothetical Pastry node with NodeID 10233102, b = 2. The top row is row zero. The NodeID is coloured with three different colours to indicate the common prefix with 10233102 (blue colour), next digit (green colour), and rest of the NodeID (red colour) General DOSN’s Architecture . . . . . . . . . . . . . . . . . . . . . Distributed Online Social Network architecture . . . . . . . . . . . Data Management Services . . . . . . . . . . . . . . . . . . . . . . Availability in DOSNs . . . . . . . . . . . . . . . . . . . . . . . . . Information diffusion in structured and unstructured DOSNs. . . . Privacy violations through contents’ metadata. . . . . . . . . . . . . SafeBook Architecture . . . . . . . . . . . . . . . . . . . . . . . . . Control access in LifeSocial.KOM. . . . . . . . . . . . . . . . . . . . Example Cachet objects. . . . . . . . . . . . . . . . . . . . . . . . . Gemstone: system architecture . . . . . . . . . . . . . . . . . . . . Current proposed P2P-based DOSNs . . . . . . . . . . . . . . . . .
14 17 17
18 21 22 22 23 27 34 35 37 39 40 41
3.2 3.3
Graphical representation of a undirected (a), a directed weighted undirected (c) graph with N = 7 nodes and K Ego network of the red node. . . . . . . . . . . . . . . Dunbar’s circles. . . . . . . . . . . . . . . . . . . . . .
(b), and a = 14 links. 44 . . . . . . . 50 . . . . . . . 51
4.1 4.2 4.3 4.4 4.5
The Dunbar-based ego network of nodes x and y. . . . . . . . DiDuSoNet Architecture. . . . . . . . . . . . . . . . . . . . . . The SocialTable of node N (the Additional Table is identical). The Ego Network of node N and its representation in a node . Searching friend inside the DHT . . . . . . . . . . . . . . . . .
5.1
Routing loops are possible in naive approaches for Social DHTs . . 73
. . . . .
. . . . .
. . . . .
57 60 62 62 65
x
LIST OF FIGURES
5.2
Basic idea of the routing algorithm is to route around blocked links. 74
6.1
(a) X’s ego network EN (X), (b) nodes’ CGX when no P oS(X) is still elected, (c) nodes’ CGX after D’s election as PoS(X) . . . . . Dunbar Ego Network of Node n. . . . . . . . . . . . . . . . . . . . PoS re-election for an offline node . . . . . . . . . . . . . . . . . . Case 1: Ego network of the node n after the disconnection of A. . Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ego Network of Node n after the reconnection of n. D is not a PoS and the updated data are stored on A and n. . . . . . . . . . . . . PoS dynamism and Data Consistence. . . . . . . . . . . . . . . . . Network Coverage through PoSs: an example . . . . . . . . . . . Coverage of an edge (A,B) . . . . . . . . . . . . . . . . . . . . . . Local degree computation . . . . . . . . . . . . . . . . . . . . . . Local Degree vs. Global degree . . . . . . . . . . . . . . . . . . . Network Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . A SS node goes offline . . . . . . . . . . . . . . . . . . . . . . . . When an offline node joins the network . . . . . . . . . . . . . . .
6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 7.1 7.2
. . . . . .
86 89 89 90 90 91
. . . . . . . . .
92 92 95 96 98 98 100 102 104
7.3 7.4 7.5 7.6 7.7 7.8 7.9
Messaging Patterns in an OSN . . . . . . . . . . . . . . . . . . . . The social update dissemination problem over the ego network of node u . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EBC example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example of network graph . . . . . . . . . . . . . . . . . . . . Ego Network of the node E . . . . . . . . . . . . . . . . . . . . . Example of the Dijkstra’s algorithm . . . . . . . . . . . . . . . . . Weighted Ego Betweenness on Undirected Graphs . . . . . . . . . A weighted and directed graph . . . . . . . . . . . . . . . . . . . Community Discovery . . . . . . . . . . . . . . . . . . . . . . . .
. 106 . . . . . . . .
106 109 110 111 113 114 116 117
8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10
Interaction Graph: four time windows . . . . . . . . . . . . . . . . Distribution of the contact frequences . . . . . . . . . . . . . . . . Distribution of the contact frequences in the interval [0, 0.1] . . . Indegree distribution . . . . . . . . . . . . . . . . . . . . . . . . . Outdegree distribution . . . . . . . . . . . . . . . . . . . . . . . . Joint Degree Distribution . . . . . . . . . . . . . . . . . . . . . . Distribution of Facebook friends among ego networks. . . . . . . . CDF of the number of Facebook friendships. . . . . . . . . . . . . Distribution of average clustering coefficient. . . . . . . . . . . . . Distribution of normalized local degree in the typical ego network.
. . . . . . . . . .
130 131 131 132 133 134 136 137 138 139
0.0. LIST OF FIGURES
8.11 8.12 8.13 8.14 8.15 8.16 8.17 8.18 8.19 8.20 8.21 8.22 8.23 8.24 8.25 8.26 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 9.11
9.12 9.13 9.14 9.15 9.16
xi
Distribution of Normalized Ego Betweenness Centrality. . . . . . . . 140 Distribution of Modularity value. . . . . . . . . . . . . . . . . . . . 141 Distribution of discovered communities . . . . . . . . . . . . . . . . 142 Distribution of total cliques among ego networks. . . . . . . . . . . 145 Distribution of the median clique size for each network . . . . . . . 146 Distribution of k-cliques in the typical network . . . . . . . . . . . . 147 Distribution of overlapping cliques (any size) per node for each network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Distribution of full and approximate covering set size for each network.149 Analysis of the users interactions and ego network structure . . . . 152 Tie Strength and clusters analysis . . . . . . . . . . . . . . . . . . . 152 Temporal availability of all users. . . . . . . . . . . . . . . . . . . . 155 Analysis of the temporal properties . . . . . . . . . . . . . . . . . . 157 Analysis of the Dunbar circles temporal features . . . . . . . . . . . 158 Analysis of the temporal matching on Dunbar circles . . . . . . . . 159 Percentage of matching . . . . . . . . . . . . . . . . . . . . . . . . . 160 Conditional probability on active network . . . . . . . . . . . . . . 160 PeerFactSim.KOM: Architecture . . . . . . . . . . . . . . . . . . . . 164 Distribution of incoming and outgoing links for the Zhao’s dataset model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Distribution of incoming links for the Barab´asi-Albert model . . . . 168 Zhao’s Dataset: Finished Lookups . . . . . . . . . . . . . . . . . . . 169 Barab´asi-Albert Model: Finished Lookups . . . . . . . . . . . . . . 170 Small World Graph: Finished Lookups . . . . . . . . . . . . . . . . 171 Experiment A: Quantity of online nodes and points of storages. . . 176 Experiment A: Ratio of available profiles to total number of profiles. 176 Experiment A: Quantity of ping-pong messages for different values of alpha. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Experiment B: Simulation results, all nodes leave the network without selecting another point of storage (failure). . . . . . . . . . . . . 178 Experiment C: Simulation results, half of the leaving nodes select another PoS wheres the other half fails. The simulation time is set to 48 hours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Pure Availability and Friend Availability . . . . . . . . . . . . . . . 181 Average PoS tie strength and Dunbar ego network . . . . . . . . . . 181 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Frequency of the total amount of replicas needed to have a network coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Normalised values for BC and EBC for a random network extracted from the data set and composed by 1595 nodes . . . . . . . . . . . 185
xii
LIST OF FIGURES
9.17 Normalised values for BC and EBC for a random network extracted from the data set and composed by 5000 nodes . . . . . . . . . . 9.18 Evaluation of how often the EBC computation is required in a network of 1000 nodes and 150 time instants. . . . . . . . . . . . . . 9.19 Variation between EBC and WEBC for random networks obtained by the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.20 Replica Distribution in FoF . . . . . . . . . . . . . . . . . . . . . 9.21 Number of replica according to the community dimension in a network of 4000 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . 9.22 Number of replica according to the community dimension in a network of 10000 nodes . . . . . . . . . . . . . . . . . . . . . . . . . A.1 The index page. . . . . . . . . . . . . . . . . . . . . A.2 The graph page. . . . . . . . . . . . . . . . . . . . . A.3 The statistics page. . . . . . . . . . . . . . . . . . . A.4 The interactions page. . . . . . . . . . . . . . . . . A.5 The map page. . . . . . . . . . . . . . . . . . . . . A.6 High level application design: a three-tier structure. A.7 The server side design. . . . . . . . . . . . . . . . . A.8 Database schema - Topology and profiles . . . . . A.9 Database schema - Interaction information . . . . . A.10 Database schema - Online status table . . . . . . .
. . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. 185 . 186 . 187 . 188 . 189 . 190 . . . . . . . . . .
200 201 202 203 204 205 210 217 220 222
List of Tables 6.1 6.2
Table of Symbols used for the Basic Trust Approach . . . . . . . . 82 Table of Symbols used in this section . . . . . . . . . . . . . . . . . 94
7.1
Centrality Metrics for the graph in figures 7.4: betweenness centrality (BC), closeness centrality (CC) and ego-betweenness centrality (EBC). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.1 8.2 8.3 8.4
Social Graph characteristics. . . . . . . . . . . . . . . . . . . . . . Refined dataset: characteristics. . . . . . . . . . . . . . . . . . . . Pearson Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . Some properties of different group of networks according to total clique size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dunbar circles analysis for networks with k-opt = 4. 95% confidence intervals are reported in square brackets. . . . . . . . . . . . . . .
8.5 9.1 9.2 9.3 9.4 9.5 9.6
. 129 . 130 . 133 . 143 . 153
Simulator Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Link distribution for the Barab´asi-Albert model and the Small World graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average hop counts for different routing methods, friendship models and levels of trust . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulator Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network properties . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of Messages of the both protocols by varying the number of nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
165 172 173 175 183 186
List of Publications Published Papers 1. DiDuSoNet: A P2P architecture for distributed Dunbar-based social networks. Guidi, B., Amft, T., De Salve, A., Graffi, K., and Ricci, L. Peer-to-Peer Networking and Applications, 1-18 (2015). 2. Trusted Dynamic Storage for Dunbar-Based P2P Online Social Networks. Conti, M., De Salve, A., Guidi, B., Pitto, F., and Ricci, L. In On the Move to Meaningful Internet Systems: OTM 2014 Conferences (pp. 400-417). Springer Berlin Heidelberg. 3. Epidemic Diffusion of Social Updates in Dunbar-Based DOSN. Conti, M., De Salve, A., Guidi, B., and Ricci, L. In Euro-Par 2014: Parallel Processing Workshops (pp. 311-322). Springer International Publishing. 4. Distributed protocols for ego betweenness centrality computation in DOSNs. Guidi, B., Conti, M., Passarella, A., and Ricci, L. In Pervasive Computing and Communications Workshops (PERCOM Workshops), 2014 IEEE International Conference on (pp. 539-544). IEEE. 5. HyVVE-A Voronoi based Hybrid Architecture for Massively Multiplayer On-line Games Ricci, L., Genovali, L., and Guidi, B. In DCNET 2013, 4-th International Conference on Data Communication Networking, Reykjavik, Island. (pp. 15-23) - BEST PAPER AWARD. 6. Managing Virtual Entities in MMOGs: A Voronoi-Based Approach. Ricci, L., Genovali, L., and Guidi, B. In E-Business and Telecommunications (pp. 58-73). Springer Berlin Heidelberg. 7. P2P architectures for distributed online social networks. Guidi, B. In High Performance Computing and Simulation (HPCS), 2013 International Conference on (pp. 678-681). IEEE. 8. GoDel: delaunay overlays in P2P networks via Gossip. Baraglia, R., Dazzi, P., Guidi, B., and Ricci, L. In Peer-to-Peer Computing (P2P), 2012 IEEE 12th International Conference on (pp. 1-12). IEEE.
0.0. LIST OF TABLES
xv
9. The impact of user’s availability on On-line Ego Networks: a Facebook analysis. Andrea De Salve, A., Dondio, M., Guidi, B., and Ricci, L. Computer Communications, Special Issue on Online Social Networks. Elsevier (2015). 10. FRoDO: Friendly Routing over Dunbar-based Overlays. Amft, T., Guidi, B., Graffi, K., and Ricci, L. 40th IEEE Conference on Local Computer Networks (LCN), 2015, Clearwater Beach, Florida, USA.
Acknowledgement It is hard to find the words to thank all the people that supported me during my PhD. PhD is a beatiful experience, a sort of journey in which you can enrich yourself. First of all, I would like to say thanks to my supervisors: Prof. Laura Ricci, which has been really important for me and without her guidance and persistent help this dissertation would not have been possible; and Marco Conti. Both of them have given me the opportunity of realizing this dream. I would like to thank Andrea De Salve who worked with me on some aspects of this thesis. Furthermore, I would like to thank my parents, my boyfriend, friends and colleagues who encouraged and advised me during my PhD. Without their continuous support the success of my PhD would not have been possible. Finally, I would like to say thanks to me, because I never thought of being able to finish my PhD after the ordeal I had to face during this experience. Instead, we know, the impossible always seems impossible until it’s done!
Chapter 1 Introduction An Online Social Network (OSN) is defined in [1] as an online platform that provides services for a user to build a public profile and to explicitly declare the connection between his/her profile and those of the other users; an OSN enables a user to share information and content with selected users or to define the information as public; furthermore, it supports the development and usage of social applications enabling the interaction of the user with both his/her friends and strangers. The currently popular OSNs are centralized which means they are based on centralized servers storing all the information of the users. This centralized structure has several drawbacks including scalability, privacy, and dependence on a provider [2]. In particular, in recent years, the rise and quick development of social networks have led to important phenomena: the rapid spread of information and the user privacy disclosure. Social networks have become the epicentre and main channel of individual privacy disclosure. These problems have moved researchers to investigate alternative solutions with respect to the centralized one. Facebook is one of the most well-known OSN and it is the better scenario to understand the privacy problem arising in OSNs. During the last years, there have been some concerns expressed regarding Facebook and the usage of private data of its users. Two students from the Massachusetts Institute of Technology (MIT) used a script to download information of over 70,000 Facebook profiles from four schools (MIT, NYU, the University of Oklahoma, and Harvard University) as part of a research project on Facebook privacy published in 2005 [3]. In August 2007, a small fraction of the code used to generate Facebook web pages was made public1 . As explained in [4], a configuration problem on a Face-
1
www.foxnews.com/story/2008/06/25/facebook-source-code-leaked-onto-internet
2
CHAPTER 1. INTRODUCTION
book server caused the PHP code to be displayed instead of the web page the code should have created. This problem raised concerns about how secure private data on the site was. A visitor copied, published and later removed the code from his web forum, claiming he had been served and threatened with legal notice by Facebook. In February 2008, a New York Times article2 advises that Facebook does not provide any mechanism to close user accounts, and raised the concern that, as a consequence, private user data would remain stored on Facebook’s servers. Recently, Facebook gives users the possibility to deactivate or delete their accounts. Deactivating an account does not remove user data, because the account can be restored later, while deleting it will remove the account ”permanently”. However, some data submitted by that account (posts or messages) will remain. Facebook’s Privacy Policy now states: ”When you delete an account, it is permanently deleted from Facebook.” [4]. In May 2015, a report3 commissioned by the Belgian Data Protection Authority revised policies and terms-of-use and concluded that Facebook gives users a false sense of control over their data privacy. Facebook has been tracking on a long-term basis, users who visit any page belonging to the Facebook.com domain. Facebook has the power to link internet users’ browsing habits to their real identity, social network interactions and sensitive data including private real information such as medical information, religious, sexual and political preferences. It is in a unique position compared to most of the other cases of so-called “third-party tracking”. In particular, the report affirms that Facebook has the ability to track users’ activity outside Facebook, mainly through the spread of social-plugins, “Like” buttons, and through new forms of mobile tracking. It affirms that Facebook now gathers information through these plugins regardless of whether the buttons are used. It said Facebook’s acquisition of Instagram and WhatsApp has allowed Facebook to collect more kind of user data, which enables more detailed profiling. The report also said that it is impossible to add information on Facebook that may not later be used for targeting advertisements and that Facebook’s privacy settings were less clear in relation to the collection and use of data by Facebook itself or by third-parties like application developers. We can also illustrate other scenarios which are not strictly related to users data privacy protection, as all the scenarios presented above, but on Internet freedom as a human right. Facebook has been banned in some countries, such as in China. In Tunisia, the government hacked into and stole passwords from citizens’
2 3
http://www.nytimes.com/2008/02/11/technology/11facebook.html?_r=0 https://www.huntonprivacyblog.com/tag/belgium/
3 Facebook accounts. In May 2011, Syrian activists noticed that the telecommunications ministry was tapping into Facebook activity. We can think that the problem is related to the Facebook site, instead others Social Networks, such as Twitter has been blocked as well. Twitter was inaccessible in Egypt on 25 January 2011 during the Egyptian protests. In 2009, during 2009 Iranian presidential election, the Iranian government blocked Twitter due to fear of protests being organised. On 20 March 2014, access to Twitter was blocked when a court ordered that ”protection measures” be applied to the service. All presented scenarios represent the main motivation which have lead researchers to evaluate a distributed platform for implementing an Online Social Network. A Distributed Online Social Network (DOSN) [2] is an online social network implemented on a distributed information management platform, such as a network of trusted servers, P2P systems or an opportunistic network. During the last years DOSNs have been argument of several works from both academic researchers and open source communities. By decentralizing OSNs, the concept of a service provider is changed, as there is no single provider but a set of peers that take on and share the tasks needed to run the system. This has several consequences: in terms of privacy and operation, no central entity that decides or changes the terms of service exists. Moving from a centralized web service to a decentralized system also means that different system models become possible: using one’s own storage or cloud storage, exploiting delay-tolerant social networks, and P2P networks, to name some of them. Since the social data is stored on the peers, their availability depends on the online behaviour of peers. The first big project in this area has been Diaspora [5]. Diaspora was founded in 2010 by four students, who started to work on the project after being motivated by a speech given by Columbia University law professor Eben Moglen. In his speech, Moglen described centralized social networks as ”spying for free”, and in a New York Times interview, Salzberg, the co-funder of Diaspora, said ”When you give up that data, you’re giving it up forever ... The value they give us is negligible in the scale of what they are doing, and what we are giving up is all of our privacy”. The group decided to address this problem by creating a distributed social network. Diaspora is the first decentralized approach based on a decentralized federation of servers. In the last years fully decentralized solutions based on the P2P model [6] have been proposed for Online Social Network. Decentralizing the existing functionalities of online social networks requires finding ways for distributing storage of data, propagating updates, defining an overlay topology and a protocol enabling searching and addressing, robustness against churn, etc. We discuss these problems in more detail below. • Dynamism. In a DOSN there are two types of dynamism: a social and an
4
CHAPTER 1. INTRODUCTION
infrastructure dynamism. The social dynamism concerns social relationships which can change due to the variation of relations between users and of the number of users of the DOSN. This kind of dynamism is presented also in centralized OSN, but in DOSN it has an impact on the structure of the underlying overlay. The infrastructure dynamism is related to the underlying overlay network. Nodes may join/leave the underlying network, so that the corresponding user has a state that indicates its availability (online/offline). In different snapshots of the overlay, the number of available connections may change in the term of number of active links. • Data availability/persistence. Decentralizing the social network, data should be stored at some nodes to be always available. The main questions are which mechamisn should be used, for example replication and where data should be stored, for instance exclusively at nodes run by friends, or at random nodes. The requirement for redundancy to provide availability of data depends to a large extent on the duration and distribution of nodes availability online. • Scalability. Mapping a social graph onto a distributed network can be very expensive due to the number of social links for each node, so the cost of mirroring the social network links into distributed network links can be high. And it can be very inefficient, because as discussed previously, most of the social links are inactive (two friends who rarely interact online). • Topology. Nodes should be connected according to their social connections in order to cluster friends in the overlay network. This should facilitate operations as information diffusion or data storage. As a downside, this would limit the availability and robustness of data access. • Updates. This issue is related to how to deal with updates, i.e. status updates of friends. In DOSNs, with distributed storage and replication and a potential need for scalability, the mechanisms for the diffusion of data are fundamental and each update from one user must be diffused to all its social connections. • Privacy. Data can be stored not only on the node of profile owner, but also on others. The classical requirements for security should be refined to consider the context of DOSN. Most of proposed P2P Distributed Online Social Networks focus on the privacy problem. Some approaches are peer-to-peer based, for example Safebook [7], LifeSocial [8, 9], PeerSoN [10], Cachet [11] and My3 [12]. Safebook is an architectural approach that aims to solve privacy issues focusing on communication anonymisation. LifeSocial provides a plugin/based architecture in which user information is
5 stored in a DHT and is accessible from various plugin-based applications; PeerSoN exploits a two-tier system architecture for content organization and for communications, where users store their information on their local devices; Cachet relies on structured architectures where users’ data could be stored and replicated on different peers. Finally, My3 uses an unstructured approach where users’ data are hosted on a set of self-chosen trusted peers among their friends. When we consider a distributed structure, such as a peer-to-peer solution, it is necessary to consider other problems, for example nodes have a high fluctuation which can lead to data becoming unavailable or even lost. Data availability, as well as Information Diffusion, are important problems that have to be addressed in a DOSN. Most of proposed systems use classic encryption techniques, such as symmetric or asymmetric encryption, to guarantee privacy and anonymity. The usage of encryption is led mainly by the DHT, which stores user data on untrusted nodes. These techniques are useful but, in a such dynamic and large environment, they can be potentially expensive. For example, consider a system in which data are encrypted using a conjunction of symmetric and asymmetric cryptography, a user encrypts its symmetric data with its private key and then, the key has to be encrypted with the public key of each friend. This means that each user has to generate a huge amount of keys and has to manage all of them. The attribute-based technique [13] is more efficient than the others, but it is not completely suitable in DOSNs. Another important point is regarding to the nature of a data. In a DOSN, a profile can be split into two different categories: private and public data. Public data are those that can be accessible by everyone (e.g. name and surname), instead private data has to be accessible only by friends. For these reasons, public data cannot be encrypted, but the same cannot be applied to private data. As just said before, the problem of these systems is the usage of DHTs, where private data are stored everywhere in the network and they have to be encrypted to guarantee privacy. The goal of this thesis is to present a new Distributed framework for Online Social Network which is able to prevent user data violation by considering trust instead of encryption. By using a Friend-to-Friend network for storing private data, the above encryption problem is partially resolved. Trust is widely accepted as a major component of human social relationships. In general, trust is a measure of confidence that an entity will behave in an expected manner, despite the lack of ability to monitor or control the environment in which it operates [14]. In OSNs, trust is a critical determinant of sharing information and developing new relationships ( [15], [16]). Trust is also important for successful online interactions. In online systems such as eBay and Amazon, trust is based on the feedback on past interactions between members [17, 18]. In this sense, trust is relational. As two
6
CHAPTER 1. INTRODUCTION
members interact with each other frequently, their relationship strengthens, and trust evolves through a positive or negative increase based on their experience. We propose DiDuSoNet (Distributed Dunbar-based Social Network), a system based on an important social concept, which permit us to guarantee trust: the Dunbar circles. Dunbar [19] explains that human brain has cognitive limit to the number of people with whom one can maintain stable social relationships. This concept is used to build a novel Dunbar-based Social Overlay, where each node maintains connections only to trusted nodes according to the strength of the relation. Furthermore, each node maintains only a limited number of connections according to the Dunbar’s number (which is about 150). Our system is logically organized on two tiers: the Dunbar-based Social Overlay and a P2P lookup service, implemented through a DHT. Each user directly connects to its Dunbar friends, which are the friends with whom it has a stronger relation. This enables to limit the number of outgoing connections of a user to at most 150 connections. Furthermore, since most Dunbar relationship are reciprocal, this limits also the number of incoming connections. The DHT is exploited to allow the bootstrap of the nodes in the system, the search of new friends, and to store information useful for higher level services. The usage of a DHT causes a reduction to the level of trust provided from our system. For this reason, we propose a novel DHT called Social Pastry, which is based on Pastry and guarantee a friendly-routing functionality. Due to the constraints introduced in the routing table of Pastry to obtain Social Pastry, the efficiency of the routing is reduced. This reason is the main motivation that has led us to the introduction of a general friendly-routing algorithm, which is independent from the network overlay and can be used above structured and unstructured overlays. DiDuSoNet provides a set of social services needed for the data management in a DOSN. These services are needed to manage data availability and information diffusion. We provide two replication based approaches for the data availability problem, which use the trust of nodes to elect storage points among network friends. The trust of a node is evaluated by considering their structural and temporal properties. In particular, we evaluate the user behaviour and the degree of a node to understand which nodes are suitable to be elected as storage nodes. The first approach guarantees a high availability of data by using 2-replicas of each users’ profile on its Dunbar friends and it provides a functionality to manage the data consistence of replicas. The second approach is based on the concept of network coverage we propose and it it guarantees that a node can retrieve the profile of a friend through trusted connections, i.e. connections with common friends. Furthermore, we provide an information diffusion service by using an epidemic diffusion algorithm. We propose a new metric, called Weighted Ego Betweenness
7 Centrality and we use the metric to guide the diffusion over the Dunbar-based Social Overlay. Our main goal is to guarantee the availability of data in our system. Both our approaches exploits structural and temporal properties of each node. The lack of real datasets containing temporal information about users behaviour represents the main limitation to the research in this field. To fill this important gap, we decided to implement a Facebook application to collect useful information, such as the social graph of each user, the interactions between users which are needed to build the Dunbar-based Social Overlay, and information about the online sessions of users. We have analysed in deep our dataset to obtain information about the structure of the social graph and the users’ behaviour. Furthermore, we have obtained information about the evolution of particular characteristics of Facebook users by comparing our dataset analysis with others proposed in literature. We have used the real datasets to evaluate our system, in particular the data availability service. The system evaluation has been conducted by testing the three major novelties proposed in this thesis. We have evaluated our friendly routing algorithm and we have demonstrated that our approach provides a more efficient solution than a friendly routing through a Social DHT. Furthermore, we have evaluated our data availability solution with a real dataset and we have demonstrated that we are able to guarantee a high availability with only 2 replicas of each profile. Finally, we have evaluated our information diffusion service and the Weighted Ego Betweenness Centrality metric used in the algorithm. We are able to demonstrate the efficiency of our epidemic algorithm. To recap, the main research contributions of this thesis are: • the definition of a novel Social Overlay based on an important social concept, the Dunbar’s circles. The Dunbar-based Social Overlay represents the baseline of all the social services of our system. • the proposal of a friendly routing by exploiting the Dunbar-based Social Overlay. We have proposed a general algorithm which is independent from the underlying overlay and which guarantees a trust-preserved routing. • the proposal of two data availability approaches based on structural and temporal properties of the users. The first approach guarantees a high availability with a fixed number of replica and provides a mechanism to guarantee the consistence of replicas. The second approach provides an original replication scheme based on the network coverage. • the definition of a new centrality metric, the Weighted Ego Betweenness Centrality (WEBC), which is used to guide the diffusion over the Dunbar-
8
CHAPTER 1. INTRODUCTION
based Social Overlay. We propose two distributed protocols for the WEBC computation and an epidemic protocol for the information diffusion. • the evaluation of the system with an up to date real dataset. We have implemented a Facebook application to retrieve real data and we have studied the dataset to validate our assumptions.
1.1
Outline of the Thesis
The thesis is conceptually organized in four separate parts: Part I: Basic Concepts in which we introduce the reader to the world of Distributed Online Social Networks and Complex Networks Analysis, putting the basis for understanding the following parts. In particular: • Chapter 2 reviews the current state of Distributed Online Social Networks. We describe the current proposals and the most important challenges, such as data availability and information diffusion. • Chapter 3 describes the properties of Complex Networks and important metrics which are used in our work. Part II: A P2P Dunbar-based Distributed Online Social Network in which we introduce our Dunbar-based P2P Distributed Online Social Network and the Social Services which are provided by the system. In particular: • Chapter 4 introduces DiDuSoNet by showing the system model and in particular, our novel Social Overlay based on the Dunbar’s concept. • Chapter 5 describes an improvement proposed for the Pastry DHT, which we use in our system. We propose a Social Pastry in which the routing table is built in according to social links. We introduce a general algorithm which can be used over each kind of overlay (structured and unstructured) and which provides a friendly routing. • Chapter 6 describes the data availability management service provided by our system. We propose two approaches based on the concept of trust. • Chapter 7 describes the information diffusion management service provided by our system. We propose a diffusion algorithm based on the Ego Betweenness Centrality.
1.1. OUTLINE OF THE THESIS
9
Part III: System Evaluation in which we show our experimental results and a detailed analysis of the two used real Facebook datasets. In particular: • Chapter 8 provides a detailed analysis of a dataset related to a Facebook Regional Network. Furthermore, we show and analyse the structural and temporal properties of our Facebook dataset obtained by SocialCircles!, our Facebook Application. • Chapter 9 describes the evaluation of all our novel approaches, in details the evaluation of Social Pastry and our goLLuM algorithm; the evaluation of our data management services: data availability and information diffusion. Part IV: Conclusions which concludes the thesis. In particular, we have: • Chapter 10 presents the conclusions of the thesis and the possible improvements planned as future works. • Appendix A presents an overview of SocialCircles!, the Facebook application.
10
CHAPTER 1. INTRODUCTION
Part I Basic Concepts
Chapter 2 Background and Related Works In this chapter we review the main background notions related to the research field of Distributed Online Social Networks. We introduce an overview of P2P systems, of current DOSNs, and of the main challenges for the development of a DOSN, such as data availability and information diffusion.
2.1
P2P Systems
A P2P network is a distributed network composed of a large number of distributed, heterogeneous, autonomous, and highly dynamic peers in which participants share a part of their own resources, such as processing power, storage capacity, software, and files contents [20]. The participants of the P2P network can act as a server and a client at the same time. They are accessible by other nodes directly via the logical overlay links, without passing through intermediary entities (client/server) and, this represents one of the most important advantage of using a P2P system at application layer. P2P systems build and maintain overlay networks at the application-layer, assuming the presence of an underlying network layer, which assures connectivity among any pair of nodes that can communicate through a self-organizing and fault tolerant network topology. A P2P system has particular properties [21], such as: • High Degree of Decentralization. Based on the degree of decentralization, the P2P systems are classified into two categories: hybrid systems and purely decentralized system. In a purely decentralized P2P system, there is no central entity and peers are able to implement both client and server functionality, and most of the system’s states and tasks are dynamically allocated among the peers.
14
CHAPTER 2. BACKGROUND AND RELATED WORKS
• Self-organization. Self-organization is required for reliability and to manage continuous peers joining and leaving. Node joining and leaving is handled in a distributed manner. When a node joins the network, it builds connections with other nodes to become an active participant. When a node leaves, it is removed from the network and the overlay is updated to maintain connected all the others nodes. Furthermore, when a node fails without warning, connected node needs to update their routing and neighbouring connections. • Scalability. This concept refers to the ability of a system to continuously evolve in order to support a growing amount of peers. This means that P2P systems are able to maintain the system’s performance attributes independent from the number of nodes or documents in the network. • Reliability. This concept denotes the ability of the network to deliver its services even when one or several nodes fail. • Churn. Churn represents the situation where a number of nodes join, leave, and fail in a continuous and rapid way.
Figure 2.1: An Abstract P2P Overlay Network Architecture Figure 2.1 shows an abstract P2P architecture as presented in [22]. The Network Communications layer describes the network characteristics of devices (sensor-based or computers) which are connected over the Internet (or wireless). The Overlay Nodes Management layer is responsible about the management of peers into the P2P network. It includes the discovery of peers and routing algorithms. The Features Management layer covers the security, reliability, fault resiliency and aggregated resource availability aspects of maintaining the robustness of P2P systems. The Services Specific layer supports the underlying P2P infrastructure and the application components through scheduling of parallel and
2.1. P2P SYSTEMS
15
computation-intensive tasks, content and file management. The Application-level layer is concerned with tools, applications and services that are implemented on top of the underlying P2P overlay infrastructure. As said before, P2P systems can be classified as centralized and purely decentralized. In centralized systems, such as Napster [23], the index, which supports the retrieval of the information, is centralized, but data are exchanged directly between peers. A basic problem of this approach is that, in case of a failure of the node that stores all the information about the network, all the network fails as well. Also, the system has a poor performance when the number of nodes is increased. Not fully centralized P2P systems have a dedicated controller node which is responsible to maintain the set of participating nodes and controls the system. Instead, in a purely decentralized P2P system, there are no nodes with particular responsibilities and that represent the critical point for the operation of the system. They have no inherent bottleneck and can potentially be resilient to failures, attacks, and legal challenges. In some purely decentralized P2P systems, nodes with plenty of resources and high availability, act as super-nodes. These super-nodes have additional responsibilities, such as acting as a rendezvous point for nodes behind firewalls, storing state or keeping an index of available content. This kind of P2P systems are generally referred as hybrid P2P systems. P2P systems maintain an overlay network, which is a layer on top of the physical network connections, which creates transparent services to handle the virtual topology of the network. Overlay networks offer services for the communication and connection handling between the peers in the network, including peers that are not directly connected. Furthermore, there is an additional distinction of P2P systems between structured and unstructured ones. In a unstructured P2P system [22], there are no constraints on the links between different nodes, and therefore the overlay graph does not have any particular structure, so nodes can occupy any position within the overlay network. In a typical unstructured system, a new node joins the network and establishes its initial connections in a random way starting at the bootstrap node. As explained in [24], in a structured overlay, each node has a unique identifier in a large numeric key space. Identifiers are chosen in a way that makes them uniformly distributed in that space. The overlay graph has a specific structure and the identifier determines the position of a peer within the structure and constrains its set of overlay links.
2.1.1
The P2P look-up service: DHT
A Distributed Hash Table (DHT) provides a global view of data distributed among nodes in a network, independent of the actual location. As referred in [25], a Dis-
16
CHAPTER 2. BACKGROUND AND RELATED WORKS
tributed Hash Table manages data by distributing it across a number of nodes and implementing a routing scheme which allows efficient look up the node on which a specific data item is located. In contrast to flooding based searches in unstructured systems, each node in a DHT becomes responsible for a particular range of data items. Also, each node stores a partial view of the whole distributed system which effectively distributes the routing information. Based on this information, the routing procedure typically traverses several nodes, getting closer to the destination with each hop, until the destination node is reached. As reported in [25], DHTs have the following characteristics: • each node has a partial view of the whole system by managing only a small set of links to other nodes. Generally, these are O(log N) references, where N depicts the number of nodes in the system. • by mapping nodes and data items into a common address space, routing to a node leads to the data items for which a certain node is responsible. • queries are routed via a small number of nodes to the target node. Because of the small set of references each node manages, a data item can be located by routing via O(log N) hops. • by distributing the identifiers of nodes and data items nearly equally throughout the system, the load for retrieving items should be balanced equally among all nodes. • because no node plays a distinct role within the system, the formation of hot spots or bottlenecks can be avoided. • the departure or dedicated elimination of a node should have no considerable effects on the functionality of a DHT. Therefore, DHTs are considered to be very robust against random failures and attacks. • if a data item is stored in the system, the DHT guarantees that the data is found. The following table (figure 2.2), proposed in [26], compares again the main characteristics of three well-known approaches in terms of complexity, vulnerability and query ability. According to their complexity in terms of communication overhead, per node state maintenance, and their resilience, DHT shows the best performance unless complex queries are not vital. DHTs follow a proactive strategy for data retrieval. In comparison, routing in unstructured systems is not related to the location of specific data items but only reflects connections between nodes.
2.1. P2P SYSTEMS
17
Figure 2.2: Comparison of central server, flooding search, and distributed indexing. In a DHT system, each data item has an identifier (ID), a unique value from the address space. This value can be chosen freely by the application, but it is often derived from the data itself via a collision-resistant hash function, such as SHA-1. Each document index is expressed as a (K, V ) pair, K is called keyword, can be the hash of a file name (or description of document) of the hash value, V is the actual file storage node IP address (or description of the other nodes). The generic operations are: the put function, which accepts an identifier and arbitrary data and it is used to store the data (on the node responsible for the ID); and, symmetrically, the get function retrieves the data associated with a specified identifier (figure 2.3).
Figure 2.3: Interface of a DHT with a simple put/get − interface. During the years, several DHT variants have been proposed. In the following, we introduce Pastry, which is used in our system. Pastry Pastry [24] is a structured P2P overlay network which uses a prefix routing to build a self-organizing decentralized overlay network. Each node in the Pastry network has a unique numeric identifier: (NodeID). The NodeID is used to give a position in a circular NodeID space, which ranges from 0 to 2128−1 and, the set of NodeIDs is assumed to be uniformly distributed in the 128−bit space. The NodeID is assigned at random when a peer joins the system. For a network of
18
CHAPTER 2. BACKGROUND AND RELATED WORKS
N peers, Pastry routes to the numerically closest peer to a given key in less than log2b N steps under normal operation (where b is a configuration parameter with typical value is 4). The NodeIDs and keys are considered a sequence of digits with base 2b . In each routing step, a node normally forwards the message to a peer whose NodeID shares with the key a prefix that is at least one digit longer the key shares with the current peer NodeID. If such a node is not known, the message is forwarded to a node whose NodeID shares a prefix with the key as long as the current node, but is numerically closer to the key than the current node. Each Pastry peer maintains a routing table, a Neighbourhood Set, and a Leaf Set. The routing table is organized in log2b N rows with 2b − 1 entries each. The 2b − 1 entries at row n of the routing table refer to a peer whose NodeID shares the current peer’s NodeID in the first n digits, but whose (n + 1)th digit has one of the 2b − 1 possible values other than the (n + 1)th digit in the current peer’s NodeID. The choice of N and b determines the size of the routing table. Larger b increases the routing table size but reduces the number of hops. Each entry in the routing table contains the IP address of peers whose NodeID have the appropriate prefix, and it is chosen according to the proximity metric. If no node is known with a suitable NodeID, then the routing table entry is left empty. The choice of b involves a trade-off between the size of the populated portion of the routing table (approximately (log2b N ) ∗ (2b − 1) entries) and maximum number of hops required to route between any pair of peers (logB N ). In figure 2.4 is shown the routing table of a hypothetical Pastry node, whose ID is 10233102. The entries in row j refer to a node whose ID shares the NodeID 10233102 only in the first j digits.
Figure 2.4: State of a hypothetical Pastry node with NodeID 10233102, b = 2. The top row is row zero. The NodeID is coloured with three different colours to indicate the common prefix with 10233102 (blue colour), next digit (green colour), and rest of the NodeID (red colour)
2.1. P2P SYSTEMS
19
The Neighbourhood Set maintains information about nodes that are close together in terms of network locality. It is not normally used in routing messages, but it is useful to maintain locality properties. Finally, the Leaf Set L is the set of nodes with the |L|/2 numerically closest larger NodeIDs, and the |L/2| numerically closest smaller NodeIDs. It is used during the message routing, as explained below. The primary goal of the routing algorithm is to quickly locate the node responsible for a particular key. Pastry routing works as follow: 1. Given a message with a particular key K, the node first checks if K falls within the range of NodeIDs covered by its Leaf Set. If so, the node is forwarded directly to the destination node, namely the node in the leaf set whose NodeID is closest to the key. 2. If the key is not covered by the leaf set, then the node checks the routing table and the message is forwarded to a node that shares a common prefix with the key by at least one more digit. 3. If the appropriated entry in the routing table is empty or the associated node is unreachable, then the message is forwarded to a node that shares a prefix with the key at least as long as the current node, and it is numerically closer than the current node’s id. As explained in [24], this routing procedure always converges, because each step takes the message to a node that either (1) shares a longer prefix with the key than the current node, or (2) shares as long as a prefix with, but is numerically closer to the key than the local node.
Nodes Leave/Failure Nodes in Pastry may fail or depart without warning. Routing table is handled by periodically exchanging keep alive messages among neighbouring nodes. Node failure is detected when its neighbours in the NodeID space can no longer communicate with it. When it occurs, it is necessary to repair all leaf sets of its neighbours. A neighbour contacts the live node with the largest index on the side of the failed node, and asks that node for its leaf table. Another important aspect of Pastry’s routing is locality. Pastry’s notion of network proximity is based on scalar proximity metric, such as the number of IP routing hops or geographic distance. It is assumed that each node can determine the distance of a node with a given IP address to itself. A node with lower distance value is assumed to be more desirable.
20
CHAPTER 2. BACKGROUND AND RELATED WORKS
2.2
Distributed Online Social Networks
A Distributed Online Social Network (DOSN) [2] is an Online Social Network implemented on a distributed information management platform. Authors in [2] propose the reference architecture of a DOSN, which could consist of six layers (Figure 2.5) and provides an architectural abstraction of variety of current related approaches to decentralized social networking in the research literature. The lower layer of this architecture is the physical communication network, which can be the Internet or another physical communication network. The distributed P2P overlay management provides core functionalities to manage resources in the infrastructure of the system, which can be a distributed network of trusted servers or a P2P overlay. Specifically, this layer provides services for looking up resources, routing, and retrieving information reliably and effectively among nodes in the overlay. On top of this overlay is the decentralized data management layer, which implements distributed functionalities such as query, insert, and update various persistent objects to the systems. The social networking layer implements all basic functionalities and features that are provided by centralized social networking services. Among these functionalities the most important ones are given in Fig. 2.5, namely the capability to search the system (Distributed search) for relevant information, the management of users and shared space (User account and share space management), the management of security and access control issues (Trust management, Access control and security), the coordination and management of social applications developed by third parties (Application management). The top layer of the architecture includes the user interface to the system and various applications built on top of the development platform provided by the DOSN. The decentralization of a OSN may be done with different granularity. Proposed approaches can be divided into the following categories: • Federation of servers, which require that social network providers agree upon standards of operations in a collective fashion. Federated social networks are not real P2P system, but they enable users to share their social contents with friends from other OSNs. An example of these kind of DOSNs is Diaspora [5]. • OSNs over unstructured P2P overlays, which have users’ personal data distributed among multiple peers [27], [28], [29], and [30]. • OSNs over structured P2P overlays, which utilize a DHT approach or social overlay [7, 10, 12, 31, 32]. Decentralizing the existing functionalities of Online Social Networks requires finding ways for distributing storage of data, propagating updates, defining a topol-
2.3. DATA MANAGEMENT IN DOSNS
21
Figure 2.5: General DOSN’s Architecture ogy and a protocol that enabling search and addressing, robustness against churn, etc. In the following, we explain in detail, the challenges that we have faced into thish thesis: the data availability problem, the information diffusion and privacy problems.
2.3
Data management in DOSNs
The data management can be seen as a layer over the P2P infrastructure, which includes part of the Social Network Support Layer, in particular services as the Information and Updates diffusion and privacy and security mechanisms; and the Distributed or P2P storage system, which can be referred as the data availability service. The three most important services provided by the data management layer are shown in Figure 2.7. In this section we show the current approaches proposed to manage these issues. In order to explain the data management issue, we can classify the current proposals according to two solutions, which emerge as dominant for DOSNs in term of topology, as shown in figure 2.6. In the first solution, we consider all systems in which the DOSN topology is integrated into the P2P overlay. The second solution, shown in figure 4.2, is a two tiers structure in which data are stored in a separated layer. System has two levels, in which the first level is the P2P infrastructure, which is used for routing and search functions. The second level is the Social Topology, or Social Overlay, as called in [33].
22
CHAPTER 2. BACKGROUND AND RELATED WORKS
(a) Single Tier
(b) Two Tiers
Figure 2.6: Distributed Online Social Network architecture
Figure 2.7: Data Management Services A Social Overlay is a logical overlay in which peers are connected to known peers, as explained in [33]. An edge between a pair of nodes indicates that a tie exists between two adjacent nodes.
2.3.1
Data Availability and Persistence
One of the main challenges in decentralization comes from guaranteeing availability of the data when the owner of the data is not online. When we talk about data availability, we talk about an important challenge in distributed systems. Without a central point of storage, data are distributed among all nodes in the network. When a node leaves the network, its data should remain available in the network. As a matter of fact, the data of a user become no more available when the user disconnects from the social network, even if its host is not shut down. It is important to notice that, in the following, we will use the term user availability and host availability in an interchangeable way. In the context of DOSNs, we consider availability of user’s data, called content, that can be seen as the digital representation of a user, which is stored on a computing device and can be transmitted from one device to another. In the context of DOSNs, two types of data availability are generally considered: Pure availability and Friend Availability [34] (figure 2.8). The Pure Availability measures the fraction of time a piece of users’ data is available for other users. However, in a DOSN, people interested in the data of a given user are primarily its friends. So the Friend Availability, which measures the fraction of time a user’s data is
2.3. DATA MANAGEMENT IN DOSNS
23
available when its friends are online, is a good metric for this scenario.
Figure 2.8: Availability in DOSNs The main technique used to resolve the problem of the content availability in DOSNs is the Replication [35]. Replication is a well-known technique in a distributed system and it is a technique based on the storage of the same data on different storage devices. Several proposals use DHT for storing data, as explained in section 2.6. By using the DHT, data are always available, but a generic user’s profile is stored on unknown nodes. Instead, by using a social overlay, data can be stored on specific peers which can be chosen by the user itself. With a social overlay, users have more control over their data. When a social overlay is exploited, storage nodes are chosen by exploits different characteristics like temporal information. In particular, one of the well-known solution is to store data on friend nodes. [36] finds the limitation of storing data only on friends on the data availability. Authors show that the problem of obtaining maximal availability while minimizing redundancy is NP-complete and proposed greedy data placement heuristics to improve the data availability In [37] the Erasure Codes (ECs) is proposed to manage data availability, which has been proven to be more efficient in terms of redundancy than replication.
Replica selection policies One most important goal in a replication-based data management system is to decide where data should be stored. To manage this challenge, most of the current data availability management proposals provide replica selection policies. In [38] three different replication policies are studied. The first one increases the availability by choosing, as replica locations, the user friends which maximize the availability of the user profile. The second approach prioritizes the most active friends
24
CHAPTER 2. BACKGROUND AND RELATED WORKS
for placing the replicas, where active means that a user is mostly online and, in the third approach the storage nodes are randomly chosen. In [34] are proposed, evaluated and compared further three policies: • No replication, only the user provides its own data. • Direct replication only, the data is made available by the user and its friends, which receive a copy of data directly from the user itself. • Indirect replication, friends of a user can collaborate with other friends. The indirect replication increases the availability but reduces the trustworthiness. The work presented in [39] introduces My3, which is mainly focuses on the distributed storage layer. In this work, several replication strategies are introduced: • Minimize the number of replicas, this approach aims to minimize the storage and replica management overhead. • Minimizing update propagation delay, the algorithm minimizes the update propagation delay which is the delay in time between the time instance an update occurs on a user profile at one of the replicas, and the instance the update reaches all the other replicas; • Minimizing the access cost, the approach minimizes the access cost incurred in accessing a user’s profile. It assigns the nearest trusted node connected to a user in the online time graph; • Maximizing the replication gain, the approach quantifies the replication gain of a subset of trusted nodes and explores the entire solution space to pick the right set with the minimum effective cost. In [40] an efficient replication selection policy is proposed to select the set of storage nodes. It considers three aspects: • online time which represents the average online probability; • social relation which is currently an absolute measure of either being a friend or not; • user experience, which is computed in regarding to the suitable of the storage nodes to be again storage nodes.
2.3. DATA MANAGEMENT IN DOSNS
25
User Behaviour Analysis and Availability Prediction Studies concern availability prediction of users of a P2P system based on their past behaviour are much less than those on their generic availability. Several distributed storage systems are proposed from classical file sharing application, but OSNs are different in term of user’s behaviour. In [41], the time spent online by users of Bebo, MySpace, Netlog and Tagged has been studied through statistical analysis and authors discovered that a Weibull distribution accurately models the behaviour of 80% of users. Session lengths and the number of sessions follow power law distributions. In [42] the Orkut, MySpace, LinkedIn and Hi5, OSNs have been studied. Authors concluded that session lengths follow a heavy-tailed distribution, while inter-arrival times a Log-normal distribution. In [43] the users availability in MySpace has been studied. Authors show that users availability is not only dependent of the daytime, but also correlated to the presence of their friends on the platform. This last work is an interesting study for the data persistence problem, but the lack of a dataset with both temporal and structural information is the actual limitation in this research field. As a matter of fact, the online patterns of OSN users are more discontinuously than in traditional decentralized applications [41]. Most of DOSN proposals don’t consider the problem of data availability by considering the system as stable or by considering external storages. Instead, in the last years, the data availability problem in a DOSN has become more popular and the replication approach is widely used. In [43], the purpose is to find peers who behave according to a given availability pattern. In particular nodes seek partners for two different types of problems known as disconnection matching (a peer looking for a partner who will disconnect about the same time) and presence matching (a peer looking for a partner who will be online with its in the future). For both of these purposes it’s necessary to predict the future users behaviour based on their past behaviour. The authors propose to use a simple binary predictor that predicts the online presence of a peer in a time window of 10 minutes: the prediction is based on the number of days in the past week in which the peer was online in the same time window. If the peer was online in the same time slot in at least 5 days out of 7 of the last week, then it is estimated that the peer will be online again. The authors use a real dataset collected from eDonkey to verify the effectiveness of predictors. The dataset contains about 12 million peers, whose predictability is not very high. However authors can extract a reduced set of nodes with predictable behaviour filtered in according to a set of parameters. This set contains only 19600 peers, but the knowledge of the online traces of a limited number of peers can allow to improve the performance of different applications. In [38] an empirical study of the various system properties of DOSNs and the parameters that influence them
26
CHAPTER 2. BACKGROUND AND RELATED WORKS
have been studied. Authors explained that one of the most important parameters to provide availability is the online time of a user. By considering the replication of the user profile of an user u on a set of trusted friends Ru , let OTu the online time period of an user u, the profile of u is accessible by an arbitrary user v only if ∃j ∈ Ru such that OTv ∩ OTj 6= 0. They model the online times based on user activities in three different ways: • Sporadic, which assumes that a user is online several time a day sporadically (20 minutes session length) • Continuous-Fixed Length, all the users in the network are assumed to be online, each day of the week, during a continuous time window of a fixed length • Continuous-Random Length, each user randomly chooses its own length of the online time window from the range [2, 8] hours. The Sporadic model is considered the most realistic. Authors explain that another important parameter is the replication degree: higher is the replication degree, more is the level of potential exposure of personal information to others.
2.3.2
Information diffusion in DOSNs
Social information systems have seen a rapid growth in recent years reaching a huge amount of contents and interactions. The task of disseminating information plays a key role in these systems because they are seen as an infrastructure for sharing information. Given the huge amount of content produced, such systems must enable the dissemination of information between users and improve it by personalized content dissemination on specific or popular topics. In a profilebased system, the user’s profile requests are directly done by the other nodes in the network. Instead, when we consider the profile update propagation (i.e. the news feed of Facebook), it’s important to send updates by considering the topology and eventually privacy policies. Data dissemination techniques have been developed in many networking fields as sensor networks, mobile networks, and distributed networks. The existing methods include gossip protocol, epidemic routing, probabilistic routing based on prediction, distributed caching, and Content Distribution Networks. OSN’s users produce huge amount of information and the propagation of this information to the destination has to be well coordinated so as to reduce information overload, duplication, latency and to ensure quality. Dissemination of updates can be seen as a publish/subscribe (pub/sub) problem wherein users are publishers and their friends subscribers.
27
2.3. DATA MANAGEMENT IN DOSNS
Usually information dissemination in structured distributed systems takes place through pub/sub methods which rely on rendezvous node to bring information from publisher to his subscribers, but rendezvous nodes may be hotspots of the system because they could act as bottlenecks to application performance and raise scalability issues. Pub/Sub methods on structured distributed systems are more sensitive to the dynamism of the network because structure increases both complexity and the overhead needed for joining the system. Gossip-based methods are used for spreading information in unstructured and structured distributed systems. These methods are well suited to unstructured systems because they are less susceptible to user’s churn and more robust to failures. Gossip-based methods for information diffusion ensure with a high probability that a node interested in a content, eventually gets that content. The reliability of this methods depend on the parameters of the epidemic algorithms. Usually it is possible to adjust this parameters to achieve different level of reliability despite failure and dynamic network topology. Furthermore unstructured distributed systems mitigate the overhead needed for joining the system because the absence of structure reduces both complexity and susceptibility to dynamism. Structured System
Unstructured System
Hybrid System
Pub/Sub
Information Diffusion
Hotspots
Robustness
Replication
Scalability
Gossip-based Robust Scalable
More susceptible
Churn
Less susceptible
Less susceptible
Congestion
More susceptible
High reliability
Reliability
Different levels of reliability
Figure 2.9: Information diffusion in structured and unstructured DOSNs. Authors in [44] define a overlay-independent pub/sub system for social networks. Nodes use attenuated Bloom filters to create an area of attraction (blackhole) for messages moving around them. These black-holes allows both to attract
28
CHAPTER 2. BACKGROUND AND RELATED WORKS
the messages addressed to the same group membership and to filter messages. Probabilistic message routing to group members is performed using parallel random walks. Negative information attached to each message is used to avoid self loops on nearby nodes. In [33] an efficient gossip-based updates dissemination protocol for social networks is proposed. This protocol presents three key principles: the use of message histories, an anti centrality selection heuristic, and fragmentation awareness. To propagate the updates, two gossip protocols are used: a rumour mongering protocol and an anti-entropy push-pull protocol. The rumour mongering protocol is based on a push exchange strategy and is used to disseminate updates quickly. This fast but unreliable dissemination is therefore complemented by an anti-entropy push-pull protocol. This second protocol runs in background at a slower pace with respect to rumour mongering and guarantees that all nodes that become and remain online for long enough eventually receive all updates. In each gossip round, a node u selects a node v from its neighbourhood uniformly at random. Then it sends all, or part of its list of hot rumours (updates) to v and collects a response vector from v which tells about rumours that v already knew and rumours it did not. For each item in the response vector, if the rumour was not known to v, then nothing is done; otherwise, the rumour is removed from the hot rumour list with probability p. Node v, in turn, adds to its hot rumour list the new rumours received from u. GoDisco [45] disseminates information in communities of online social networks, using exclusively social links and exploiting semantic context. Nodes inform their neighbours about their interests and keep track of the behaviour of their neighbours. Nodes can communicates directly only with other nodes with whom they have a social relation. As an extension of this work, in [46] authors propose GoDisco++. The novelty of this extension is the exploration of a multi-dimensional social network, whose semantics can be exploited to achieve better dissemination characteristics. Social triads are exploited to avoid duplication among nodes. The essential idea is to avoid forwarding a message to the common neighbours of the node. Furthermore, the approach uses a feedback derived locally by nodes for a new dissemination based on experience from previous disseminations. In [47] an approach of selective propagation of social data is proposed for decentralized online social networks. It takes into account specific interactions between users by considering an area of a specified interest. For the propagation of social data belonging to a specific interest, the strength of relationship between users is computed and it is increased or decreased according to the interaction between users. In [39] an updates dissemination approach is proposed. This approach guarantees that, after every update, the concerned replica pushes the update to other
2.3. DATA MANAGEMENT IN DOSNS
29
members during the online period. Updates on a profiles are pushed immediately by a replica to all other replicas. When a replica comes online, it announces itself to all other online replicas and pulls any buffered updates. When concurrent events are detected on an object, the two replicas have to decide on ordering the events. In [48] and [49], a particular approach of caching called Social Caching is presented. The selection of social caches is social-relationship driven and social caches are selected to cache updates only for friends to ensure security requirements and only allow one hop communication. Social Butterfly [48] selects certain nodes as Social Caches for bridging social update delivery between producers and consumers and thereby reducing the cost of the update diffusion. Social cache are special nodes which act as ”local servers”, while the remaining nodes are assigned as members of one or more social caches to form Social Clusters. A member node only contacts the social caches it is associated with, while a social cache can be a member of multiple social clusters. Every node in the social network can be both a social consumer and producer. Producers push social updates to the social caches they are associated with and consumers fetch data from the corresponding social caches when they want to learn the updates from their friends. This solution is not a fully distributed technique because it requires knowledge about the entire social graph topology. SocialCDN [49] is the latest version of Social Butterfly in which the distributed social cache selection mechanism does not require global knowledge about the entire social graph. Within the context of SocialCDN, four distributed cache selection algorithms are proposed and analysed: Randomized algorithm, Triads Elimination algorithm, Span Elimination algorithm, and Social Score algorithm. Dissemination of the social updates among friends needs O(n2 ) network connections, where n is the number of friends. As Social Butterfly, SocialCDN proposes social caches to reduce the total number of connections necessary for social updates dissemination. The goal is to minimize the number of social caches using fully distributed algorithms and the problem is close to the Neighbour-Dominating problem. In [50] is proposed a random walk approach for the update management in unstructured DOSNs. A random walk is performed by choosing a single neighbour at random to forward the queries until the desired item is found or the search is cancelled. Performing multiple simultaneous random walks decreases the time required to find existing data while maintaining a limit on overhead due to network traffic. Authors use checking to control the duration of random walks, in which peers receiving a query send an acknowledgement message to the peer that initiated it. When a query finds a result using a random walk, that query is forwarded by one or more peers including the source of the query. Using path replication, the result is forwarded back through the chain of peers to the originating peer and
30
CHAPTER 2. BACKGROUND AND RELATED WORKS
each peer in the chain caches the data before forwarding it. These caches form a path on the network overlay graph. The main concept of this algorithm is the replication of data along the path followed be a successful random walk and also, a replication of paths to improve the performance of random walks. In fact, when a node visits a peer with relevant information in its path cache, the next visited peer is stored in cache instead of a random peer. Each data item is associated with a master peer that manages updates for that item. It determines the order of updates and records a strictly increasing version number on successive updates. While other peers may initiate an update, it is always applied at the master copy and the updated version propagated to other peers. New versions are pushed along the edges of the directed graph produced by following the child links in the caches. Peers that contains a copy of the data in their data cache update their copy and forward the update message to their children.
2.4
Security and Privacy
The proposed DOSNs effectively address the main privacy concern of users’ data that exist in centralized OSNs because personal information of the users is not in full control of the OSN providers. Hence, users do not depend on an external OSN service provider to maintain their data. While the absence of a single control point cuts out the most powerful privacy breaches, the decentralization of the OSN service raises new privacy and security challenges that were previously addressed by the central authority. We focus our attention on security and privacy issues in DOSNs and organize them into the followed categories: • Privacy breaches: in this category fall all the attacks that try to strike the users’ privacy exploiting the huge amount of users’ data available in the system. • Viral Marketing: includes spamming attacks and phishing attacks that use the OSN services to disseminate unsolicited messages or malicious software that try to get users’ confidential information. • Network attacks: this category includes attacks exploiting the network structure that hosts the OSN. In particular, we focus our attention on Privacy breaches to exploit the critical task of data management in a DOSN.
2.5. PRIVACY BREACHES
2.5
31
Privacy breaches
Users of OSNs produce a huge amount of contents and personal information that they would like to share only with their direct friends. However, centralized OSN architectures require that users must trust the service providers to store and protect all they personal information.
2.5.1
Privacy breaches from centralized service providers
Users of the centralized OSNs do not have control of their data and the service providers can obviously benefit from examining or sharing this information (for advertising or for spying purposes [51]). While the absence of a single control point effectively address the main privacy concern of the users’ data that exist in the centralized OSN, the decentralization of the OSN also removes some privacy protection provided by the central authority. As matter of fact, centralized OSN allow user’s friends to access personal information of the user while ensuring their integrity and authenticity. However, the friendship relation in OSNs is merely a social link that two users have agreed to establish regardless of the actual offline relationship.
2.5.2
Privacy breaches from other users
There are several security requirements that a DOSN could ensure to their users: Confidentiality The primary security requirement in DOSN is confidentiality. User’s data must be protected from unauthorized access and only those users who are explicitly authorized by the content owner can read it. Confidentiality is not limited only to data but also to communications between users because no other than directly addressed parties may have the possibility to trace which parties are communicating. Anonymity The identity of users and the type of relationship between them should not be inferred from DOSNs. Hence, users’ identities must be unique and anonymous. Identity and their actions, which must remain hidden from third untrusted parties. For instance, Safebook [52] provides a Trusted Identification Service (TIS) that assures to each user at most one unambiguous identifiers in every levels of the DOSN. Integrity User’s identity and their data must be protected against unauthorized modification and tampering. Decentralized OSNs must ensure that the contents posted by users’ friends are uncorrupted.
32
CHAPTER 2. BACKGROUND AND RELATED WORKS
Authentication The authentication has to assure the existence of real persons behind registered OSN members. Currently existing DOSNs try to address these privacy concerns coupling a distributed approach with encryption techniques. Typically, data of the OSNs, such as status updates and photos, include small contents belonging to several friends (such as comments, likes and tags). To achieve fine-grained access control, each data should be encrypted separately for different sets of recipients. As a consequence, the design of current DOSNs may suffers from performance issues that arise due to the fragmented structure of the data. Fetching or decryption of the objects belonging to friends are required in order to view the entire content of a data but, with a large number of objects, these operations might be quite time consuming. The choice of the cryptographic mechanisms does not only involve performance and privacy, but also the structural design of the supported social contents. Since asymmetric cryptography is much more computationally intensive than symmetric cryptography, most of the solutions described use encryption systems based solely on asymmetric cryptography are not used. Authors in [53] analysed the existing security mechanisms for DOSNs by comparing them in terms of efficiency, functionality and privacy. Most of the current DOSNs are based on the conjunction of symmetric and asymmetric cryptography, which is much more efficient since the object itself is encrypted using a symmetric cipher, and then this symmetric key is encrypted multiple times with a public key of each of the receivers. However, if the number of users who can see the content is very big, the overhead connected with encryption of the same keys multiple times is quite significant both in terms of time and space because encryption for the group means that data is first encrypted with a symmetric key and then this symmetric key is encrypted with a public key of each member [53]. Behind these security mechanisms offered by the the most popular DOSNs , there are also a large collection of works that seek to improve the existing security mechanisms in a specific direction. Gunnar Kreitz et al. [54] focus on the problem of cryptographic primitives that hide the user’s data but reveal access policies and they introduce predicate encryption (PE) [55], like ABE, in order to hide the user’s data without revealing the access policies. A user’s profile is defined as a set of multiple objects encrypted for different users. Additional, a Bloom filter is used to store users who can decripher the objects. Jain et al. [56] analyse the Safebook’s security features and suggest some improvements to the system. Authors propose to use Onion Routing [57] technique in order to provide anonymity of the users. A simple onion routing technique in-
2.6. EXISTING APPROACHES
33
volves encrypting the data with the public keys of the traversed routers in reverse order. As the data passes through a router, it decrypts the outer most layer of the data. Moreover, threshold cryptography is used in order to protect user’s data. A. Datta et al. [58] use threshold-based scheme to address the problem of back up and recovery of the user’s private key in a network of untrusted servers. To improve security of the secret sharing protocol, they propose a mechanism to select the most trustworthy delegates based on the social relationships among users. The authors in [59] focus on the problem of re-identification of the users from social network, even if the victim’s identity is preserved using anonymization techniques. Adversary knows the exact 1-neighbourhood of the target node and the anonymization algorithm attempts to make this 1-neighbourhood isomorphic to k-1 other 1-neighbourhoods via edge addition. Fu Y. and Wang Y. [60] propose a privacy-preserving common friend estimation scheme that estimates the set of common friends without the need of cryptography techniques. The authors assume that each user is assigned a unique and a variable length identifier. Bloom filters are used to represent the set of its friends. The estimation of common friends of two users is computed using the intersection of Bloom filters that is computed by one of their common friends. Finally, encrypting the content is not enough to hide all sensitive information from attackers, because they could infer sensitive personal information from the properties of the contents (such as size, structure, packet header, etc.). The authors in [61] identify the privacy problems that arise when metadata of the contents are shown (shown in Figure 2.10). They classify the problems arisen by inference from the metadata into three categories: (i) inferences from the stored data can reveal size, structure and modification history of the related content; (ii) inferences from access control mechanisms (such as the replacement of an encryption key) might allow conclusions about a user’s social events; (iii) inferences from metadata’s communication flows (such as direct connection between users or requests for content sharing) allow to infer information about usage about usage patterns, interests or IP address of the users. Finally, the authors list the countermeasures in order to mitigate the described metadata attacks. However, there is not a comprehensive solution that covers all the problems highlighted.
2.6
Existing Approaches
In the current research field of DOSNs, a useful common choice is the structured P2P overlay using DHT approach. The DHT approach is useful to provide privacy and security services, but at the same time, it has several drawbacks such as the management in case of high network dinamicity.
34
CHAPTER 2. BACKGROUND AND RELATED WORKS
Figure 2.10: Privacy violations through contents’ metadata. Diaspora The first commercial DOSN is Diaspora [5], a social network which is actually used by about 200.000 users. The system exploits a network of independent federated Diaspora servers that are administrated by individual users who allow other Diaspora users’ profiles to be hosted on their servers. Some users choose to be Diaspora servers in order to keep much control on their data, while others might choose to use an existing server. A typical Diaspora server allows to its administrator read and write access to unencrypted information of the hosted user. A user organizes its contacts into groups, named aspects (similar to the circles of Google+). A user’s social content (such as post or comment) is sent to all the aspect’s members. A user has three different ways to join Diaspora: it can join a Diaspora closed server after receiving an invitation, it can join an open Diaspora server or the user can create its own Diaspora server that store its data. The structure of the users’ data in Diaspora may be affected by three levels of security: unencrypted, encrypted by the server for some intended receivers, information encrypted by the owner itself for some intended receivers. SafeBook Safebook [7] is a DOSN based on two design principles: decentralization and exploiting real-life trust. The system integrates several privacy and security mechanisms in order to provide data storage and data management functions that preserve data integrity, privacy and availability. Safebook consists of a three-tier architecture mapped on three different logical levels (Fig. 2.11): a Social Network Layer based on a user-centred structure called Matryoshka; the Social Networking service (SNS), which is based on Kademlia [62] and provides the infrastructure, managed by the provider, to implement data storage and retrieval, indexing of the content, access permissions of the data and node join and leave; and the Communi-
2.6. EXISTING APPROACHES
35
cation and Transport (CT) level which is provided by the Internet infrastructures. The SNS provides each member with a set of social network features such as accessing profiles, commenting, liking or finding a friend. It is implemented using a user-centred Social Network Layer which consists of Matryoshkas, which are concentric rings of nodes connected to its center, the core node, through radial paths based on trust relationship in real life. The innermost shell of a Matryoshka is composed of direct contacts of the core (mirror), which provides both encrypted storage and retrieval of the core’s data. The nodes in the outermost shell acts as a gateway (entry point) for the core’s data. Every request to the core’s data can be routed, recursively, from the outermost shell to the core and vice versa providing communication obfuscation.
Figure 2.11: SafeBook Architecture
PeerSoN Buchegger et al. [10] propose a two-tier P2P system to preserve user privacy, through encryption of its data, and to allow transaction between users, even if there is no Internet connection. The first tier, implemented by OpenDHT, provides a look-up service that stores all the information needed to find the users and their data (IP address, files, ..). When a friend is unreachable, all messages and notifications are stored in the DHT, and retrieved when the node is reachable again. The second tier consists of peers and contains the user data. In this tier, users connect directly to each other in order to exchange messages and notifications. The system provides confidentiality, integrity and authentication by assuming the availability of a public-key infrastructure (PKI) with the possibility of key revocation. Users data are encrypted with the public keys of the users who have
36
CHAPTER 2. BACKGROUND AND RELATED WORKS
access to it. Encryption is the main mechanism to ensure both privacy-preserving and control access on the user data. The direct exchange of the social data is implemented using the real-life part of social network. Users can carry data for each other and spread information through the physical social network or delay uploading data until someone has online connectivity. SuperNova In [63] a super-peer architecture for DOSNs is proposed. The architecture of SuperNova is based on two principal entities: storekeepers and super-peer. The former are lists of users who have agreed to keep a replication of another user’s data and provide data access control based on different policies defined by the owner’s data (Public, Private or Protected). The latter are nodes providing different types of services (storage, recommendation of storekeepers, recommendation of friends,..). Initially, when a new user joins it may not have enough friends for storing its data, so the user relies on super-peer to keep its data available. Super-peer may delegate the store-keeping task to other nodes. After this initial phase, the new node tries to find suitable storekeepers either among its friends or among strangers provided by super-peers, until it has reached a good availability score. The user’s data at the super-peer and at storekeepers are encrypted so nobody else can read them. A reputation system allows the evaluation of a super-peer based on feedback from users that use it. Finally, SuperNova enables the management of communities based on common interests or not.
LifeSocial.KOM Graffi et al. [8], [9] propose a completely Distributed P2P-based Online Social Network, which provides a security infrastructure that enables secure communications and access control on the stored data. In term of social networking functionalities, LifeSocial provides, as a basis, the same functionality of a centralized online social network. Furthermore, the system is an expandable plugin-based platform. The core network layer is a structured P2P overlay called FreePastry, which is based on Pastry and provides the functionality for ID-based routing among peers. FreePastry [64] provides a reliable storage component with integrated replication of the data called PAST [65]. With FreePastry and PAST, objects can be stored and retrieved from the network based on their ID. In LifeSocial.KOM any data is first encrypted with a new symmetric key and then the symmetric key is encrypted with the public key of the users able to decrypt the data and appended to the user’s content. The list of encrypted symmetric key, as well as the object itself, are signed by the author and stored in the overlay. Data
2.6. EXISTING APPROACHES
37
objects in LifeSocial.KOM contain either final information (e.g. a photo, a profile or a status entry) or additional links to further objects (e.g. a photo albums of the user). Final objects can be directly retrieved and instantly presented. Instead, objects with additional links result in a distributed linked list which can be traversed recursively. The users interested in the content may retrieve it from the network, validate the signature using the Public Key of the author and, if allowed, decrypt the symmetric key and thus the content of the object. Figure 2.12 shows an example of the security mechanism described above.
Figure 2.12: Control access in LifeSocial.KOM.
Vis-` a-Vis Vis-`a-Vis [31] is a scheme for DOSNs that targets high content availability. It introduces the concept of a Virtual Individual Server (VIS), where each user’s data are stored on a personal virtual machine. The Vis-`a-Vis architecture is a hierarchical structure it is based on a two-tier DHT structure composed of a set of highly available VISs. The top-tier DHT, called the Meta Group, is used to advertise and search for public OSN groups. The lower-tier corresponds to OSN groups and is maintained by the VISs of the group members. Users access location-based groups through clients such as stand-alone mobile applications and web browsers. Each group supports a membership service responsible for implementing admission policies and for maintaining pointers to the group’s location tree. Each group acts as a leaf node in the location tree. There is a structured overlay network per group and each overlay represents a group. The same VIS can belong to multiple overlay networks, just as one person can belong to multiple social groups.
38
CHAPTER 2. BACKGROUND AND RELATED WORKS
My3 My3 [12] is a privacy-friendly DOSN which exploits well-known interesting properties of online social networks, for instance the locality of users. The system allows users to exercise finer granular access control on the content. The system exploits the trust relationships among friends to improve the availability of data in the network. Users’ profile is hosted only on a set of self-chosen trusted nodes (TPS). This set is populated with respect to the availability and performance goals, in particular low access and consistency costs and high data availability. Members of My3 leveraging their mutual trust relationships to enforce access control on the access requests in place of encryption-based-access control that typically involve encryption of many data objects. The system uses a DHT for storing the privacypreserving index of the profile content and other meta information. A user u and its set of trusted nodes (TPS) are stored in the DHT in the form of (key,value) pair, with key being the U Idu and value being the members of the TPS. This mapping is useful for contacting the nodes where the profile of a particular user is stored. The user trusts these nodes both for storing its profile content and for enforcing access control on the access requests. The selection of the Trusted nodes (TPS) and their churn dynamics are handled according to the geographical location of nodes.
Cachet Cachet [11] is an architecture that provides security and privacy guarantees protecting the confidentiality, the integrity and the availability of the user content. Cachet provides a distributed pool of nodes to store user personal data and these nodes are untrusted. The system uses the distributed hash table as a base storage layer, and a gossip-based social caching algorithm. Data is stored as an object in the DHT using objID as the DHT key. In addition to the standard get and put operations, the DHT also supports an append operation. In Cachet, users data are protected by DECENT [66]: a cryptographic hybrid structure, based on EASiER [67], that ensures confidentiality and integrity without revealing policies. Data in Cachet are stored in container objects that include content, such as status updates and photos, as well as references to other containers; authorized contacts can add comments or other annotations to containers. The structure of the data includes two components: cryptographic capabilities used by the storage nodes to authenticate update requests and attribute-based encryption (ABE) [67, 68] used to provide flexible and fine-grained access policies. The system allows to define two kinds of policies: identity-based policies, that define user-specific access and attribute-based (AB) policies that define access for a
2.6. EXISTING APPROACHES
39
group of social contacts sharing some content. Each content is protected by three permissions (read, write and append) defined by the owner using policies. Each content is encrypted with a randomly chosen symmetric encryption key K, which is, in turn, encrypted with attribute-based encryption policy (such as ABE(K,P)). Only users that meet the policy P can decrypt the private key used to encrypt the content. Authentication of the write request is achieved with a write-policy signature key (SPK) and a random public and private key (WAPK and WASK) used to sign and authenticate requests. Figure 2.13 shows an example object structure.
Figure 2.13: Example Cachet objects.
LotusNet LotusNet [69] is a P2P-based DOSN, which uses strong authentication of peers at overlay level to provide security and stability to social applications and finegrained access control to privacy resources. The system does not impose a very high, fixed privacy level, but allows the users to tune the trade-off between privacy and services. The architecture of the system is based on a DHT, in particular it uses Likir [70], a customized version of Kademlia. Likir enhances the Kademlia protocol with strong identity management at the overlay level. Likir requires that users must full-fill a preliminary user registration procedure in order to provide a certified identifier for each DHT node. GemStone GemStone [40] is a P2P social network which guarantees data availability and security. Private data are stored among a set of other nodes called Data Holding Agents (DHAs). An important characteristic of Gemstone is the synchronization
40
CHAPTER 2. BACKGROUND AND RELATED WORKS
of social data among all participant social applications and others devices of a users. The overlay is formed by using a DHT as shown in figure 2.14 Gemstone encrypts all users’ data using ABE. Data is stored in form of encrypted user profiles and messages to users, which can contain arbitrary contents.
Figure 2.14: Gemstone: system architecture The confidentially is a key element in Gemstone. The system allows each user to grant fine grained access to its confidential data. Data can not be accessed by other entities than those that have the corresponding decrypting key. Data Management is strongly dependent from the choice of topology. Table shown in figure 2.15 provides a schematic review of the current proposed P2P-based DOSNs.
2.7
Analysis of existing DOSNs
Existing DOSNs are mainly focused on privacy and security issues. Most of them use encryption to preserve privacy. Data are encrypted and decrypted by receivers. However encryption may be computationally expensive if we consider that receivers, which are friend nodes, can always access to private data. Encryption adds overhead in both the time and space domain. Furthermore, several problems arise from the distributed architecture and they need to be considered. For example, data availability is highly necessary to manage the social network services. Without a good point of storage, data could be unavailable for a long time damaging the framework functionality. Due to the dynamics of information and relationships (and/or interactions) that characterizes OSNs, most of proposed DOSNs rely on persistent logical nodes such as external storage servers. Even when data availability is managed by proposed DOSNs, such as My3 and Cachet, the lack of real data about the temporal user’s behaviour represents the major limitation for verifying the correctness of these approaches.
2.7. ANALYSIS OF EXISTING DOSNS
41
Figure 2.15: Current proposed P2P-based DOSNs The main challenge of this thesis is to define a Distributed Dunbar-based Online Social Network based on the notion of trust among nodes. By using the Dunbar’s approach (explained in Chapter 3), we are able to build a P2P Social Overlay which connects only trusted nodes. We exploit the Dunbar-based Social Overlay, principally to manage the problem of data availability. Furthermore, we introduce an approach to manage the Information Diffusion.
42
CHAPTER 2. BACKGROUND AND RELATED WORKS
Chapter 3 Complex Networks Analysis for DOSNs Complex networks are in a myriad of fields, ranging from web pages and their links, protein-protein interaction networks, and social networks. The modelling and mining of these large-scale and self-organizing systems is a hard effort which involves many disciplines. Studies have observed a number of common properties in complex networks, such as power law degree distributions and the small world property. With the current popularity of OSNs such as Facebook, LinkedIn, Twitter, and of P2P system, there is an increasing interest in their measurement and modelling. Social networks are a typical example of complex networks. Unlike other complex networks, models for OSNs are relatively new and lesser known. Models may help detect and classify communities, and better clarify how news and gossip is spread in social networks. In this Chapter, we provide an overview of the Complex Networks Analysis. We explain in details what is a Complex Network, their important properties such as degree distribution, small world, clustering and centrality measures.
3.1
Complex Networks
The study of complex networks plays an increasingly important role in science. The structure of such networks affects their performance and one main feature of complex networks is that they are large. Formally, a complex network can be represented as a graph. Let us introduce some basic notions. Definition 3.1. An undirected (directed) graph G = (N, L) consists of two sets N and L, such that N 6= ∅ and L is a set of unordered (ordered) pairs of elements of
44
CHAPTER 3. COMPLEX NETWORKS ANALYSIS FOR DOSNS
N . The elements of N = n1 , n2 , ..., nN are the nodes (or vertices, or points) of the graph G, while the elements of L = l1 , l2 , ..., lK are its links (or edges, or lines). A node is usually referred by its order i in the set N . In a undirected graph, each link is defined by a couple of nodes i and j and is denoted as (i, j) or li,j . The link is said to be incident in nodes i and j, or to join the two nodes. Two nodes joined by a link are referred to as adjacent or neighbours. In a directed graph, the order of the two nodes is important: li,j stands for a link from i to j, and li,j 6= lj,i . A weighted (or valued) graph GW = (N, L, W ) consists of a set N = n1 , n2 , ..., nN of nodes (or vertices, or points), a set L = l1 , l2 , ..., lK of links (or edges, or lines), and a set of values (weights) W = w1 , w2 , ..., wK that are real numbers attached to the links. Examples of undirected graph, directed graph and, weighted graph are shown in figure 3.1.
Figure 3.1: Graphical representation of a undirected (a), a directed (b), and a weighted undirected (c) graph with N = 7 nodes and K = 14 links. OSNs are a typical example of a complex network: millions of nodes are present, with billions of interconnections running among them. We will introduce in this section some of the main concepts and measurements involved in complex networks analysis.
3.2
Small World
The small world concept in simple terms describes the fact that despite their often large size, in most networks there is a relatively short path between any two nodes. The distance between two nodes is defined as the number of edges along the shortest path connecting them [71]. A very famous experiment performed in the 1960 by the American sociologist Stanley Milgram [72] aimed at investigating social networks properties. In this experiment around 200 people were randomly selected in the Nebraska and Kansas country and asked to deliver a letter to a stock market broker living in Boston, knowing only its name. People could forward the letter
45
3.3. CLUSTERING
only to one of their direct known contacts with higher probability of knowing the recipient of the letter. This experiment is also known as ”six degrees separation” because the result showed that at most 6 hops (letter passing) where needed to reach the destination. The small world property represents the fact that, even a big network, exposes typically a very low network diameter, where the diameter is defined as the maximum distance between two nodes in the network. As a consequence, the average shortest path between two random vertices is very small, hence the name of small world effect. Formally, if we consider an undirected graph (meaning that edges don’t have associated a direction), the average shortest path can be defined by the following formula: `=
X 1 dij 1 n(n − 1) i>j 2
(3.1)
where n is the number of nodes and dij is the minimum length between vertices i and j. A network presents the small world effect if ` grows logarithmically (or sub-logarithmically) by the network size. The study of this property is of particular interest, because can be strictly related to the information diffusion or searching speed in a network. It is known that social networks present small world properties [73]. As a particular example, we cite a study that demonstrates how the Facebook diameter keeps shrinking: in 2008 Facebook exposed an average node separation of 5.28 hops, value which became 4.74 in 2011 [74].
3.3
Clustering
The clustering coefficient is used to estimate how the neighbours of a node v are connected to each others [75]. A very common trait of networks is to expose groups of nodes forming communities (clusters), meaning that these nodes are highly connected intra-community and weakly connected inter-community. The clustering coefficient assesses the quality of the communities partitioning. Although many different definitions of clustering coefficient exist, we refer to the one by Watts and Strogatz in [76]. If we refer the clustering coefficient from a single vertex perspective, the clustering coefficient gives a local aggregation evaluation: it measures how many edges exist between v neighbours with respect to the maximum theoretical number of existent edges between them. Formally, if we define N (v) as the set of v neighbours, nv = |N (v)| as its cardinality, the maximum number of edges between v neighbours is given by: nv 1 (3.2) = nv (nv − 1) 2 2
46
CHAPTER 3. COMPLEX NETWORKS ANALYSIS FOR DOSNS
Given an undirected graph G = (V, E), v ∈ V and mv = |E(G(N (v)))| (number of edges in subgraph constructed with the set of nodes N (v)), we can define the local clustering coefficient cc(v) of node v as: (m 2mv v = nv (n if δ(v) > 1 nv v −1) ) ( 2 cc(v) = (3.3) undef ined else where δ(v) is the degree (number of edges incident) to node v. Similarly, given a directed graph, the local clustering coefficient of node v is ( m v = nv (nmvv−1) if δ(v) > 1 (2·(n2v )) cc(v) = (3.4) undef ined else where δ(v) = δin (v) + δout (v). The global clustering coefficient of a network CC(G) can be defined in term of the average of all local clustering coefficient, considering only nodes where the cc(v) is defined. Formally: CC(G) =
1 X cc(v) |V ∗ | v∈V ∗
(3.5)
where V ∗ = {v ∈ V | δ(v) > 1}. An alternative definition of the global clustering coefficient is given by [75] and called network transitivity: τ (G) =
n∆ (G) nΛ (G)
(3.6)
which is the ratio between total distinct triangles of the graph n∆ (G) and the total number of triples nΛ (G). In the case of OSN, the global clustering coefficient is also defined as network density: ρ(G) =
|E| n
(3.7)
2
The global clustering coefficient is a value in the range [0, 1]: closer to 0 this value describes a low clustered network, with many edges connecting far nodes, whereas a value close to 1 denotes a network with many edges between close nodes.
3.4
Degree distribution, scale-free network model and network resilience
A very interesting measure is the vertex degree, the number of incident edges for a node. This information makes us understand the structure of the network: it
3.4. DEGREE DISTRIBUTION, SCALE-FREE NETWORK MODEL AND NETWORK RESILIENCE47
can be composed by nodes of similar degree, creating a regular structure, or some nodes with higher degree than the average value may be present. In the latter case, these nodes are called hubs, since they are key nodes in the network, being connections towards many others and thus being of strategic importance. The removal of the hub nodes from the network may severely ”damage” the structure, splitting the original graph in different separate components. Many studies on nodes degree distribution of real networks have been performed. Remarkable the work of Barab´asi et. Albert [77], which showed that the distribution is a power-law distribution for many complex networks, meaning that there are few high degree and many low degree nodes. This fact shows that links in a network are not distributed uniformly, making the classical random graph model proposed by Erd˝os et. R´enyi [78] not suitable to model complex networks in general, in particular social networks. Networks which present a power-law distribution of the degree of the nodes are known scale-free networks: the distribution of nodes degree is independent from the network size. A confirmation of the social networks presenting the power-law distribution is found in [79] and [73]. Furthermore, it can be interesting to know how nodes connect each other: the phenomena is known as assortative mixing [75], and two important alternatives have been observed [80]: • Nodes with high degree are connected to high degree nodes • Nodes with high degree are connected to low degree nodes To measure this, the Pearson’s correlation index is often being used. This index aims to measure the linear correlation (dependence) between two variables x and y. With respect to a sample of size n, the Pearson’s correlation index r is defined as: Pn (xi − x¯)(yi − y¯) pPn (3.8) r = pPn i=1 ¯)2 ¯)2 i=1 (xi − x i=1 (yi − y where x¯ and y¯ represents the mean values of variables xi and yi . The r index ranges from -1 to 1: a positive value (r > 0) means that to an increment of x variable an increment of y variable is observed. Conversely, for r < 0 an increment of x values correspond to a decrement of y variable. A value r close to 0 means that no significant linear dependences between x and y is observed. In the context of network analysis, the Pearson’s index becomes a measure of degree correlation [81]. In particular, the degree correlation shows important properties [82]: an high positive correlation value represents networks where high degree nodes are connected each other, making the structure robust and fault tolerant. By contrast, a
48
CHAPTER 3. COMPLEX NETWORKS ANALYSIS FOR DOSNS
negative correlation indicates networks which are sensitive to structural faults and attacks, where high degree nodes are connected to low degree nodes. This property therefore is used to measure the network resilience of a network, which is the ability of a network to maintain a good service even after some nodes failure or removal [75]. To assess this property, some experiments have been performed, by choosing different removal policies: • Random vertex removal • High degree vertex removal • Removal of a class of vertices Some studies (see [83] [84]) demonstrated a strong resilience in the case of random vertex removal, resilience which became lower by choosing particular nodes.
3.5
Centrality Indexes
Centrality indexes represents another very important metric to describe a network. These measures aim to study the importance of a node in a network. Several indexes have been proposed, we now show the most common ones. The Degree Centrality measures the connectivity of a node to the network in term of incident edges. It is defined as: DC(v) =
δ(v) |V | − 1
(3.9)
The Closeness Centrality [85] [86] measures the average distance of a node to the other ones of the network: it is a measure of the information propagation speed in the network from a node towards all other ones. It’s defined by equation 3.10, where d(v, t) is the minimum path distance from node v to node t. P d(v, t) CC(v) = t∈V (3.10) |V | − 1 A third centrality measure is the Betweenness Centrality [87] [88]: it models the importance of a node in term of the information flow in a network. The BC of node v measures the fraction of the shortest paths σst (v) from node s to node t that pass through v with respect to all the shortest path σst between s and t. Formally: BC(v) =
X σst (v) σst v6=s,t
(3.11)
3.6. SOCIAL NETWORKS: THE DUNBAR’S PROPERTY
49
This measure is a global property of a node, and an high value suggests that lot of information between other vertices transit through this node. Despite its importance, calculating this parameter can be computationally very expensive due to its nature of finding all the shortest paths between each pair of nodes: the complexity is O(nm), where n is the number of nodes and m the number of edges of the graph. However, a centrality measure called Ego Betweenness Centrality has recently been suggested, where empirical results showed a strong correlation with the Betweenness Centrality metric. The high advantage of computing this measure, is that only ego network information are needed. We present this measure in detail in section 3.6.1.
3.6
Social Networks: the Dunbar’s property
Most of the work done to describe the properties of social networks has been obtained by analysing offline environments, since OSNs appeared recently and research in the field of OSN analysis is relatively new. Collecting data concerning offline social networks is a complex task, since reconstructing the history of face-toface communications between people requires to recollect facts about past social contacts. Collecting data about entire social networks is often infeasible, and research on offline social networks is thus focused on personal social networks of single individuals, called ego networks.
3.6.1
Ego Networks
An interesting concept is the Ego Networks (EN ) model [89], which represents a graph constituted by a user (the ego), its direct friends (the alters) and the social ties occurring between them. An ego network is a simple social network model and they are useful to study the properties of human social behaviour at a personal level. An example of a Ego Network for the red node is shown in figure 3.2. From a graph theoretic view, can be called centred. Definition 3.2. Consider a graph, G, consisting of a set, S, of k points and a set, E, of e symmetrical edges linking pairs of points. Now if k > 2 and there are k − 1 edges such that some one point, p∗, is directly connected or adjacent to all of the others, G is a k-star. A centred graph is any graph of k points that contains a k − star. Clearly, any ego network is, structurally, a centred graph. An Ego Network can be seen as a local view of the graph relative to the user (ego). An ego network can be defined through an oriented graph, when we want to model interactions and the resulting graph is said Interactions Graph.
50
CHAPTER 3. COMPLEX NETWORKS ANALYSIS FOR DOSNS
Figure 3.2: Ego network of the red node.
3.6.2
The Dunbar circles
The most important result found on ego networks is that the cognitive constraints of human brain, and the limited time that a person can use for socialising, bounds the number of social relationships that it can actively maintain in its network. Many sociological and anthropological studies showed that maintaining social relationships is costly in term of cognitive capabilities [90], more in particular it has been showed that an upper limit of relationships an ego can maintain is about 150, the so called Dunbar ’s number [91] [19]. Some studies aimed at analysing offline social network depicted these network as being composed by several concentric circles around ego [91]: these circles reflect different social ties strength, and the strength fades as the distance from the ego increase (Figure 3.3). Each of these circles has typical size and frequency of contact between the ego and the alters contained in it. The first circle, called support clique, contains alters with very strong social relationships with the ego, informally identified in literature as best friends. The size of this circle is limited, on average, to 5 members, usually contacted by the ego at least once a week. The second circle, called sympathy group, contains alters who can be identified as close friends. This circle contains on average 15 members contacted by the ego at least once a month. The next circle is the affinity group, which contains 50 alters usually representing causal friends or extended family members [92]. Although some studies tried to identify the typical frequency of contact of this circle, there are no accurate results in the literature about its properties, due to the difficulties related to the manual collection of data about the alters contained in it through interviews or surveys. The last circle in the ego network model is the active network, which includes all the other circles, for a total of 150 members. This circle contains people for whom the ego actively invests a non-negligible amount of resources to maintain the related social relationships over time. People in the active network are contacted, by definition, at least once
3.6. SOCIAL NETWORKS: THE DUNBAR’S PROPERTY
51
a year. Alters beyond the active network are considered inactive, since they are not contacted regularly by the ego. One of the most important properties of ego network’s circular structure is that the ratio between the size of adjacent circles appears to be a constant with a value around 3.
Figure 3.3: Dunbar’s circles.
3.6.3
Tie Strength in Online Social Networks
Tie strength is probably the network concept that has attracted the most research attention. The first definition of tie strength, provided by Granovetter in [93] is that the strength of a tie is a (probably linear) combination of the amount of time, the emotional intensity, the intimacy (mutual confiding), and the reciprocal services which characterize the tie. Social ties can be broadly divided into two categories: strong and weak ties. The former are related to a small set of intimate friends and are useful to consolidate a core group of trusted people on whom an individual can count in case of troubles. On the other hand, weak ties are acquaintances, socially far from the ego and usually included within different social milieus. Granovetter’s findings indicate that tie strength must be taken into account to fully understand social aspects of a social network. In social network analysis, many attempts have been made to find valid indicators and predictors of tie-strength. The simplest way was to assume that close friends have strong ties and acquaintances or distant friends are connected by weak ties [94]. Additionally, multiplexity was also used as a strength indicator [93]. For measuring tie-strength, frequency of contact has been proposed by [95] and [96]. Indicators and predictors summarised previously have been extracted from data collected in off-line social groups and as such, they may or may not be valid in virtual communities, such as an Online Social Network. Obviously, the computation
52
CHAPTER 3. COMPLEX NETWORKS ANALYSIS FOR DOSNS
of the tie strength in a virtual communities depends on the intrinsic characteristics of the communities themselves. By considering OSNs, the possibility to deduce social tie strength from OSN data has been proved in [97]. Authors used a Facebook data set and explicit evaluation of tie strength done by the users. In [98] a study aimed at predicting tie strength from online interactions is presented. Authors asked a set of participants to indicate the name of their close friends and they used the collected evaluations to train a classifier to distinguish between strong and weak ties. Since the proposed model is based on evaluations of close friendships only, it is less accurate in the prediction of weak ties. In [99], authors use the ”frequence of contact” to estimate the strength of a tie. They consider a contact as an interaction between two users (a post or a comment). Authors refine the concept of tie strength in [100], where they observe that the tie strength can be estimated as a linear combination of different variables, such as posts, comments, tags. In [101] authors model tie strength as an index of cognitive resources that a user spend in a social relation. The outgoing communications have to have a higher weight than the incoming communications, because they require more cognitive resources. The tie strength is defines as: T ieStrengthjk = fjk +
fjk · fkj fjk + fkj
(3.12)
where fjk is the frequency of outgoing communications from an ego j to an ego k.
Part II A P2P Dunbar-based Distributed Online Social Network
Chapter 4 System Model In this Chapter we present the system’s model by explaining in detail all its components. As shown in Chapter 1, decentralization creates important challenges that have to be faced. We propose DiDuSoNet (Distributed Dunbar-based Social Network), a distributed framework which tries to solve some of these challenges, like that of data persistence and diffusion in a distributed system. Our system exploits the knowledge of the Dunbar-based ego network of a user, introduced in Chapter 3 both for the definition of the overlay and for information persistence and diffusion. The rest of the Chapter is organised as follow: • in sections 4.1, we introduce the profile-based communication which is the basis of our system; • in section 4.2, we provide an overview of the possible definition of a Social Graph; • in section 4.3, we introduce our novel Social Overlay and we explain how a social graph is mapped into our Dunbar-based Social Overlay; • in section 4.4 we explain why we use the Pastry DHT as lookup service and how. The content of this Chapter is mainly based on material that appeared in the following publication: DiDuSoNet: A P2P architecture for distributed Dunbar-based social networks. Guidi, B., Amft, T., De Salve, A., Graffi, K., and Ricci, L. (2015). Peer-to-Peer Networking and Applications, 1-18.
56
CHAPTER 4. SYSTEM MODEL
4.1
Profile-Based Communication
In a DOSN each user is associated with a prof ile, which is a personal web page where each user freely posts content − e.g. text, snippets, pictures, videos and music. In response to these postings other users, usually friends, post comments and other content. In modern OSNs communication is said to be profile-based, since around 90% of server requests are related to profile page content [42]. Profiles generally have small dimension, since pretty much everything in the profile pages − comments, links, thumbnails, small pictures − are small objects. The only exception are movies and large collections of high-quality pictures: however, these are usually not part of profiles anyway, and are linked from services such as YouTube and Flickr [33]. We consider that each node in the network has its profile and the profile contains public and private data. Which data are public or not is a user’s choice. Example 4.1. Consider Facebook, which is the model we mainly refer to build our system. Most information, which are given when a user build its profile, are public, for example age, language, and country. Furthermore Facebook uses the public profile of a user, to help the search of friends. The public profile includes name, sex, username, user ID, and profile’s image. Instead, private data are chosen by the user and they can contains for example the email address and photos.
4.2
The Social Graph
In this section, we review the possible definitions of social graph and we introduce the social graph we use for the definition of the DiDuSoNet social overlay. Formally, a social graph models interconnections (relationships) among people, groups and organizations in a social network. Individuals and/or organizations are nodes of the graph. Interdependencies between nodes, called ties, can be multiple and diverse, including characteristics or concepts like age, gender, or other ties. In the present day context, these graphs define our personal, family, or business communities on social networks. Different notions of Social Graphs have been proposed, which differ according to the kind of relationship between nodes the graph describes. We briefly summarize the main definitions of social graph. A social graph may model: • a friend relationship between two users, like that defined between two Facebook users, where the relationship is created when both users have accepted to establish it. In this case, the social graph is undirected.
4.2. THE SOCIAL GRAPH
57
Figure 4.1: The Dunbar-based ego network of nodes x and y. • an asymmetric social relationship like the ”follow relationship” of Twitter. According to the Twitter model, a user can be follower of another one, which does not follow it. In this case the social graph is directed. • an interaction between two users of a social network. In this case, the social graph is called Interaction Graph [102], which is a model for representing social relationships based on interactions between users. An interaction graph contains all nodes from its social graph counterpart, but only a subset of the links. A social link exists if and only if its connected users have interacted directly through communication or an application. • a Dunbar relation between two users. In this case an edge of the graph between node x and node y represents the relation ”y is a Dunbar’s friend of x”, i.e. y belongs to the Dunbar’s circles of x. This graph can be defined by considering the graph at the previous point and by connecting a node x with all its Dunbar’s friends, which are defined by considering the tie strength of the relationship. The tie strength describes the intimacy between two users. This graph is directed and weighted. Note that it is directed because a user u may belong to the Dunbar’s circles of a user w, but not the other way round, as explained in the Example 4.2. Example 4.2. Consider nodes x and y in the figure 4.1. Node x has weak tie strength with most of its friends, while it has a strong tie with y so that y belongs to the Dunbar’s circles of x. Instead, y has a strong relations with all its friends and most of these relationships are stronger than that with x. For this reason, x is not in the Dunbar’s circles of y. Even if the Dunbar graph is directed in the general case, we will show in section 4.5 that most Dunbar relationships are symmetric.
58
CHAPTER 4. SYSTEM MODEL
Most of current DOSNs use some kind of social graph to build the social overlay. A social connection between a user x and a user y is created in the social overlay if there is an edge between x and y into the social graph. We refer to the concept of Social Overlay as introduced in [33]. In a Social Overlay nodes are only connected to each other if some of the social relationships previously described holds between them. This kind of overlay can be considered a friend-to-friend (or F2F) overlay, which is a type of P2P overlay in which users only make direct connections with people they know. In this kind of environment, the overlay network is called social overlay: nodes are connected to known nodes, and an edge between a pair of nodes indicates that a tie exists between them. Social overlays are expected to improve: • privacy, non-friends do not see the information sent between friends, • locality, due to network homophily, • cooperation, due to friendship. A social graph can be a weighted graph where the weight of an edge is the tie strength. In real life, people maintain a large number of relationships with varying tie strength: close friends, family, work colleagues, casual acquaintances, and so on. That weak ties are extremely important in real-life social networks. When we consider an undirected weighted social graph, the weight of an edge between x and y is a combination of the tie strength from x to y and from y to x. In OSNs, the tie strength can be calculated by taking into account different factors such as contact frequency between the two users, the number of likes, posts, comments, private messages, tags, etc. [100].
4.3
DiDuSoNet: the general architecture
DiDuSoNet grounds on a Dunbar-based P2P Social Overlay where the connections between nodes correspond to the social relations of the Dunbar-based ego networks of the users, which are ego networks built by applying the Dunbar’s concept, as we will explain later. Users are able to communicate with each other directly via point to point connections through the Dunbar-based P2P Social Overlay. In DiDuSoNet, each peer knows its social relationships and it maintains a view which contains the descriptors of all the nodes which are directly connected to itself. The Social Overlay is used to support a set of Social Services like information diffusion and data availability. Let us briefly state some assumptions we make in the definition of DiDuSoNet. We assume that every user in the social network corresponds to exactly one node in the social overlay. Furthermore, we assume that users can connect to the social
4.3. DIDUSONET: THE GENERAL ARCHITECTURE
59
network with only one device at time, it is out of the scope of this thesis the management of multiple devices. Note that however, it is possible that the same peer connects to the social overlay through different devices, but at different times. Since mapping is one-to-one, we refer to users, peers, and nodes interchangeably. To support the bootstrap of a node in the social overlay and the search of new friends, DiDuSoNet defines another level, which exploits a DHT and which is used also for supporting the data management services. In particular the DHT is used in the following scenarios: • if a node enters the network, it has no contact information which could be used to establish an initial connection to any friend (bootstrapping). Even if a joining node might know about the contact information of one friend, it might be not possible to determine the IP addresses of all friends, for example if the first contacted node does not share these friends with the joining node. • users cannot search other users in the network by exploiting only its ego contacts and this is a limit to the ego network. The overall architecture of DiDuSoNet is shown is Figure 4.2.
4.3.1
Mapping the social graph onto the social overlay
In a DOSN, each user is mapped to a node of the distributed system and the social overlay connecting the nodes is defined, where nodes are connected according to their social links. A direct mapping of the graph describing the friendship relationships, like Facebook friendship graph, onto the social overlay is feasible, but this solution presents several drawbacks. As of February 2012 the average number of Facebook friends reached 318.5 among adults aged between 18-34, while it is about middle for older adults. In our dataset, explained in Chapter 8, the maximum size of an ego network is about 3000 friends. This implies that the each node of the distributed system should maintain a large amount of connections with nodes paired with friends, while a large subset of them may be underused, because a few interactions occur among the corresponding friends. Note that a node couldn’t have enough capabilities to maintain a huge amount of relationships, or to send a large number of social updates to all its friends, for instance if the node is a mobile device. Furthermore, nodes can establish connections with other users, but not all of these are really active (commonly used). Our proposal is to exploit the Dunbar’s approach to define the social overlay. As discussed in the previous Chapter, in section 3.6, Dunbar explains that people have limited cognitive resources and, for this reason, each person can maintain a limited
60
CHAPTER 4. SYSTEM MODEL
Social Services Data Availability
Information Diffusion
...
Social Overlay
DHT: Pastry
Figure 4.2: DiDuSoNet Architecture. number of active social relationships. This limit is 150 and is called the Dunbar’s number. We use the Dunbar’s concept to limit the number of relationships each user has to manage to the 150 relationships which have higher tie strength than the others and this is the basis for the definition of our social overlay.
4.3.2
The Dunbar-based Social Overlay
Users can access private data of their Dunbar’s friends, only through the Dunbarbased overlay. Therefore, private data are only stored in a controlled manner on well-known nodes. On the other hand, we store public information in the DHT and information useful to support the retrieval of profiles’ replicas. This means that a particular mechanism has to be applied to control the access to this information. In Figure 4.2 the SocialID identifies a node in the Social Overlay, while the DHT ID is the PastryID of the node. Each user is identified by two different IDs: the first one is the SocialID and it is used to identify a node in the system; the second
4.3. DIDUSONET: THE GENERAL ARCHITECTURE
61
one is the ID related to the underlying DHT which is used to navigate within the DHT. We can assume that two persons which are friends in our system know each other about SocialIDs. We use the Dunbar’s number to create a dynamic social overlay where each user is connected only to a limited set of friends arranged in a hierarchical inclusive sequence ordered by increasing level of intimacy [90]. The stability of a connection may be defined as a function of the tie strength which characterizes the relation and it is a numeric value that quantifies the strength of the relationship between two users. This permits us to reduce the total amount of social information each peer has to keep in memory and the number of connections of the overlay. We consider a basic system where the weight of an edge (A, B) is the average value of the tie strength of the edge (A, B) and (B, A). The resulting Social Overlay is a structured P2P Overlay Network composed by the Dunbar-based Ego Networks of each node in the network. Each node, when joins the network, builds its Dunbar-based Ego Network by retrieving information about its friends from the DHT, as we will see in section 4.4. Each node maintains two views: • the Social Table: it contains, for each node n in the network, all the Dunbar’s friends of n. Each row of the table contains information about the social relationship between n and a generic Dunbar’s friend m. In particular, each entry contains: the SocialID of the friend m, the timestamp defining the time the relationship between n and m has been established, called InitRelationTime, the tie strength from n to m, and a field ”Frequency” in which is stored the number of interactions between n and m. The table can contain at maximum 150 nodes according to the Dunbar’s concept (figure 4.3); • Additional Table: it contains, for each node n in the network, all the friends of n who are not its Dunbar’s friends. Each row of the Additional table contains the same information of the Social Table. In figure 4.4 is shown the ego network of a generic node N and its representation inside the Social Overlay. Ego network are almost stable, but the interactions between users can vary depending on the real life of each user. To manage the evolution of the social relationship of the users, we have implemented a mechanism which periodically checks the tie strength variation. The service, periodically, recomputes, in background, the tie strength for each friend by using the following formula: T ieStrengthjk =
fjk +fkj 2
durationRelationjk
(4.1)
62
CHAPTER 4. SYSTEM MODEL
Figure 4.3: The SocialTable of node N (the Additional Table is identical).
(a) Ego Network of a generic (b) Dunbar-based Ego Network of a generic node N node N
Figure 4.4: The Ego Network of node N and its representation in a node where fjk is the frequency of outgoing communications from an ego j to an ego k. Then, it orders the Social Table and the Additional Table by ascending order. The tie strength of the friend in the last position of the Social Table has to be higher than the first element in the Additional Table. Elements can be moved from the Social Table to the Additional table according to the variation of the values of the tie strength. The computation of the tie strength is shown in Algorithm 6.2.1, where the functions days(timestamp) and minutes(timestamp) are used to obtain, respectively, the days and minutes of timestamp, that in our case is the time in which the computation is done. The function performTieStrength() is used to compute the tie strength and after that, we have to normalize the value. This computation is obtained by applying the Normalize() function (algorithm 2), where the function refreshTieStrength() is used to update the tie strength and it calls the
4.4. THE DHT LOOKUP SERVICE: PASTRY
63
performTieStrength() function. Algorithm 1 Tie Strength Computation after an interaction between n and m function performTieStrength(actualTime) threshold ← numberOf M inutesInSixM onths interval ← minutes(actualT ime) − minutes(initRelationtime); numberOf Interactions ← numberOf Interactions + 1; days ← days(actualT ime) − days(initRelationtime); if interval < threshold then weight = (1 − 1/days); else weight = 1; end if tieStrength = weight ∗ (numberOf Interactions/interval); end function In Algorithm 6.2.1, we set a different weight to the relationship if this is too young. Dunbar explains that a relationship can be considerate stable after a certain amount of time (six months)1 . Interactions between a pair of users decrease over the time and the maximum value is after the creation of the relation. This can be an overestimation of the tie for young relationships.
4.4
The DHT lookup service: Pastry
In the DiDuSoNet DHT, we store information about the public data of the users and, a single private information concerning the references to nodes which store replica of user’s data (Point of Storage of a node). DHTs have several benefits, but, at the same time, they present several security problems: it’s hard to verify data integrity and the secure routing is still an open problem. We can consider another critical point which concerns DOSNs: systems which require that connection are established only between nodes that have a tie, or systems in which private data should be stored on trusted nodes, DHT seem to not be the right solution. For these reasons, we have decided to implement a twotier system in which private data are stored and visible only to the Dunbar Social Overlay, and the DHT is used as support for the Social Overlay. By considering all the purposes, advantages and disadvantages of existing structured P2P overlays
1
https://socialcapital.wordpress.com/tag/getting-connected/
64
CHAPTER 4. SYSTEM MODEL
Algorithm 2 Normalization of the Tie Strength after an interaction between n and m function normalize(actualTime) maxT ie ← 0; T iesGreaterT hanZero ← 0; for all f riend ∈ SocialT able do refreshTieStrength(friend, actualTime); if getT ieStrength(f riend) > 0 then if maxT ie == 0 OR maxT ie < getT ieStrength(f riend) then maxT ie ← getT ieStrength(f riend); end if T iesGreaterT hanZero ← T iesGreaterT hanZero + 1; end if end for for all f riend ∈ SocialT able do if T iesGreaterT hanZero ≤ 1 then if getT ieStrength(f riend) > 0 AND oldT hanSixM onths(f riend) then setT ieStrength(1); else setT ieStrength(0); end if else newT ie ← getT ieStrength(f riend)/maxT ie; setT ieStrength(f riend, newT ie); end if end for end function
4.4. THE DHT LOOKUP SERVICE: PASTRY
65
(CAN, Kademlia, Chord, Pastry) we have chosen Pastry as mainly structured P2P overlay for our system. In our system, Pastry is used to provide functionality to find nodes with a given SocialID and to store, search, and retrieve data objects. Further, we assume that a node n which joins the network knows at least about the SocialID of its friends. To ascertain the logical identifier of a node in the DHT, we simply hash the SocialID. The procedure for searching friends inside the DHT is the following: • the hash of the SocialID of the friend to be found is computed through a SHA-1 function. • obtaining the hash value, the node which is responsible for this value is searched. • the node which is found to be responsible of the hash value is the friend we are looking for. If no node with the same Pastry ID exists in the DHT, the friend is not online and the look up returns a reference to the node which stores the data of the offline friend. Note that, additionally, an access control procedure can be applied to verify the identity of a node in the network. Example 4.3. Consider Figure 4.5, suppose that a node x is going to be online and it needs to search information about its friends.
Figure 4.5: Searching friend inside the DHT It knows the SocialIDs of these nodes: 500, 700, and 100. It computes the hash of the SocialIDs, for example h(500) = 30, and it obtains the PastryIDs, for example, 30, 80, and 200 respectively. Obtained the hash value, x is able to search for these node through the DHT.
66
CHAPTER 4. SYSTEM MODEL
Another friend of x is the node with SocialID equal to 800, which is mapped by the hash function in the PastryID equal to 20. Since 800 is not online, it is not present on the DHT and its data are managed by the node which follows 20 on the DHT, which is the node with PastryID equal to 30. When a node register itself to the system, it joins the DHT and it builds its Social Overlay, which is empty. Through the DHT, a user is able to retrieve information about other nodes in the network by using the search function. When a new user n finds a new friend m inside the system, it inserts m inside the Social Table. The tie strength is equal to 0 at the initial phase. After an interaction between n and m, the tie strength is updated (in our system the computation is periodically executed). If the number of friends is less than 150, the Additional Table will be empty. The Bootstrap problem In the bootstrap phase, each node n performs a look up on the DHT to retrieve its Dunbar friends. For each online friend, the current physical address is returned, so that n may open a direct connection to that friend in the Dunbar-based Social Overlay. If the friend is not online, the node which is responsible for the data of n is retrieved by the look-up. Pastry has a one dimensional identifier space and routing is possible in O(logN ). Each node is uniquely responsible for a well-defined part of the identifier space. Instead, in Kademlia, for example, depending from which node in the network a request starts, the request does not lead deterministically to the same node (because of α parallel lookups). Social DHT The usage of the DHT is the main problem for preserving trustworthiness into the system because data are stored in nodes, which can not be trusted. A simple solution to manage this problem is the introduction of Social links which permit to have a certain level of trust during the routing over the DHT. As explained in the Chapter 5, we provide a Social Pastry solution and a general algorithm which can be used over different overlay.
4.5
Analysis of the Dunbar Connections
The Dunbar-based Social Overlay permits to limit the number of connections users have to open. This limitation has impact on the number of outgoing connections, which are at most 150, but no assumption is possible on the number of ingoing
4.5. ANALYSIS OF THE DUNBAR CONNECTIONS
67
connections. If we consider the total number of friends a node x has, all these could open ingoing connections to x. By considering, for example, Facebook, where a user can have at maximum 5000 friends, x could have to manage 5000 ingoing connections. This scenario may be critic, in particular when we consider that a user can connect to our system with different kinds of devices. To evaluate this scenario, we have evaluated the percentage of relationships which are symmetric. We have used our Facebook dataset explained in details later in Chapter 8 and, we have consider only registered users because we have a complete knowledge of their ego networks and of the tie strength of each their friendships. Among the 328 registered users, we have considered only 253 registered users which have relations between them. The total amount of analysed relationships are 937. Among them, 743 (about 79%) are symmetric and only 184 (about 21%) relationships are asymmetric. This means that users, in most cases, have a number of ingoing connections that is almost equal to the number of outgoing connections. Furthermore, not all ingoing connections are opened at the same time, this depends by the user’s behaviour. Nevertheless a scenario in which the number of ingoing connections is very large may occur. We propose a simple approach to manage this problem by using the knowledge of the Dunbar Ego Network. We introduce a parameter k which indicates the maximum number of ingoing communication a user can support. This parameter is setted by the user itself. Furthermore, we introduce, for each ingoing connections, a timestamp which indicates the time of the latest received request. Suppose that a user x, which is online, has just k opened ingoing connections with a subset of its friends which are online and that a node y, which is a friend of x and it considers x a Dunbar’s friends, is going online. When y joins to the network, it finds all its online Dunbar’s friends and tries to establish a connection with them. When y connects to x, x checks the parameter k and it realizes that it cannot open further ingoing connections. In this case, x sends a message to y in which requires information about the Dunbar’s Circle in which it is contained. If the Dunbar’s circle of y where x is located is the inner one (Support Clique) or the second one (Sympathy Group), y has a strong relation with x and x could be a possible Point Of Storage for y. In this case x accepts to open a new connection with y by closing the oldest unactive connection (by analysing the timestamp associated to the connection).
68
CHAPTER 4. SYSTEM MODEL
Chapter 5 Friendly Routing over Dunbar-based Overlays DiDuSoNet guarantees a certain level of trust to all users by using the Dunbar’s overlay without introducing any further particular privacy mechanisms. On the other way, to guarantee the correct functionalities and to permit the management of our system, we need a DHT, as shown in Chapter 4. The DHT is the weak point of our system because of its intrinsic lack of trustness and privacy. By using a DHT, data can be accessible to everyone and more powerful mechanisms should be used to guarantee trustness. There are two different levels of trust which can be introduce in a DHT: • store information only on friends nodes, but this constraint change the DHT mechanism; • maintain the classic functionality of a DHT, but the overlay contains only connection between friends nodes. To guarantee a complete trusted system, we want improve the system by guaranteeing a friend to friend routing functionality completely embedded into the overlay defined by the DHT we are using. Our intuition is to modify the routing of our Pastry DHT by guarantee a certain level of trustness. By integrating the Dunbar’s approach into the Pastry’s routing, we are able to direct the lookup through Dunbar friends nodes. On the other hand, some basic properties of Pastry routing are not longer guaranteed, because of the modifications. To override these drawbacks as latest solution, we provide a more general approach which can be used in structured and unstructured overlay networks because it doesn’t modify the DHT’s routing, but it can be used over the routing service. We evaluate how far lookups are possible in Pastry if messages are only forwarded to Dunbar friends nodes and, we show the drawbacks of this approach in term of how the number of
70
CHAPTER 5. FRIENDLY ROUTING OVER DUNBAR-BASED OVERLAYS
failed lookups increases if the number of possible contacts in on overlay’s routing table decreases. We show an increasing number of visited nodes one message has to traverse if the number of friends is reduced. The rest of the Chapter is organised as follow: • in section 5.1, we introduce the problem of friendly routing over P2P networks. • in section 5.2, we introduce Social Pastry, a novel approach in which we change the classic Pastry functionalities to integrate trustworthiness into the DHT. Social Pastry presents several drawbacks that had guided us to the definition of a general algorithm; • in section 5.3, we introduce goLLuM, an efficient algorithm proposed to find existing paths between two nodes. • in section 5.4, we briefly explain our evaluations, which are proposed in detail in Chapter 10. The content of this Chapter is mainly based on material that appeared in the following publication: FRoDO: Friendly Routing over Dunbar-based Overlays. Amft, T., Guidi, B., Graffi, K., and Ricci, L. Accepted to the 40th IEEE Conference on Local Computer Networks (LCN), 2015, Clearwater Beach, Florida, USA.
5.1
Friendly Routing
Existing friendly routing approaches in P2P networks are proposed to manage misrouting attacks, which refer to any failure by a peer node to forward a message to the appropriate peer according to the correct routing algorithm. Using a priori relationship knowledge may be key to mitigating the effects of misrouting. Several solutions are proposed for structured P2P networks. In [103] the SPROUT algorithm is proposed. SPROUT adds to Chord additional links to any friends that are online. In FreeNet [104], version 0.7, is included a so-called darknet mode of operation that restricts transfer to a social connections only. Nodes has a fixed connections which are based on social relationships. Routing is possible only through these links. Freenet assumes that the Darknet (a subset of the global social network) is a small-world network, and nodes constantly attempt to swap locations in order to minimize their distance to their neighbours. If the network actually is a small-world network, Freenet should find data reasonably quickly. However, it does not guarantee that data will be found at all. Turtle [105] is a friend-to-friend (F2F) network which provides a P2P overlay on top of pre-existing
5.2. SOCIAL PASTRY: THE DHT STRUCTURE
71
trust relationships among Turtle users. Turtle does not allow arbitrary nodes to connect and exchange information. Instead, each user establishes secure and authenticated channels with a limited number of other nodes controlled by trusted users (friends). In the Turtle overlay, both queries and results move hop by hop and information is only exchanged between people that trust each other and is always encrypted. However, misrouting is far from the only application of social networks to peerto-peer systems and our main problem is not related to manage this scenario, but also, to manage the problem of guarantee the same level of trust proposed by our Social Overlay over a P2P structured network like Pastry. The first solution that we will show in section 5.2 is similar to SPROUT, but the novelty is related to the concept of trust we use. We define a concept of trust which is more restrictive than SPROUT and FreeNet, because we consider that friends nodes don’t have the same level of trust, as suggested by Dunbar. We will show that a friendly routing over Social Pastry is not possible if we consider only Dunbar friends. To manage this drawback, in section 5.3 we will introduce goLLuM, a general algorithm which can be used on the top of different overlays. The proposed algorithm is similar to FreeNet, but also in this case, we consider a very restrictive situation in which trust is based on Dunbar-based social relationships and the number of links is limited to the Dunbar’s number.
5.2
Social Pastry: the DHT structure
To define a Social Dunbar-based Pastry DHT, the overlay has to be extended by a friend-routing table, which comprises Dunbar friends of one node. The list of friends of a node is also sorted by level, where a level corresponds to a Dunbar’s circle. We call this protocol SocialPastry. To obtain SocialPastry, the routing table is modified in such a way that original routing behaviour is conserved, while at the same time contact information to friendly nodes is stored. Instead of storing one contact node per entry in the Pastry routing table, we extend the overlay with an additional routing table, namely the friend-routing table. Additionally, a level of closeness is associated to each node in the table which refers to the strength of the personal relation between the owner n of the routing table and that node. The closeness level corresponds to the Dunbar’s circle where the friend is located. According to their level of closeness the additional nodes are stored inside the friend-routing table in friend-buckets. Given a target identifier IDt for a lookup request, the Social Pastry protocol acts as follow: • Node n, which has to find the next hop of the lookup request examines if the given target identifier IDt is in the range of its leaf set. That is to decide
72
CHAPTER 5. FRIENDLY ROUTING OVER DUNBAR-BASED OVERLAYS
whether IDt ∈ [Lccw , Lcw ] or not, where Lccw is the node in the leaf set farthest away from n in counter-clockwise direction in the id space and Lcw is farthest away from n in clockwise direction. If so, node n has to determine the next hop of the lookup request, which is the numerically closest node to IDt in the leaf set, whose closeness level is below or equals a given threshold λ at the same time. • If IDt is not in range of n’s leaf set or all closeness levels are greater than λ, the next hop is considered to be found in Pastry’s routing table. Similar to original Pastry, the contact node with the longest common prefix to n will be the next hop of the lookup request. Instead of storing only one Pastry contact per routing table entry, we store for each entry the original contact node as well as those friendly nodes which share the same common prefix with the original node. The lookup request will be forwarded to one node in the entry set whose closeness level is not greater than λ. • In case that no contact node has been selected as next hop so far (e.g. if closeness levels did not satisfy n’s decision), n picks the first fitting node of the next numerically closest nodes which fulfils the requirements of the closeness threshold. • If no contact node is found (e.g. if node n does not have any friend stored in the routing table), the lookup request is stopped. It is important to notice that the properties guaranteed from the Pastry DHT are no more valid in Social Pastry: first of all, Social Pastry does not guarantee logarithmic bounds for the routing; furthermore, routing loops are also possible, as shown in the next example. Example 5.1. Consider the topology shown in Figure 5.1(a), node 2 starts a lookup to target node 58. Node 2 considers node 28 to be next hop. 28 selects node 48 as next hop, etc. The message is forwarded until the target node is found. In Figure 5.1(b) node 34 does not consider node 58 as friend. Instead it forwards the lookup message to node 48. Node 48 still considers node 14 should be best suited as next hop. As a result the message is endlessly forwarded in a routing loop among nodes 14, 34 and 48. SocialPastry fails if any node n, which is responsible to forward a certain lookup request, has no friends in its routing sets or the number of maximal hops for a lookup request message is exceeded. Therefore, the number of failed lookups should increase with decreasing threshold value λ, since more hops have to be traversed. Due to these drawbacks, we have implemented a general protocol which is independent from the P2P overlay to overcome the routing issue and, on the other
73
5.3. FRIENDLY ROUTING BY USING GOLLUM
Target
62
2
Start
Target
58
2
Start
58
e
50
62
50
c
c
14 48
14 b
48
b
e
a d 34
a d
28 30
(a) Successful path of a lookup
34
28 30
(b) Lookup fails due to routing loop
Figure 5.1: Routing loops are possible in naive approaches for Social DHTs hand, to apply friendly (social) routing on top of any structured and unstructured P2P overlay.
5.3
Friendly Routing by using goLLuM
In this section we introduce goLLuM (going over Local Links by using Dunbar’s Method ), a distributed depth-first search in combination with routing preferences, which is used to find existing paths between two nodes efficiently, or to inform the sender if no such path exists. goLLuM allows us to forward messages via friends and friends of friends only, which fulfil a certain closeness level. The main goal of this algorithm is to forward messages from any starting point to any given target node (note, that both nodes do not necessarily have to be friends) whenever a path, including a sequence of nodes which are friends, between those two nodes exists. A message should be sent back to the starting point if no such route exists in order to inform the initial sender about the inaccessibility of the desired receiver. Normally, the principle of structured overlays is to carry a message closer to a given receiver with every selected hop. If we limit routing by forwarding messages only to friendly nodes with certain trust level, it might happen that the direct path from starting point to target node is blocked due to an incompatible closeness level of the next, closer hop. Our proposed algorithm will detect obstacles in the direct routing path and circumvent them. As can be seen in Figure 5.2, node B would be the best choice of node A for being next hop on the way to target T . Nevertheless, A is able to forward the message to C which is second closest node to T in A’s
74
CHAPTER 5. FRIENDLY ROUTING OVER DUNBAR-BASED OVERLAYS
routing table and is considered as friend of A. C in turn might decide to forward the message to D which has no further friends than C. D will thereupon send the message back to C which has enough alternatives to forward the message.
5.3.1
The Routing Algorithm
In order to realize the distributed search, messages have to be extended by a list of already visited nodes LV and by a list of nodes LP which represents the current traversed path from the start point towards the target node. Additionally, a flag within the message indicates whether it is forwarded or sent back as we saw in Figure 5.2 when D had no further friends than C. S
A
B
T
C
D E
F
Figure 5.2: Basic idea of the routing algorithm is to route around blocked links. Algorithm 3 describes the goLLuM routing algorithm in detail. Lookups are started by sending a LookupMsg to a friendly node which is closest to a given target identifier. First, upon receiving a LookupMsg a node decides if it is responsible for the given target identifier by itself (line 6). If so, the message has reached the desired receiver. Otherwise, the current node is added to the list of visited nodes LV (line 12). In order to determine the node with the closest identifier to the target, all friends with matching closeness level are stored in a new list LF (line 10) and sorted according to their distances to the target (line 11 in Algorithm 3). In the next step, the current node has to decide if the previous node has forwarded the message to the current node through the normal way (the previous node is closer to the target id and was therefore selected as next hop), or if the previous node has sent the message back because it has no further friends (this means the current node has forwarded the message previously to a node which seemed to be a good next hop, but has no other friends). In the first case, if a message is received a second time (it has been sent back), the position in list LF of the last sender (who sent the message back) is determined and increased by one to select a possible next hop out of all friends (line 14). In the second case, the first closest friend is simply selected as possible next hop (line 16). Starting at the calculated position, the list of sorted friends LF is now iterated to find the next closest node which has not been visited previously (lines 18-20). If such a node
5.3. FRIENDLY ROUTING BY USING GOLLUM
75
is found, it is added to the temporary calculated path represented by list LP and the message is forwarded to this node (lines 21-23). Only if all friends of one node have been visited before, the message is sent back to the last recently visited node on list LP (line 31), i.e. the node which previously sent the message to the current node. In addition, the last sender will be removed from list LP which is used to remember the sequence of all the previous senders of the message (line 30). The depth-first search will be executed until the message reaches its receiver or the initial sender again (line 33).
5.3.2
Privacy
We are aware that both lists, LV and LP , infringe privacy issues as information about all participants which build the path between the sender and receiver. They are exposed to non-friendly, not-trusted nodes. Therefore we describe next, how to overcome the lack of privacy while the functionality is retained. First, the routing path represented by LP is not included within the message. Instead, each node caches the last senders of all forwarded messages for a limited amount of time. The cached information is released early if the node sends the message back to the last sender. The list of previously visited nodes LV is privatized by storing encrypted information about the visited nodes. Each node only needs to know if its own friends appear on the list (line 20 in Algorithm 3), regardless of whether the information has been put into list LV by the current node itself or by another participant that shares the same friend. In order to encrypt the nodes stored in LV , each node creates four individual symmetric keys, one for each level of closeness. For each friend in a node’s routing table those keys, whose priority is greater or equal to the level of the respective friend, are stored. As an example, for level-1-friends keys for level 1, 2, 3, and 4 are stored, whereas for level-3friends only keys 3 and 4 are stored. Doing this, every node maintains contact information, the associated closeness levels, and the affiliated symmetric keys for each friend. Furthermore, information about previously visited nodes is added to list LV , which is encrypted by using the key with the lowest priority that is still above the defined maximum level. Therefore, information about visited nodes is only readable to participants that share a key which fulfils the desired closeness level.
5.3.3
Independence
The goLLuM algorithm is absolutely independent from any underlying P2P overlay. As an extension which allows routing over friends only, it can be run on every existing approach, since the used distributed depth-first search is designed to find always an existing route between two nodes. If no underlying structure
76
CHAPTER 5. FRIENDLY ROUTING OVER DUNBAR-BASED OVERLAYS
Algorithm 3 receipt of LookupM sg(LP , LV , IDt , back) from m at n 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36:
// Receive LookupM sg which contains: // LP : list in which previously traversed path is stored // LV : list in which previously visited nodes are stored // IDt : given target identifier // back: specifies if LookupM sg has been sent back (true) or forward (false) if n.isResponsibleF or(IDt ) then deliver(LookupMsg); else // get friends from routing table with level ≤ λ LF = getF riends(λ); LF .sort(IDt ); LV .add(n); if back == true then pos = LF .getP os(m) + 1; else pos = 0; end if while pos < LF .size() do nextHop = LF .get(pos); if LV .contains(nextHop) then LP .push(n); send to nextHop : LookupM sg(LP , LV , IDt , f alse); return else pos = pos + 1; end if end while // all nodes have been visited before if LP .size() > 0 then nextHop = LP .pop(); send to nextHop : LookupM sg(LP , LV , IDt , true); else stop(); // msg is at initiator again (fail) end if end if
5.4. EXPERIMENTAL RESULTS
77
exists, similar to flooding in unstructured overlays, depth-first search will still find an appropriate path to a given target node. As a drawback many nodes have to be visited to find a path. One characteristic of structured overlays is to keep the network scalable by introducing links to far away nodes. Another feature all structured peer-to-peer overlays have in common is the need of a distance metric to describe the difference between participants. In Chord and Pastry for example, distances are derived from positions in an one-dimensional identifier space, which is mapped onto a modulo ring. In Kademlia the XOR operation is used to define distances in the overlay. Furthermore, in every unstructured or structured overlay in which data objects are searched, it has to be clear who is responsible for the data objects. In unstructured overlays like Gnutella1 , participants are simply responsible for content if they hold a given data item. In most structured overlays responsibility for a given data object is distributed equally among all overlay nodes and depends on the position of a node on the identifier space. In Kademlia, lookups are not deterministic, so that many nodes can be responsible for a desired data item. Our routing approach benefits from a distance metric to order links and selects preferred links first. Furthermore, it has to be known where to stop a lookup. Although in DiDuSoNet, we build the algorithm over Pastry for evaluation purpose, it can be applied to any other structured or unstructured overlay in principle, which only has to define an interface for the used distance metric and an interface for announcing responsibility for a given data item.
5.4
Experimental Results
The evaluation of our system is proposed in Chapter 10. In particular, we will propose the evaluation of SocialPastry and goLLuM algorithm by using different friendship models for social networks. We will shown that routing via friends and friend-of-friends is possible if a path between sender and receiver of a message exists. Furthermore, we will shown that the number of visited nodes is reasonable. We will observe that the probability to fail increases with every step a lookup message is forwarded. In other words: the probability to find a proper next hop among all friends decreases. Formally, let pi be the probability at node i to have a good (or bad) next hop in the Qkfriend list. Then the probability of a lookup to be successful is something like i=1 pi , where k is the number of hops traversed.
1
http://rfc-gnutella.sourceforge.net/developer/stable/index.html
78
CHAPTER 5. FRIENDLY ROUTING OVER DUNBAR-BASED OVERLAYS
Chapter 6 Data Availability: a trusted storage approach Although DOSNs can help to protect the data privacy, maintaining data availability becomes a big challenge. The profile of a user should be highly available regardless of the user’s own connectivity to the system. Replication is one of the most popular approaches to manage this problem. In this approach, k copies are created for each user’s profile or for each data item published by a user. The replica of the user’s profile may be stored on nodes of the distributed systems, which have no social relationship with the profile owner, but this requires the usage of cryptography protocols. These techniques require high computation and they are not useful by considering the diffusion of information in a DOSN. Since the friends in general, and in particular the Dunbar’s friends of the owner of the profile u, are able in any case to see u’s private data. Although data replication helps to improve data availability, it introduces the problem of consistency and the problem of minimizing the number of replicas. Furthermore, increasingly amount of data are being generated on the OSNs nowadays and users often use the mobile devices, such as smart phones, to access the OSN services, which have a limited storage capacity respect to a personal computer. In this chapter, we present our two replication-based approaches to manage the problem of data availability, characterized by different level of trust. In the rest of the Chapter, we use friends and Dunbar’s friends in a interchangeably way. In detail, the rest of the Chapter is organised as follow: • in section 6.1 we introduce our Trusted Social Storage approaches by providing the basic concepts; • in section 6.2 we propose our trusted k-replicas approach in which we use a fixed number of point of storage to store the replicas of the user’s data;
80
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
• in section 6.3 we propose our trusted social storage approach based on Network Coverage. The content of section 6.2 is based on material that appeared in the following publications: Trusted Dynamic Storage for Dunbar-Based P2P Online Social Networks. Conti, M., De Salve, A., Guidi, B., Pitto, F., and Ricci, L. In On the Move to Meaningful Internet Systems: OTM 2014 Conferences (pp. 400-417). Springer Berlin Heidelberg. DiDuSoNet: A P2P architecture for distributed Dunbar-based social networks. Guidi, B., Amft, T., De Salve, A., Graffi, K., and Ricci, L. (2015). Peer-to-Peer Networking and Applications, 1-18.
6.1
Trusted Social Storage
Distributed Social networks have different characteristics with respect to other distributed networks and the selection of nodes in which to store profile’s replica should be social-relationship-driven. As explained in detail in Chapter 4, we use a DHT to guarantee the lookup service and the evolution of the ego networks. In our system user’s profile is composed by public and private data. Public data are stored inside the DHT and are available to all nodes that participate in the network. Contrary to that, data which are stored inside the Dunbar-based Social Overlay are private and available only for friends. Private data are stored only on the owner of the data and they are replicated on elected nodes chosen from the set of friendly nodes. In our approach, a number of profile’s replicas are stored into other nodes in the network. Each user dynamically elects a subset of his Dunbar friends, which we will refer as Points of Storages (PoS), to store a replica of its social data. In case that a generic node n leaves the network, its friends have to find out which node is PoS for this node and has therefore a copy of n’s data. Since the PoSs might change over time (due to churn), it is important that all friends of n are able to find n’s PoSs. We use the DHT to maintain the knowledge about the PoSs: for every node n we store a list of its PoSs into the DHT. Whenever one PoS of n changes, the list is updated. We consider the pure availability as the main problem to resolve, because a user profile should be always available regardless the behaviour of friends. We use our Dunbar-based social overlay to propose two independently data availability solutions, which manage the data availability with two different level of trust: • Basic Trust, a node m have the possibility to get the profile of a friend node n through one of the PoS nodes of n, which not necessary belongs to
6.2. BASIC TRUST: K-TRUSTED-REPLICAS APPROACH
81
the friends of m. We decide to elect only 2 PoSs per nodes because of, as we show in details in section 6.2, 2 PoSs guarantee a high availability. • Full Trust, we guarantee that given a node n, all its friends can easily find the profile of n through trusted connections defined through common friends. The number of Point of Storages (PoSs), in this case, depends on the topology of each ego network.
6.2
Basic Trust: k-trusted-replicas approach
The first approach we propose uses a fixed number of PoSs, for each user, by guaranteeing a limited number of replica into the network and a easy data consistence solution. This approach is based on the basic assumption that the private data of a user are always stored on one of its Dunbar friends, but we allow two Dunbar’s friends of a node u, which may not be friends each other, to open an ad hoc connection to transmit u’s profile when u is not online. In the next section we will show an alternative solution guaranteeing a higher level of trust. When a node n is online, it is a PoS for itself and it needs to elect k PoSs to guarantee the k online copies of its profile exist. The node has to send a copy of all its data objects to PoSs via a point-to-point connection by using the Dunbarbased Social Overlay, as well as a list of all friends that belong to n’s ego network. Additionally, both nodes exchange ping-pong messages frequently to detect node failures (sudden departures). To assist the reader, Table 6.1 summarizes all the symbols of this section. Definition 6.1. We define Prof(u) as the profile of a node u, Pos(u) as the set of Point of Storages of node u and P oS −1 (u) = {n | u ∈ P oS(n)} as the set of nodes for which u is Point of Storage. We assume user u’s knowledge of the social network corresponds to its Dunbar based ego network, which is the set of its Dunbar friends df (u), the state (online/offline) of each user ∈ df (u), and the tie strength of each link in {(x, y) | x, y ∈ {u} ∪ df (u)}. A user u also knows the set of nodes n in P oS −1 (u) = {n | u ∈ P oS(n)}. Friends nodes can download data objects (e.g. profile) directly from one of the PoS nodes, but the procedure requires that if the owner is online, all request are managed by itself. As we will see in Chapter 9, the value k = 2 guarantees a good profile availability and it represents a good trade-off for a high availability with a reduced number of replicas. When a user u disconnects from the system with a notification, it executes the following two steps: • it elects a second PoS for itself by considering its Dunbar’s friends,
82
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
P oS(u)
Table of Symbols A node which is a Point Of Storage for the node u
P oS −1 (u)
Set of nodes that have node u as Point Of Storage
SSegon (u)
The Social Score assigned to a node u, which is a neighbour of the node n
EN (u)
Ego Network of node u
Vu
Nodes contained into the ego network of u and u
Eu
Edges contained into the ego network of u
T ieStrengthnu
the tie strength from the node n to the node u
CGn (u)
the gain in term of trusted connection obtained by the election of u as P oS(n)
M SL(u)
the medium session length of the node u
P rof (u)
the profile of the node u
df (u)
set of nodes that are included in the Dunbarbased ego network of u
ef (u)
set of nodes that are included in the ego network of u
Table 6.1: Table of Symbols used for the Basic Trust Approach
6.2. BASIC TRUST: K-TRUSTED-REPLICAS APPROACH
83
• it exploits P oS −1 (u) to elect a new PoS for each node n in P oS −1 (u) that is offline at that moment (if a node is online is able to elect another PoS by itself). To implement the election of a new PoS for n, u must also know the list of all friends of n and all the social information about them required to implement the PoS selection strategy, that we will describe in Section 6.2.1, for instance the tie strength and the average session length of these nodes. Each node has a PoSs Table, which contains the list of PoSs and is stored on the DHT. When a node is elected as PoS for some user u, its SocialID is added in user u’s entry list in the PoS Table. When it leaves the network, it deletes its entry in the PoS Table of u, but it keeps a copy of the Prof(u) until its reconnection. When it reconnects to the system, it has to check if there is at least one online PoS for the node u. In case it finds one online PoS, it destroys its local copy of Prof(u), otherwise it elects itself as PoS of the node u and an old version of the Prof(u) will be available. Through the PoS Table, each user always knows the current PoS of its friends and can access the newest version of their profiles. We assume there is an authentication mechanism that allows the access to the user u’s PoS information only to its friends. Note that friends of u might exchange its data, but if they are not mutual friends themselves they cannot host each other’s data. Another challenge of our approach is the robustness against single-node failure. A PoS may voluntarily disconnect suddenly or because of different types of failure without being able to pass a replica of its data to another node. To address this goal we keep two online copies of each user’s profile at any time: an online node n must have one online PoS so that a copy is still present in the system also in case of failure of n or of P oS(n). For the same reason an offline node with more than one online friend must have two online PoSs.
6.2.1
Social Score and Selection Strategies
In this section we introduce the criteria exploited for the selection of PoS for a node n. For this reason, we introduce the social score SocScoreegon (x) of a user x, Dunbar’s friend of a user n, as follow: SocScoreegon (x) = (α · T ieStrengthnx ) · (β · CGn (x)) · (γ · M SL(x)) where all the terms T ieStrengthnx , CGn (x) and M SL(x) are normalized and α, β and γ are weights autonomously chosen by the users according to the relevance given to each criteria. The tie strength between the nodes may be computed as a combination of many factors such as the contact frequency between the two users, the number of likes,
84
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
posts, comments, private messages, tags, etc. We exploit the approach of [101] to compute the tie strength and the computation is described in the Chapter 4 by the Algorithm . The ConnectionGain (CGn (x)) for an online node x in the ego network of a user n is the gain in terms of trusted connections obtained by the election of x as P oS(n) and is proportional to the number of common neighbours of x and n. If CGn (x) has an high value, then x may transfer the profile of n to neighbours of n through a trusted connection. M SL(x) is the Medium Session Length and represents the average duration of a user session. We approximate this measure so that it is able to distinguish at least between users almost always connected to the social network (e.g. through mobile devices), those that connect only for short periods of time, and those that use the social network as a mean of social interaction and therefore have quite long sessions on average. We propose a set of selection strategies which show how the different values influence the goodness of a PoS. We will name Fair Strategy, the strategy that gives to all the weights the value 1. 1) Maximizing the trust (MaxTrust) In this approach the social score maximizes the trustness (β and γ are equals to 0). A user selects as its PoS those friends with whom he has the strongest relationships, in order to store data on the most trusted friends. Hence, the online friend with the highest value of tie strength with the user is selected as its new PoS. 2) Maximizing the trusted connections (MaxTrustConnect) In this strategy, the social score is obtained with α and γ equal to 0 and β equal to 1. We select as PoS for a user u the node y in df (u) with the highest value of the ConnectionGain. The rationale for this strategy is that users who are interested in accessing the profile of a user u are its friends and the elected PoS should send the profile of u to these nodes, when they require it and u is offline. If these nodes are, in turn, friends of the PoS, the number of connections established between nodes that are not Dunbar ego friends each other is reduced. We call these kind of connections as untrusted, whereas we call trusted all the connections established between a user u and one of its Dunbar ego friends. Definition 6.2. The set of online nodes whose connection to y to access P rof (n) would be trusted is the Trusted Connection Set, which is defined as T CS(y, n) = {y} ∪ {x | x ∈ df (y) ∩ df (n) ∧ x.isOnline()}. Since we elect two PoS for an offline user u, only online nodes in T CS that haven’t already got a trusted connection to the other PoS of n contribute to the ConnectionGain.
6.2. BASIC TRUST: K-TRUSTED-REPLICAS APPROACH
85
Definition 6.3. The Uncovered Trusted Connection Set (UTCS) is defined as U T CS(y, n) = T CS(y, n) − {x | P oS ∗ (n) ∈ df (x)}, where P oS ∗ (n) is the online PoS of the user n. By considering df(p) as the set of the Dunbar friend of a node p, while ef(p) denotes the set of all ego friends of a node p. We select as Pos for a user u the node y in df (u) with the highest value of the connection gain which is formally defined as: X CGn (y) = fjn j∈U T CS(y,n)
In order to maximize the number of trusted connections between ego friends, we approximate the possibility that a node n contacts P oS(u) with its contact frequency fnu . Note that the probability that a user belonging to ef (u) − df (u) contacts P oS(u) is negligible. Consider a node p ∈ df(u) that has been selected as one of u’s PoS. Common ego friends between u and p can access Prof(u) through their trusted link to p. Example 6.4. The values of CGn calculated for the nodes of the graph in Fig. 6.1(a) when no P oS(X) is still elected are shown in Fig. 6.1(b). In this scenario node D will be elected as P oS(X). The values of CGn calculated after D’s election as P oS(X) are shown in Fig. 6.1(c). Node I would be elected as the second PoS for node X at the moment of its disconnection from the system. In the example we would obtained P oS(X) = {D, I} and 8 out of a total of 11 nodes in EN (X) would be able to access P rof (X) through a trusted connection.
Minimizing the number of profile transfers (MinTransf ) This strategy minimizes the number of times a profile has to be transferred to another node (α and β equal to 0). We consider the user’s behaviour as basic property of this strategy, in particular the session length. Differently from [12] we don’t assume that users’ online time is known a priori, but we need an estimation of the time an online user will remain online before its disconnection. An estimation of the availability of a node by considering the simple formula: N odeAvailability =
AvgOnlineT ime AvgOf f lineT ime
However, this is not a precise measure for our goal. Consider, for example, a node x which is online on average 3 hours per day. This not describe how these 3 hours are distributed during the day. The node could be online for 3 consecutive hours and in this way it could be a good candidate for being elected PoS. But, it could be online for a large amount of sessions and this means that it is not a good candidate, because of continuous disconnections, which imply dynamic data
86
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
Figure 6.1: (a) X’s ego network EN (X), (b) nodes’ CGX when no P oS(X) is still elected, (c) nodes’ CGX after D’s election as PoS(X) transfers. A measurement based on the session length is the most precise than the N odeAvailability. For this reason we have introduced the M SL value which estimates the session length. Considering that session length presents a certain level of homogeneity in term of length, we are able to estimate the remaining online time of a user by using the session length. The M SL is defined as follow: M SL = max(M ediumSessionLength − CurrentU ptime, 1) We define a greedy strategy by choosing as PoS of a node its Dunbar neighbour which is online and that is more likely going to remain connected to the system for the longest time interval.
6.2.2
PoS Election Algorithms
In this section we describe the algorithms for the election of PoSs, which must guarantee that each online node has one PoS among its online friends and that each offline node with more than one online friend has two PoS among them. The three possible scenarios in which the algorithm is executed are: • when a user u is going to disconnect from the system; • when a user u is going to disconnect from the system and it is a PoS for at least one node in the system; • when a user u does disconnect from the system without a notification (e.g. crash, in this case the algorithm is obviously not executed by u, but by its friends).
6.2. BASIC TRUST: K-TRUSTED-REPLICAS APPROACH
87
Algorithm 4 Best PoS Selection 1: 2: 3: 4: 5: 6: 7: 8: 9:
function SelectBestPoS(p, Set) if (∃ x ∈ Set | x.isOnline()) then get SSegop (v) ∀v ∈ Set | v.isOnline(); select n | SocScoreegop (v) = max{SocScoreegop (v)}; return n; else return null; end if end function
First, we define an auxiliary procedure SelectBestPos(p, Set) that, given a user p and a given set of nodes Set, selects the best PoS for p among them. Algorithm 4 gets SSegop (v) for all online nodes in the set and elects as the new P oS(p) the node with the highest Social Score. If all nodes in the set are offline the algorithm returns null, otherwise the selected node is returned. Algorithm 5 is executed when an online node p receives a disconnection notification from a node that is its only current P oS(p) and when a node is going to disconnect from the system and needs to elect its second PoS. Algorithm 5 PoS election for a node p 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
function Election(p) Set = df (p) − P oS(p); n = SelectBestP oS(p, Set); if n 6= null then send < P rof (p),ef (p), T ieStrengthp∗ > to n; update P oST able(p); else Set = ef (p) − P oS(p); n = SelectBestP oS(p, Set); if n 6= null then send < P rof (p),ef (p), T ieStrengthp∗ > to n update P oST able(p); end if end if end function
Algorithm 5 defines the procedure to elect a new PoS for a node p. The best PoS is chosen among df(p) at first and among ef(p) if all nodes in df(p) are offline Selected PoS receives from p, the profile of p Prof(p) and the list of Dunbar
88
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
Algorithm 6 PoS election for an offline node executed by a disconnecting node 1: 2: 3: 4: 5:
function OnDisconnect(p) for all u ∈ P os−1 (p) | !(u.isOnline()) do Election(u); end for end function
friends of p df(p) and all the tie strength values between p and one of its friends (T ieStrengthp∗ ). Afterwards, p’s entry in the P oS T able is updated with the information about p’s new PoS. Let us now consider the procedure executed by a node n which is going to disconnect from the system with notification and which is the PoS for at least a user. n notifies its disconnection to online nodes that belong to P os−1 (n) and these nodes will be able to elect a new PoS on their own. As far as concerns offline nodes ∈ P os−1 (n), n must select a new PoS for each of them among their neighbours. Notice that n received the list ef(p) at the moment of its election as PoS(p). As for the election procedure the best PoS is the node with the highest Social Score. Algorithm 6 shows the procedure previously described.
6.2.3
Dynamism and Points of Storage (PoS)
One of the main problems that has to be faced in DOSNs is their dynamism. In this section, we explain our data availability service protocol to manage the problem of dynamism (due to churn) from the point of view of a PoS. As long as there is at least one online friend for each node, the availability can be always satisfied but, if no friend is online, then data stored in the system will not be accessible by any means. As long as no friend of the user is online that is a minor turn-off, since no one is interested in accessing the user’s profile. But if a friend of the user reconnects to the system and does not keep in memory a copy of the user’s profile, it cannot access the user’s data until one PoS of the user or the user itself reconnects to the system. We have to consider that in most of the famous OSNs, users usually don’t disconnect from the system through a log-out phase and this represents a hard challenge to face for the problem of data availability. Consider the ego network of the node n shown in figure 6.2, in which are selected the 2 PoSs for n: itself and the Dunbar friend node A, we explain the three scenarios that may occur and how they are managed by our system: Case 1: Voluntarily disconnection of a PoS When a node A, which is a PoS, is going to disconnect from the system, it sends a notification to the other
6.2. BASIC TRUST: K-TRUSTED-REPLICAS APPROACH
89
PoS A
n
Figure 6.2: Dunbar Ego Network of Node n. PoS, in this case the node n. When n receive the disconnection message from A, it knows that it is the responsible for the election of the second PoS. Furthermore A exploits P oS −1 (A) to elect a new PoS for each node n in P oS −1 (A) that is offline at that moment (if a node is online is able to elect another PoS by itself. Consider the example 6.5, the new PoS is the node B. Nodes n and B start to exchange ping-pong messages now. As last operation, the changes are stored in the PoS table.
Figure 6.3: PoS re-election for an offline node
Example 6.5. Consider the graph in Fig. 6.3 where all nodes except F are online: cylinders next to a node represent data stored by it. Each online node has a PoS in the network and F, which is offline, has its profile replicated on two different nodes. Suppose node A is going to disconnect from the system. Before its disconnection it controls all nodes that belong to P os−1 (A) and detects that C and H are online
90
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
and will be able to elect a new PoS by themselves. But since F is offline A must elect a new PoS for F, which will be chosen among F’s Dunbar ego friends, namely {E, G, H, I}.
A
n
B
PoS
Figure 6.4: Case 1: Ego network of the node n after the disconnection of A. We have the same behaviour even if is n the PoS that wants to leave the system. The difference is that the disconnecting PoS is the owner of the profile (n), so it elects the second PoS for itself. Case 2: Involuntary disconnection of a PoS This happens when a PoS leaves the network suddenly without a notification about its departure. In figure 6.5(a) node n leaves the network suddenly. B is able to detect this failure since n will not respond to incoming ping messages. Nodes which request data from n will be informed about its absence and can still ask node B for the required data. A
A PoS C n
B
n
PoS
B
(a) Ego Network of Node n after its departure.
PoS
(b) Ego Network of Node n and its two PoSs after its departure.
Figure 6.5: Case 2 Furthermore, node B is able to select a second PoS (node C as shown in figure 6.5(b)) in the ego network of n. Similar to case 1, node B sends a copy of n’s data
91
6.2. BASIC TRUST: K-TRUSTED-REPLICAS APPROACH
to node C which is second PoS now. Keep-alive messages are exchanged between nodes B and C.
Case 3: Involuntary disconnection of both PoSs This happens when both PoSs leave the network involuntarily at the same time (e.g. due to failures in the network) as shown in figure 6.6(a). None of the two PoSs is able to inform the network about its departure. Nodes which request n’s data will find the PoS Table but will also detect both PoSs to be unavailable. In this case, no data is available in the network. PoS A
A D
PoS C
C n
B
n
PoS
PoS
B
(a) Ego Network of Node n after a case of involuntary disconnection of its PoSs B and C.
(b) Ego Network of Node n: reconnection of its old PoS A after a involuntarily disconnection of both B and C.
Figure 6.6: Case 3
If during this time node A joins the network again which has been PoS before and has a copy of n’s data, it searches for the current PoSs (which are still listed in the PoS Table) to check if there is an online copy of the n’s profile and to delete its old copy. A will detect the absence of any PoS and it will be the only node which still has a copy of n’s data (there might be other friends which have a copy, but do not know that both PoSs are down). A can now elect itself as PoS of n and update the PoS Table. As a next step node A selects a second PoS and updates the PoS Table again (Fig. 6.6(b)). D, which is the new PoS for n, will receive a copy of n’s data and a list of n’s friends. If node n joins the network again it knows about the current PoSs by accessing the PoS Table. In according to our model, when n is online, it has to be a PoS for itself, so n will decide which node will be the second PoS between A and D.
92
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
6.2.4
Data Consistence
In this section, we discuss about the process of updating data objects to manage the consistence of data. In according to the situation explained in the case 3, if node n creates a new data object or changes existing data, it informs the second PoSs about this (the node A as shown in figure 6.7). In Figure 6.7 can be seen that the node A sends the updated data to n that is online. The node D should delete its n’s profile copy because there are just other online copies of this profile. PoS A D C n
B
Figure 6.7: Ego Network of Node n after the reconnection of n. D is not a PoS and the updated data are stored on A and n.
PoS
PoS
A
A
D C n
B
A D
C
C n
B
(a) Ego Network of Node n after both PoSs fail.
D
PoS
(b) Reconnection of the old PoS C.
n
PoS
B
(c) C checks if at least one PoS is online. It detects that both PoSs are offline, it elect itself as PoS and elects D as second PoS.
Figure 6.8: PoS dynamism and Data Consistence. Now consider the case 3 again in which both PoS fail (Fig. 6.8(a)). If a node which has been PoS before detects the absence of both PoSs (Fig. 6.8(b)) it will
6.3. FULL TRUST: NETWORK COVERAGE
93
select a second PoS immediately (Fig. 6.8(c)) and elects the second PoS between the online friend nodes of n. When node n is joining the network again and detects old data objects on both PoSs, it sends copies of the new data objects to the PoSs. Afterwards, n can decide which of C and D will be the second PoS and it updates the PoS Table. When the owner of a profile, the node n in the previous case, is joining the network again and there are just two online PoSs in the network, it has to decide which one is the most suitable to be the second PoS. This choice is done by considering the Social Score value. n computes and evaluates which of the two nodes has the highest SS. It notifies to the unselected node that is not a PoS yet and it updates the PoS Table.
6.3
Full Trust: Network Coverage
The approach described in this section is based on the concept of network coverage and is inspired by the solution proposed in [106], where particular nodes, called Social Caches, are used to reduce the cost of updates diffusion over a network. Social Caches act as local bridges for their friends in order to reduce the number of connections necessary to collect the social updates in DOSNs. To guarantee the network coverage, each pair of nodes, that are not Social Caches, have to have a common friend which is a Social Cache. Taking inspiration from previous solution, we propose a solution where the Point Of Storages of a node take the role of the social caches and act as a bridge for an offline node to provide its private data to the mutual friends. In this solution, we distinguish two different kind of Point Of Storage for a node x. Each node x has a Local Point of Storage (LP S), which stores a backup copy of its data till x remain online. This allows to face abrupt crashes which could lead the data loss. When the node is online, it is able to give its data to all its friends through trusted connections. When the node goes offline, the data must be replicated on a set of nodes which guarantee that all the friends of x can receive data through trusted connections. These nodes are called Social Storage (SS) of x. The Social Storage are detected by applying the notion of network coverage. Hence, the solution we propose in this section satisfies the following requirements: • each peer stores its private data only to nodes which are contained into its Dunbar-based Ego Network;
94
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
Table of Symbols A node which is a Point Of Storage for the node u. A node is defined PoS when it has one of the following roles: Social Storage (SS) or Local Point Of Storage (LPS)
P oS(u)
LP S(u)
A node which is a Local Point Of Storage for the node u
SocScoreegon (u)
The Social Score assigned to a node u computed by considering the ego network of the node n
Nu
Online Nodes contained into the ego network of u
span(u)
Number of edges which are uncovered,according to the given network coverage definition
SS − 1(u)
set of nodes for whom u is Social Storage
LP S 1 (u)
Number of edges which are uncovered,according to the given network coverage definition
Table 6.2: Table of Symbols used in this section
6.3. FULL TRUST: NETWORK COVERAGE
95
• each online peer p requests private data of each its offline friends f through a friend common both to f . Note that while the first requirement is satisfied also in the solution proposed in section 6.2, the second is a new requirement we introduce. Example 6.6. Consider the figure 6.9, our aim is to guarantee that data of C and D, which are offline, should be accessible for their online friends. Suppose that E and G are Social Storage nodes, E has a copy of the C’s profile and G has a copy of D’s profile. Note that the online friends of C, that are A and F are able to retrieve C’s profile from E, which is a common friend with C, while online friends of D, that are B and F are able to retrieve D’s profile from G.
Figure 6.9: Network Coverage through PoSs: an example
To assist the reader, Table 6.2 summarizes all the symbols of this section. Let us first introduce the notion of Neighbour Dominating Set, which we will then exploit for our notion of coverage. Definition 6.7. Let N (u) ⊆ V to be the set of neighbours of node u, that is, v ∈ N (u) iff (u, v) ∈ E. The Neighbour-Dominating Set of a graph G=(V,E) is the set S ⊆ V of vertices such that for each edge (u, v) ∈ E, there exists a w ∈ S satisfying w ∈ (N (u) ∩ N (v)) ∪ {u, v}. Intuitively, for each edge (u, v), either one of its endpoints should be in the neighbour-dominating set, or one of the common neighbours of u and v. Definition 6.8. Given a graph G and a set of vertices V ∗ ⊆ V (G), the subgraph induced from V ∗ includes the set of vertices V ∗ and the set of edges E ∗ , such that E ∗ = {e ∈ E(G)|e =< u, v >, u, v ∈ V ∗ }
96
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
Definition 6.9. We define the dynamic Dunbar-based Ego Network of a node u as the graph including u and all the Dunbar friends of u, each one tagged with its online/offline status. Definition 6.10. Given a dynamic Dunbar-based Ego Network, an edge (A,B) of the network is covered if and only if: • A and B are online; • A and B are offline; • A is online and B is offline (or the symmetric case in which B is online and A offline) and A is a Social Storage (or B in the symmetric case) or exists a common friend which is Social Storage; The previous different scenarions are shown in Figure 6.10.
Figure 6.10: Coverage of an edge (A,B) Note that the social storage can be considered a dynamic overlay which guarantees the coverage of the network and hence the persistence of the data in the network. Definition 6.11. A dynamic Dunbar-based Ego Network is covered if all its edges are covered. Definition 6.12. The span of a node u, span(u) is the number of edges not covered in the Dynamic Dunbar-based Ego Network of u, where u is considered offline. As concern the network coverage, we have several related problems, such as:
6.3. FULL TRUST: NETWORK COVERAGE
97
• define a PoSs selection strategy which uses the node’s properties (structural and temporal); • manage the dynamic behaviour of the distributed networks by considering a re-election algorithm which can be used when PoSs are going offline to guarantee trust; • manage storing and consistency.
6.3.1
Social Score Definition
The choice of friends where profile’s replica has to be stored when the owner of the profile is offline is a fundamental issue in our system. We consider social and structural properties of the social graph, and the user’s behaviour to define our approach. We assign to each node a score called Social Score, which measures the users’ suitability as PoS of a specific node. The Social Score (SocScore) of a user x, which is a Dunbar neighbour of a user n, can be computed by considering both structural and temporal properties. In the Full Trust approach, we want nodes with high node’s degree to use information about the centrality of nodes to reduce the number of Point Of Storages. The other important information concerns the temporal behaviour of users. The SocScore is defined as follow: SocScoreegon (x) = deglocegon (x) ∗ M SL where deglocegon (x) is the local degree of x by considering the ego network n and M SL is the Medium Session Length as defined in the section 6.2. The SocScoreegon (x) is used in the first phase of our algorithm SSS as shown in Algorithm 7. This phase, for each ego node u, permits to select a LPS in its ego network EN (u) by considering the maximum SocScoreegon (x) of all nodes. The first phase terminates with the selection of a LPS. Example 6.13. Consider the ego network of the node X shown in figure 6.11. The local degrees are: deglocx (J) = 0, deglocx (H) = deglocx (I) = 1, deglocx (A) = deglocx (E) = deglocx (G) = 2, deglocx (B) = deglocx (C) = deglocx (F ) = 3, deglocx (D) = 4. Node D has the highest local degree and it is the best choice to be a Local Point of Storage (LPS) for X. Indeed, LP SX = D provides the X’s profile to C, F , E, G. Consider the ego network of X in Figure 6.12. Node J is the node with the highest global Social Score by considering the whole network of X. But it is not a good choice elect the node J as LPS.
98
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
Figure 6.11: Local degree computation
Figure 6.12: Local Degree vs. Global degree
6.3.2
Computing Social Coverage
The Social Storage Selection (SSS) algorithm (Algorithm 7) is executed when a node joins the network and it has not a Local Point of Storage. Each node v ∈ V belongs to one of the following categories: white each node is not a Social Storage and the set of edges EN (v) can be not covered in whole. black the node is a Social Storage (SS). red the node is a Local Point of Storage (LPS). In the Algorithm 7 is described the election of a LPS node. Each node, when goes online, gets the Social Score of all its Dunbar’s friends and elects its LPS by choosing the Dunbar’s friends with the higher Social Score. In the Algorithm Nu is the set of online neighbours of u and LP Su is the Local Point of Storage of u.
6.3. FULL TRUST: NETWORK COVERAGE
99
Algorithm 7 SSS procedure LPSElection if Nu − {u} = 6 then get SocScoreegou (v), ∀v ∈ Nu ; select n, SocScoreegou (n) = maxv∈N (u) {SocScoreegou (v)}; LP Su = n; mark n as RED; end if end procedure
To explain in detail how our coverage approach works, we provide a list of all possible scenarios which can happen into the system. 1) An ego node leaves the network and it is not a Point Of Storage When a node x is going offline, it has to compute its ego network coverage by choosing friend’s node which can store a replica of the profile of x. Furthermore, it has to inform all its neighbours about this. Algorithm 8 Disconnection from the system of an ego node x procedure OnDisconnect set C = computeCoverage(x, Dunbar EGO(x)); for all n ∈ C do send prof ile(x); end for for all v ∈ Np do send logout(x); end for end procedure The actions executed by x are shown in the Algorithm 8. The node going offline computes the Social Coverage (C) with respect to its Dunbar-basedego network and sends its profile to all the nodes belonging to C, then it notifies it is going offline to all its neighbours. Algorithm 9 describes the coverage procedure, which consists in the computation of the uncovered edges through the value of span(x). Initially, span(v) = |Ev | because no social storage exists in the network. When span(x) > 0 means that x has not computed a complete coverage of its ego network composed by online nodes and it has to elect Social Storages until span becomes 0. The node n which is
100
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
Algorithm 9 Ego Network Coverage procedure computeCoverage(x, G) set Coverage = ∅; while span(x) > 0 do get span(n)∀n ∈ G|!(n.isSocialStorage()); select m|span(m) = maxspan(n); Coverage = Coverage ∪ m; set m as BLACK; . x send a message to m with x data end while return Coverage; end procedure
elected as Social Storage is marked as BLACK. Furthermore, isSocialStorage() is used to know if a node is a PoS or not (Social Storage or LPS). Example 6.14. Consider Figure 6.13. Suppose that node A is going offline. The disconnection is notified to all its friends (B, C, D, E, F and G) and A has to compute the coverage for its ego network. Nodes B, C, D have covered the edge to A through C. Instead, nodes F , E, G have the uncovered edge to A and so A has to cover the uncovered portion of its ego network. The coverage is done by computing the span of these nodes: span(E) = 2; span(F ) = 3, span(G) = 2. A elects F as Social Storage. After the first step, span(A) = 0 and the algorithm terminates.
Figure 6.13: Network Coverage
6.3. FULL TRUST: NETWORK COVERAGE
101
2) A node n leaves the network and it is a Point Of Storage for an online node x In this scenario, a node n, which is a Point Of Storage (LPS or Social Storage) for a node x, is going offline. The node x is online and it receives a disconnection notification from n (logout message in Algorithm 8. The node x is able to elect another Point Of Storage for itself. If n is a LPS for x then x has to execute the Algorithm 7. Otherwise, if n is a Social Storage then x executes the computeCoverage() procedure, shown in Algorithm 9, to check if its ego network is covered or not. Algorithm 10 Receive a Disconnection Notification executed by node x procedure OnMessageReceive(disconnect(n)) if x ∈ LP S −1 (n) then LP SElection(); end if if x ∈ SS −1 (n) then computeCoverage() end if end procedure
3) A node n leaves the network and it is a Point Of Storage for an offline node x When a node n is going offline and it is a Point of Storage for at least an offline node, it has to elect a new Point Of Storage for this node. To maintain the trust, the SS has to mimic the behaviour of all offline nodes x, which are included in SS −1 (n), by executing for them, the network coverage on the graph induced by x and the friends common with x. This procedure is described by the Algorithm 11. In the Algorithm 11, the function CommonDunbarN eighbours(x, y) returns the set of Dunbar neighbours common to x and y. Example 6.15. Consider the ego network of node x shown in figure 6.14(a). When x goes offline, it elects node N and node V to cover its ego network. Suppose that node N , which is a SS of x, goes offline. N recomputes the coverage for x on the common neighbours between N and x (N mimics x), which are R, T , and K, as shown in figure 6.14(b). The recomputation of the coverage on the subgraph induced by x, R, T , and K, may return, for instance, two new social storages for x, for example R and K (as shown in figure 6.14(c).
102
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
Algorithm 11 Algorithm execute from a node n which is a SS for an offline node x procedure OnDisconnect for all x|x ∈ SS −1 (n) do get CN = CommonDunbarN eighbours(n, x); G = subgraph induced f romCN ∪ {x}; C = computeCoverage(x, G); end for end procedure
(a) Ego Network of Node x when it goes offline. N and V are elected SS.
(b) Ego Network of Node x after its disconnection.
(c) The induced graph composed by online common neighbours of N and x. R and K are elected SS to maintain the coverage
Figure 6.14: A SS node goes offline 4) When an offline node joins the network When an offline node n is going online, it has to build its Dunbar-based ego network with the online friends. After that and considering that a profile request is an on-
6.4. EXPERIMENTAL RESULTS
103
demand action, it has to check if its LPS is online, otherwise it needs to elect a new one. This could happen in two cases. The first one is after the registration of the node (the first time that it logins the system) and the second one is when its old LPS, when it left the network, was not able to find an online new LPS. Example 6.16. Consider the case in which n joins the network, as shown in Figure 6.15(b). It needs to find the Social Score, which stores the profile of w, among the set of common friends. It could happen that w was going offline when n was offline, as shown in figure 6.15(a), so there is no a Point Of Storage for it (no coverage). To manage this scenario, we allow to check a Point of Storage of w by using common neighbour of n and w. n sends a w’s profile request to y which is a common neighbour. y checks if it has common friends with w and it forwards the request to all common neighbours minus n. In this case, it finds that z is a common neighbour and it forwards the request to z. Fortunately, z is a Point Of Storage for w and it provide its profile replica to y which sends the replica to n. If no common neighbours are found, we can not guarantee the availability of w’s profile.
6.4
Experimental Results
The evaluation of our trusted storage approaches is proposed in section 9.3. The goal of our simulations is to investigate the quality of our data availability service by using our real Facebook dataset, which will be introduced in the Chapter 8. We have evaluated in particular, the k-trusted-replicas approach by focusing on the number of available profiles in the network under churn. Furthermore, we analyse the quantity of nodes which are responsible for storing the published data objects (PoSs) and the costs of our protocol to prevent data loss due to failing nodes. The experimental results indicate that our proposed data availability service protocol with only 2 PoSs provides high availability of data objects under realistic circumstances and realistic user behaviour. Furthermore, the data consistence is guarantee with a number of keep-alive messages proportional to the number of published profiles.
104
CHAPTER 6. DATA AVAILABILITY: A TRUSTED STORAGE APPROACH
(a) Snapshot of the network when w is going to be offline.
(b) Snapshot of the network when n is going to be online.
Figure 6.15: When an offline node joins the network
Chapter 7 Information Diffusion on Dunbar-based Overlays In this Chapter we exploit the Dunbar-based ego network to manage the information diffusion problem. We define information diffusion as the process by which a piece of information called social content, e.g., a message, is spread between users. Information diffusion is a general problem studied also in OSNs, but under different aspects: • which pieces of information or topics are popular and diffuse the most, • how, why and through which paths information is diffusing, and will be diffused in the future • which members of the network play important roles in the spreading process In this thesis we focus our attention on a sub-problem, which is more interesting for DOSNs: the social updates dissemination problem. The social updates are spread on the social overlay presented in Chapter 4, so over Dunbar-based ego networks. A social update is defined as any social content that users share with their friends (profile information, wall postings, pictures, videos, etc.). Users generate a huge amount of social content inside a social network which should be disseminated to their direct friends or to a larger extent, depending on the kind of the social update. Any update to personal data should be disseminated with a low cost in term of time and messages. In a generic OSN, users can send emails and instant messages to their friends. Furthermore, user are able to disseminate updates from users to users as new content gets posted to their profile pages, or comments are made in return. As explained in [33], typical communication patterns are: unicast pattern, generally required by e-mail and chat, multicast pattern required by profile page posts, as for social updates dissemination over newsfeed (Figure 7.1). The outline of this Chapter is the following:
106
CHAPTER 7. INFORMATION DIFFUSION ON DUNBAR-BASED OVERLAYS
(a) Email and instant (b) v posts content to (c) Friend posts new messaging pattern own profile page - multi- content to the profile cast pattern page of v - multicast pattern
Figure 7.1: Messaging Patterns in an OSN Let Gu be the graph representing the ego network of some user u, and let v be a member of this ego network. The update dissemination problem consists in disseminating an update from any user v to all of the other users in Gu [33]. The two main cases are shown in Figure 7.2: in Figure 7.3(a), the owner of the profile page posts an update. In Figure 7.3(b), a friend of u posts a comment to a previous u’s post.
(a) u posts an update to own profile page
(b) v posts a comment to a previous u’s post
Figure 7.2: The social update dissemination problem over the ego network of node u In [33] are proposed two different gossip protocols to guarantee the diffusion of social updates. Gossip protocols are typically run in rounds, in which each node selects a gossip partner using a randomized selection heuristic and exchanges data with it, based on an exchange strategy. The first protocol is a rumor mongering protocol which provides a push phase to disseminate information quickly. The
107 second protocol is an anti-entropy push-pull protocol which guarantees that all node that become and remain online for long enough eventually receive all updates. Three different node selection heuristics are proposed: • Random, the node selection is uniformly at random as in [107], • Anti-centrality, node selection considers the Degree Centrality property, • Fragmentation, node selection considers the disconnected components the node permits to cover. Our approach uses the underlying Dunbar-based Social Overlay and by using structural characteristics of the network, in particular the betweenness centrality, which is a centrality measure, it provides an efficient protocol to diffuse social updates. We consider the problem of computing the Ego Betweenness Centrality in Distributed Online Social Networks, and we propose two distributed protocols for the computation of Ego Betweenness Centrality. Furthermore, we evaluate the Ego Betweenness Centrality on undirected graphs by using the method proposed in [89], and we study the Ego Betweenness Centrality for directed graphs. • in section 7.1, we propose an overview on the Ego Betweenness Centrality index metric and we explain how this metric is important for the information diffusion; • in section 7.2 we show the computation of the Ego Betweenness Centrality for both directed and undirected graphs; • in section 7.3 we introduce the Weighted Ego Betweenness Centrality by showing its computation and its properties; • in section 7.4 we introduce our epidemic diffusion algorithm based on the Weighted Ego Betweenness Centrality metric; • finally, in section 7.5, we show our two distributed protocols for the Ego Betweenness Centrality computation. The content of this Chapter is based on material that appeared in the following publications: Epidemic Diffusion of Social Updates in Dunbar-Based DOSN. Conti, M., De Salve, A., Guidi, B., and Ricci, L. In Euro-Par 2014: Parallel Processing Workshops (pp. 311-322). Springer International Publishing.
108
CHAPTER 7. INFORMATION DIFFUSION ON DUNBAR-BASED OVERLAYS
Distributed protocols for ego betweenness centrality computation in DOSNs. Guidi, B., Conti, M., Passarella, A., and Ricci, L. In Pervasive Computing and Communications Workshops (PERCOM Workshops), 2014 IEEE International Conference on (pp. 539-544). IEEE.
7.1
Ego Betweenness Centrality
In term of information diffusion, but also network analysis, a centrality metric is needed. The centrality of a node in a network is a measure of the structural importance of the node. A central node, typically, has a stronger capability of connecting to other network nodes. There are several ways to measure centrality. The three most widely used centrality indexes are the degree centrality, closeness centrality and betweenness centrality [87]. The Betweenness Centrality index is the most complex and best measure suited for describing network communication based on shortest paths and predicting the congestion of a network. It models the importance of a node in term of the information flow in a network. The BC of node v measures the fraction of the shortest paths σst (v) from node s to node t that pass through v with respect to all the shortest paths σst between s and t. This measure is a global property of a node, and a high value suggests that lot of information between other vertices transit through this node. The computation of the Betweenness Centrality is a high time consuming task and, considering a distributed network, such as a P2P network, where nodes have a local knowledge of the network, it is not easily applicable. A more computationally efficient approach is to compute the betweenness on the ego network as opposed to the global network topology. A betweenness centrality based on an ego-centric approach, known as Ego Betweenness Centrality can be considered a good solution to evaluate centrality in a DOSN. Empirical results done on this measure showed a strong correlation with the Betweenness Centrality metric. The high advantage of computing this measure, is that only local ego network information are needed. Due to the nature of Ego Networks, computing classic centrality measures such as the Betweenness Centrality or the Closeness Centrality has no meaning. The ego networks contain topological information regarding ego and its alters, and no information about whole network structure and connections. For this reason a centrality measure called Ego Betweenness Centrality has been suggested by Everett and Borgatti in 2005 [89], where empirical results showed its correlation with the Betweenness Centrality metric. The strong advantage of computing this measure, is that only ego network information are needed, without a global view of the network. We present this measure as defined in [89]:
109
7.1. EGO BETWEENNESS CENTRALITY
X
EBC(n) =
An (i,j)=0,j>i
1 A2n (i, j)
(7.1)
where n is the ego node, An its adjacency matrix. The EBC(n) is therefore the sum of the reciprocal values A2n (i, j) such that An (i, j) = 0. This is because we do not consider geodesics (shortest paths) of length 1 among alters, we are only interested in paths of length 2 that pass through ego. The j > i condition in the sum is justified by the fact that in an indirect graph the adjacency matrix is symmetric. The matrix A2 contains in position A2i,j the number of paths of length 2 connecting i and j by considering the ego network of n. In conclusion, ego betweenness centrality measure indicates how many shortest paths pass through the ego, out of all possible shortest paths between alters.
B A
B E
C
A
D
E
C
D
(a) Ego network of node E
(b) Links considered in EBC computation
Figure 7.3: EBC example. As an example to better illustrate this important concept, we present the ego network of node E in figure 7.3(a). In this network, alter B can communicate directly with rest of alters {A, C, D} without going through ego E (the same applies to node D). Conversely, if we consider nodes A and C, a direct path doesn’t exist among them: all routes in between are of length 2. These paths are highlighted as solid lines in figure 7.3(b): one of the three possible path transit through ego node (A-E-C), whereas other two ways are possible (A-B-C, A-D-C). The adjacency matrix contains the representation of this network, and is defined as: E E 0 A 1 AE = B 1 C 1 D 1
A B 1 1 0 1 1 0 0 1 1 1
C 1 0 1 0 1
D 1 1 1 1 0
(7.2)
110
CHAPTER 7. INFORMATION DIFFUSION ON DUNBAR-BASED OVERLAYS
and its square is: E A B C D E 4 2 3 2 3 A 2 3 2 3 2 2 AE = B (7.3) 3 2 4 2 3 C 2 3 2 3 2 D 3 2 3 2 4 We can notice that the matrix is symmetrical, as a consequence of the graph being undirected. As we can see, the value AE (A, C) = 0 indicates the missing link between nodes A and C. The matrix A2E contains in position A2E (A, C) = 3 the number of paths of length 2 connecting A and C. Since in the calculus we only consider the pairs above diagonal which are zero in AE (meaning we only consider path of length 2 between alters), the EBC for node E results EBC(E) = 1 , indicating that one (path A-E-C) out of three possible shortest paths passes 3 through the ego.
7.2
Ego Betweenness Centrality Computation
The Ego Betweenness Centrality is the value of the Betweenness Centrality computed using only the nodes and the links in the ego-network of a node. The computation of the Ego Betweenness Centrality for an undirected graph has been proposed in [89]. The correlation between BC and EBC has not a theoretical link. However, it has been verified on random networks and real networks composed by a little amount of nodes [89]. In Chapter 9 we will show this correlation through experimental results, conducted on networks including a longer number of elements. H A
G
D
E
F
I
L
C
B
N M
Figure 7.4: An example of network graph Figure 7.4 shows a simple graph, where the nodes E and F have an important role in term of centrality. They have a high BC and EBC values because of the
7.2. EGO BETWEENNESS CENTRALITY COMPUTATION
111
link between them that permits the information diffusion between two independent communities composed respectively by {A,B,C,D,E} e {F,G,H,I,L,M,N}. Table 7.1 lists the EBC, the BC and the Closeness Centrality values for the graph shown in figure 7.4. All values in the table are no normalised. By analysing the BC and EBC values, it’s possible notice the strong correlation between the two measures. Nodes A B C D E F G H I L M N
BC 0 0.33 0 0.33 28.3 30 7.49 4.25 7.49 0.25 0.25 0.25
CC 0.39 0.41 0.39 0.41 0.55 0.61 0.55 0.52 0.55 0.41 0.41 0.41
EBC 0 0.33 0 0.33 4.33 3.0 1.66 1.0 1.66 0.33 0.33 0.33
Table 7.1: Centrality Metrics for the graph in figures 7.4: betweenness centrality (BC), closeness centrality (CC) and ego-betweenness centrality (EBC).
Figures 7.5 shows one of the community shown in figure 7.4. The ego network of the node E is shown without the edge (E, F ) to simplify the example. The alter D is able to communicate directly with {A,B,C}. The same for the node B. Nodes A and C are able to communicate through E (path A-E-C), or through B (path A-B-C), or through D (path A-D-C). D
A
E
C
B
Figure 7.5: Ego Network of the node E The BC computation can be computed in O(nm), as explained in section 3.5. [108] shows an approximation method to compute the BC, and the computation
112
CHAPTER 7. INFORMATION DIFFUSION ON DUNBAR-BASED OVERLAYS
√ complexity is O( nm). The EBC computation requires a computation complexity equal to O(n3 ) for a square matrix of nxn dimension, where n is the number of nodes contained into the ego network.
7.2.1
Ego Betweenness Centrality Computation in directed graphs
In some scenario, the communications are not bidirectional and the resulting graph is a directed graph. For example, Twitter social graph is a directed graph, while Facebook social graph is undirected graphs. The EBC is proposed and evaluated for undirected graph in [89]. We have analysed the computation of the BC on a directed graph as proposed in [109], and we propose a method to compute the EBC on a directed graph with an ego-centric approach. In [109] is explained that the computation of the BC for directed graphs is obtained by normalising the measure. Given n the number of nodes, the normalisation is (n − 1)(n − 2) for directed graphs and (n − 1)(n − 2)/2 for undirected graphs. Definition 7.1. Betweenness Centrality for directed graph For every vertex v ∈ V of a directed graph G(V, E), the betweenness centrality CB (v) of v is defined by X X σst (v) (7.4) CB (v) = σ st s6=v t6=s,v
In according to the definition 7.1, the computation of the EBC on a directed graph requires some modifications to the basic algorithm. Let us consider a node i. In order to compute the EBC of i we have to consider all the minimal oriented paths passing through i and connecting a node j to a node k, where k is an alter of i, while j may not belong to the ego network of i. This implies that if we consider an alter k of i, we have to consider all the nodes j such that i is in the ego network of j, but k is not in the ego network of j. The adjacency matrix is not symmetric, and the computation does not consider only the nodes i e j such that AE (i, j) = 0 which are above the diagonal, but also the nodes i e j such that AE (i, j) = 0 which are under the diagonal.
7.3
Weighted Ego Betweenness Centrality
BC and EBC are topology-based metrics, so that their value for a node n depends on the position and on the connections of n in the graph. Consider, for instance,
7.3. WEIGHTED EGO BETWEENNESS CENTRALITY
113
(a) Simple Weighted Network (b) The Weighted Network with the inverted weights
Figure 7.6: Example of the Dijkstra’s algorithm Fig. 7.7, representing the ego-network of node E and let us neglect, for the moment, the weights paired with the links of the graph. The value of the EBC for nodes E, D and B is exactly the same, 31 , since a single shortest path out of 3 different shortest paths between A and C passes through each of them. Therefore, the value of the EBC doesn’t permit us to decide, for instance, which node among them is the best choice for propagating a social update from A to C. This is due to the fact that the EBC considers only structural properties of the graph. Therefore, in an effort to generalize these measures for weighted networks, a first step is to generalize how shortest distances are identified and their length defined. If we are looking at the diffusion of information in a network, then the speed that it travels, and routes that it takes, are clearly affected by the weights. Since weights in most weighted networks are a representation of tie strength and not the cost of the transmission, we need to understand how to compute the shortest path. Dijkstra [110] proposed an algorithm that sum the cost of connections and find the path of least resistance. For example, GPS devices uses this algorithm by assigning a time-cost to each leg of the road, and then find the route that cost least in terms of time. This algorithm can also be used in social network analysis. Newman [111] applied it to a collaboration network of scientists by inverting the tie weights (dividing 1 by the weight). This implies that a stronger tie gets a lower cost than a weaker tie. In figure 7.6 is shown the network from above with the inverted weights. Let us consider figure 7.6(a), look at the shortest path from node C to node B. The direct connection carries a weight of 1; however, the indirect connection through node A is composed of stronger ties. By applying Dijkstra’s algorithm, we find that the direct connection between node C and node B has a cost of 1, whereas the indirect connection via node A has a cost of 0.75 (1/2 + 1/4), as shown in figure 7.6(b). Therefore, according to this algorithm, the information will travel faster through the indirect connection. We introduce a new algorithm for the Ego Betweenness Centrality which is
114
CHAPTER 7. INFORMATION DIFFUSION ON DUNBAR-BASED OVERLAYS
applied to weighted networks and is completely equivalent to the Dijkstra’s algorithm. We first need to introduce the concept of the Weighted Ego Betweenness Centrality (WEBC) which discriminates the shortest paths crossing the ego according to the weights of the edges on those paths. The computation of the WEBC requires the adjacency matrix Ani,j associated n which contains with the ego network of node n, ego(n), and the weights matrix Wi,j the weights paired with the edges of ego(n), (note that a weight may also be equal to 0). Given two nodes i and j belonging to ego(n) such that Ani,j = 0, let X
P athi,j (n) =
n n Ani,k ∗ Ank,j ∗ Wi,k ∗ Wk,j
k∈ego(n),k6=i,j
be the sum of the weights of all the 2-hops paths between nodes i and j including only edges in ego(n). The WEBC of a node n for an undirected graph is defined as follows: W EBC(n) =
X i,j∈ego(n),An i,j =0,j>i
n n ∗ Wn,j Wi,n P athi,j (n)
(7.5)
where all the nodes i e j such that there is not a directed link between them (Ani,j = 0) in the ego network of n are considered in the summation. The path n n . ∗ Wn,j between the nodes i e j crossing n has weight Wi,n
Figure 7.7: Weighted Ego Betweenness on Undirected Graphs . Consider the weighted graph in Figure 7.7, where only the weights of the edges on the paths between A and C are specified. We have previously seen that nodes E, D and B are indistinguishable in term of EBC. The values of the WEBC for nodes E, D, and B obtained by applying 7.5 are the following ones:
115
7.3. WEIGHTED EGO BETWEENNESS CENTRALITY
W EBCD =
0,001·0,057 0,001·0,057+0,044·0,076+0,22·0,073
= 0, 0029
W EBCE =
0,044·0,076 0,001·0,057+0,044·0,076+0,22·0,073
= 0, 17
W EBCB =
0,22·0,073 0,001·0,057+0,044·0,076+0,22·0,073
= 0, 82
The WEBC highlights that the path connecting A and C and crossing B has a higher weight with respect the other ones. We show the equivalence between our approach and the Dijkstra’s algorithm, by considering the inverted weights below: W BCD =
1 0,001
1 0,044
W BCE = W BCB =
+
1 0,22
1 0,057
+
+
= 1017, 54
1 0,076
1 0,073
= 35, 88
= 18, 2454
As concern the Dijkstra’s algorithm, we have to choose the path connecting A and C and crossing B, because of this path has the lower value. And it is the same path that we are able to choose with the WEBC. So, we can say that the computation of the betweenness centrality for weighted graph by applying the Dijkstra’s algorithm is equivalent to the computation of the Weighted Ego Betweenness Centrality. In a Dunbar-based overlay where the weights paired with the links of the overlay represent the tie strength between two users (for instance expressed as number of interactions between the users) we can exploit WEBC to distinguish among different paths the most important one in term of number of interactions between the nodes on that path. As a matter of fact, if the weights on the paths represent the contact frequency between the nodes, a larger value of the WEBC corresponds to higher values of the tie-strength between the nodes involved in the path and to a higher level of mutual trustness between these nodes. As we will see in the next Section, this characteristics may be exploited to detect important paths for the diffusion of the information. In some scenarios, it is useful to evaluate W EBC(n, a), the WEBC of a node n with respect to a particular alter a. We define the WEBC of a node u respect to an alter a as follows: W EBC(n, a) =
X j∈ego(n),An a,j =0
n n Wa,n ∗ Wn,j P atha,j (n)
(7.6)
In the next section, we will see that W EBC(n, a) can be exploited to evaluate the capability of node n to connect the alter a to nodes in the ego-network of n
116
CHAPTER 7. INFORMATION DIFFUSION ON DUNBAR-BASED OVERLAYS
which are not directly connected to a. An epidemic algorithm can exploit this information to define an heuristics for the neighbour selection when propagating a social update. When computing the WEBC of a node n on a direct graph, a possible solution is to consider all the shortest oriented paths passing through n and connecting a node j to a node k, where k is an alter of i, while j may not belong to the ego network of i. Furthermore, node k must not belong to the ego network of j, otherwise a direct path exists between j and k.
Figure 7.8: A weighted and directed graph In the graph shown in figure 7.8, the WEBC of A considers the weighted oriented path between B, and C crossing A with respect to all the weighted oriented path linking B to C: 1·1 = 0.6 W ebcA = 1 · 1 + 1 · 0.5
7.4
WEBC-based information diffusion
In a OSN, like Facebook, when a user u produces a social update, for instance it publishes a post on its wall, this update has to be sent to users which are k-hops distant from u in the social graph. The extent of the diffusion depends on the kind of the social update and on the privacy settings of the users. For instance, in Facebook, each post published by a user on its own wall should be sent to its 1-hop friends, i.e. to all the users in its ego-network. In other scenarios, like the comment to a post or a photo of a friend the information may be transmitted to 2-hops distant users FriendOfFriend(FoF). Some particular scenarios require to transmit the update even to 3-hops-away friends FriendOfFriendOfFriend (FoFoF). In a DOSN, we need a service which manages the social updates transmitting them to the k-hops neighbours through the links of the social overlay. The definition of an efficient algorithm is fundamental to avoid a huge amount of duplicated updates sent to the same node and to optimize the overall traffic on the DOSN overlay. A simple solution is the flooding technique. However, this approach is inefficient
7.4. WEBC-BASED INFORMATION DIFFUSION
117
because of it generates a lot of duplicates. When a node receives a social update, it sends the update to all its alters and the procedure is recursively executed, till the maximum number of hops is reached. An alternative solution is to send the information by an epidemic algorithm and to exploit the WEBC to guide the choice of the neighbours for the propagation of the information. In particular, we exploit the definition of WEBC defined in 7.6 to obtain a neighbour selection heuristic. When a node v receives a social update it passes it to the neighbour n in its ego-network maximizing W EBC(n, v). This heuristic is able both to choose the neighbours able to connect v to nodes not belonging to its ego-network and to favour the spreading of the update on highly weighted paths, where the nodes are connected by strong ties. Note that this heuristic can be implemented by exploiting only local information. The differentiation of paths based on their weights implies that nodes belonging to highly weighted paths will receive the update earlier. This agree with recent design choice of current social networks, where the importance of social updates is classified according to the tie strength with the node generating the update (for instance each node in Facebook classifies the updates it receives according to this politics in order to present the updates on different news feeds, on the basis of their importance).
7.4.1
The Information Diffusion algorithm
Let us suppose that a node A generates a social update. We focus on the case where the update has to be spread two-hops away with respect to A, but our approach is valid also for the general case of k-hops diffusion. Consider Fig. 7.9, node A produces an update on its profile which involved a certain number of its neighbours (in the general case involved all neighbours), as a post which contains tags of its friends. In this case o has to be sent to all A’s friends and to all friends of the tagged friends into the post.
Figure 7.9: Community Discovery Our algorithm is organized into two phases:
118
CHAPTER 7. INFORMATION DIFFUSION ON DUNBAR-BASED OVERLAYS
• Communities discovery. • Diffusion of social content into each community. The phase of community discovery allows to partition the i-hops, i ∈ 1..2, neighbours of X into a set of communities such that each pair of nodes within a community is connected by a path, while nodes belonging to different communities are not connected. The community discovery algorithm is based on [112]. Fig. 7.9 shows the node A together with its direct social contacts and FoFs. These nodes are shown into two concentric circles, according to their social distance from A. The different communities detected by the algorithm are shown with different colours. In the second phase the WEBC-based epidemic algorithm, is executed. Consider the general case in which all direct friends of a producer node has to receive the update, let N Hops1 (X) be the set of nodes which are 1-hop from X and N Hops2 (X) the set of 2-hops nodes from X, the algorithm works as follow: • the producer X of the update o starts the information diffusion by sending o to a node for each community. These nodes are selected on the basis of their WEBC computed with respect to X. • if a node Y in N Hops1 (X) receives the update o from another node K in N Hops1 (X), it checks its ego-network to see if it contains neighbours in N Hops2 (X) which are not also neighbours of K (recall that the ego network of Y includes also the links connecting neighbours of the ego). If the set of these neighbours is not empty, it sends the update to each node in this set. Then Y chooses one neighbour belonging to N Hops1 (X) and chosen according to the WEBC-based heuristics and propagates the update to it. • nodes in N Hops2 (X) do not propagate the update. • each node n maintains the list of updated nodes U pdatedN odes, which is initially empty. n records in this list the nodes which are known to have received the update. Whenever n sends an update to one of its neighbours, it add this neighbour to this list. • if a node receives a social update, but all the neighbours in its ego-network belongs to U pdatedN odes, it stops the diffusion of the information. Note that the previous algorithm may be refined in several directions. For instance, in the second step, the check related to the common neighbours in the ego-networks of the sender and of the receiver does not guarantee that the message will be propagated only to nodes which has not received it previously. As a matter
7.5. DISTRIBUTED PROTOCOLS FOR WEBC COMPUTATION
119
of fact, each node has only a local and partial view of the nodes of the community, restricted to its ego-network. To face this problem, it is possible to pair each update with an history recording the nodes which have already received the update, as proposed in [33]. This allows to reduce the number of duplicate updates at the expense of a larger usage of the network bandwidth.
7.5
Distributed protocols for WEBC computation
The computation of both the EBC and of the WEBC needs the knowledge of the ego network of a node, i.e. of all its alters and of the connections between the alters. In a dynamic environment like a DOSN, these connections change continuously. For this reason, a distributed protocol for the notification of the updates occurring in the ego network of a node to all the neighbours of that nodes is required. We propose two distributed protocols to compute the EBC through the Dunbar-based social overlay. At the best of our knowledge, there are not other proposed distributed protocols. The distributed protocols we propose, can be exploited for the diffusion of the information required for the computation of the EBC and of the WEBC. They do not depend on the kind of information sent on the ego network. The proposed protocols provide a simple mechanism to compute the EBC by building the adjacency matrix of each ego node. The two protocols differ from each other in the update phase of the adjacency matrix. The first protocol is the Ego-BC broadcast (EBC Broadcast) is used to maintain the adjacency matrix up-to-date by doing communications with all the nodes in its ego network. The second protocol called Ego-BC Gossip (EBC Gossip) maintains the adjacency matrix up-to-date through specific gossip techniques [113]. Both protocols can be used both with a directed and with an undirected graph.
7.5.1
EBC Broadcast Protocol
For each node v, we define N (v) as the set of nodes in its ego network, and Ev the set of edges between the node v and the set of nodes in the ego network. In this protocol, each node, in a single step, sends to its neighbouring nodes information about its ego network. The protocol is structured according to the following steps: • Connection, when a node v and a node n join to the overlay, they exchange their neighbourhood N (n) and N (v). Furthermore, the nodes v e n inform their neighbours in N (v) and N (n) about the updates into their ego network.
120
CHAPTER 7. INFORMATION DIFFUSION ON DUNBAR-BASED OVERLAYS
This phase is described by the algorithm 12. In this phase, each ego node v exchanges O(|N (v|) messages, which are the only essential messages to notify the exchanges into the ego network. Algorithm 12 The node v is connecting to alter function connect(alter) send MessageUpdate(N(v),v) to alter; send MessageUpdate(alter,v) to N(v); N(v)=N(v) U alter; end function • Disconnection, when a node v voluntary leaves the system, it sends a notification message (disconnect) to its neighbours in N (v). Each node v has a handler for receiving messages, which specifies the steps executed by v for each received message. When an ego node v receives a disconnect message from the node n, it updates its local data structures and communicates to neighbouring nodes the removal of the node n from its ego network through the M essageDelete message. Each ego node v exchanges only O(|N (v)|) messages, where |N (v)| is the number of neighbouring nodes. When the node v receives messages containing updates about the ego network of its neighbours, it updates its local data structures. When a node v login into the system, it executes one or more connect procedure calls taking into account the active nodes of its ego network. Each instance of the protocol communicates only the occurred changes in the ego network of a node, instantly or at a frequency that is considered appropriate.
7.5.2
EBC Gossip Protocol
The EBC Gossip protocol is a notify-pull protocol based on the SIR (Susceptible, Infected, Removed) model [113]. The protocol exploits two types of messages: • UpdateRequest. It is a request message sent from the node a to the node b to ask information about neighbouring nodes N (b) of node b. • UpdateReply. It is the response message sent from node b to the node a containing information about neighbouring nodes N (b) of the node b. In the pull phase, nodes send update requests to neighbouring nodes of which they do not yet know the ego network. A node can finish the pull phase when it has received all the information about the ego networks of its neighbours.
7.5. DISTRIBUTED PROTOCOLS FOR WEBC COMPUTATION
121
Let Kv the set of the alters of which the ego node v wants to know the ego network, the algorithm 13 defines the pull phase, which is periodically executed (every ∆1 time unit) for each node, until Kv = ∅. The handler OnUpdateRequest manages the update requests received from the other nodes in the network. When the node v receives a update request (UpdateRequest) from a node n, it replies with a U pdateM essage message informing all nodes about its ego network. The handler OnUpdateReply manages the U pdateReply messages. The function update executes the update of the local data structures as described into the received message. Given a node v, it is able to know all the information about the neighbouring nodes N (v) ego networks, after a certain number of cycles. Algorithm 13 Pull while Kv 6= ∅ do select random alter n ∈ Kv ; Send UpdateRequest to n; Kv = Kv − {n}; wait ∆1 ; end while The notify phase, described by the Algorithm 14 permits to communicate all the changes occurring into the ego network of the node v to all the nodes n which have the node v as an alter node. Let Eov the set of nodes that need to know the notification of the update o relative to the ego network of the node v. The notification phase is periodically executed, each ∆2 time units, and it terminates when all the updates are notified. The selection of the node n to which the updates have to be sent, can be random or based on the number of updates to notify. Furthermore, it is possible to reduce the number of exchanged messages with the aggregation of the all notifications o (the set {o | n ∈ Eov }) into a single message. The two phases of the protocol can be separated and executed with different frequency depending of the dynamics of the network. Let us now show how algorithms 13 and 14 are exploited to implement the basic operations of a node participating to a social overlay: • Join the network . When a node joins the network for the first time, it executes one or more Pull phases (Algorithm 13). • Add a new link . When a new link between two nodes a and b is created, the two nodes update the list of nodes for which is unknown the neighbouring
122
CHAPTER 7. INFORMATION DIFFUSION ON DUNBAR-BASED OVERLAYS
Algorithm 14 Notify function notify while (∃ o : Eov 6= ∅) do select n ∈ Eov ; Send UpdateMessage({o | n ∈ Eov }) to n; for {o | n ∈ Eov } do Eov = Eov − {n}; end for wait ∆2 ; end while end function nodes Ka = Ka ∪ b e Kb = Kb ∪ a. So that, it is executed a Pull phase (Algorithm 13) to request the update. Furthermore, the nodes a and b add a new entry into their lists of updates to report Eoab e Eoba (they notify the presence of a new link to their neighbouring nodes (Algorithm 14)). • Remove a link . When a node v removes a link to its alter a ∈ N (v), it notifies the removal to its neighbouring nodes by inserting into its updates list the notification oa relative to the removal of the node a, Eova . (Algorithm 14). • Updating of link properties. When a node v updates its links properties p, it notifies the changes to its neighbouring nodes by inserting into the updates list E v the entry relative to p (Algorithm 14). The Pull phase for a generic node v, explained by the algorithm 13, terminates after O(|N (v)|) gossip cycles and it requires a request message (UpdateRequest) and a response message (UpdateMessage) for each cycle. The Notify phase requires a single message for each gossip cycle and the termination depends on the frequency of changes occurred in the ego network of a node.
7.6
Experimental Results
The evaluation of both the distributed protocol for the WEBC computation and the epidemic diffusion algorithm is proposed in Chapter 10. First, we will show the evaluation of the distributed protocol by computing the EBC for two different scenarios: static and dynamic networks. We evaluate our protocols and the correlation between EBC and BC on directed/undirected graphs by extracting different networks randomly chosen from the Zhao’s dataset and by varying the number of
7.6. EXPERIMENTAL RESULTS
123
nodes. Afterwards, we will show the evaluation of the epidemic diffusion algorithm on a subset of nodes extracted by the Zhao’s dataset. Furthermore, we will show the evaluations of the WEBC heuristics by comparing it with an heuristic which selects the neighbours on the basis of the EBC.
124
CHAPTER 7. INFORMATION DIFFUSION ON DUNBAR-BASED OVERLAYS
Part III System Evaluation
Chapter 8 Experimental Results: The Datasets The first big question we had to solve, was about the choice of the OSN to target. Current OSN scenario is very rich: many platforms are general purpose, whereas other ones are more service oriented and targeted to particular users. Since our target was to address OSN with a viral diffusion and normal users, the natural choice was to focus on the first category. The major OSNs, which have world-wide diffusion and are general purpose, are mainly Facebook, Google+ and Twitter. We decided to put our effort towards Facebook, for many reasons. First, Facebook is without doubt the most used and viral OSN present in the world nowadays: as a consequence, the dynamics occurring in this OSN can be considered more ”mature” and, better reflect real world interactions, since, on average, a larger number of users friends are on that OSN compared to other platforms. Other OSN, although interesting, reflect a less pervasive diffusion. Second, if we compare Facebook to its most similar competitor, Google Plus, we notice a different structure: Google Plus is built on the concept of friends circles, whereas Facebook exposes a theoretically much more ”plain” and free friend organization. Facebook has a less strict architecture which does not force users to label or categorize friends ”a priori”. We believe that this structure may influence the user interactions with their contact, overestimating (or underestimating) the importance of some people. Moreover, we believe that analysing a more plain structure such as that of Facebook could allow us to better understand the relationships among users, making them emerge from our analysis without the bias given by a user explicit categorization. Third, by comparing Facebook to Twitter a major difference exists: Twitter is used by people mostly to follow news or updates often written by influential or famous people. As a consequence, Facebook usage reflects more the interactions
128
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
among normal users which better model real world interactions, whereas Twitter relationships encode more the concept of ”being fan of someone”. This difference can be clearly seen also by analysing the graph structure of Twitter, where a small number of high degree nodes dominates over a large number of low level nodes, exposing a very heterogeneous degree distribution. Moreover, in Facebook links require mutual agreement (modelling people agreeing on being friends), whereas on Twitter the relationships between users are often not reciprocal. We believe therefore that Twitter is not the best candidate for our study. For these reasons, we are convinced that Facebook is currently the most interesting choice regarding OSN analysis: its structure and diffusion should better reflect offline social networks, and should allow us to better understand users behaviour and interactions features. We have decided to use two different Facebook datasets: the Zhao’s dataset [114] and a dataset, SocialCircles!, we have logged through the application shown in the appendix A. The main advantage of the first one is that it has more data than our dataset and it is possible to obtain a big connected component. On the other way, this dataset has only information about ego networks and interactions between users and no temporal information, that are essential for our studies. This is the main reason which has moved ourself to develop a Facebook application. The second important reason is that the Zhao’s dataset is too dated. In the last ten years the usage of Facebook is completely changed and so we need up to date information. The usage of an application let us access a large number of information on the temporal behaviour of the sessions of the users by monitoring and sampling periodically their chat status: we emphasize that this is the only possible way to get temporal information of users on Facebook, and, as for our knowledge, our dataset is the first which exploited this possibility to build online session traces with some direct measurement. Our dataset is a large collection of mostly independent ego networks rather than a full interconnected social graph. We will refer to this dataset as the SocialCircles! dataset, derived from the name of our application. The content of this Chapter is partially published in the following publications: The impact of user’s availability on On-line Ego Networks: a Facebook analysis. Andrea De Salve, A., Dondio, M., Guidi, B., and Ricci, L. Accepted to Computer Communications, Special Issue on Online Social Networks. Elsevier (2015).
8.1. THE ZHAO’S DATASET: GENERAL CHARACTERISTICS
8.1
129
The Zhao’s Dataset: general characteristics
The Zhao’s dataset was obtained from a Facebook Regional Network1 and it is composed by: • a Social Graph, which defines the whole social network and is represented by an undirected graph. In this graph, an edge between two people means that they are Facebook friends (Table 8.1). • four Interaction Graphs, which define the interactions between users in different time windows and are represented with four undirected graphs. Since the start date of crawling (April, 2004), there are four temporal windows (as shown in figure 8.1): last month, last 6 months, last year, all the duration (four years). The Interaction graph contains an edge for each interaction (post or photo comment) done during the specific temporal window.
Nodes Edges Avg. Degree Avg. Clustering Coefficient Assortativity
3,097,165 23,667,394 15.283 0.098 0.048
Table 8.1: Social Graph characteristics. A complete analysis of the dataset is proposed in [114]. In the follow, we propose the analysis of the subset, of the complete Facebook Social graph, we used. The dataset has been refined by deleting all the relations which don’t have any correlated interactions. Since the dataset doesn’t contain temporal information regarding the starting date of a friendship, it has been necessary to estimate the duration of each friendship by considering in which temporal windows has occurred the first interaction [99]. During the refining process, we have obtained a dataset where, for each friendship relation, are created two weighted edges, where the weight is the contact frequency between the two users. The contact frequency has been obtained by
1
available on http://current.cs.ucsb.edu/socialnets/ and referred as ”Anonymous regional network A”
130
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
Figure 8.1: Interaction Graph: four time windows considering all the interactions done from the starting date of the friendship relation. The characteristics of the refined dataset are shown in Table 8.2.
Nodes Edges Min. Contact Frequency Max. Contact Frequency Avg. Contact Frequency
876,071 4,619,221 0.00004839 1 0.151306
Table 8.2: Refined dataset: characteristics.
Figure 8.2 shows that more than 50% of edges, precisely 57%, have a contact frequency in the interval [0, -1). We can see that the number of edges in the interval [0.9, 1] is greater than the others, and the 83% of them has a frequency of contact equal to 1. Figure 8.3 shows the distribution of the contact frequencies of the edges in the interval [0, 0.1]. The distribution is almost uniform and the majority of the users has a relative smaller contact frequency. This information seems to indicate that in this dataset users are almost observers or they are occasional users (online only for small sessions length). Figures 8.4 and 8.5 show the distribution of the in-degree and out-degree edges, respectively. Both distributions follow a Powel-Law distribution, but the outdegree distribution is interrupted after the 10% of nodes. The majority of nodes are frequently contacted by a number of users which is less than 20. Instead, there is a small set of users which are frequently contacted by a considerable number of users. The 50% of nodes has an in-degree equal to 1, so they are contacted by only one user. The presence of nodes with in-degree equal to 1 and out-degree equal to 0 is due to two mainly factors:
8.1. THE ZHAO’S DATASET: GENERAL CHARACTERISTICS
Figure 8.2: Distribution of the contact frequences
Figure 8.3: Distribution of the contact frequences in the interval [0, 0.1]
131
132
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
• dataset has been obtained from a Facebook Regional Network in an unknown way. We don’t know the methodologies that have been used to collect users. Furthermore, the dataset is a subset of the Facebook Regional Network (about 50%) and, as a consequence, for each user we have about the 50% of its ego network. • the refined techniques cause an isolation of the ego networks. In fact, all the ego networks of users, which have registered themselves in the last six months of the crawling, have been deleted. Furthermore, the ego networks which have on average, less than 10 interaction per month, have been deleted. This is why the dataset contains a lot of egos with an out-degree equal to 0. We have analysed the degree correlation, which permits us to understand if similar nodes are connected among them. We use the Pearson correlation index (shown in Chapter 3) r(α, β) ∈ [−1, 1] where α and β are respectively the kind of the degree.
Figure 8.4: Indegree distribution A positive correlation r(α, β) > 0 means that nodes with high α-degree are connected with nodes with a high β-degree, or similar. Instead, a negative correlation r(α, β) < 0 indicates that nodes with a low α-degree are connected to nodes with a high β-degree. Table 8.3 shows the correlation index computed for each possible combination. The correlation r(in, in) is positive and this means that users that are contacted by a certain number of users are inclined to connect to users who are contacted to a similar amount of users. The correlation r(in, out) is positive, but less than the first one. The correlation r(out, out) is absent. Finally,
8.1. THE ZHAO’S DATASET: GENERAL CHARACTERISTICS
133
Figure 8.5: Outdegree distribution
r(in, in) r(in, out) r(out, in) r(out, out)
0.2207 0.1546 0.0327 -0.0156
Table 8.3: Pearson Correlation
r(out, out) is slightly negative, close to the absence of correlation. This means that central nodes seem inclined to communicate with no central nodes. Since the correlation values are close to 0 is hard individuate the real existing relation among the in-degree and out-degree. A more detailed result is obtained with the Joint Degree Distribution of the network. Consider the graph G and the matrix JDD, where the JDD[i, j] is the number of edges that connect a pair of nodes with degree equal to i and j respectively. Figure 8.6(a) shows the positive correlation seen with the Pearson correlation r(in,in). Nodes that have an in-degree between 0 and 50 are connected, on average, to nodes with a similar in-degree. This correlation seems to be more strong if we consider the 90% of nodes have an in-degree in this interval. Furthermore, the correlation seems to be scattered for in-degree over 50. The correlation in-degree and out-degree, shown in Figure 8.6(b) is similar to the previous one. Figure 8.6(c) shows the absence of correlation between out-degree e indegree and, this means that central ego nodes, which communicate with more than 50 nodes, are connected to nodes that have different characteristics. Figure 8.6(d) shows the negative correlation, which means that nodes with out-degree equal to 0, about
134
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
(a) J(in,in)
(b) J(in,out)
(c) J(out,in)
(d) J(out,out)
Figure 8.6: Joint Degree Distribution 90% of nodes, are connected with nodes that have a very high out-degree. Further analysis on this dataset are proposed in [99].
8.2
SocialCircles! dataset
The dataset, obtained from the SocialCircles! application (which is presented in details in Appendix A), contains 337 complete Ego Networks from Facebook, for a total of 144.481 users (ego and their alters). Since few users deauthorized our application, we were able to retrieve complete interactions information of 328 users. In this chapter, we show a complete analysis of the dataset to understand users’ behaviours in this kind of social networks. The Ego Networks we retrieved, have the advantage of representing a very heterogeneous population: 213 males and 115 females, with age range of 15-79 with different education, background and geographic location. We believe this is an advantage compared to other works where the dataset comes from an analysis of groups of people with the same background and education (see [101]). Our dataset represents a larger and real world sample of users with a a different background.
8.2. SOCIALCIRCLES! DATASET
135
We thus expect less biased results and more variety in the networks structure, which should better reflect real OSN utilization. We managed to build a complete dataset containing following information: • Topology and profile information: the social graph of the ego networks, containing users with their profiles and friends relationships. We were able to process 337 complete Ego Networks. • Interaction Information: social interactions (with associated timestamps) occurring between users, useful to estimate the tie strengths and trust to build a weighted interactions graph. We obtained data regarding incoming and outgoing interaction of 328 ego with their alters. • Online presence: data relative to users sessions behaviour and online/offline temporal patterns. This parameter is useful to understand typical user behaviour in OSN by examining online traces. We were able to access overall 95.578 users (considering the ego and their alters). We perform a deep analysis of these features, motivating them by comparing them, where possible, with similar and recent studies, giving particular focus to differences. We evaluate both the nature of these ego network and their temporal evolution.
8.2.1
Topology
For each ego network, we collected some measures such as the number of nodes, number of edges and the average clustering coefficient. This analysis has been performed on the whole set of 337 ego networks. Node and edges distribution Figure 8.7 shows the distribution of friends for all the analysed ego networks (95% C.I. ± 38.75). We notice that the majority (80%) of ego networks have less than 600 friends, whereas only 7% of nodes exceed 1000 friends. Furthermore, we can notice that only 20% of ego have less than 250 friends. The distribution of values is right (positively) skewed, for this reason the median value is a better representative of the central tendency of the distribution than the mean value: we discovered that the median Facebook network has about 390 friends totally. The figure 8.8 shows the cumulative distribution of friendships (links) between ego and alters and between alters, i.e the total number of edges in the ego network
136
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
0.2 0.18
Mean
= 486.899
Median
= 394
Std. Dev = 361.609 0.16 0.14
p(x)
0.12 0.1 0.08 0.06 0.04 0.02 0 0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 Total network size
(a) Histogram, bin width = 100 friends.
1 0.9 0.8
Mean
=
486.899
Median
=
394
Std. Dev =
361.609
0.7
CDF(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 0
500
1000
1500 Total network nodes
2000
2500
3000
(b) Cumulative distribution function.
Figure 8.7: Distribution of Facebook friends among ego networks. (95% C.I. ± 2380.72). To better show the results, a logarithmic scale has been used. The distribution of nodes is heavily right skewed: the very high value of standard deviation (SD) confirms that the mean value is practically useless. More than 75% of network have less than 10.000 ties, and the typical network exposes around
137
8.2. SOCIALCIRCLES! DATASET
1 0.9 0.8
Mean
=
10930.142
Median
=
4586
Std. Dev =
22218.143
0.7
CDF(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 10
100
1000 10000 Total network links
100000
1e+06
Figure 8.8: CDF of the number of Facebook friendships.
4500 connections (median value). The high value for the Standard Deviation and the wide range of values suggests strong heterogeneity in our analysed sample.
Clustering coefficient distribution The clustering coefficient is particularly useful to quantify how the neighbours of a node are close to being a clique (a complete graph whose cluster coefficient equal to 1). A high clustering coefficient indicates therefore the presence of a tightly connected graph structure. Figure 8.9 shows the distribution of the clustering coefficient for our dataset of 337 ego networks (95% C.I. ± 0.008). The distribution is almost normal: as a consequence mean and median are almost coincident, and we can consider the mean to represent the typical network. The mean value of 0.636 indicates that the average ego network has a high clustering coefficient. This high clustering coefficient shows that many ego friends are friends each other: this, in conjunction with a low diameter, makes ego networks small world networks. We notice that the analysed sample exposes a comparable but slightly higher
138
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
1 0.9 0.8
Mean
=
0.636
Median
=
0.638
Std. Dev =
0.078
0.7
CDF(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 0.4
0.45
0.5
0.55
0.6 0.65 0.7 Average Clustering Coefficient
0.75
0.8
0.85
0.9
Figure 8.9: Distribution of average clustering coefficient. average clustering value with respect to past analyses (such as 0.60552 ). Local degree nodes distribution The local degree is a centrality measure of an alter m, with respect to the ego network of n. This index represents the number of mutual friends that the alter m shares with ego n. This property has been proved to be useful when selecting social storages to assess the data persistence problem. To evaluate the typical ego network behaviour we adopted the following methodology: 1. For each network we calculated the local degree of each alter a, normalizing by the number of alters - 1 (we must exclude the alter a): this allows us to compare networks of different sizes. 2. We sorted the normalized degree values and extracted the ten percentiles. 3. We averaged all the percentiles over all ego networks and built an aggregated cumulative distribution function.
2
http://snap.stanford.edu/data/egonets-Facebook.html
139
8.2. SOCIALCIRCLES! DATASET
The resulting plot is shown in figure 8.10.
1 0.9 0.8 0.7
CDF(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.05
0.1
0.15 Local Degree
0.2
0.25
0.3
Figure 8.10: Distribution of normalized local degree in the typical ego network. From this graph we discover that in the typical ego network an alter shares on average 4% of friends with the ego. Furthermore, we can see clearly a long tail, which confirms that as the global degree distribution, also the local degree distribution presents a power-law shape: this means that only a small subset of alters present sensibly higher value of local degree, in contrast to many low-degree nodes. Ego betweenness centrality Due to the nature of our dataset, computing classic centrality measures such as the Betweenness Centrality or the Closeness Centrality has no meaning. The ego networks contains topological information regarding ego and its alters, and no information about whole network structure and connections. However, a centrality measure called Ego Betweenness Centrality has been suggested by Everett and Borgatti in 2005 [89], where empirical results showed its correlation with the Betweenness Centrality metric. The strong advantage of computing this measure, is that only ego network information are needed. To be able to compare networks of different sizes, we performed a normalization by dividing the EBC(n) for the total possible number of edges between alters, obtaining a normalized value between 0 and 1:
140
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
NEBC(n) =
2 ∗ EBC(n) (kn − 1) ∗ (kn − 2)
(8.1)
where kn is the number of nodes of the ego n network: kn − 1 is therefore the number of alters excluding ego. Using the (8.1) function, we present the distribution for our dataset in figure 8.11 (95% C.I. ± 0.015).
1 0.9
Mean
= 0.725
Median
= 0.756
Std. Dev = 0.139 0.8 0.7
CDF(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Ego Betweenness distribution (normalized by possible #pairs between alters)
1
Figure 8.11: Distribution of Normalized Ego Betweenness Centrality. From this plot we can see that the distribution is left (negative) skewed, thus the best central tendency is given by the median value of 0.756. This value is significantly high, and seems to suggest that these ego networks are composed by several different communities. Community analysis To achieve a deeper understanding of these ego networks we performed some clustering analysis to discover topology-based communities. First we applied the Louvain Community Detection Algorithm defined in [115] using the Gephi toolkit implementation3 .
3
Gephi Toolkit is available at https://gephi.org/toolkit
141
8.2. SOCIALCIRCLES! DATASET
The algorithm aims to maximizing the modularity function Q by assigning nodes to different groups. The modularity value is a measure of how much a graph is composed by several communities of nodes. The result of our analysis are shown in figures 8.12 and 8.13.
1 0.9
Mean
= 0.46
Median
= 0.484
Std. Dev = 0.119 0.8 0.7
CDF(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3 0.4 Modularity value
0.5
0.6
0.7
Figure 8.12: Distribution of Modularity value. Figure 8.12 indicates how the modularity value Q is distributed among all ego networks (95% C.I. ± 0.013), whereas figure 8.13 shows the result of the partitioning, displaying the number of discovered communities (95% C.I. ± 0.17). This analysis confirmed our hypothesis that the ego networks are indeed composed by different communities, leading to high values for the Ego Betweenness Centrality measure. In particular, we calculated that the EBC value and the modularity value are strictly correlated: an analysis demonstrated that the correlation among these two metric is very high, exposing a Pearson’s correlation coefficient of 0.806. However, we think that the obtained communities may be too limited to explain alone the complexity of the social ego networks structure: the algorithm we have exploited is a exclusive clustering algorithm, meaning that a node is assigned to one partition only. This may lead to limited expressive power for nodes which should logically belong to multiple communities. Furthermore, this family of algorithms are topological based, meaning that communities are created just by analysing the ties between alters, and not taking into account any information regarding users profiles. More recent approaches are based on more sophisticated algorithms which
142
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
1 0.9
Mean
= 5.817
Median
= 6
Std. Dev = 1.621 0.8 0.7
CDF(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 2
3
4
5
6 7 Communities discovered
8
9
10
11
Figure 8.13: Distribution of discovered communities take advantage of profile informations, we mention in particular the McAuley et. Leskovec algorithm [116]. Clique analysis We want investigate how our networks are composed by cliques of different sizes. A clique of order k is a complete (fully interconnected) subgraph of the original network composed by k vertices. In our analysis, we were able to process only 291/337 ego networks: due to limited processing power, we discarded networks with a larger number of edges than the typical network. The following analysis has been performed by using an approximate clique finder tool4 which implements the Clique Percolation Method (CPM) method partially. The first interesting property that we noticed is that a linear correlation between the total cliques number and the size (in nodes) of a network does not exist: the Pearson’s coefficient whose value is 0.012 indicates these two variables are completely independent. This result justifies the fact we can merge and compare results obtained on different sized networks, without requiring a binning representation. We performed three studies: on the total cliques number, on the typical network
4
https://sites.google.com/site/cliqueperccomp/
8.2. SOCIALCIRCLES! DATASET
143
clique size, and on the distribution of number cliques by varying the k size, i.e. the maximal size of cliques. This first analysis aimed at understanding how many cliques, of any size, are present in an ego network. This study showed an interesting fact: with an analysis of a box plot we noticed a significant amount of high extreme outliers. An high extreme outlier is a value which lies above the value calculated as Q3 + 3IQR where Q3 is the third quartile and the Inter Quartile Range IQR is the distance from Q1 and Q3. In particular, we discovered that 40/291 (13.75%) of instances were outliers: instead of discarding them from the analysis, we divided instances into two separate groups and tried to understand their differences. We identified the threshold value to be around 52.000 cliques: the analysis of the two groups showed interesting differences (Table 8.4). “Below” group: instances with less than 52000 cliques (86.15%) Node number (median) Avg. Local degree (median) Modularity (median) 333 16.674 0.505 “Above” group: Instances with more than 52000 cliques (13.75%) Node number (median) Avg. Local degree (median) Modularity (median) 492 43.626 0.362
Table 8.4: Some properties of different group of networks according to total clique size. From these data we can clearly see some differences. These networks are slightly larger, but the most significant information is the average local degree of nodes: we discovered that the nodes of the ”above” group have almost three times the local degree than the nodes of ”below” group. This seems to indicate that in the former group, all nodes tend to have many connections on average, while in the latter, the power law may hold (with the presence of many low-degree nodes and a few high-degree nodes). Another property we discovered is a notable difference in the modularity value: ”above” group exposes much less modularity, indicating the tendency of having a lower community structure. To summarize, these facts in conjunction suggested us the presence of a small, but still significant, group of egos, whose ego network seem to be constituted mostly by one giant community where nodes are highly interconnected with each other, and the presence of several communities is much less evident. We then performed the distribution analysis discarding the ”above” group: although they represent a non trascurable portion of network, the wide majority of egos have less than 52000 cliques, thus we focused our analysis on them. We believe that the nature of the ”above” group should be investigated separately.
144
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
The analysis of the 251 ego (the ”below” group) is shown in Figure 8.14 (95% C.I. ± 1072.13). The high SD of 9288.57 suggests large heterogeneity in the total cliques number in ego networks. The distribution is positively skewed, for this reason we use median as a more robust central tendency indicator, discovering that the typical ego network contains an overall number of 1690 cliques of different sizes. Average cliques size Next we focused on the average size of these cliques. Once again, we considered only the 251 ego networks of the ”below” group as they model better the most common ego network. We investigate the Average size of cliques by considering only the 251 ego networks of the ”below” group, as they model better the most common ego network. We computed for each network the median size value of all its cliques, as shown in Figure 8.15 (95% C.I. ± 0.43). We discover that a large portion (80%) of networks has a median clique size of less than 10 nodes, and that near 50% of them expose a median clique size between 6 and 10. The distribution is lightly skewed, therefore we choose once again the median value and we discover that the typical clique size in an ego network is of order 7. We want to understand how in a typical network the cliques are distributed for different values of k, where k is the size of the clique. This study is similar to the previous one, but in this case we first compute the median number of clique for each increasing value of k for all networks (Figure 8.16 (95% C.I. ± 0.35)). The analysis confirms that the majority of cliques in the typical network lies between k = 4 and k = 8 size. An interesting study is related to the presence of nodes in different cliques. We wanted to study how, in each of the 251 networks of the ”below” group, each node is member of multiple cliques. We discover that 8% of nodes in the typical network are not members of any cliques, this means that these nodes are either or isolated nodes or placed in a star topology. By considering the median value of clique memberships for each network, we built the distribution plots in Figure 8.17 (95% C.I. ± 1.5). By these plots, we see that in the typical network, nodes are on average members of 8 cliques. The inter quartile range tells us that in around 50% of networks, nodes are on average member of 5-14 cliques. Finally, we assessed a property of particular importance because of it is related to the social storage selection problem: we call the set of nodes elected as social caches to cover the entire network as the covering set. In this analysis we will discover how many nodes are needed to be able to reach all other nodes in a
145
8.2. SOCIALCIRCLES! DATASET
0.4
0.35
Mean
= 5621.86
Median
= 1690
Std. Dev = 9288.57
0.3
Frequency
0.25
0.2
0.15
0.1
0.05
0 0
5000
10000
15000
20000 25000 30000 35000 Distribution of total cliques
40000
45000
50000
(a) Histogram, bin width = 1000 cliques.
1 0.9
Mean
= 5621.86
Median
= 1690
Std. Dev = 9288.57 0.8 0.7
CDF(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 0
10000
20000 30000 Distribution of total cliques
40000
50000
(b) Cumulative distribution function.
Figure 8.14: Distribution of total cliques among ego networks.
146
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
0.18
0.16
Mean
= 8.09
Median
= 7
Std. Dev = 3.477 0.14
Frequency
0.12
0.1
0.08
0.06
0.04
0.02
0 3
4
5
6
7
8
9 10 11 12 13 14 15 16 Median clique size for each network
17
18
19
20
21
22
(a) Histogram, bin width = 1
1 0.9
Mean
= 8.09
Median
= 7
Std. Dev = 3.477 0.8 0.7
CDF(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 Median clique size for each network
18
19
20
21
22
23
(b) Cumulative distribution function.
Figure 8.15: Distribution of the median clique size for each network
147
8.2. SOCIALCIRCLES! DATASET
0.14
0.12
Mean
= 7.18
Median
= 7
Std. Dev = 2.78
Frequency
0.1
0.08
0.06
0.04
0.02
0 3
4
5
6
7 8 9 10 11 12 13 Distribution of k-clique in the typical network
14
15
16
Figure 8.16: Distribution of k-cliques in the typical network
network starting from the nodes belonging to more cliques: a node is reachable by another node if it shares at least one clique with him. To compute the covering set we created a simple CoveringSet algorithm, presented in Algorithm 15.
The idea of the algorithm is first to sort each node by the number of cliques it is a member and put the nodes in an ordered list. Then, by processing this list in descending order, we pick a node, put it into the covering set and remove all reachable nodes from the list. When this collection is empty, we covered all nodes and we obtained the full covering set.
148
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
0.25 Mean
= 11.92
Median
= 8
Std. Dev = 12.09 0.2
Frequency
0.15
0.1
0.05
0 0
3
6
9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 Median number of overlapping cliques (any size) per node for each network
(a) Histogram, bin width = 3
1 0.9
Mean
= 11.92
Median
= 8
Std. Dev = 12.09 0.8 0.7
CDF(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 10
20 30 40 50 60 70 Median number of overlapping cliques (any size) per node for each network
80
(b) Cumulative distribution function.
Figure 8.17: Distribution of overlapping cliques (any size) per node for each network Algorithm 15 CoveringSet Input: The nodes and the cliques set of a network Output: The coveringSet of the network 1: 2: 3: 4: 5: 6: 7: 8: 9:
sortedN odes = SortNodesByCliquesMemberships(nodes, cliques) coveringSet = ∅ while isNotEmpty(sortedN odes) do node = getNextNode(sortedN odes) sortedN odes = sortedN odes \ {node} coveringSet = coveringSet ∪ {node} ownCliques = GetOwnCliques(node, cliques) for all clique in ownCliques do for all reachableN ode in clique do
149
8.2. SOCIALCIRCLES! DATASET
However, since we discovered that a small portion of nodes (we calculated on average 8% in previous section) are not members of any clique, this means that to be covered all these nodes must be chosen as social storage themselves. This makes the number of storages needed increase just to cover few nodes more and it doesn’t make sense. For this reason we also computed the approximate covering set, which is the set of nodes needed to reach all other nodes which belong to at least one clique. We computed both the full covering set size and the approximate covering set size with respect to the network size for each network of the ”below” group. We present the results of both distributions in Figure 8.18, where we show in red the full covering set and in green the approximate covering set size (95% C.I. ± 0.86% full distribution and 0.24% approximate distribution).
1 0.9 0.8
Mean
= 0.248 / 0.143
Median
= 0.23 / 0.14
Std. Dev = 0.07 / 0.02
0.7
Full covering set Approximate covering set
CDF(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 0.05
0.1
0.15
0.2 0.25 0.3 0.35 0.4 Size in % of covering set for each network
0.45
0.5
0.55
Figure 8.18: Distribution of full and approximate covering set size for each network. The median size for the full covering set means that in the typical network, by choosing 23% of nodes, we are able to cover the entire network. For the approximate covering set the typical network needs only 14% of nodes to be elected as social storages. The approximate covering set shows a very low SD, thus most values are concentrated close to the central value: in particular, almost 80% of networks expose an approximate covering set between 11% and 17% of their size. By comparing the two covering sets, we discover a very interesting result: the two plots indicates that the tradeoff of not covering 8% of nodes is balanced by a significative smaller covering set. The results we obtained seem to be very promising and may be worthy of further and more detailed studies regarding the social
150
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
storage problem.
8.2.2
Analysis of the users interactions and ego network properties
In this analysis, we focus our attention on the analysis of the interactions occurring between users in our dataset. By analysing the flow of visible activities between ego and alters and viceversa, we were able to assign a weight (tie strength) to each friendship relation: it’s important to notice that due to the restricted access policies (this was did on purpose to avoid asking a large amount of personal data which could have limited application spreading) we were not able to access and retrieve user private mailbox. We remark that regarding the outgoing activity, it was possible to retrieve only a limited subset of it: due to Facebook privacy restrictions, we could access likes but not posts nor comments written on friends wall, content which would need friends explicit access authorization to our application. Fig. 8.19(a) shows that likes, photos and comments contribute to over 50% of total interactions. The second most important interaction type is comment, which accounts for 22.7% of the overall interactions. It has been shown that in OSN, each ego is in direct communication with less people compared to people who contact him [100]. Our analysis in fact confirms this trend, showing that the average of the incoming active network is around 29% of the alters (vs the 26% of the outgoing), although the difference is smaller compared to the work above mentioned. Furthermore, we discovered that 18% of ties in the ego network are symmetric, representing ego reciprocating the interaction with alters. To reflect different levels of importance of the relationships, friendships are associated to a tie strength, a numerical value representing the social distance between the ego and the alter involved in a relationship. It has been shown [99,100] how the tie strength between egos and their alters is strongly related to their contact frequency, computed as the ratio between the number of direct interactions and the duration of the social relationship. Since both Facebook API are not able to provide the overall duration of a friendship relation and there is a strong correlation between the overall amount of interactions and the contact frequency, we estimate the tie strength as the number of direct interactions occurred from ego to their alter. From our perspective, an active contact is a contact with an associated tie strength greater than 0. To evaluate the size of active network, we compute for each ego network the ratio between the total number of alters and the number of active alters (active ratio), which is a measure of the size of the active network with respect to total network size. The graph in Figure 8.19(b) shows that the active ratio of a typical network
8.2. SOCIALCIRCLES! DATASET
151
is around 26%. This result is interesting when compared to the work cited in [100] (45.88%) where the ratio is lower: we explain the difference first by comparing the different methodologies (explicit tie strength evaluation required by users versus real interactions activity), and second by considering that our average network is much bigger and recent than the one in that study, which refers to a dataset created in 2011 (see [117]) and may expose different properties. However, if we consider the mean size of active network in terms of alters number, we obtain the value of 117.8, which is comparable to the values discovered by similar studies (e.g., 105 in [100] for OSN, 124 in [91] and 135.2 in [118] for offline networks). Is interesting to notice that the size of the network in our dataset is almost uncorrelated to the active contacts ratio: intuitively we would expect that ego with more contacts are likely to have lower active contacts ratio. The Pearson’s correlation coefficient of -0.22 tells us that this phenomena is not very significative. A very expected result comes from the correlation between the active contacts ratio and the activity per alter index, which measures the overall activity of an ego on the OSN: the high correlation of 0.77 indicates clearly that people which are more active on Facebook are able to keep higher percentage of active contacts. Finally, we decided to investigate a claim which stated that females are able to keep more active contacts than males [119]. The analysis of the active network, dividing egos by gender, confirmed this difference: in average women can maintain active connections with 30.4% of their alters, whereas men only with 24.2%. In addition, women seems to be overall more active on OSN than men, with an average activity per alter of 1.597 compared to 1.198 of men. If we consider the mean size of active network in terms of alters number, we obtain the value of 132, which is comparable to the values discovered by similar studies (i.e., 105 in [100] for OSN, 70.04 in [120] for Twitter and 128.16 in [120] for Facebook). Finally, we decided to investigate whether females are able to keep more active contacts than males. The analysis of the active network, dividing egos by gender, confirmed this difference: in average women can maintain active connections with 30% of their alters, whereas men only with 23%. In addition, women seem to be more active on OSN than men, with an average activity per alter of 4.8 compared to 3.9 of men. We try to understand better the tie strength nature by focusing on how the tie strength is distributed among alters: for each ego network, we computed the alters tie strength distribution and built an aggregated CDF shown in Fig. 8.20(a), where min-max normalization has been used for. The elbow in the graph indicates that around 10% of alters can be considered at a high level of intimacy and trust: compared to analysis in [100], which provided a value of 23.53% by considering the recency of contact as tie strength model, we obtain a lower value. We explain
152
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
1 Mean=0.263 Median=0.257 Std.Dev.=0.132
CDF(x)
0.8 0.6 0.4 0.2 0 0
0.2
0.4
0.6
0.8
1
Active contact ratio
(a) Facebook interactions composition
(b) Ego network active ratio
Figure 8.19: Analysis of the users interactions and ego network structure
1
250
0.8
200
0.6
150
Count
CDF(x)
this with the higher average size of the ego networks.
0.4 0.2
100 50
0
0 0
0.2
0.4 0.6 Tie Strenght
0.8
(a) Tie strength distribution
1
0
1
2 3 4 Optimal clusters number
5
6
(b) Optimal clusters number
Figure 8.20: Tie Strength and clusters analysis We perform a mono-dimensional clustering analysis using the K-Means algorithm [121], by exploiting the tie strength value separately for each active network of our sample. To compute the best number of clusters, we adopt the well-known elbow method [121] adding a new cluster iteratively until the improvements of the clusters are below the 0.1 threshold. Figure 8.20(b) shows the distribution of the optimal number of clusters for each ego network (95% C.I. ± 0.065). The majority of active networks (63.4%) have an optimal clusters number of 4. As regards the ego networks with 3 clusters, we found that their structure depends on both the lower ratio of user activity and the lower number of social links they expose. In fact, compared to the users with 4 Dunbars’ circles they have smaller average ego network size (362.5 vs. 416.5), less overall interactions (338 vs. 461), smaller daily session number (4.5 vs. 4.9) and session length (4.84 vs. 5.50 time slots). These facts seem to suggest that ego networks with three clusters are composed by users which don’t use Facebook as much as other groups, confirming a claim stated
153
8.2. SOCIALCIRCLES! DATASET
in [99]. Since ego networks with three Dunbar circles do not have a counterpart in real ego networks, we will focus only on ego networks with a number of circles equal to 4. The detailed results about the obtained clusters (or circles) for these ego networks are shown in Table 8.5. For each circle (C1 , C2 , C3 and C4 ordered from the innermost to the outermost), we computed the average size of the Dunbar circle for each ego network (size), the scaling factor between circles (scal.f.), the minimum/maximum tie strength (min/max TS ), the mean (mean TS ) and the median (median TS ) tie strength of the circles. Finally, we added some properties which characterize globally these networks, such as their average size, their average active network and the average number of total interactions performed by egos. These networks confirm the Dunbar circles hypothesis: in particular we notice the average scaling factor between the concentric circles sizes is about 3: similar values have been demonstrated to hold in [99] (scaling factor of 3.12) and [120] (scaling factor of 3.45). Compared to [99, 120], the Dunbar circles seem to be slightly bigger; however, we highlight that our data comes from complete ego network structure while networks in [99] have been subject to an estimation due to partial topology. We believe that the approximation may underestimate the real circles size. Finally, our results also indicate that the size of the Dunbars’ circles is very similar with other results about offline social networks [122].
8.2.3
Analysis of temporal characteristics related to data availability
In term of data availability, one main challenge is to consider the availability patterns of the nodes in the ego network to define data allocation, with the goal of avoiding continual data transfers which affect the overall performance of the social service. The study of the temporal behaviour of users in OSNs, in particular the
size scal. f. mean TS median TS min TS max TS
C1 4.3 [.50] 34.09 [2.49] 22 23 363
C2 17.7 [1.85] 3.64 [.46] 12.69 [.43] 9 7 149
C3 50.8 [5.19] 2.35 [.24] 5.05 [.1] 4 3 52
C4 132.8 [13] 2.69 [.35] 1.57 [0.017] 1 1 15
Properties of active network Network size 416.5 Outgoing Active network perc. 28% Ingoing Active network perc. 30% Total ego interactions 461 Sex of ego (m/f) 61% / 39%
Table 8.5: Dunbar circles analysis for networks with k-opt = 4. 95% confidence intervals are reported in square brackets.
154
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
study of the relation between online sessions of egos and those of their alters is therefore of primary importance to help the decentralization of social services by characterizing the typical OSNs’ usage and understanding how users interact with these platforms. While a few recent studies examined the availability patterns of OSNs’ users in terms of session length or interaction frequency, they do not provide a global picture of the relation between the ego network structure of users and the availability patterns of the alters belonging to their ego-network. We want to investigate the presence of temporal dependency in the ego network structure of OSNs to understand if possible temporal patterns can be exploited to manage important problems in a distributed scenario, such as data availability and information diffusion. The main result of our study is the identification of a relation between the similarity (or temporal homophily) between the availability patterns of the egos and their alters, which increases when considering alters belonging to inner Dunbar circles. Using the SocialCircles application, we aim to investigating Facebook data with the goal of studying the availability patterns which characterizes the Dunbar ego networks in Facebook and analyse the extent to which the availability of an ego depends on the presence of the alters in each Dunbar’s circle. The lack of data concerning online presence of users in OSNs is currently the main limitation for defining temporal pattern analysis. Furthermore, only a very small number of studies are based on complete datasets provided by the OSN operators while others have collected a complete view of specific parts of OSNs. As a matter of fact, a complete dataset is typically unavailable to researchers, as most OSNs are unwilling to share their company’s data even in an anonymized form, primarily due to privacy concerns. For all these reasons, it is common to work with small but representative samples of an OSN. At the best of our knowledge, no existing up-to-date dataset is able to provide complete information, such as information regarding the social graph, interactions among users and temporal information (online sessions) for a real OSN. Analysis of the user behaviours We sampled all the 337 registered egos and their friends every 8 minutes for 10 consecutive days (from Tuesday 3 June 2014 to Friday 13 June). Using this methodology we were able to access the temporal status of 308 registered users and of their friends (for a total of 95.578 users). For the purpose of clarity, we will refer to registered users to indicate these 308. In order to characterize OSN workloads at the session level, we consider the availability trace of each user to determine the start of a session (when a user switches from offline to online or idle) or the end of a session (when a user switches from online or idle to offline).
155
8.2. SOCIALCIRCLES! DATASET
Utilizing the session information, we first examined the number of concurrent users that accessed the OSN site (see Fig 8.21). 35000 idle
online
total
Number of users
30000 25000 20000 15000 10000 5000 0 03/06 05/06 06/06 07/06 09/06 10/06 11/06 13/06 15:00 00:00 09:00 18:00 03:00 11:00 20:00 05:00 Time
Figure 8.21: Temporal availability of all users. Figure 8.21 indicates clearly the presence of a cyclic day/night pattern (confirming the results in [123]). Since the majority of the registered users live in Italy or in central Europe, time-zone differences are negligible. In Facebook, the online chat status information can be: • Offline: if the user is not available on Facebook (not shown in graph). • Online: if the user is currently available and performing some activity. • Idle: if the user was in available status but he hasn’t done any activity for at least 10 minutes on the OSN. The analysis of graph depicts the presence of two peaks: on average, most users seem to be connected after lunch time with a peak around 14:30. The other peak is usually in the evening, around 22:30, probably preceding the sleeping time. It is interesting to notice that the presence of weekends seems to have no influence on users: Friday and Saturday night seem not to expose the above mentioned evening peak, reflecting the fact that many people may go out. It is important to notice that these patterns describe just a global tendency, and cannot be exploited to make any prediction nor assumption of single user behaviour.
156
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
In order to estimate how often and for how long users connect to OSN, we measure the frequency and duration of sessions for each user. Fig. 8.22(a) shows how many sessions are done by users (95% C.I. ± 0.23). We can notice that the majority of users (90%) exposes on average less than 100 daily sessions while the average number of sessions for all users is less than 4 sessions per day. Fig. 8.22(b) shows, for all users, the CDF of the session length (95% C.I. ± 0.65) and the elapsed time (inter arrival time) between two consecutive user’s sessions (95% C.I. ± 4.44). There is a large variation in the OSN usage among users. However, almost half of user sessions are shorter than 20 minutes (median value of 24 mins), and a significant percentage of 34% last less than 10 minutes. Only a few users sessions (less than 13%) have a long duration, exceeding the 2 hours. We can notice that almost 50% of users present an inter-arrival time shorter than 1 hour. These plots confirm therefore the fact that in OSN the typical session has a short duration. Small inter-arrival times correspond to users who constantly use the OSN service, while large inter-arrival times correspond to users who connect occasionally to the OSN. It is important to notice that the size of the active network of each user is slightly correlated to the time spent online by the ego, such as the average session length (0.36) and the number of daily sessions (0.20): intuitively we would expect that ego with more active contacts are likely to spend more time on the OSN. While researchers have observed that strong and weak ties are characterized by different levels of homophily, i.e. the tendency of individuals with similar interests to join with each other, it has not been understood to what extent ties in circles show the existence of temporal homophily, the tendency of similar individuals to participate in similar uptime patterns. Moreover, the actual impact of correlated availabilities on Dunbar’s circle remains unexplored. In order to bridge this gap, we evaluate whether the online patterns of users are correlated with those of their alters in each Dunbar circles. We consider separately the alters in each circle and compute the availability correlation between egos and their alters using the similarity between their availability patterns. As done in [43], we evaluate this correlation using the cosine similarity metrics [124] which is frequently adopted when trying to determine similarity between binary data (such as documents or in our case availability patterns). The availability of each user is represented by an availability vector of fixed size. For each time slot of the monitoring period (eight-minute time slot for 10 consecutive days) the corresponding entry contains 1 if the user was online at that time and 0 otherwise. Formally, let A and B the availability vector of two users, the cosine similarity is computed as shown in Eq. (8.2): CosineSimilarity(A, B) =
A·B ||A|| · ||B||
(8.2)
157
1
1
0.8
0.8
0.6
0.6
CDF(x)
CDF(x)
8.2. SOCIALCIRCLES! DATASET
0.4
0.4
0.2
0.2
0
0 0
5
10
15
20
25
session length inter arrival time 10
100
Daily sessions' number
1000
10000
100000
Time (mins)
(a) Number of sessions for all users
(b) Sessions length and inter arrival time for all users
0.35
Cosine similarity
0.3 0.25 0.2 0.15 0.1 0.05 0 [0 - 4)
[4 - 8) [8 - 12) [12 - 16)[16 - 20)[20 - 24) Temporal ranges
(c) Similarity of the online-offline patterns on temporal ranges
Figure 8.22: Analysis of the temporal properties
The resulting similarity ranges from 1 meaning perfect correlation, to 0, usually indicating no correlation between the ego and the alter. We investigate to what extent the availability patterns of friends who appear to be online in the same time slot are similar, for different time windows of the day. We divide the day into 6 time windows of four hours each (from 0 to 24), and then compute the cosine similarity of the availability patterns among friends who appear to be online at the same time slot of the considered time window. The Fig. 8.22(c) shows the average similarity for each time window. As we expected, users who happened to be both online during abnormal temporal window (from 0am to 8am) have greater similarity than users connected during the classical time windows (from 8am to 0am, i.e. preceding the sleeping time). Since a correlation implicitly exists at the dataset level as users of the same country tend to connect during specific times of the day (see Fig. 8.21), we compare availability correlation on each circle with those obtained by considering external users. For these purposes, we compute also the average correlation between the ego and the set of their friends outside the Dunbar’s circles (referred as Random). We
158
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
1
0.25
0.8
0.2 CDF(x)
Cosine similarity 95% C.I.
0.3
0.15
0.6 0.4
0.1
Circles 1 Circles 2 Circles 3 Circles 4 Random
0.2
0.05 0
0 Circles 1 Circles 2 Circles 3 Circles 4 Random Dunbar’s circles
0
0.1
0.2
0.3 0.4 0.5 Cosine similarity
0.6
0.7
0.8
(a) Average cosine similarity for each reg- (b) Average cosine similarity distribution istered user
Figure 8.23: Analysis of the Dunbar circles temporal features
have computed, for each registered user, the average cosine similarity between the egos and their alters in each of the Dunbar’s circles. Fig. 8.23(b) shows the CDF of correlation values, while Fig. 8.23(a) shows the average correlation values for each circle. The similarity values are rather low for the Dunbar’s circles as well as for random friends. However, the graph clearly indicates that alters in innermost circles (such as Circles 1 and Circles 2 ) have a higher average similarity with the availability pattern of the ego than alters in outermost circles (such as Circles 3 and Circles 4 ). The similarities with alters in Dunbar’s circles are higher than with random ones, thus highlighting the impact of Dunbar’s circles on availability. The average similarity of each circle is equal respectively to: 0.23 for Circle 1 (95% C.I. ± 0.025) , 0.19 for Circle 2 (95% C.I. ± 0.019), 0.18 for Circle 3 (95% C.I. ± 0.018), 0.15 for Circle 4 (95% C.I. ± 0.017) and 0.10 for random friends (95% C.I. ± 0.012). We have also measured the average number of times ego and their alters are both online (1,1) or offline (0,0) in the same time slot and the number of times that either ego (1,0) or alter (0,1) are online, separately for each circle. The Fig. 8.24(a), 8.24(b), 8.24(c) and 8.24(d) show the CDF of the (0,0), (0,1), (1,0) and (1,1) matching, respectively. As we expected, the number of (1,1) matching between availability vectors increases as we consider alters of the innermost circles. In contrast, the number of (0,0) matching indicates an opposite trend since they decrease as much as we consider alters the inner circle. This highlights the key role that positive matches of the form (1,1) have on the availability pattern of close alters compared to the matches of the form (0,0). As regards the (0,1) matching results show that, when the ego is offline, alters in the inner circles are much more online than alters belonging to the outer circles. An opposite trend occurs when we consider the (1,0) matching since the number of offline alters when ego is online appear to increase as we consider outer circles. In order to characterize the temporal structure of the active network, we com-
159
8.2. SOCIALCIRCLES! DATASET
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
CDF(x)
CDF(x)
pute the average percentage of matching found in each circle (see Fig. 8.25). The
0.5 0.4 0.3
0.1
0.4 0.3
Circle 1 Circle 2 Circle 3 Circle 4 Random
0.2
0.5
Circle 1 Circle 2 Circle 3 Circle 4 Random
0.2 0.1 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
0
Number of (0,0) Matching
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5 0.4 0.3
0.1 0 0
200
400 600 800 1000 Number of (1,0) Matching
1200
(c) Ego online and alters offline
1000
1200
0.5 0.4 0.3
Circle 1 Circle 2 Circle 3 Circle 4 Random
0.2
400 600 800 Number of (0,1) Matching
(b) Ego offline and alters online
CDF(x)
CDF(x)
(a) Ego offline and alters offline
200
Circle 1 Circle 2 Circle 3 Circle 4 Random
0.2 0.1 0 1400
0
100 200 300 400 500 600 700 800 900 1000 Number of (1,1) Matching
(d) Ego online and alters online
Figure 8.24: Analysis of the temporal matching on Dunbar circles average percentage of (0,0) matching exceeds other cases in each circles and it is equal to: 61.4% for Circle 1 (95% C.I. ± 8.4), 62.4% for Circle 2 (95% C.I. ± 7.3), 63.5% for Circle 3 (95% C.I. ± 7.2), 65.4% for Circle 4 (95% C.I. ± 7.3) and 67.6% (95% C.I. ± 7.4) for alters who are not members of the Dunbar’s circles. Instead, matching of the form (1,1) has the lowest percentage value for all circles, namely: 7.4% for Circle 1 (95% C.I. ± 3.9), 5.9% for Circle 2 (95% C.I. ± 2.6), 5.3% for Circle 3 (95% C.I. ± 2.2), 4.6% for Circle 4 (95% C.I. ± 2.0) and 3.6% (95% C.I. ± 1.6) for random friends. The Circle 1 has approximately the same percentage of (0,1) and (1,0) matching, i.e. 15.9% (95% C.I. ± 4.5) and 15.3% (95% C.I. ± 5.6), respectively. As alters are located in the outer circles the percentage of the (1,0)/(0,1) matching increases/decreases of about 1%. As a further step, we characterized the impact of this similarity between users and their friends on the probability of the ego to be online/offline by taking into account the aggregated behaviour of their alters in the active network. In order to estimate this impact, we computed for each ego the probability to be online/offline depending on the available alters in each time slot. As done in other studies [43, 125], we model this dependence using conditional
160
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
Figure 8.25: Percentage of matching 1
P(OFF | #offline friends > k) P(OFF | #online friends > k) P(ON | #offline friends > k) P(ON | #online friends > k)
Probability
0.8
0.6
0.4
0.2
0 0
50
100
150
200
k
Figure 8.26: Conditional probability on active network
probabilities. More formally, let e = {ON, OFF} the events ”ego is online/offline” and ak = {# online/offline friends > k} the set of event ”At least k alters of
8.2. SOCIALCIRCLES! DATASET
161
the active network are online/offline”, we calculated the conditional probability P (e|ak ) = P (e ∩ ak )/P (ak ) for k = 0, . . . , 200. The value P (e ∩ ak ) is the number of time slots when the user is online/offline and at least k of her active contacts are online/offline, normalized by the total number of time slots. P (ak ) is the number of time slots when at least k of her active contacts are online/offline, normalized by the total number of time slots. Fig. 8.26 shows the CDF of the conditional probabilities for different combinations of events. The results clearly show that an ego is more likely to be online when at least 10 of their alters in Dunbar’s circles are connected. After that, the conditional probability decreases as the number of online friends increases. The probability that ego is offline decreases very quickly when the number of the online neighbours increase. Instead, the conditional probability that the ego is offline/online remains roughly the same, for any number of offline neighbours.
162
CHAPTER 8. EXPERIMENTAL RESULTS: THE DATASETS
Chapter 9 System Evaluation The goal of our simulations is to evaluate our system and the social services we propose. To obtain realistic results, we use the event-based simulator PeerfactSim.KOM [126, 127] for our simulations, which contains already implementations of several DHTs like Pastry [24] and Chord [128]. Each simulation uses GNP coordinates [129] to estimate delays in the fundamental network. Furthermore, measurements from the PingEr project [130] are integrated into the simulator for reasonable approximations of jitter. The outline of this Chapter is listed in the follow: • in section 9.1 we provide an overview of the PeerFactSim.KOM simulator, which we use for our experiments; • in section 9.2 we provide an evaluation of Social Pastry and the goLLuM algorithm proposed in Chapter 5; • in section 9.3 we provide an evaluation of the Trusted Social Storage approach based on 2-replicas, proposed in Chapter 6. Furthermore, we propose a study which explains how 2 replicas guarantee a high level of trust; • in section 9.4 we provide an evaluation of the Information Diffusion Service and of the two proposed protocols to manage the computation of the Weighted Ego Betweenness Centrality (and the classic Ego Betweenness Centrality), proposed in Chapter 7.
9.1
PeerFactSim.KOM
PeerfactSim.KOM is a flexible event-based simulator, written in Java, for largescale P2P systems. It offers a simulated environment to execute a variety of P2P scenarios. The simulator is organised as a layered-architecture and, its modular
164
CHAPTER 9. SYSTEM EVALUATION
design helps the implementation and integration of new components. Furthermore, a visualization service is integrated into the simulator to provide graphical visualizations of communication observed during simulations. This visualization can also be used for debugging purposes.
Figure 9.1: PeerFactSim.KOM: Architecture
9.2
Social DHT: Experimental Results
In this section, we show the simulation results obtained by testing the novel Social Pastry and the goLLuM algorithm. This evaluation wants to show how our goLLuM algorithm provides a good solution for our system, but also for a general trusted system. Each simulation uses the GNP coordinates [129] to estimate delays in the fundamental network. Furthermore, measurements from the PingEr project [130] are integrated into the simulator for reasonable approximations of jitter and packet loss. Each of our simulation has been run with 10 different random seeds, therefore the values we obtained represent the average of 10 different values. Table 9.1
9.2. SOCIAL DHT: EXPERIMENTAL RESULTS
Simulator details Network model
165
General Settings PeerfactSim.KOM, 10 seeds per setup GNP, jitter based on [130], churn, no packet loss
Topology Size Level λ
B) goLLuM Pastry as DHT, forwarding to successors with all closeness levels. Barab´asi, Watts and Strogatz, Zhao’s dataset 1000 1,2,3,4,5
Setup Topology Size Level λ
C) SocialPastry Pastry with the social routing table. Barab´asi, Watts and Strogatz, Zhao’s dataset 1000, 2000, 4000, 8000 1,2,3,4,5
Setup
Table 9.1: Simulator Setup shows an overviews of the simulation settings. Primarily, we want investigate the number of successful lookups to prove functionality of our routing algorithm. Secondly, we identify the average number of hops which have to be traversed to reach a given target identifier. In our simulation setup, approximately 800 lookups to random targets are started per simulation minute. In addition, lookups are only forwarded to nodes which satisfy condition Lc ≤ λ, where λ is a given maximum level.
9.2.1
Experimental Setup
One central part of our evaluation is the assignment of friendship relations to every node we simulate. As shown in detail in Table 9.1, the simulations run with 1000 nodes per simulation, and we generate friendship relationships according to two synthetic distribution and then we exploit the Zhao’s dataset: we compare a friend topologies according to Barab´asi and Albert [77], a small world graph introduced by Watts and Strogatz [76], and a topology observed in real friendship relations by Wilson et al. in [114]. Links between two different nodes are bidirectional, i.e. if node A is friend of B, B is friend of A, as well. The real dataset is the Zhao’s one because we need a small connected component and this dataset is useful for this goal. The dataset consists of a list of edges (node A, nodeB) and a tie strength paired with this edge which represents the (di-
166
CHAPTER 9. SYSTEM EVALUATION
rected) communication intensity from node A to node B. By using the Dunbar’s approach, friends which are added to the overlay node, are considered to have a certain closeness level depends on the tie strength. Closeness levels represent the Dunbar’s circles and they are composed as: • closeness level 1: 5 overlay nodes. • closeness level 2 and below: 15 overlay nodes. • closeness level 3 and below: 50 overlay nodes. • closeness level 4 and below: 150 overlay nodes. As first step of our simulations, we generate graphs of 1000 nodes, which represent the relationships between all participating nodes in the network. The simulation can be considered divided into two phases: in the first phase, the used overlay (Pastry) is established and stabilized; in the second phase, one of the previously created topologies is used to extend the overlay by an additional routing table, namely the friend-routing table. Each node in the friendship topology represents an overlay node during simulation and for each overlay node, all outgoing links of the friendship topology node are considered as friends. According to their level of closeness the additional nodes are stored inside the friend-routing table in friendbuckets. For each level of closeness one bucket exists. Starting with the first, the buckets are sequentially filled with nodes such that the first bucket contains up to 5 friends. The second bucket has not more that 10 nodes, up to 35 nodes are maintained in bucket 3, 100 nodes are maximal stored in bucket 4. All remaining nodes are added to the friend-routing table as level 5 contacts. Doing this, the friend-routing table is filled with highly trusted nodes first until the respective friendship bucket is full. By applying the Dunbar scheme to the routing tables of each node, a directed friend graph is obtained. Undirected friendship graphs are not compatible with the Dunbar scheme. Since, as shown in section 4.5, the election ”to be a Dunbar friends” is not symmetric. Since the different level buckets are limited in their sizes, and all nodes have different friends, it is not possible to assign bi-directionality to all links in the friendship network. The reason is that closeness to a certain node has to be defined by each node itself (each node has to determine on its own, whom to trust). Since the different level buckets are limited in their sizes, and all nodes have different friends, it is not possible to assign the same links to all nodes. For example, consider nodes that only have one friend in their routing table. If now more than 5 of these nodes maintain level 1 closeness links to a specific node n, this node n will not be able to hold all of this nodes in its own level 1 bucket at the same time. Therefore, friendship might not exist mutually inside the friend
167
9.2. SOCIAL DHT: EXPERIMENTAL RESULTS
400
Level = 1 Level ≤ 2 Level ≤ 3 Level ≤ 4
Absolute Frequency [number]
350 300 250 200 150 100 50 0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20+
Incoming Links [number]
(a) Histogram of incoming links 400
Level = 1 Level ≤ 2 Level ≤ 3 Level ≤ 4
Absolute Frequency [number]
350 300 250 200 150 100 50 0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20+
Outgoing Links [number]
(b) Histogram of outgoing links
Figure 9.2: Distribution of incoming and outgoing links for the Zhao’s dataset model graph which is induced by the Dunbar scheme. We assume in our simulations that each node has at least one friend it trusts in. The resulting graph might be weakly connected only. This is the case if some nodes have only outgoing links but no incoming links, which means that no other node considers them trustful. In reality this might be the case if those nodes only request data from others but never offer data by their own. To get a better feeling for the characteristics of the simulated graphs, in Figures 9.2, 9.3, and in Table 9.2 it is shown how incoming and outgoing links are distributed over the nodes in the different topology types with respect to the
168
CHAPTER 9. SYSTEM EVALUATION
900
Absolute Frequency [number]
800 700
Level = 1 Level ≤ 2 Level ≤ 3 Level ≤ 4
600 500 400 300 200 100 0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20+
Incoming Links [number]
Figure 9.3: Distribution of incoming links for the Barab´asi-Albert model considered friend levels. While the x-axis presents the number of incoming or outgoing links, the y-axis presents the absolute number of nodes which hold the given number of links respectively. For the real-world dataset, it can be seen in Figure 9.2(a) that more than 350 out of 1000 nodes have no incoming links. Furthermore, Figure 9.2(b) reveals that each node has at least one outgoing link. In Figure 9.3, the power-law distribution of incoming links in the Barab´asi topology is shown. The distribution of outgoing links for the Barab´asi model, as well as the link distribution for Watts’ and Strogatz Small World model, is presented in table 9.2. The distribution of nodes for both models is presented for each closeness level and for selected numbers of incoming and outgoing links. All nodes which do not have 0, 5, 15, 50, or 150 links are listed as Others.
9.2.2
Experimental Results
In the following, we focus on routing performance in Pastry in terms of successful lookups per closeness level and number of hops each lookup request has to traverse. In our simulations we observe the following results: in figures 9.4, 9.5, 9.6 we present the number of successful lookups per maximum closeness level for the real-world Zhao’s model, the Barab´asi-Albert model, and a Small World graph respectively. As expected, it can be seen in figures 9.4(a), 9.5(a), and 9.6(a) that the more lookups are successful with SocialPastry the more links each node uses for routing decisions. Considering figures 9.4(b), 9.5(b), and 9.6(b), on the contrary, reveals that using the goLLuM routing algorithm, lookups are performed similar regardless of which maximum closeness level is chosen.
169
9.2. SOCIAL DHT: EXPERIMENTAL RESULTS
800
level = 1 level ≤ 2 level ≤ 3 level ≤ 4 level ≤ 5
Finished Lookups [number]
700 600 500 400 300 200 100 0
0
10
20
30
40
50
60
40
50
60
Time [minutes]
(a) Social Pastry 900
level = 1 level ≤ 2 level ≤ 3 level ≤ 4 level ≤ 5
Finished Lookups [number]
800 700 600 500 400 300 200 100 0
0
10
20
30 Time [minutes]
(b) goLLuM algorithm
Figure 9.4: Zhao’s Dataset: Finished Lookups
170
CHAPTER 9. SYSTEM EVALUATION
800
level = 1 level ≤ 2 level ≤ 3 level ≤ 4 level ≤ 5
Finished Lookups [number]
700 600 500 400 300 200 100 0
0
10
20
30
40
50
60
40
50
60
Time [minutes]
(a) SocialPastry
Finished Lookups [number]
1200
level = 1 level ≤ 2 level ≤ 3 level ≤ 4 level ≤ 5
1000 800 600 400 200 0
0
10
20
30 Time [minutes]
(b) goLLuM algorithm
Figure 9.5: Barab´asi-Albert Model: Finished Lookups
171
9.2. SOCIAL DHT: EXPERIMENTAL RESULTS
800
level = 1 level ≤ 2 level ≤ 3 level ≤ 4 level ≤ 5
Finished Lookups [number]
700 600 500 400 300 200 100 0
0
10
20
30
40
50
60
40
50
60
Time [minutes]
(a) SocialPastry
Finished Lookups [number]
1200
level = 1 level ≤ 2 level ≤ 3 level ≤ 4 level ≤ 5
1000 800 600 400 200 0
0
10
20
30 Time [minutes]
(b) goLLuM algorithm
Figure 9.6: Small World Graph: Finished Lookups
172
CHAPTER 9. SYSTEM EVALUATION
Barab´ asi-Albert Model Level = 1 Level ≤ 2 Level ≤ 3 Out Out Out 0 0 0 0 5 983 5 5 15 0 938 6 50 0 0 788 150 0 0 0 Others 17 57 201 Small World Graph Level = 1 Level ≤ 2 Level ≤ 3 In Out In Out In Out 0 71 0 61 0 26 0 5 919 1000 1 0 1 0 15 0 0 909 1000 1 0 50 0 0 0 0 874 1000 150 0 0 0 0 0 0 Others 10 0 29 0 98 0
Level ≤ 4 Out 0 5 6 3 357 629 Level In 0 0 0 0 1000 0
≤4 Out 0 0 0 0 1000 0
Table 9.2: Link distribution for the Barab´asi-Albert model and the Small World graph For each simulated friendship model, for both routing approaches, and for each selected level of closeness, Table 9.3 presents the number of average hops which have to be traversed to reach a given destination node successfully. Hop counts obtained from failed lookups attempts are not listed in the table. The actual hop count is presented in row Hops, whereas +σ and −σ denote the average standard deviation in positive and negative direction. During simulations, we observed that by using the goLLuM routing algorithm, lookups fail only if no route between the starting point of a lookup and the target node exists. In this case, the lookup message is sent back to the initiator after some time. By using SocialPastry, we observed further that lookups fail only due to routing loops in the friend topology. In our simulations we aborted lookup attempts after 50 hops. We have shown that routing via friends and friend-of-friends is possible if a path between sender and receiver of a message exists. Further it has been shown that the number of visited nodes is reasonable. We observe that the number of hops a lookup message has to traverse to reach the destination node lies around 2 regardless to the minimum closeness level that is used for routing. It seems that lookup messages are either dropped after the hop count for this message exceeded 50, or lookups are successful after a few steps.
173
9.3. DATA AVAILABILITY: EXPERIMENTAL RESULTS
Level = 1 +σ Hops -σ
7.47 2.22 0.98
+σ Hops -σ
16.60 12.28 8.66 Level = 1
+σ Hops -σ
1.27 1.26 0.29
+σ Hops -σ
2.25 2.65 1.31 Level = 1
+σ Hops -σ
3.18 3.26 1.72
+σ Hops -σ
5.69 3.87 2.24
Zhao’s Dataset Level ≤ 2 Level ≤ 3 Level ≤ 4 SocialPastry 8.88 8.92 9.04 2.24 2.26 2.28 0.90 0.89 0.91 goLLuM Algorithm 14.00 11.30 9.21 7.06 4.22 3.52 4.82 2.44 1.88 Barab´ asi-Albert Model Level ≤ 2 Level ≤ 3 Level ≤ 4 SocialPastry 2.35 5.51 5.39 1.70 2.39 2.09 0.70 1.12 0.93 goLLuM Algorithm 3.65 14.73 2.36 2.99 5.44 1.07 1.70 4.16 0.09 Small World Graph Level ≤ 2 Level ≤ 3 Level ≤ 4 SocialPastry 3.05 2.80 3.87 2.83 2.36 1.80 1.50 1.13 0.76 goLLuM Algorithm 3.80 9.23 11.49 1.94 1.63 1.05 0.90 0.65 0.07
Level ≤ 5 12.77 3.59 2.29 9.41 3.63 1.96 Level ≤ 5 11.89 3.17 1.93 1.40 1.04 0.06 Level ≤ 5 12.03 3.33 2.08 11.13 1.06 0.07
Table 9.3: Average hop counts for different routing methods, friendship models and levels of trust
9.3
Data Availability: Experimental Results
In this section we show the evaluation of our Trusted Social Storage approach based on 2-replicas for each profile. The goal of our simulations is to investigate the quality of our data availability service in ego networks for a given real-world user behaviour. In order to judge the quality of our proposed protocol we focus
174
CHAPTER 9. SYSTEM EVALUATION
on the number of available profiles in the network under churn. Additionally we observe the quantity of nodes which are responsible for storing the published data objects (PoSs) and we investigate the costs of our protocol to prevent data loss due to failing nodes.
9.3.1
Experimental Setup
The experiment size is set to 1859 nodes and contains a small subset of the SocialCircle! Facebook dataset, described in Chapter 8. From this set we extracted a subset of nodes, which form a connected graph and represent real friendship relationships. From minutes 1 to 10 of the simulation, each node starts the application and therefore joins the DHT Overlay. During simulation each of the simulated nodes observes its interactions with other nodes and calculates thereupon an ordered list in which known contacts (friends) are sorted according to their Social Score. Friends with a high degree of interaction are placed at the top of the list and more likely to be considered as PoS for the data availability protocol than nodes with low contact frequency. At minute 14 those nodes which represent an ego node in the network start to publish private data (profile) and start the data availability service protocol. By this means they select a good friend from the sorted interaction list, which is available, to be second PoS for their data. In our simulations 28 nodes in total publish their profiles in this phase of the simulation. In order to investigate the rate of available data in the network we start lookups to the published profiles every minute. From minute 30 to the end of the simulation, user churn is activated. Nodes decide now to join or leave the network according measurements from SocialCircles!. Approximately half of the nodes which leave the network decide to leave the network without informing the other PoS about their departure, whereas the other half informs the other PoS about it’s intention to leave the network. Therefore it is likely that both PoSs for a given profile leave the network simultaneously without selecting another PoS for the profile. Since more nodes leave the network than join it again in our simulations it is furthermore quite likely that certain profiles are not available throughout the rest of the simulation as soon as both responsible PoSs leave the network simultaneously. Therefore we expect lookup failures at the end of the simulation time when many nodes have left the network. In Table 9.4 we briefly describe the setup of our simulations.
9.3.2
Experimental Results
As explained in Table 9.4, our evaluation is divided in three different scenarios to evaluate the reaction of our approach to the dynamic of the network:
9.3. DATA AVAILABILITY: EXPERIMENTAL RESULTS
175
Simulator details Network model Churn model Analyser
Simulator Settings PeerfactSim.KOM. GNP, jitter based on [130], no packet loss. User churn according to the Facebook dataset from SocialCircles!. Focus on: number of online nodes, profile lookups, and message consumption.
Experiment size DHT Overlay Parameter α
Scenario Settings 1859 nodes, out of which 28 nodes are egonodes. Pastry. Different values for frequency of ping-pong messages. Experiments One half of the nodes selects new PoS when leaving, whereas the other half fails suddenly. All nodes fail upon churn event, nodes leave without notification. Churn behaviour same to A, simulation time is set to 48 hours
A B C Minute Minute Minute Minute Minute
1 -10 12 14 -16 30 120
Scenario Actions Nodes join the network, especially the DHT. Start periodic calculation of interactions among friends. Egonodes start the data availability service. User churn is turned on. Most nodes go offline. End of the simulation.
Table 9.4: Simulator Setup. • Experiment A, one half of the nodes regularly execute the disconnection phase, in which selects new PoS when leaving. The other half fails suddenly; • Experiment B, all nodes fail upon churn event. This is the worst scenario in which all nodes leave without notification; • Experiment C, the dynamic of the network is the same of the Experiment A, but simulation time is set to 48 hours. Experiment A As can be seen in Figure 9.7(a), the number of participating nodes in the network decreases abruptly as soon as the simulation of user churn starts. Approximately 50 percent of the participating nodes leave from minute 30 until the end of the simulation, from 1859 online nodes at the beginning, only 800 to 900 are online at the end of the simulation. In general we observe that more nodes leave the network during our simulations than join the network. This happens because the simulator has limitation in terms of performance and we have simulated a time window which represents a little interval of the whole crawled interval. During the simulation, 28 ego nodes publish their profiles and select a second PoS to increase the availability of the profiles. Whenever any PoS leaves the network, voluntarily or abruptly without further proclamation of their departure, another node in the network will be selected to take on the tasks of the leaving node. In Figure 9.7(b) we present the number of online PoSs during the simulation. In Figure 9.7(a) it can be seen that most nodes leave the network abruptly in
176
CHAPTER 9. SYSTEM EVALUATION
2000
60
1600
PoSs Online [number]
Online Hosts [number]
1800 1400 1200 1000 800 600
alpha=60s alpha=120s alpha=180s
400 200 0
0
20
40
60
80
100
50 40 30 20
0
120
alpha=60s alpha=120s alpha=180s
10
0
20
Time [minutes]
40
60
80
100
120
Time [minutes]
(a) Number of nodes online.
(b) Number of points of storages online.
Figure 9.7: Experiment A: Quantity of online nodes and points of storages. minutes 38 and 39 of the simulation. Although many PoSs fail due to churn, the amount of PoSs slightly recovers after the sudden reduction. 1
Profiles Online [ratio]
0.95
0.9
0.85
0.8
0.75
alpha=60s alpha=120s alpha=180s 0
20
40
60
80
100
120
Time [minutes]
Figure 9.8: Experiment A: Ratio of available profiles to total number of profiles. We obtain that some of the published profiles are unreachable throughout the whole simulation time. This circumstance is caused by the sudden failure of both PoSs of a certain profile. If none of the current PoSs of the profile nor any other node which has been PoS for the profile before rejoins the network again the published profile will remain unavailable, none of the lookups for this data item will be successful then. Figure 9.8 shows the ratio of available profiles to the number of total profiles disseminated in the network. In the fortieths minute, when the churn rate reaches its maximum, the number of reachable profiles decreases rapidly and thereafter increases again so that after 25 minutes approximately 95% of all profiles are available in the network. The remaining profiles are unavailable unless at least one of the current PoSs for this data item join the network again.
177
9.3. DATA AVAILABILITY: EXPERIMENTAL RESULTS
We tested our data availability service with different values of α, where 1/α is the frequency with which ping messages are sent by the first PoS to the second PoS and the frequency with which pong messages are sent back respectively. The lower α is chosen the higher the frequency of sent ping messages and the higher the data availability is. Figure 9.8 reveals that higher values of α do not necessarily lead to an availability rate worse than that with lower values. 70
50 40 30 20 10 0
alpha=30s alpha=60s alpha=120s alpha=180s alpha=240s alpha=300s alpha=600s
120 Messages [number]
Messages [number]
140
alpha=60s alpha=120s alpha=180s
60
100 80 60 40 20
0
20
40
60 Time [minutes]
80
100
120
0
0
20
40
60
80
100
120
Time [minutes]
(a) Message consumption for different alpha. (b) Message consumption is proportional to ping-pong frequency.
Figure 9.9: Experiment A: Quantity of ping-pong messages for different values of alpha. Furthermore, we focus on the costs of the data availability service in terms of message consumption. In order to react to failing nodes, PoSs which are responsible for a certain profile, exchange keep-alive messages. For every created ping message, a pong message is sent back to inform the supporting PoS about the presence of a PoS. By this means, for every published profile two messages are frequently created to keep the availability of the data object alive. The total number of keep-alive messages sent in a given time-interval T is therefore proportional to 2n Tα , where n is the total number of profiles published in a given network and 1/α is the frequency with which ping messages are created. Figures 9.9(a) and 9.9(b) present the simulated amount of keep-alive messages per minute which are used to prevent PoS failures. Experiment B Nodes which leave the network due to user churn do not inform other nodes about their departure, instead they leave without selecting a new PoS which could take over the responsibility for the published profile the PoS maintains. Figure 9.10(a) is equal to Figure 9.7(a), approximately half of the nodes fail due to churn. The number of available PoSs during the simulation of Experiment B is shown in Figure 9.10(b). Although failing nodes do not select another PoS the data availability rate is similar to that given in Experiment A. In Figure 9.11(c) it can be seen that data availability is the higher, the faster keep-alive messages
178
CHAPTER 9. SYSTEM EVALUATION
are exchanged between two PoSs, and the faster node failures are detected. The message costs remain proportional to 2n Tα , where n is the total number of profiles published, T is the observed time interval and 1/α is the frequency with which ping messages are sent, as can be seen in Figure 9.10(d). 2000
60
1600
PoSs Online [number]
Online Hosts [number]
1800 1400 1200 1000 800 600
alpha=30s alpha=60s alpha=180s
400 200 0
0
20
40
60
80
100
50 40 30 20
0
120
alpha=30s alpha=60s alpha=180s
10
0
20
40
Time [minutes]
(a) Number of nodes online.
0.9 0.85 alpha=30s alpha=60s alpha=180s 0
20
40
60
100
120
alpha=30s alpha=60s alpha=180s
100 Messages [number]
Profiles Online [ratio]
120
0.95
0.75
80
(b) Number of points of storages available in the network.
1
0.8
60 Time [minutes]
80
100
80 60 40 20
120
Time [minutes]
(c) ratio of availability profiles to total amount of published profiles.
0
0
20
40
60
80
100
120
Time [minutes]
(d) Costs in terms of messages.
Figure 9.10: Experiment B: Simulation results, all nodes leave the network without selecting another point of storage (failure).
Experiment C Figure 9.11 shows the results of Experiment C in which we investigate our data availability service with given user behaviour from the SocialCircle! Facebook dataset and with 48 hours of simulated time. During the whole simulation the quantity of online nodes fluctuates around 800 nodes, which is less than 50 percent of all simulated nodes (Figure 9.11(a)). The results given by figures 9.8 and 9.11(c) indicate that our proposed data availability service protocol with only 2 PoSs provides high availability of data objects under realistic circumstances and realistic user behaviour. Furthermore, the number of keep-alive messages in this simulation is proportional to the number of published profiles and proportional to the inverse of α. As shown in Figure 9.11(d) for α = 30s the amount of messages
179
9.3. DATA AVAILABILITY: EXPERIMENTAL RESULTS
is higher than for higher values of α (lower frequency). For α = 180s, the highest value for α, the fewest messages are sent. 2000
60
1600
PoSs Online [number]
Online Hosts [number]
1800 1400 1200 1000 800 600
alpha=60s alpha=120s alpha=180s
400 200 0
0
500
1000
1500
2000
2500
50 40 30 20
0
3000
alpha=60s alpha=120s alpha=180s
10
0
500
Time [minutes]
0.98 0.97 alpha=60s alpha=120s alpha=180s 500
1000
1500
2500
3000
alpha=60s alpha=120s alpha=180s
120
0.99
Messages [number]
Profiles Online [ratio]
140
0
2000
(b) Number of points of storages available in the network.
1
0.95
1500 Time [minutes]
(a) Number of nodes online.
0.96
1000
2000
2500
100 80 60 40 20
3000
Time [minutes]
(c) ratio of availability profiles to total amount of published profiles.
0
0
500
1000
1500
2000
2500
3000
Time [minutes]
(d) Costs in terms of messages.
Figure 9.11: Experiment C: Simulation results, half of the leaving nodes select another PoS wheres the other half fails. The simulation time is set to 48 hours.
PoS Election Evaluation To evaluate the PoS election strategy and how is good in term of data availability we use our Facebook dataset (all the 308 ego networks). The simulation replicates the actual online status of the users by considering the timestamp information. We evaluate the system through the following metrics: 1. Pure Availability: [12] fraction of time in a day a user’s profile is reachable through its PoS, that is the union of times the user has at least one online PoS. 2. Friend Availability: [12] fraction of time in a day a user’s profile is reachable through its PoS from its friends. In other words, since the profile of a user is accessible only from its friends, this metric focus on the fraction of time the
180
CHAPTER 9. SYSTEM EVALUATION
profile of a user is available when its friends are online. In a social network higher Friend Availability (even with a lower Pure Availability) is desirable. 3. Average PoS Tie Strength: average tie strength value between the user and its online PoS. A high value means that the PoS have been chosen among more trusted friends. 4. Average Number of Elections: average number of PoS elections made by a user. We consider all possible pairs of individual selection strategies; namely TieStrength and ConnectionGain (TieStrength & CG), TieStrength and Medium Session Length (TieStrength & MSL) and ConnectionGain with Medium Session Length strategy. Furthermore, we compare each heuristic with the social score (i.e. TieStrength & CG & MSL). Pure Availability and Friend Availability Pure Availability and Friend Availability in the Facebook Dataset are shown in Figure 9.12. Since availability depends on users’ traces and only marginally on PoS selection strategies, results are very similar for all the proposed strategies. Hence, we show only the graphs corresponding to the TieStrength & CG & MSL selection strategy. Pure Availability increases monotonically and an availability of 90% is achieved for nodes with more than 40 friends. As we expected, Friend Availability is always greater than or equal to Pure Availability. It is remarkable to note that also a user with a relatively small number of friends (e.g. 20) achieves more than 90% of Friend Availability on average. For nodes with a very small number of friends (i.e. < 10) the difference between Pure Availability and Friend Availability is remarkable, since neither the node nor its friends are online for long periods of time. Average Tie Strength Since a user’s PoS are always chosen between its online friends, we expect load increases proportionally to the users’ ego network size and to the percentage of time they spend online in the system. The maximum load is about 70 profiles and it is almost the same for all the considered strategies. Fig. 9.13 shows a plot of the average tie strength values between the nodes and their online PoS as function of the Dunbar ego network size by considering the different paired strategies. The average value of the tie strength returned by the different strategies is almost the same for small sized Dunbar ego networks, since the choice of PoS is always restricted among nodes
181
9.3. DATA AVAILABILITY: EXPERIMENTAL RESULTS
1
Average availability
0.8 0.6 0.4 0.2 AvgPureAv AvgFriendAv
0 1
10
100 1000 Number of friends
10000
Figure 9.12: Pure Availability and Friend Availability
Average PoS TieStrength
0.014
TieStrength & MSL TieStrength & CG CG & MSL TieStrength & CG & MSL
0.012 0.01 0.008 0.006 0.004 0.002 0 0
20
40 60 80 100 120 Dunbar ego network size
140
160
Figure 9.13: Average PoS tie strength and Dunbar ego network
(a) Average tie strength
(b) Average Number of Elections
Figure 9.14: Simulation results
that are online at the moment of the election. As the Dunbar ego network size grows, the difference between the average tie strength values obtained with the
182
CHAPTER 9. SYSTEM EVALUATION
different strategies increases and, as expected, the highest tie strength value is obtained with the TieStrength & CG & MSL strategy. Since the CG & MSL strategy selects PoS regardless the tie strength value with the ego, it presents the smallest tie strength among all heuristics. The histogram in Fig. 9.14(a) shows the average value of tie strength between egos and their online PoSs. The average values of the different strategies are compared with the average tie strength between the same egos and all their ego friends (AvgTieEgo). As before, the average PoS tie strength is much higher for TieStrength & CG & MSL than for the other strategies while CG & MSL strategy is quite similar to the average tie strength of the Dunbar ego network.
Average Number of Elections The average number of elections made by each user is shown in the histogram of Fig. 9.14(b). The minimum average number of elections is obtained with the TieStrength & MSL and CG & MSL strategies that exploit information about users’ sessions for PoS elections while for the remaining strategies these values are quite similar. Currently, medium session length is computed during the simulation, by considering previous sessions of the users during the considered period. We believe that the number of elections may be further reduced, by using longer periods of times and/or by refining the methodology used to predict users’ availability.
Evaluation number of replicas through network coverage In our Trusted Social Storage approach we use 2 replicas for each profile. We evaluate what is a suitable number of replicas to guarantee a certain level of trust. To evaluate replicas, we use an algorithm inspired to the network coverage approach proposed in [49]. We decide to simulate a network coverage by considering static ego networks and all ego nodes online. We use all the active users of our dataset (301 egos) and we evaluate how many replicas are needed to coverage the whole ego network of each node. In Figure 9.15 is shown the frequency of the number of Social Storages per ego needed to obtain a network coverage. The maximum number of Social Storage that an ego nodes has is 18, the minimum is 1. The mean value is 2,8 (the Std. Dev. is 2,4). With this evaluation we are able to known that on average, each node has only 2 Point of Storages.
183
9.4. INFORMATION DIFFUSION: EXPERIMENTAL RESULTS
Figure 9.15: Frequency of the total amount of replicas needed to have a network coverage
9.4
Information Diffusion: Experimental Results
In this section, we introduce an evaluation of our information diffusion service. The evaluation is divided into two parts: first of all, we show the evaluation of our distributed protocols proposed for the computation of the Weighted Ego Betweenness Centrality (and EBC) computation; then, we introduce the evaluation of the epidemic diffusion algorithm, which uses the Weighted Ego Betweenness Centrality protocols to compute the metric. For the evaluation of both protocols and the epidemic diffusion algorithm, we use the Zhao’s dataset, explained in detail in the Chapter 8. Furthermore, we have evaluated the epidemic diffusion algorithm on a subset of nodes extracted by the original dataset. The dimension of the extracted networks is, respectively, of 4000 and 10000 nodes. Table 9.5 shows the structural properties of the two extracted networks. Network Properties Min. Nodes Degree Max. Nodes Degree Mean Nodes Degree StdDev Nodes Degree Min. Dim. FriendOfFriend Max. Dim. FriendOfFriend Mean Dim. FriendOfFriend StdDev Dim. FriendOfFriend
4000 Nodes 1 150 6.692 10.788 1 861 105.734 74.77
Table 9.5: Network properties
10000 Nodes 1 150 7.674 11.84 1 2542 160.41 139.318
184
9.4.1
CHAPTER 9. SYSTEM EVALUATION
Experimental Setup
The Social Graph has been used to evaluate the EBC computation on undirected graphs and the Interaction Graph to evaluate the EBC on directed graphs. If a user j has done an interaction, i.e. a post, to a user i, a link from j to i is created in the Interaction Graph. We have evaluated the computation of the EBC for two different scenarios: • static network . The computation analyses all the social relationships by considering that each node has an online status. In other words, each node is considered online and the static EBC (StaticEBC ) evaluates the social importance of each node according to its social relationships and independently from the network configuration. • dynamic network . The computation analyses the importance of a node with respect to its online/offline status. The dynamic EBC (DynamicEBC ) does not consider the offline nodes. We have exploited a network of 1000 peers and we have used a well-known churn model [131].
9.4.2
Experimental Results
We have evaluated the correlation between the BC and the EBC on the social graph and on the interaction graphs by using the Pearson correlation. We have evaluated our protocols and the correlation between EBC and BC on directed and undirected graphs, by extracting different networks, randomly chosen, from the dataset and by varying the number of nodes. Figure 9.16 shows the correlation between BC and EBC for an undirected graph containing 1595 nodes, and figure 9.17 shows the correlation for a directed graph containing 5000 nodes. There is a strong correlation between the two metrics, and they are the same for nodes with centrality equals to 0. We have evaluated the average number of messages sent by each protocol by varying the number of nodes in the network on 10 iterations (Table 9.6). The number of messages sent by the broadcast protocol are on average about three times the number of messages sent by the gossip protocol. The broadcast protocol has been introduced to provide a fast message delivery between nodes, but this implies a larger number of messages. The gossip protocol provides message delivery with a small number of messages, and it can be used to prevent the network congestion or in case of devices with low capabilities. EBC in a dynamic environment In a distributed environment, peers can be online or offline. The status of peers changes the underlying graph on which the EBC is calculated. Each time the
185
9.4. INFORMATION DIFFUSION: EXPERIMENTAL RESULTS
Figure 9.16: Normalised values for BC and EBC for a random network extracted from the data set and composed by 1595 nodes
1
BC EBC
0.9 0.8 0.7
Value
0.6 0.5 0.4 0.3 0.2 0.1 0
0
500
1000
1500
2000
2500 Node Id
3000
3500
4000
4500
5000
Figure 9.17: Normalised values for BC and EBC for a random network extracted from the data set and composed by 5000 nodes ego network of a peer is changed, the EBC has to be computed again. We have evaluated how often this computation is required and the average number of time instants after that a peer has to recalculate the EBC. Figure 9.18 shows ten experiments. For each experiment, on the x-axis the
186
CHAPTER 9. SYSTEM EVALUATION
#Nodes 1000 2000 3000 4000 5000
EBC Gossip Protocol 29646 62581 121220 100824 241339
EBC Broadcast Protocol 64580 160006 358254 249187 776880
Table 9.6: Number of Messages of the both protocols by varying the number of nodes
1
0.8
CDF(x)
0.6
0.4
0.2
0 0
20
40
60 80 100 Average Time Interval Computation EBC
120
140
Figure 9.18: Evaluation of how often the EBC computation is required in a network of 1000 nodes and 150 time instants. time instants x, while on the y-axis the cumulative function of the number of nodes which have to recalculate the EBC at a time instant less or equals to x. The results show that about 50% of nodes recalculate the EBC after an average number of time instants less than 90 and 140. Furthermore, the majority of nodes performs the calculation of the EBC after at least 60 cycles of simulation. Epidemic Diffusion Algorithm We have evaluated the WEBC for several networks randomly extracted from the real dataset. The contact frequencies of the social relations are used to compute the weights associated with each relation. The nodes are ranked on the basis of
187
9.4. INFORMATION DIFFUSION: EXPERIMENTAL RESULTS
the value of their EBC. Figure 9.19 reports on the x-axis the rank of the nodes, while on the y-axis we find the value of the centrality indexes in a logarithmic scale. We can observe that the WEBC provides a better differentiation of nodes, i.e. it allows to evaluate nodes not only from point of view of their structural properties, but also from a qualitative point of view. We can observe that nodes with a similar EBC are redistributed by the WEBC on the basis of the contact frequency of the paths. 100000
100000
WEBC EBC
10000 1000
1000
100
100
10
10
1
1
0.1
0.1
0.01 0
1000
2000
3000 Node ID
4000
5000
0.01
(a) Nodes 5639 - Edges 34065 100000
0
1000
2000
3000 4000 Node ID
6000
7000
WEBC EBC
10000
1000
5000
(b) Nodes 7208 - Edges 49170 100000
WEBC EBC
10000
WEBC EBC
10000
1000
100
100
10 10
1
1
0.1
0.1
0.01 0.001 0
2000
4000
6000 Node ID
8000
10000
(c) Nodes 11621 - Edges 119100
0.01 0
2000
6000
10000 Node ID
14000
18000
(d) Nodes 18678 - Edges 166721
Figure 9.19: Variation between EBC and WEBC for random networks obtained by the dataset Figure 9.20 compares the epidemic diffusion algorithm exploiting EBC, respectively the WEBC, and the baseline algorithm based on flooding, considering the two networks. The Figure shows the CDF of FoF which have received at least a replica, with respect to the different solutions. The results show that our algorithm actually permits to obtain a lower number of replica with respect to flooding, while the two heuristics are very close in term of replicated updates. Finally, we have evaluated the WEBC heuristics by comparing it with an heuris-
188
CHAPTER 9. SYSTEM EVALUATION
1 0.8
CDF
0.6 0.4
Gossip WEBC 10000 Gossip EBC 10000 Flooding 10000 Gossip WEBC 4000 Gossip EBC 4000 Flooding 4000
0.2 0 0
10
20 30 40 50 % nodes replica in FoF
60
70
Figure 9.20: Replica Distribution in FoF tics which selects the neighbours on the basis of the EBC. Figures 9.21, respectively, 9.22 show the number of replicated updates as a function of the size of the communities by using the two heuristics and by considering the 4000 nodes, respectively the 10000 nodes network. The figures show that the percentage of replica is similar in the two scenarios in both networks. We can conclude that the heuristics based on the WEBC outperforms the other one, because it is able both to return a similar number of replicas and to select the most important paths in terms of tie strength of the nodes.
# replicated messages in FoF
9.4. INFORMATION DIFFUSION: EXPERIMENTAL RESULTS
1000 900 800 700 600 500 400 300 200 100 0
0
20
189
40 60 80 100 120 140 Max community size
# replicated messages in FoF
(a) WEBC heuristics
1000 900 800 700 600 500 400 300 200 100 0
0
20
40 60 80 100 120 140 Max community size (b) EBC heuristics
Figure 9.21: Number of replica according to the community dimension in a network of 4000 nodes.
190
# replicated messages in FoF
CHAPTER 9. SYSTEM EVALUATION
700 600 500 400 300 200 100 0
0
20
40 60 80 100 120 140 Max community size
# replicated messages in FoF
(a) WEBC heuristics
800 700 600 500 400 300 200 100 0
0
20
40 60 80 100 120 140 Max community size (b) EBC heuristics
Figure 9.22: Number of replica according to the community dimension in a network of 10000 nodes
Part IV Conclusion
Chapter 10 Conclusion and Future Work In this Chapter we briefly resume the research work conducted in this thesis, we recall the main contributions achieved, and discuss some possible improvements. Finally we report the references to the papers (published and under revision) in which such contributions are discussed and assessed.
10.1
Thesis Contributions
This thesis proposed an architecture for Distributed Online Social Networks which exploits the Dunbar’s concept to manage all the functionalities of the system and, in particular, to provide two important social services: data availability and information diffusion. We exploit the trust of users in their close friends as a basic concept for the definition of the system. In Chapter 4, we introduced the structure of the system, which is composed by a Social Overlay and a P2P lookup service. We introduced a new concept of Social Overlay based on the definition of a Dunbar-based Ego Network, where each ego node maintains connections with only the 150 friends with whom it has strongest relationships. We use the Pastry DHT to provide the lookup service and to implement the search function of new friends. In Chapter 5, we refine the design of the DHT by defining trustful routes between sender and receiver. We propose an optimization by introducing a Social DHT and a general approach which exploit friendly routing. The usage of DHT is a critical point for the trust of the system and, by introducing a Social DHT we are able to afford this problem. Limiting the structure of the routing tables of Pastry, implies that several routes may fail. Due to these drawbacks, we have implemented a general protocol which is independent from the underlying overlay to overcome the routing issue and, on the other hand, to apply friendly (social)
194
CHAPTER 10. CONCLUSION AND FUTURE WORK
routing on top of any structured and unstructured P2P overlay. We described the goLLuM routing algorithm, which ensures to find a trusted route between two arbitrary nodes if one exists. For doing this, we use a modified and distributed depth-first search in which links with a short distance to the given target node are preferably investigated. The advantage of our approach is its independence from the underlying overlay, in principle it can be applied to any P2P system. In Chapter 6 we proposed two replication-based approaches to face the data availability problem. We focused on trusted nodes, that we called Point Of Storages. The first approach we proposed is a distributed storage support in which users’ data are stored on trusted friends and it guarantees the users’ data persistence and the data consistence. Each node dynamically elects a minimal set of Point of Storages among its friends by choosing, at first, its Dunbar friends. User data are dynamically transferred between online users in order to maximise the availability of users’ profiles in the social network. It uses two replicas for each profile chosen among Dunbar’s friends of a node. We managed the dynamics of the storage nodes and the consistency of replicas. The second approach followed the concept of providing full trust to the system. In this approach each online node has a backup storage node into its Dunbar based Ego Network and, when it goes offline, it has to provide a network coverage for all its online friends. This means that a node can retrieve a profile by using only trusted social connections. In Chapter 7, we proposed an approach to manage the information diffusion by using a gossip dissemination protocol. This Chapter is logically divided into two parts. The first one is centred on the definition of the Ego Betweenness Centrality. We have evaluated a distributed computation of the EBC on undirected graphs, and we have studied the computation of the EBC on directed graph. Furthermore, we have provided two distributed protocols which can be used on directed and undirected graphs. The second part introduced of a novel centrality index, the Weighted Betweenness Centrality, which enables the definition of an efficient epidemic algorithm able to select the paths for the propagation of the social updates on the basis of the weights paired with them. In Chapter 8, we showed an analysis of the two real Facebook datasets that we have logged through an application we have developed by exploiting Facebook’s API. One of the most important contribution of this thesis is the real up to date dataset obtained through the SocialCircles! Facebook application (showed in Appendix A). Our study uncovered a number of interesting findings related to the specific nature of online social networking environments. By using temporal information of real OSNs we have found that availability patterns of single individuals has a non-trivial relationships with those of their close friends. Furthermore, we showed the extent to which availability patterns of each friend in the Dunbar’s
10.2. FUTURE WORK
195
circle affects the availability pattern of the user. Namely, social ties on innermost circles not only are stronger in terms of volume of communications, but also show higher similarity of the users availability patterns. Finally, we have shown that users have more probability to be online when at least 10 of their Dunbars’ friends are online. Finally, in Chapter 9, we showed the evaluation of our system. Firstly, we evaluated SocialPastry and goLLuM algorithm. By using different friendship models for social networks, we compare SocialPastry with goLLuM algorithm. Evaluation reveals that using the distributed depth-first search make it possible to route via trusted friends only. Furthermore, we evaluated our data availability approaches. Our experimental results show that our system is able to guarantee a trade off between a high availability and a reduced number of replica. Finally, we showed the evaluation of our information diffusion approach. We proposed a deep evaluation of the Ego Betweenness Computation. The experimental results show the strong correlation between BC and EBC on directed/undirected graphs extracted from the Facebook regional data set. Furthermore, we have evaluated by using a well-known churn model how many times a node has to recompute its ego betweenness.
10.2
Future Work
The work of this thesis is only a first step toward the definition of a distributed framework for Online Social Networks and we plan to extend our work in several directions. The first important improvement is related to the usage of multiple devices for a user at the same time. Furthermore, we plan to evaluate an alternative solution by considering explicit privacy policies. We plan to exploit authorizations defined by users at service level in the underlying infrastructure level of the OSN that implements the mechanisms to support the OSN services. Such mechanisms take advantage of the privacy policies defined by users to perform more efficient data allocation/diffusion decisions that preserve as much as possible the expected users’ privacy. We plan to study a comparison between our system and the alternative approach proposed above. The data availability management service may enforce users’ policies by including information that specifies where the content can be stored: exclusively at peers run by friends who are authorized to see the users contents, encrypted on untrusted devices, on the devices of closest friends, or using some heuristics which combine other architectural information able to produce intelligent allocations. Moreover, another improvement to the data availability management service may consist of investigate how to manage the load balancing and to study in depth the
196
CHAPTER 10. CONCLUSION AND FUTURE WORK
problem of data consistence. In according to the information diffusion management service, we anticipate that our focus will be on optimizing the EBC recomputation in dynamic environments and the study of the EBC computation in a specific social overlay by using a weighted version of the social graph. The epidemic algorithm may be refined by pairing an history with each update in order to further reduce the number of duplicated updates. We plan to integrate in our system a strategy to guarantee the availability of the social content for offline peers. Finally, we will investigate the application of the WEBC in other contexts, for instance for the link prediction problem. Another important step of our work is to investigate our Facebook dataset, in particular the impact of our finding on content distribution patterns. Answering these questions will let us to explore opportunities for efficient content distribution and data replication, as well as advertisement and recommendation strategies. Lastly, based on our results, we plan to build a user churn model able to shape availability patterns of the user behaviour by incorporating the most part of our findings, including sessions distribution, tie strength and temporal features.
Appendix A The Facebook application: SocialCircles! In this appendix, we present SocialCircles!, our Facebook application used to crawled real users’ data. We show how is structured the application, which functionalities we offer to attract users, and the development technologies and tools we used.
A.1
Introduction
SocialCircles! is our Facebook application used for data crawling. The first approach we considered when thinking about how to proceed in our data retrieval was the building of a HTTP crawler. The crawler is a tool, which starting from a given node, can explore the network and extract information: this tool could have been used to explore recursively the ego networks of the alters, thus permitting us to obtain a medium-sized dataset with profile information of at least 20.000-30.000 users. As exploration strategy, we decided to follow the Breadth First Search (BFS) method. The BFS should have been the best choice, as opposed to the Depth First Search (DFS) that explores the graph ”in depth” by increasing the hop distance, and the Random Walk strategy which wasn’t suitable for our needs. The idea was to retrieve and analyse an ego profile (the starting point), then to put all its direct alters in the exploration queue. When processing each alter, all its direct alters are added to the end of the queue. This implementation makes the queue grow, but keeps the explored nodes ”close” (in term of hop distance) to the starting node. Some research showed that the crawling based on HTTP requests wasn’t new: in particular some researchers in 2011 [132] managed to build a dataset by using
198
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
this technique, so we were relieved the system could work as expected. Other interesting work in 2009-2010 showed the system feasibility [133]. However, Facebook became much more careful about these activity only in last 1-2 year, and took many countermeasures to discourage people crawling against them. We believe this is one of the main reason why there are no currently big and up to date datasets of Facebook network. Furthermore, by using a crawler we were facing a privacy issue: the obtained data couldn’t be published (even though anonymized). This problem convinced us that this path was just a dead end. We therefore changed approach: Facebook infrastructure has a very flexible API system, which can be exploited to both read and publish data of users. We thought that by creating an application which users must approve we could be able to request the data we needed, even for a period after last user usage of the application.
A.2
What data can we obtain?
The goal of application is to retrieve a good number of Ego Networks, associated with a solid set of data. In order to proceed, we logically divided the information we wish to retrieve from each network. We identified three tasks: 1. Topology and Profile Information. The first data we aim to retrieve is all topology and node profile information. By getting these data, we will be able to build a similar but richer and more up to date dataset than the one presented in [134], which will be used to study network structural properties and evaluate social circles detection algorithms. 2. Interactions Information. By analysing posts, photos, flow of comments, likes and tags between friends, we should be able to associate different importance to people relationships. By aggregating all these information we may be able to weight each link between two users (we could also give 2 different weights by considering the direction of the interaction). Due to practical reason (time needed to fetch all data), we restrict the information retrieval up to 6 months prior to user application registration. 3. Online presence data. Finally, we noticed that we could easily obtain an approximated estimation of the time spent online by Facebook users: by requesting the online presence permission, we are able to request at any time the online status of each user and its friends. In particular this mechanism will allow us to assess the coavailability aspect, which gives indication whether users are online with respect to their friends connections. Moreover
A.3. THE SOCIALCIRCLES! FUNCTIONALITIES
199
these data will be used to study and characterize the typical connection session behaviour. It’s important to point out that this information is the only data that can be requested to Facebook which gives indication about the time spent on the OSN by each user.
A.3
The SocialCircles! functionalities
The idea of the application was to offer a tool which could attract users by letting them discover some facts about their Facebook ego network interactively. The first requirement was to build a simple and easy interface, which could allow users to register with just a few clicks by exploiting their Facebook account in a trustworthy way. For this reason we decided to create a dedicated website and embed the Facebook JDK library: this library allowed us to manage easily the logged status of Facebook and to subscribe for status changes. The whole session management is based on a variant of the OAuth 2.01 protocol, which controls both the user identity (authentication) and the data users will share with the application (authorization). More in detail, when user authorizes our application through the provided login dialog, a short-lived user access token 2 with a validity of one or two hours is issued by the OSN: during the registration process, we negotiate with Facebook the exchange of this token with a long-lived user access token which lasts 2 months. This token will be needed each time we want to fetch a user data from Facebook. The website has been designed to be HTML 5 compliant, with large use of Javascript technology to make it interactive. When a user connects to website, an index page (shown in figure A.1) explains briefly the application goal and let user register. The website provides following sections: • The Graph page • The Statistics page • The Interactions page • The Friends Map page • About and Privacy statements
1 2
https://developers.facebook.com/docs/facebook-login/overview/v2.0#secure https://developers.facebook.com/docs/facebook-login/access-tokens
200
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
Figure A.1: The index page. When user clicks the login button, a Facebook popup appears asking user to grant us the required permissions. If it accepts, an AJAX call is made to send registration details (Facebook uid and access token) to a webserver servlet. The whole registration procedure is asynchronous: in this way the registration process is immediate, and the new user doesn’t need to wait for their data to be downloaded (process which could take some time) from Facebook before exploring the website. After the login, while the AJAX registration call is made, user is redirected to the graph page: while the user explores his graph, his data is being retrieved in background on the server.
A.3.1
The Graph page
When a user reaches this page, the application checks if the user is logged to its Facebook account. If it is not logged in or it didn’t accept application, it is requested to login. This is a defaut behaviour for every section of the website. If a user is logged in, an AJAX call with its uid and access token is made to build the graph representation of its ego network: an indefinite progress bar is shown, and when the data is finally ready (as a JSON response) the progress bar changes and shows the rendering status of the graph. The graph is generated as a SVG image
A.3. THE SOCIALCIRCLES! FUNCTIONALITIES
201
and rendered by the javascript library d3js 3 using a force directed layout. Once the graph is fully loaded and displayed, user can interact with this by dragging, zooming and hovering: when mouse is passed over nodes, a popup containing name and the Facebook profile picture of this contact is shown.
Figure A.2: The graph page.
A.3.2
The Statistics page
This page uses the profile information of friends to show some aggregated statistics. These statistics are based on the retrieved profile information data, we decided to show only some of the most relevant and most popular features, such as movies, musical preferences and favourite books. By using d3js library, we display an histogram which presents three datasets: the 20 most viewed movies, 20 most listened musical groups, and 20 most read book as taken from friend profiles. The data displayed in this section is retrieved
3
d3js javascript library http://d3js.org/
202
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
dynamically from the server via an AJAX call to a dedicated servlet, processed by javascript routine and finally displayed with the d3js library. The presented histogram indicates, on the top right corner, the time of last data update. If profiles information have not been retrieved yet, a message asking to come back in a few minutes is displayed.
Figure A.3: The statistics page.
A.3.3
The Interactions page
Using the interactions information, we collected the relationships of the user with its friends, in term of posts, comments, likes given and received. For the sake for clarity, this page shows only an high level aggregated view of all the retrieved information we could obtain and in the form of an histogram displays an ordered list of the most relevant people selected according the type of interaction.
A.3. THE SOCIALCIRCLES! FUNCTIONALITIES
203
Figure A.4: The interactions page.
A.3.4
Friends Map page
This page is created by using the powerful Google Map API4 in conjunction with a custom clustering library5 to display a world map where the friends of a user are placed as markers in their current living location. The data is once again retrieved via AJAX as a JSON structure, and by iterating over the results the markers are added in real time to the map. To correctly handle the case of multiple marker on same location, a small random offset is added to positioning: many people living in same place will be displayed as spread around the center of location in a small radius. Finally, by clicking on clusters a user can expand them, by clicking on marker an infowindow containing name and picture of that person is shown.
4
Google map API https://developers.google.com/maps/ MarkerClusterer library http://google-maps-utility-library-v3.googlecode.com/ svn/trunk/markerclusterer/docs/reference.html 5
204
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
Figure A.5: The map page.
A.3.5
About and Privacy statements
As required from Facebook, we created a privacy statement explaining the nature, goals and usage of application and the retrieved data.
A.4
The application structure
The SocialCircles application is structured with a classical three-tier architechture (figure A.6): • Front - end: the website we presented above. This module is just the interface that users will adopt to interact with the application. • The Server: a Tomcat webserver to handle the traffic and generate dynamic content through servlets. • The Database: a MySql instance to store all users data. We will describe in detail the server, the database and the deployment of the application later in sections A.6 and A.7.
A.5. THE DEVELOPMENT TECHNOLOGIES AND TOOLS
205
Figure A.6: High level application design: a three-tier structure.
A.5
The development technologies and tools
To develop the whole application many different technologies and tools have been used. The main tools used in the application development are the open source Eclipse6 IDE for Java EE Developers and the standard Eclipse IDE. The programming languages involved are Java, HTML5, CSS, Javascript and SQL. Where needed, some small bash shell scripts have been created for some specific purposes (i.e. a regular data backup). We will now give a more detailed explanation of specific technologies used in different application phases.
A.5.1
Front-End technologies
The website has been realized using standard web technologies: HTML5, CSS, JavaScript. We put some effort in make the website highly interactive and took advantage of many existing libraries and frameworks. We present them briefly.
6
https://www.eclipse.org
206
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
• Facebook SDK for JavaScript 7 . This useful library allowed us to manage in simple and transparent way the logged session status of users and being notified of changing in real time. The library allowed us to interact directly with Facebook by embedding some components, such as the Login buttons and the Like social plugin, directly in our website pages. • AJAX: Asynchronous JavaScript and XML8 . The AJAX technology is a mechanism to exchange data with the server and update parts of a web page without reloading the whole page. This technology is nowadays very common, most browsers support it natively and many websites exploit it. Since our website is highly dynamic and interactive, AJAX is a key component which allowed us to send and receive data to and from our server. • The d3js JavaScript library 9 . To display ego network data we made heavy use of this powerful and versatile library: due to its flexibility we were able to handle dynamic data retrieved via AJAX from the server, manipulate the DOM of the page and create interactive graph and histograms. As an exchange format of data we made use of the standard JavaScript Object Notation (JSON)10 . • Google Maps API 11 . The powerful Google Maps API allows programmers to embed a flexible geographical map system inside their applications. We complemented this API with a library named MarkerClusterer12 used to handle overlapping markers and cluster them into groups. To allow the sharing functionality of the generated graph and statistics, since the standard Facebook JDK doesn’t allow to share a dynamic generated content (an SVG image in our case), we took advantage of two components to implement a custom sharing functionality: • Alertify JavaScript library 13 . This small library let web developers to embed an interesting popup and notification system, and let users compile forms in intuitive way.
7
Facebook SDK for JavaScript https://developers.facebook.com/docs/javascript AJAX http://www.w3schools.com/ajax/default.ASP 9 D3.js JavaScript library http://d3js.org/ 10 http://www.w3schools.com/json/ 11 Google Maps API https://developers.google.com/maps/ 12 MarkerClusterer library http://google-maps-utility-library-v3.googlecode.com/ svn/trunk/markerclusterer/docs/reference.html 13 Alertify JavaScript library http://fabien-d.github.io/alertify.js/ 8
A.5. THE DEVELOPMENT TECHNOLOGIES AND TOOLS
207
• Canvg JavaScript library 14 . This library is an high performance SVG parser and renderer, with the ability to manipulate SVG and transform them into HTML 5 Canvas elements. The custom sharing function works as follow: when the share button is clicked first we make a call through Facebook JDK asking for publish action permission (needed to post a feed on user wall), when user accepts we display another popup (made with Alertify) in which user can add a custom message. Then, when user clicks share, we use the Canvg Library to convert the SVG object into an HTML 5 canvas, and then from canvas to a png image. Finally, we upload through another AJAX POST call, this image to user profile wall to a Facebook API endpoint (we explain this API in next subsection).
A.5.2
Server technologies
The environment used for our application is the Apache Tomcat 7.0 webserver15 . We used the Servlet Java16 as main technology to create the dynamic website content and provide the service. In servlet development, we adopted other components. We will cite the most important. • Facebook Graph API17 . These powerful infrastructures offered by Facebook allowed us to request and retrieve data easily in a JSON format. To take advantage of them, an application has to issue HTTP requests to different endpoints with an associated valid token. Each endpoint provides different type of information: regarding user profile, musical interests, languages, friends... etc. For instance, it is possible to retrieve a user friends list by issuing a request to the /friends endpoint. Furthermore, it is also possible to send data to some particular endpoints via HTTP POST to publish content to user wall or photo albums. • FQL18 . This other API is a valid alternative to the previous one: FQL is a SQL-like language which allowed us to request data with more complex queries, thus retrieving nested and structured data with one single call in more efficient way than the simpler Graph API. Due to its relational nature
14
Canvg Javascript Library https://code.google.com/p/canvg/ Apache Tomcat webserver http://tomcat.apache.org/ 16 http://www.oracle.com/technetwork/java/index-jsp-135475.html 17 Facebook Graph API https://developers.facebook.com/docs/graph-api 18 Facebook FQL API https://developers.facebook.com/docs/reference/fql
15
208
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
it allows to retrieve data from different tables with one call and perform inner and dependant queries. • Jackson Databind Library19 . This library has been used to facilitate the parsing and handling of the JSON objects. JSON are used as a standard communication exchange format by the Facebook APIs, we decided to adopt it also for our server-client communications.
A.5.3
Database technologies
As underlying database we used the MySql20 . This relational database was well suited for our purpose since we are dealing with many tables, and offered a reliable Java interface (JDBC) for querying and retrieving data by issuing SQL statements programmatically. As storage engine we decided to adopt MySql InnoDB21 in favour or the default MyISAM. This storage engine choice was motivated mostly byt the fact that a desiderable property we wanted to achieve in our application was the consistant state of the database. Since we issue many heavy different insertion queries involving several tables, we thought that the transactional mechanism offered by InnoDB was fundamental: by grouping all multiple insertions in one transaction, we can commit the changes to the DBMS only when all queries ended succesfully, and in case some exception happened, we can simply issue a rollback statement to undo the partially completed queries and bring the database to the previous fully consistant state. The transaction mechanism therefore allows us to modify the database only with atomical actions.
A.6
The server side
As a module to provide service to clients, we use the Apache Tomcat 7 webserver. We made this choice for a few reasons: 1. Flexibilty. By exploiting the power of Java, we could easily develop servlets to provide and generate dynamic content to the website. Also, the use of filters gives more control on the incoming HTTP requests before dispatching them to servlets and pages.
19
Jackson Databind Library https://github.com/FasterXML/jackson-databind MySql server http://dev.mysql.com/downloads/mysql/ 21 http://dev.mysql.com/doc/refman/5.0/en/innodb-storage-engine.html
20
A.6. THE SERVER SIDE
209
2. Multithreaded environment. Since the nature of Tomcat, we could take advantage of having a multithreaded system, expecially by designing specialized threads for some task (such as handling registration process). 3. Database communication. By using the JDBC driver22 we could easily communicate to a MySql instance. 4. Speed of deployment. By having a properly configured Tomcat instance, deploying the application is just matter of prepare a .war archive, place it in the webapps folder and restart the server. The whole server structure is divided in some logical layers, figure A.7 shows it in detail.
A.6.1
The filter level
All incoming HTTP requests to the website are passed through a filter before providing the requested content to the clients: this filter is a component which reads the user agent of the incoming HTTP requests, and if it is recognized as a non compatible agent clients are redirected to an error page which asks them to try with a different browser or to update it. With this component, we are able to filter older browsers and non compatibile agents in a more transparent way.
A.6.2
The servlet level
The core of the server includes a servlet level, directly accessible by users, and a background thread level not accessible by them. At this level, each servlet is specialized in providing a different service and receives requests through AJAX call or through normal HTTP GET requests: some servlets are responsible for generating the content displayed on the website pages, whereas other servlets are responsible to handle other events (such as refreshing tokens or handle unsubscribe requests). We will describe briefly each of them. Each servlet shares some common methods and classess (package it.socialcircles.utility). • HandleFBLogin. This servlet is responsible for reacting upon the logging in of some client into their Facebook accounts. When receiving a request, the servlet looks for an uid and an accessToken in the HTTP request query
22
JDBC MySql Connection Driver http://dev.mysql.com/downloads/connector/j/
210
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
Figure A.7: The server side design. string. If these data are available, they are validated through the static utility function FbSignVerifier.isTokenIndentityConfirmed(uid, accessToken) which performs a call to Facebook API to confirm the token validity and identity. After validating the token a check is made on the database to see if the user has already been registered. In this case, the servlet terminates. Otherwise, the servlet initiates an exchange procedure of the (usually short term living) token received with Facebook and obtains a long living token (of a duration of 2 months). This procedure involves some security measures, such as providing application ID and a secret application secret string, known only by
A.6. THE SERVER SIDE
211
us. With this fresh token the user is inserted into the registereduser table. Finally, the uid of the just registered user is put in the UpdateProfileQueue structure, a FIFO queue which is used to implement the producer - consumer pattern and thus complete the registration procedure. This queue is the bridge between different servlets and the underlying background demons threads level. • RefreshFBToken. The access to the user’s data is granted only if the application owns a valid user token and if this token is not expired. For this reason it is important to refresh the token at each access of the user to the website. The task of this servlet is to exchange the short term living token with a fresh obtained long term living token from Facebook. This new long term token is stored into the registereduser table, by updating the corresponding user record. This servlet is called on almost every page load of the website: this ensures that we are able to obtain the most up to date long term living token and be able to access the user information up to 2 months later of his last website activity. • HandleFBUnsubscribe. This servlet is invoked as a callback function from Facebook, when user decide to deauthorize or remove our application through his Facebook control panel. When a removal request is received, the payload is decrypted to ensure it’s a valid Facebook signed request. The unregistered user is then removed only from the registereduser table, whereas his personal information are kept in other tables (we can do this as stated by Facebook application policies 23 ), unless user explicitly asks for complete data deletion. • GenerateFBEgoGraph. This servlet receives a request when user clicks on ”The Graph” button in the website. A uid and token of currently logged in user is received, after validating it, the servlet begins the data retrieval procedure from Facebook servers to build the social graph of user and his friends. We remark the fact that the data required to build the graph are always downloaded in real time from the OSN instead of being retrieved from the database because we wanted to keep this data always up to date and im-
23
https://developers.facebook.com/policy/
212
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
mediately accessible. This is a difference to other website section, such as statistics and interactions, which are updated at regular intervals. The graph building is structured as follow: first, the alter list of the ego is fetched via a graph API call. This constructs the node set in one call. Next, the topology (the link set) is built by testing link presence between nodes through calls to the FQL friend table. This topology building process took some effort: to retrieve all links, we query the friend table by providing the list of nodes previously obtained. What we are doing is basically a cartesian product of the whole node set with itself. As a response to this FQL query, Facebook returns all existant couples (which represent existant friendships). After some debugging, we realized that although working this procedure was building just a partial topology: many links where simply missing. We studied the problem and we discovered that Facebook limits the number of results per each query to 5000. To solve this issue, we decided to divide the node set into blocks of fixed size, and make the Cartesian product between all blocks. By splitting the topology query into subqueries, we were able to retrieve the data of the full topology. To optimize it even further and reducing the calls needed, we took advantage of the symmetry of the matrix, which lead us to test only the lower triangular submatrix. However we needed to face another practical issue, regarding the size of √ blocks: the exact block size of 5000 has the advantage of being ”safe” because each product resultset is below 5000, but practically is not feasible: too many requests would have been made, making the real time graph generation too slow. A bigger block size would on the other hand bring to a faster generation but could lead to potential miss of some links. The solution was a compromise: we observed that rarely social networks matrix in Facebook context are very dense. We could use an higher block size and in case the resultset was exceeding the maximum result limit, we could split a block into two equal subblocks and repeat the query. After some empirical testing, we decided to set the initialBlockSize parameter to 180. Tests were satisfactory, rarely this number lead to splitting of blocks, still being big enough to retrieve all topology of even big graphs in a reasonable time. A last optimization was to merge the last block obtained during the partitioning of the node set with the second last node, in case the last node size falls below a minBlockSize threshold, thus minimizing the number of products needed. We set this variable to 35. Finally, the graph representation is sent back to the client in JSON format to be handled by the Javascript library on client side.
A.6. THE SERVER SIDE
213
• GetStatistics. This servlet is invoked when clicking on Statistics section. After validating uid and token, the servlet begins the retrieval of some statistics based on the data of the topology and profiles. If no data is yet available (a NULL value in the lastProfileRefresh field into registereduser table), an error message is sent as response. If data is available, some aggregation queries are executed on the database and the datasets are built. Finally, the datasets are sent back to the clients as a JSON response. • GetInteractions. This servlet is almost identical to the previous one: instead of using topology and profile data, interaction information are needed to build datasets of user communication with his friends. After some aggregation queries, a JSON resultset is built and shipped back to client. • GetFriendsMap. This servlet uses phase 1 data. Information about friend and their current living location are merged and sent back to client.
A.6.3
The background demons level
Below servlet levels, two threads called UpdateProfilesDemon and UpdateInteractionsDemon are in execution on the webserver and not directly accessible by clients. They are the other key component of server. The UpdateProfilesDemon is a consumer for the registration procedure: it keeps waiting for an uid to be consumed in the UpdateProfileQueue, a queue feed by the producer HandleFBLogin servlet. Once an uid is consumed, the UpdateProfilesDemon behaves as a producer for the other demon, the UpdateInteractionsDemon, by feeding the interactionsQueue. The UpdateInteractionsDemon consumes the received uid and completes the registration procedure for a user. • UpdateProfilesDemon. This demons completes the first stage of registration process. It’s responsible for retrieving topology and profile information from Facebook, parse them in proper way, handle special cases and finally store them in a structured way into the database tables. The whole process requires few api calls but may take some time expecially for bigger networks. Once this stage is done, the demon updates the registereduser table by setting the profilesRefresh field with an updated timestamp. • UpdateInteractionsDemon. In a similar fashion, the UpdateInteractionsDemon is responsible for fetching interaction and communication information of user with his friends.
214
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
In comparison to the previous stage this process is usually much more slower and many more API calls are needed. We spent some effort in optimizing and minimize the numbers of needed queries to Facebook, expecially by requesting all nested information (such as likers to comments of posts) all at once. Despite this, retrieving all user feed may be very expensive, expecially for very active Facebook users. For this reason we decided to limit retrieval up to 6 months before the current update date. On termination, the InteractionsRefresh field is updated in the registereduser table, thus signaling the succesfull completition of a user registration.
A.6.4
Other demons and maintenance script
Up to now, these layers are inside the Tomcat webserver. This means that once the webserver is stopped they stop working, and if some modifications are needed the webserver needs to be stopped and restarted. To be more flexible and avoid downtimes where possible, we choose to implement all other needed utilities and maintenance scripts as “standalone” executables, which could be scheduled for execution by a time based scheduler totally independently from Tomcat. More in detail: • UpdateProfilesDemon. This demon is almost identical to the thread running inside the application: its purpose is not to register, but to keep up to date and refresh topology and profile information. We decided to schedule this program every 3 days: on launch it retrieves the registered users list, sorting the users by the lastProfiles update, thus giving refresh priority to least up to date information. As a possible future optimization, we thought about making this demon multithreaded (thus speeding up the profile updates by making different user update concurrently). We choose to wait the implementation of this step once we had clearer picture of the load put on the database by each update. • UpdateInteractionsDemon. Equivalent to the previous demon, but for interaction data. Due to the faster evolution of these information (which include post, comment, likes, tags), the data are refreshed everyday. Like the first demon, a multithreaded architecture could bring benefits in term of speed. • OnlinePresenceDemon. This demon is intended for monitoring the user online presence behaviour: it regularly asks the chat status of registered users
A.7. THE DATABASE
215
and of their friends, and stores the status inside the database online status table. Therefore, it implements the phase 3, retrieving data which can be used to further analyze user session behaviour. It’s important to notice that the chat status may not reflect the real online presence of users, since many of them can decide to keep their chat offline even if they are using Facebook. Nonetheless, we think that the final result will be statistically significant. This demon is run only when needed: due to its nature, it generates lot of Facebook requests. Therefore, is scheduled only when needed (typically with a sampling time of 5 minutes). To avoid excessive amount of traffic it can be configured to be launched with a given list of uid to monitor. Database backup script. This demon is a maintenance script, which does a full backup of the current database: it dumps all database in one sql file compressing it in .gz format and it sends via scp remotely to another another amazon instance. The script keeps the last 5 days backups and removes older files. This script is scheduled for execution each day at midnight.
A.7
The database
In this application, the database is probably the key and most critical component: there are many insertion and updates when downloading and registering new users, as well as many selection statements when generating the website dynamic content. For this reason a properly designed database was a priority. We put some effort in building a schema, which allowed us to store the large amount of data in an optimized and convenient format, giving at same time lot of flexibility and expressive power. Although some tables are currently not used directly by the servlets which provides dynamic content to website, we decided to have them for future expansions. The Database we designed is relational based and consists of a total of 42 tables. we will give a brief explanation of its features, by reflecting the different phase usage we described above. An important requirement was the data integrity: to achieve this, we made heavy use of transactions. To preserve data consistency during insertions / updates / deletion, we used the transaction mechanism offered by the InnoDB24 storage engine: insertions are not completed until an explicit signal is issued to commit the
24
http://dev.mysql.com/doc/refman/5.0/en/innodb-storage-engine.html
216
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
transaction (and hence modify the database). This mechanism allowed us easily to handle any possible exception that arises during the retrieval or insertion of data: it’s enough to issue a rollback statement to bring the database to a previous consistant state and abort the current failed transaction. We will now describe in detail all tables involved in our design, grouping them according to the logical task they are designated for.
A.7.1
Service tables
In this category we only have one table, needed to handle registration of users. • registereduser(uid, registrationDate, email, accesstoken, expires, profilesRefresh, interactionsRefresh). This table is used when registering, updating and deleting people that are registered users of the application. Is the most important table, since it contains the accesstoken needed to retrieve all user data from Facebook.
A.7.2
Topology and profiles tables
To store the information needed for this phase, 23 tables are needed. We used Facebook ids where possible, but in some other case these ids are missing. For optimization purposes we had to use auto generated ids: for example when retrieving the movies liked by a user, Facebook returns them as a comma separated list of names. We tokenize this string and insert each entry to the movie table if it doesn’t exists. If a movie exists, the corresponding already generated id is retrieved. • user(uid, first name, last name, birthday day, birthday month, birthday year, sex, hometown location, current location). One of the main tables, stores the basic information of the registered users and their friends. This table has many constraints, thus a deletion in this table has many cascade effects on other tables. • friend(uid1, uid2, weight1 2, weight2 1, isvalidLink ). Another main table, it’s responsible for storing all the friendships relationship (links) between couple of users. The weights fields represent a force of the directed link and it’s initialized at 1 by default. The isvalidLink field is used during the periodical profile information refresh. Note: we decided not to have any foreign key constraint for uid to the user table: this will allow us to be more flexible and add links between users that
A.7. THE DATABASE
Figure A.8: Database schema - Topology and profiles
217
218
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
are not stored, this is useful during the ”full topology building” phase we will describe later. Note that we do not store links between alters, but only between an ego (a registered user) and his alters. We decided for this approach to minimize the storage space needed, and also because we cannot (due to privacy reasons) retrieve any information needed to estimate the weights of links between alters, such as posts or comments between alters that happen on alter walls. • location(fbid, city, state, country, latitude, longitude). Information relative to locations are stored here. All these data can be incomplete, except for the facebook identifier (fbid ). • work(uid, location, employer, position, start date, end date). Contains user work information. Note we don’t have a primary key here, because a user can work for same employer in different workposition. • workposition(fbid, name). Information about a workposition. • workemployer(fbid, name). Describes employers. • school(fbid, name). Describes a school. • user school(uid, fbid, type, year, concentration, degree). Describes a user school relationship. • schooltype(id, description). Describes the type of school. • schooldegree(fbid, description). Describes the degree of school. • schoolconcentration(fbid, description). Describes the concentration of school. • device(id, description). A device used to connect to Facebook. • user device(uid, id ). Relationship between user and device. • movie(id, name). A movie. • user movie(uid, id ). Relationship between user and movie, represents likes. • music(id, name). A musical group / singer. • user music(uid, id ). Relationship between user and music, represents likes. • book(id, name). A book.
A.7. THE DATABASE
219
• user book(uid, id ). Relationship between user and book, represents likes. • interest(id, name). An interst, such as sport or traveling. • user interest(uid, id ). Relationship between user and interest, represents likes. • language(fbid, name). A language. • user language(uid, fbid ). Relationship between user and language.
A.7.3
Interaction Information tables
The data needed for this task are divided into 17 tables and let us estimate the weights of interactions among ego and alters users. One major challenge we faced during this stage was the retrieval of posts on the news feed. We noticed that, a post written by a user containing tags to other people, is returned by API with different ids, according to the user which retrieves it. This happens because the same post is replicated on different users wall. To be able to aggregate as much information as possible and keep the database structure consistent (and avoid duplicates), we needed an unique id. After some thinking we decided to consider the post author id and the timestamp (precision of seconds) of post creation as an unique values: this means we did the hypothesis that a user can only post a feed per second, and that couple identifies one post exactly. After solving this issue, we had another problem: on the news feed, many posts are automatically generated posts (application stories). Since they are often automatically generated by user activities on application (such as games), we thought they were useless and we decided to filter them. Another desiderable property of this choice was filtering the common ”happy birthday” messages posted on user wall: these messages are often written through a small application on the top right corner of Facebook, thus are different to real posts written by user on friends wall. It’s important to notice that all posts, comments, likes and photo retrieved do not represent all user activity, but only the activities that are viewable on user wall or its photos: this means we can’t retrieve a post a user wrote on a friend wall (because of no access permission) and its comment, likes or tags. To partially overcome this limitation we are able to request, although in a limited way, the user ”outgoing” activities: we can retrieve the likes given to posts or photos owned by friends. To avoid potential duplicates, when inserting the phase 2 information, we run a ”cleanup” query: if a post was inserted into the outgoingpost (or outgoingphoto) table, and at some time another user registers to the application, it may happen that the same post is fully retrieved and placed into post table. By issuing a deletion query matching the post id, we are sure that
220
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
a post is placed either in the post table or in the outgoingpost table, but not in both at the same time.
Figure A.9: Database schema - Interaction information • post(post id, author id, created time, targeted id, place). This table stores information regarding post. The post id is an autogenerated key obtained by considering author id and created time as unique values. The field targeted id can be an uid if the post has been written on this wall by someone else (”directed wall message”), or if that person is the only tagged person. If the author is the wall owner, or if there are other people tagged, this field is marked as null. Finally, the place field represents a geographical location where this post has been created. • post like(post id, author id ). Stores all likers associated to a post. • post tag(post id, tagged id ). Stores all possible type of user id tags present in a post, ”with tags” or normal tags.
A.7. THE DATABASE
221
• place(fbid, name, latitude, longitude). This table stores information regarding places: it’s very similar to the location table of phase 1, but here the data are typically described with a much more finer granularity (for example, ”Polo Fibonacci, Pisa University”). • comment(post id, comment id, author id, created time). This table stores comments information. The comment id is same as Facebook id. Note that we do not store comments of comments, but only first level comments. • comment like(post id, comment id, author id ). Stores all likers associated to a comment of a post. • comment tag(post id, tagged id ). Stores user id tags present in a comment of a post. • photo(photo id, owner id, created time, place). This table stores information regarding photo. Note that a feed message with exactly one photo is considered as a photo, so it’s inserted here and not in the post table. photo id represents the Facebook object id, owner id is the id of the user which owns this picture. • photo like(photo id, author id ). Stores all likers associated to a photo. • photo tag(photo id, tagged id ). Stores tagged people ids. • photo comment(photo id, comment id, author id, created time). This table stores comments information of photo. • photo comment like(photo id, comment id, author id ). Stores all likers associated to a comment of a photo. • photo comment tag(photo id, tagged id ). Stores user id tags present in a comment of a photo. • outgoingpost(post id, author id, created time). Stores all posts “seen” when analyzing outgoing activities. This table may be a ”temporary” storage for not yet fully retrieved posts. • outgoingpostlike(post id, author id ). Stores likes associated to posts seen in outgoing activities. • outgoingphoto(object id, author id, created time). Stores all photo ”seen” when analyzing outgoing activities. This table may be a ”temporary” storage for not yet fully retrieved photos.
222
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
• outgoingphotolike(object id, author id ). Stores likes associated to photos seen in outgoing activities.
A.7.4
Online status table
Only one table is needed:
Figure A.10: Database schema - Online status table • online status(uid, timestamp, status). This table holds all data stored during the user monitoring of their chat presence. The chat status can assume a limited set of value: 0 if user is offline, 1 if user is in active state, 2 if user is idle. Finally, if Facebook API couldn’t determine the state of a user, an error is returned, thus a null is stored.
A.8
The hosting and the deployment
To host the whole system, we used a CentOS 64bit based virtual machine hosted in the Pisa Computer Science department. The machine is a dual core AMD Opteron equipped with 8GB of memory, which we could access via ssh as root user. We setup the machine by installing and properly configuring the Tomcat 7.0 instance, the MySql 5.6 instance, and tweaking the memory parameters as a dedicated server. The firewall has been configured to allow access only on port 80 (HTTP), and to redirect traffic through port forwarding to port 8080 (tomcat user port). Finally,
A.9. THE APPLICATION PUBLISHING
223
we setup the crond scheduler to execute other script and maintenance programs periodically. As a last step, we bought and configured the first level domain www.socialcircles. eu to point at our server.
A.9
The application publishing
Before public application release, we had a small testing period by some selected users. Having 15-20 testers, we were able to discover and solve many small bugs and issues. Once satisfied by the stability of the system, we decided to publish it. After publishing the application, we received positive feedbacks and appreciation from users, we reached in three weeks 337 registered users. Most of the registration were performed by people living in Italy, but we had many contacts from abroad as well.
224
APPENDIX A. THE FACEBOOK APPLICATION: SOCIALCIRCLES!
Bibliography [1] D. Boyd and N. B. Ellison. Social network sites: Definition, history, and scholarship. Journal of Computer-Mediated Communication, 13(1), 2007. [2] A. Datta, S. Buchegger, L. Vu, T. Strufe, and K. Rzadca. Decentralized Online Social Networks. In Handbook of Social Network Technologies and Applications, pages 349–378. 2010. [3] H. Jones and J Soltren. Facebook: Threats to privacy. Project MAC: MIT Project on Mathematics and Computing, 2005. [4] C. Maurieni. Facebook is Deception (Volume One). WSIC EBooks Ltd, 2012. [5] Diaspora. https://joindiaspora.com. [6] J. Buford, H.r Yu, and E. K. Lua. P2P Networking and Applications. 2008. [7] L. A. Cutillo, R. Molva, and T. Strufe. Safebook: A privacy-preserving online social network leveraging on real-life trust. Comm. Mag., 47(12):94– 101, 2009. [8] K. Graffi, C. Groß, P. Mukherjee, A. Kovacevic, and R. Steinmetz. Lifesocial.kom: A p2p-based platform for secure online social networks. In Peerto-Peer Computing, pages 1–2, 2010. [9] K. Graffi, C. Groß, D. Stingl, D. Hartung, A. Kovacevic, and R. Steinmetz. Lifesocial.kom: A secure and p2p-based solution for online social networks. In Proceedings of the IEEE Consumer Communications and Networking Conference, number 2011, pages 554–558, 2011. [10] S. Buchegger, D. Schioberg, L.H. Vu, and A. Datta. Implementing a p2p social network-early experiences and insights from peerson. In Second ACM Workshop on Social Network Systems (Co-located with EuroSys 2009), number LSIR-CONF-2009-007, 2009.
226
APPENDIX A. BIBLIOGRAPHY
[11] S. Nilizadeh, S. Jahid, P. Mittal, N. Borisov, and A. Kapadia. Cachet: a decentralized architecture for privacy preserving social networking with caching. In Proceedings of the 8th international conference on Emerging networking experiments and technologies, CoNEXT ’12, pages 337–348. ACM, 2012. [12] R. Narendula, A. Papaioannou, and K. Aberer. A decentralized online social network with efficient user-driven replication. In Proceedings of the 2012 ASE/IEEE International Conference on Social Computing and 2012 ASE/IEEE International Conference on Privacy, Security, Risk and Trust, SOCIALCOM-PASSAT ’12, pages 166–175, 2012. [13] Vipul Goyal, Omkant Pandey, Amit Sahai, and Brent Waters. Attributebased encryption for fine-grained access control of encrypted data. In Proceedings of the 13th ACM Conference on Computer and Communications Security, CCS ’06, pages 89–98. ACM, 2006. [14] S. Singh and S. Bawa. A privacy, trust and policy based authorization framework for services in distributed environments. International Journal of Computer Science, 2(2):85–92, 2007. [15] N. W. Coppola, S. R. Hiltz, and N. G. Rotter. Building trust in virtual teams. Professional Communication, IEEE Transactions on, 47(2):95–104, 2004. [16] G. Piccoli and B. Ives. Trust and the unintended effects of behavior control in virtual teams. MIS quarterly, pages 365–395, 2003. [17] P. Resnick, K. Kuwabara, R. Zeckhauser, and E. Friedman. Reputation systems. Communications of the ACM, 43(12):45–48, 2000. [18] S. Ruohomaa, L. Kutvonen, and E. Koutrouli. Reputation management survey. In Availability, Reliability and Security, 2007. ARES 2007. The Second International Conference on, pages 103–111, 2007. [19] R. Dunbar. The social brain hypothesis. Evolutionary Antropology, 6:178– 190, 1998. [20] B. Pourebrahimi, K. Bertels, and S. Vassiliadis. A survey of peer-to-peer networks. In Proceedings of the 16th Annual Workshop on Circuits, Systems and Signal Processing, ProRisc, volume 2005, 2005. [21] R. Rodrigues and P. Druschel. 53(10):72–82, 2010.
Peer-to-peer systems. Commun. ACM,
A.9. BIBLIOGRAPHY
227
[22] E. K. Lua, J. Crowcroft, M. Pias, R. Sharma, and S. Lim. A survey and comparison of peer-to-peer overlay network schemes. Communications Surveys & Tutorials, IEEE, 7(2):72–93, 2005. [23] A. W. Loo. The future of peer-to-peer computing. Communications of the ACM, 46(9):56–61, 2003. [24] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. IN: MIDDLEWARE, pages 329–350, 2001. [25] Klaus Wehrle, Stefan G¨otz, and Simon Rieche. Distributed hash tables. In Peer-to-Peer Systems and Applications, pages 79–93, 2005. [26] Ralf Steinmetz and Klaus Wehrle, editors. Peer-to-Peer Systems and Applications, volume 3485 of Lecture Notes in Computer Science. Springer, 2005. [27] A. Olteanu and G. Pierre. Towards robust and scalable peer-to-peer social networks. In Proceedings of the Fifth Workshop on Social Network Systems, SNS ’12, pages 10:1–10:6, 2012. [28] S. Seong, J. Seo, M. Nasielski, D. Sengupta, S. Hangal, S. Teh, R. Chu, B. Dodson, and M. S. Lam. Prpl: a decentralized social networking infrastructure. In Proceedings of the 1st ACM Workshop on Mobile Cloud Computing and Services: Social Networks and Beyond, MCS ’10, pages 8:1–8:8, 2010. [29] K. C. L. Lin, C. Wang, C. Chou, and L. Golubchik. Socionet: A social-based multimedia access system for unstructured p2p networks. IEEE Transactions on Parallel and Distributed Systems, 21(7):1027–1041, 2010. [30] J.A. Pouwelse, P. Garbacki, J. Wang, A. Bakker, J. Yang, and A. Iosup. Tribler: A social-based peer-to-peer system. In Proceedings of the 5th International P2P conference (IPTPS 2006), 2006. [31] A. Shakimov, H. Lim, R. Caceres, P. On Cox, K. Li, D. Liu, and E. Varshavsky. Vis-a-vis: Privacy-preserving online social networking via virtual individual servers. In In COMSNETS, 2011. [32] L. Aiello and G. Ruffo. Secure and flexible framework for decentralized social network services. In Proceedings of SESOC ’10: Security and Social Networking Workshop, pages 594–599. IEEE Computer Society, 2010.
228
APPENDIX A. BIBLIOGRAPHY
[33] G. Mega, A. Montresor, and G. P. Picco. Efficient dissemination in decentralized social networks. In Peer-to-Peer Computing, pages 338–347. IEEE, 2011. [34] D. Schi¨oberg, F. Schneider, G. Tr´edan, S. Uhlig, and A. Feldmann. Revisiting content availability in distributed online social networks. CoRR, abs/1210.1394, 2012. [35] S. Acharya and S. B. Zdonik. An efficient scheme for dynamic data replication. Technical report, 1993. [36] R. Sharma, A. Datta, M. Dell Amico, and P. Michiardi. An empirical study of availability in friend-to-friend storage systems. In P2P 2011, IEEE International Conference on Peer-to-Peer Computing, August 31-September 2nd, 2011, Kyoto, Japan, Kyoto, JAPON, 08 2011. [37] R. Gracia-Tinedo, M. Sanchez Artigas, and P. Garda Lopez. Analysis of data availability in f2f storage systems: When correlations matter. In Peerto-Peer Computing (P2P), 2012 IEEE 12th International Conference on, pages 225–236, 2012. [38] R. Narendula, T. G. Papaioannou, and K. Aberer. Towards the realization of decentralized online social networks: An empirical study. In ICDCS Workshops, pages 155–162. IEEE Computer Society, 2012. [39] R. Narendula, T. G. Papaioannou, and K. Aberer. In SocialCom/PASSAT, pages 166–175, 2012. [40] F. Tegeler, D. Koll, and X. Fu. Gemstone: empowering decentralized social networking with high data availability. In Global Telecommunications Conference (GLOBECOM 2011), 2011 IEEE, pages 1–6, 2011. [41] L. Gyarmati and T. Trinh. Measuring user behavior in online social networks. Network, IEEE, 24(5):26–31, 2010. [42] F. Benevenuto, T. Rodrigues, M. Cha, and V. Almeida. Characterizing user behavior in online social networks. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, pages 49–62, 2009. [43] A. Boutet, A. Kermarrec, E. Le Merrer, and A. Van Kempen. On the impact of users availability in OSNs. In Proceedings of the Fifth Workshop on Social Network Systems, page 4, 2012.
A.9. BIBLIOGRAPHY
229
[44] B. Wong and S. Guha. Quasar: a probabilistic publish-subscribe system for social networks. In IPTPS, page 2, 2008. [45] A. Datta and R. Sharma. Godisco: selective gossip based dissemination of information in social community based overlays. In Proceedings of the 12th international conference on Distributed computing and networking, ICDCN’11, pages 227–238, 2011. [46] R. Sharma and A. Datta. GoDisco++: A gossip algorithm for information dissemination in multi-dimensional community networks. Pervasive and Mobile Computing, October 2012. [47] U. Tandukar and J. Vassileva. Selective propagation of social data in decentralized online social network. In Advances in User Modeling, pages 213–224. 2012. [48] L. Han, B. Nath, L. Iftode, and S. Muthukrishnan. Social butterfly: Social caches for distributed social networks. In SocialCom/PASSAT, pages 81–86, 2011. [49] L. Han, M. Punceva, B. Nath, S. Muthukrishnan, and L. Iftode. Socialcdn: Caching techniques for distributed social networks. In P2P, pages 191–202, 2012. [50] S. Forsyth and K. Daudjee. Update management in decentralized social networks. 2013 IEEE 33rd International Conference on Distributed Computing Systems Workshops, 0:196–201, 2013. [51] G. Greenwald and E. MacAskill. Nsa prism program taps in to user data of apple, google and others. The Guardian, 7(6):1–43, 2013. [52] L. Cutillo, R. Molva, and T. Strufe. Safebook: A privacy-preserving online social network leveraging on real-life trust. Communications Magazine, IEEE, 47(12):94–101, 2009. [53] O. Bodriagov and S. Buchegger. Encryption for Peer-to-Peer Social Networks. 2011. [54] O. Bodriagov, G. Kreitz, and S. Buchegger. Access control in decentralized online social networks: Applying a policy-hiding cryptographic scheme and evaluating its performance. In Pervasive Computing and Communications Workshops (PERCOM Workshops), 2014 IEEE International Conference on, pages 622–628, 2014.
230
APPENDIX A. BIBLIOGRAPHY
[55] J. Katz, A. Sahai, and B. Waters. Predicate encryption supporting disjunctions, polynomial equations, and inner products. In Advances in Cryptology– EUROCRYPT 2008, pages 146–162. 2008. [56] I. Jain, M. Gorantla, and A. Saxena. An anonymous peer-to-peer based online social network. In India Conference (INDICON), 2011 Annual IEEE, pages 1–5. IEEE, 2011. [57] D. Goldschlag, M. Reed, and P. Syverson. Onion routing. Communications of the ACM, 42(2):39–41, 1999. [58] L. Vu, K. Aberer, S. Buchegger, and A. Datta. Enabling secure secret sharing in distributed online social networks. In Computer Security Applications Conference, 2009. ACSAC’09. Annual, pages 419–428, 2009. [59] B. Zhou and J. Pei. Preserving privacy in social networks against neighborhood attacks. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pages 506–515, 2008. [60] Y. Fu and Y. Wang. Bce: A privacy-preserving common-friend estimation method for distributed online social networks without cryptography. In Communications and Networking in China (CHINACOM), 2012 7th International ICST Conference on, pages 212–217, 2012. [61] B. Greschbach, G. Kreitz, and S. Buchegger. The devil is in the metadatanew privacy challenges in decentralised online social networks. In Pervasive Computing and Communications Workshops (PERCOM Workshops), 2012 IEEE International Conference on, pages 333–339, 2012. [62] P. Maymounkov and D. Mazi`eres. Kademlia: A peer-to-peer information system based on the xor metric. In Revised Papers from the First International Workshop on Peer-to-Peer Systems, IPTPS ’01, pages 53–65, London, UK, UK, 2002. Springer-Verlag. [63] R. Sharma and A. Datta. SuperNova: Super-peers based architecture for decentralized online social networks. CoRR, abs/1105.0074, 2011. [64] Freepatry. http://www.freepastry.org/freepastry. [65] P. Druschel. Past: A large-scale, persistent peer-to-peer storage utility. In HotOS VIII, pages 75–80, 2001. [66] S. Jahid, S. Nilizadeh, P. Mittal, N. Borisov, and A. Kapadia. Decent: A decentralized architecture for enforcing privacy in online social networks. In
A.9. BIBLIOGRAPHY
231
Pervasive Computing and Communications Workshops (PERCOM Workshops), 2012 IEEE International Conference on, pages 326–332, 2012. [67] S. Jahid, P. Mittal, and N. Borisov. Easier: Encryption-based access control in social networks with efficient revocation. In Proceedings of the 6th ACM Symposium on Information, Computer and Communications Security, pages 411–415, 2011. [68] J. Bethencourt, A. Sahai, and B. Waters. Ciphertext-policy attribute-based encryption. In Security and Privacy, 2007. SP’07. IEEE Symposium on, pages 321–334, 2007. [69] L. Aiello and G. Ruffo. Lotusnet: tunable privacy for distributed online social network services. Computer Communications, 35(1):75–88, 2012. [70] L.M. Aiello, M. Milanesio, G. Ruffo, and R. Schifanella. Tempering kademlia with a robust identity based system. In Peer-to-Peer Computing , 2008. P2P ’08. Eighth International Conference on, pages 30–39, 2008. [71] R´eka Albert and Albert l´aszl´o Barab´asi. Statistical mechanics of complex networks. Rev. Mod. Phys, page 2002. [72] J. Travers and S. Milgram. An experimental study of the small world problem. Sociometry, 32(4):425–443, 1969. [73] L. A. Adamic, O. Buyukko, and E. Adar. A social network caught in the web. First Monday, 8(6), 2003. [74] L. Backstrom. Anatomy of Facebook, 2011. [75] M. E. J. Newman. The structure and function of complex networks. SIAM Review, 45:167–256, 2003. [76] D. J. Watts and S. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393:440–442, 1998. [77] A. Barab´asi and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509–512. [78] P. Erd˝os and A. R´enyi. On the Evolution of Random Graphs. Publ. Math. Inst. Hungary. Acad. Sci., 5:17–61, 1960. [79] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, IMC ’07, pages 29–42, 2007.
232
APPENDIX A. BIBLIOGRAPHY
[80] Newman M. Assortative mixing in networks. 89(20):208701, 2002.
Phys. Rev. Letters,
[81] M. van Steen. Graph Theory and Complex Networks: An Introduction. 2010. [82] Newman M. Mixing patterns in networks. Phys Rev E, 67:026126, 2003. [83] Albert R., H Jeong, Barab´asi, and A. L. Attack and error tolerance of complex networks. Nature, 406:378–382, 2000. [84] Broder A., Kumar R., Maghoul F., Raghavan P., Rajagopalan S., Stata R., Tomkins A., and Wiener J. Graph structure in the web. Computer Networks, 33:309–320, 2000. [85] G. Sabidussi. The centrality index of a graph. Psychometrika, 31(4):581–603, 1966. [86] P.V. Marsden. Egocentric and sociocentric measures of network centrality. Social Networks, 24(4):407–422, 2002. [87] Freeman and Linton C. A set of measures of centrality based on betweenness. Sociometry, 50:35–41, 1977. [88] Jac. M. Anthonisse. The rush in a directed graph. SMC, 1971. [89] M. G. Everett and S. P. Borgatti. Ego network betweenness. Social Networks, 27:31–38, 2005. [90] A. Suitcliffe, R. Dunbar, J. Binder, and H. Arrow. Relationships and the social brain: integrating psychological and evolutionary perspectives. British Journal of Psychology, 103:149–168, 2012. [91] R.A. Hill and R. Dunbar. Social network size in humans. Human Nature, 14:53–72, 2003. [92] S. G. Roberts, R. I. Dunbar, T. V. Pollet, and T. Kuppens. Exploring variation in active network size: Constraints and ego characteristics. Social Networks, February 2009. [93] M. Granovetter. The strength of weak ties. The American Journal of Sociology, 78(6):1360–1380, 1973. [94] T. D. Wilson. Weak Ties, Strong Ties: Network Principles in Mexican Migration. Human Organization, 57(4):394–403, 1998.
A.9. BIBLIOGRAPHY
233
[95] M. S. Granovetter. Getting a job : a study of contacts and careers. University of Chicago Press, 1995. [96] N. Lin, J. C. Vaughn, and W. M. Ensel. Social Resources and Occupational Status Attainment. Social Forces, 59(4):1163–1181, June 1981. [97] E. Gilbert and K. Karahalios. Predicting tie strength with social media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’09, pages 211–220, 2009. [98] J. Jones, J. E Settle, R. M. Bond, C. Fariss, C. Marlow, and J. H. Fowler. Inferring tie strength from online directed behavior. PloS one, 8(1), 2013. [99] V. Arnaboldi, M. Conti, A. Passarella, and F. Pezzoni. Analysis of ego network structure in online social networks. In Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), pages 31–40, 2012. [100] V. Arnaboldi, A. Guazzini, and A. Passarella. Egocentric online social networks: Analysis of key features and prediction of tie strength in facebook. Computer Communications, 36(10):1130–1144, 2013. [101] M. La Gala, V. Arnaboldi, M. Conti, and A. Passarella. Ego-net digger: a new way to study ego networks in online social networks. In First ACM International Workshop on Hot Topics on Interdisciplinary Social Networks Research (ACM HotSocial 2012), pages 9–16, 2012. [102] C. Wilson, A. Sala, K. P. N. Puttaswamy, and B. Y. Zhao. Beyond social graphs: User interactions in online social networks and their implications. ACM Trans. Web, 6(4):17, 2012. [103] S. Marti, P. Ganesan, and H. Garcia-Molina. Sprout: P2p routing with social networks. In Current Trends in Database Technology-EDBT 2004 Workshops, pages 511–512. Springer, 2005. [104] I. Clarke, O. Sandberg, and T. W. Wiley, B. andHong. Freenet: A distributed anonymous information storage and retrieval system. In Designing Privacy Enhancing Technologies, International Workshop on Design Issues in Anonymity and Unobservability, Berkeley, 2000, Proceedings, pages 46– 66, 2000. [105] B. C. Popescu, B. Crispo, and A. S. Tanenbaum. Safe and Private Data Sharing with Turtle: Friends Team-Up and Beat the System. In Security Protocols, 12th International Workshop, Cambridge, UK, April 26-28, 2004., pages 213–220, 2004.
234
APPENDIX A. BIBLIOGRAPHY
[106] L. Han, B. Nath, L. Iftode, and S. Muthukrishnan. Social butterfly: Social caches for distributed social networks. In SocialCom/PASSAT, pages 81–86, 2011. [107] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. Epidemic algorithms for replicated database maintenance. In Proceedings of the Sixth Annual ACM Symposium on Principles of Distributed Computing, Vancouver, British Columbia, Canada, 1987, pages 1–12. ACM, 1987. [108] S. Y. Chan, I. X. Y. Leung, and P. Li`o. Fast centrality approximation in modular networks. In Proceeding of the ACM First International Workshop on Complex Networks Meet Information & Knowledge Management, CIKMCNIKM 2009, Hong Kong, China, November 6, 2009, pages 31–38, 2009. [109] D. R. White and S. P. Borgatti. Betweenness centrality measures for directed graphs. Social Networks, 16(4):335–346, 1994. [110] E. W. Dijkstra. A note on two problems in connexion with graphs. NUMERISCHE MATHEMATIK, 1(1):269–271, 1959. [111] M. E. J. Newman. Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Phys. Rev. E, 64:016132, 2001. [112] I. X. Y. Leung, P. Hui, P. Li`o, and J. Crowcroft. Towards real-time community detection in large networks. Physical Review E, 79(6), 2009. [113] Jelasity, M. Gossip. In Giovanna Di Marzo Serugendo, Marie-Pierre Gleizes, and Anthony Karageorgos, editors, Self-organising Software, volume 12 of Natural Computing Series, pages 139–162. 2011. [114] C. Wilson, B. Boe, A. Sala, K. P.N. Puttaswamy, and B. Y. Zhao. User interactions in social networks and their implications. In Proceedings of the 4th ACM European Conference on Computer Systems, EuroSys ’09, pages 205–218, 2009. [115] V. D. Blondel, J. Guillaume, R. Lambiotte, and Lefebvre. E. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008. [116] J. McAuley and J. Leskovec. Discovering social circles in ego networks. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(1), 2014.
A.9. BIBLIOGRAPHY
235
[117] V. Arnaboldi, A. Passarella, M. Tesconi, and D. Gazz`e. Towards a characterization of egocentric networks in online social networks. volume 7046, pages 524–533. OTM Workshops, 2011. [118] W.X. Zhou, D. Sornette, R.A. Hill, and R. Dunbar. Discrete hierarchical organization of social group sizes. Biological sciences, 272:439–444, 2005. [119] C. Marlow, L. Byron, T. Lento, and I. Rosenn. Maintained relationships on facebook. Retrieved February, 15:2010, 2009. [120] B. Gon¸calves, N. Perra, and A. Vespignani. Modeling users’ activity on twitter networks: Validation of dunbar’s number. PloS one, 6(8):e22656, 2011. [121] D. J. Ketchen and C. L. Shook. The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal, 17(6):441–458, 1996. [122] A. Sutcliffe, R. Dunbar, J. Binder, and H. Arrow. Relationships and the social brain: Integrating psychological and evolutionary perspectives. British journal of psychology, 103(2):149–168, 2012. [123] S. A. Golder, D. M. Wilkinson, and B. A. Huberman. Rhythms of social interaction: Messaging within a massive online network. In Communities and Technologies 2007, pages 41–66. Springer, 2007. [124] S. S. Choi, S. H. Cha, and C. Tappert. A Survey of Binary Similarity and Distance Measures. Journal on Systemics, Cybernetics and Informatics, 8(1):43–48, 2010. [125] R. Bhagwan, S. Savage, and G. M. Voelker. Understanding availability. In Peer-to-Peer Systems II, Second International Workshop, IPTPS 2003, Berkeley, CA, USA, February 21-22,2003, Revised Papers, pages 256–267. 2003. [126] K. Graffi. Peerfactsim. kom: A p2p system simulator—experiences and lessons learned. In Peer-to-Peer Computing (P2P), 2011 IEEE International Conference on, pages 154–155, 2011. [127] D. Stingl, C. Groß, J. R¨ uckert, L. Nobach, A. Kovacevic, and R. Steinmetz. Peerfactsim. kom: A simulation framework for peer-to-peer systems. In High Performance Computing and Simulation (HPCS), 2011 International Conference on, pages 577–584, 2011.
236
APPENDIX A. BIBLIOGRAPHY
[128] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review, 31(4):149–160, 2001. [129] T. S. E. Ng and Hui Zhang. Predicting Internet Network Distance with Coordinates-based Approaches. In Proceedings of the Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, volume 1, pages 170–179, 2002. [130] W. Matthews and L. Cottrell. The PingER Project: Active Internet Performance Monitoring For the HENP Community. Communications Magazine, 38(5):130–136, 2000. [131] Z. Yao, D. Leonard, X. Wang, and D. Loguinov. Modeling heterogeneous user churn and local resilience of unstructured P2P networks. In Proceedings of the 14th IEEE International Conference on Network Protocols, ICNP 2006, pages 32–41, 2006. [132] S. Catanese, P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Crawling facebook for social network analysis purposes. Technical report, 2011. [133] A. Markopoulou, M. Gjoka, M. Kurant, and C. Butts. A walk in facebook: Uniform sampling of users in online social networks, 2009. [134] J. Leskovec. Ego networks Facebook dataset. http://snap.stanford.edu/ data/egonets-Facebook.html.