How is that complex network complex? Michael Small, Kevin Judd and Linjun Zhang† School of Mathematics and Statistics, The University of Western Australia Crawley, Western Australia, 6009, Email:
[email protected] † Also with: Department of Statistics and Finance University of Science and Technology of China, Hefei, P.R.China, 230026 Abstract—Evidence of complex networks in real world settings abounds. Many data sets for physical and social systems display characteristics consistent with various models of complex networks - the most typical examples being scale-free and smallworld networks. However, theory does not always match reality. While we see a wide range of real complex networks, simulated data most usually comes from a limited range of generative models (the Barab´asi-Albert model for scale-free networks, WattStrogatz’s model for small world networks, and Erdos-Renyi’s model of a random graph are the three usual archetypes). We argue that there is much to be learnt by examining what real world data does that these algorithms do not. To do this we propose a variety of new network generation algorithms. These algorithms allow us to sample, in a statistically unbiased manner, from the family of all networks (of a given size N ) consistent with a given degree distribution. Using this technique we are able to determine which distributions really are likely origins for various observed data and (equally importantly) observe when particular real world networks are atypical. Examples include the observation that many collaboration networks are not consistent with the Barab´asi-Albert (BA) model but are typical of the family of graphs that exhibit a power-law degree distribution, Biological networks (protein-protein interaction and cellular metabolic processes) are scale-free (but not BA) networks with atypically large diameter.
I.
I NTRODUCTION
Scale free networks abound in a wide range of settings [4] — from neurological structures, to the Internet and disease propagation in social networks. In many of these examples, the generative preferential attachment model of Barab´asi and Albert [1] has been particularly successful: new nodes are incrementally added to the network preferentially attaching to the existing nodes with highest degree. This model naturally leads to a power-law distribution of node degree k — and the network is therefore said to be scale-free. However, preferential attachment is not the most natural mechanism for adding nodes in all settings: anatomical connections amoung neurones are constrained by physical proximity; many engineering networks (including the Internet, power-grids, and airline traffic [4]) are constrained by design principles (cost, utility and stability for example). Moreover, preferential attachment scalefree networks (including BA networks) are known to exhibit peculiar biases — their assortativity is negative [3]. Indeed, in [3] an alternative algorithm (one of many) was proposed which relied on a modified form of preferential attachment (therein called altruistic attachment) to avoid some of these issues, while still generating scale-free networks. Preferential attachment provides an intuitive technique to generate scalefree networks but it is not the only way — nor does it define what it means to be scale-free.
In [2] we propose a new algorithm which allows one to randomly sample a connected graph of finite size N with a prescribed degree distribution p(k). Of course, the degree distribution of most interest, and the one we focus on in [2] is the power-law: kγ p(k) = (1) ζ(γ) where the Reimann zeta function ζ(γ) provides the necessary normalisation factor. In [2] we take (1) to define a scale-free network. That is, a scale-free network is one for which the degree histogram (that it, the particular degree distribution of the observed network) is statistically probable realisation of (1)1 . One of the most striking findings of [2] is that preferential attachment is actually quite unusual: most networks which are likely realisations of a power-law degree distribution are unlike preferential attachment. Our ongoing work [8] now seeks to explore this observation further — we ask: which properties commonly observed in scale-free networks are due to the power-law degree distribution alone, and which are the result of additional constraints imposed by preferential attachment? As had already been observed in [3], preferential attachment naturally introduces specific biases. First, the nodes with highest degree will typically be the first nodes added and hence, the hubs will (with probability approaching 1) be interconnected. Second, preferential attachment, as originally stated, enforces a minimum degree m > 1 — something that is not inherent in (1). Third, the last nodes added will have that minimum degree and will be (with high probability) connected to the largest hubs. A natural consequence of the first of these biases is that the rich club of hubs means that the “robust yet fragile” property, widely touted as a feature of all scale-free networks2 , is greatly accentuated. Choosing m allows one a degree of control over whether the network is a tree (m = 1) or highly cross linked. Finally, for N < ∞ the third of these biases guarantees systematic disassortativity in the resultant networks. Again, we see that preferential attachment does not generate typical scale-free networks. In this communication, and in our ongoing work [8] we address a related issue: for a given network, conforming to a 1 We note in passing that there is a technical issue with this definition which is not yet resolved — saying that the histogram conforms to p(k) is not the same thing as saying that a randomly chosen node in the network has degree k with probability p(k). That is, independence of nodes is no longer a valid assumption when one samples the entire graph. However, this is not an assumption peculiar to our work and we defer a closer examination of it for the future. 2 To see that this is not a generic property of all scale-free networks, consider a scale-free network which is also a tree — Fig. 1 includes an example and we have found that such things are actually fairly common [2].
particular degree distribution, which features of that network are typical and which are unusual? In Section II we reprise the algorithm of [8], a variant of [2] which we employ here. In Section III we start to answer this question. II.
G ENERATING LIKELY SCALE - FREE NETWORKS
The algorithm proposed in [2] implements a Monte-Carlo Markov Chain (MCMC) algorithm to modify an initial seed network through a sequence of random moves that are chosen to yield a network which is progressively more likely to conform to the prescribed degree distribution p(k). As the algorithm progresses the degree distribution of the network changes and becomes more typical of p(k). In [2] we demonstrate the application of this method for a variety of small networks and demonstrate, as outlined in the previous section, that preferential attachment does not tell the whole story of what actually one expects of typical scale-free networks. In [8], and here, we modify this procedure slightly. First we sample the degree distribution p(k) to generate a nominal histogram hk and choose a network with precisely that histogram. That is, hk is the particular histogram of our network: hk is the number of nodes of degree k in a network of N nodes. This histogram is chosen with probability given by the multinomial distribution P (h)
=
N Y p(k)hk . hk !
k=1
Thereafter we apply a MCMC procedure similar to [2], but we carefully select from among moves that do not alter hk . The most common and simplest such moves are link exchanges and rewiring. That is, rather than using the MCMC procedure to explore the space of all probable networks, we restrict the application of the MCMC process to networks with a particular histogram. By sacrificing the simplicity and clear theoretical foundation presented in [2], the approach we apply here is computationally faster and therefore more readily applicable to larger networks. Figure 1 depicts a typical network generated by this process which is also highly atypical of preferential attachment (in this case the network has a very large average path-length). The networks in Fig. 1 are deliberately small — as an aid to visualisation. In Fig. 2 we build a large number of moderately large networks in an effort to sample the expected distribution of various properties of these networks. We compute mean shortest path length, assortativity and local clustering (all briefly described in the figure caption) for preferential attachment (m = 2) and uniformly sampled networks. For m ≥ 2 results of this comparison indicate that preferential attachment networks have significantly smaller diameter than one would expect from random (uniform) sampling — the mean of the distribution is atypically small, but so too is the variance (upper panel of Fig. 2). This can be understood as a consequence of the hubs of the network being always interconnection in preferential attachment networks. Assortativity exhibits a slightly narrower distribution and (surprisingly) a smaller (negative) bias in preferential attachment. This may be in part due to the size N of the network playing a significant and more complex rˆole in permissible network connectivity patterns. Finally, local clustering is larger in preferential attachment — this is due
Fig. 1. Preferential attachment is not the end — a representative sample of typical scale-free networks which are unlikely to arise from a preferential attachment growth method. These networks are small — only 300 nodes each — so as to aid visualisation. From top-left to bottom-right these networks have very small assortativity (a ≈ −0.08; note the clusters of low degree nodes connected to single, isolated, hubs); very large mean path length (d ≈ 15.1; the network is very “skinny”); significant local clustering (g ≈ 0.03; note the loops, both small and large); and, extreme dissassortativity a ≈ −0.4; the network is an extreme example of a tree). While these networks have been generated from the algorithm outlined in the text, they are selected as exemplary peculiar examples. Nonetheless, the likelihood of obtaining networks such as these from preferential attachment is almost zero — Fig. 2 confirms this assertion.
(for m ≥ 2) to the disproportionate probability of a new node gaining connections to multiple hubs, which are themselves interconnected [7], [6]. For m = 1 (histograms omitted) we do see something slightly different. First, preferential attachment always produces a tree — and hence clustering is always exactly 0 for these networks. Second, because the networks are trees, and finite, the mean shortest path length for preferential attachment networks is somewhat biased — it has a long tail, similar
(a) mean shortest path
(a) mean shortest path
10000
25
8000
20
6000
15
4000
10 2000
5 0
3
4
5
6
7
5
2.5
8
9
10
11
12
13
14
CS PhD
Erdos
(b) assortativity
x 10
Internet
US Air 97
metabolic
protein
metabolic
protein
metabolic
protein
(b) assortativity 0.2
2
0 1.5
−0.2 1
−0.4
0.5
−0.6
0 −0.4
−0.35
−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
CS PhD 5
8
Erdos
Internet
US Air 97
(c) (local) clustering
x 10
(c) local clustering 1
6
0.8 4
0.6
2
0
0.4 0.2 0
0.01
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 preferential attachment (red−/dot−dashed) vs. uniformly sampled (blue/solid)
0.1
Fig. 2. Observed distribution of typical properties of scale free networks. In the three panels we report: (a) mean path-length (the average distance between two randomly chosen nodes in a network); (b) assortativity (degree-degree correlation among connected nodes); and, (c) local clustering (the average fraction of triangles among the neighbours of a random node). Histograms represent the distribution estimated from 104 networks, each with N = 103 nodes. Red dashed line is for preferential attachment (m = 2, results for other values are similar but omitted) and solid blue line is for the uniform sample algorithm described herein. In each panel the two histograms are binned independently and the bin populations are normalised by the bin width to obtain a numerical estimate of probability density. Note that, when m = 1 preferential attachment will always generate trees and hence local clustering is exactly 0.
to the tail observed for uniform sampled networks, despite a significantly lower mean. This is due to the fact that with finite networks it is possible (i.e. non-vanishing probability) that the network initially forms chains of low degree nodes. Hence, what we find confirms (with a few extra surprises) our earlier motivation — preferential attachment networks and uniformly sample networks are quite different from one another. In the next section we take these algorithms and apply them to real world network, in an attempt to answer the question of the title of this paper. III.
T HE COMPLEXITY OF COMPLEX NETWORKS
In the previous section we showed that uniformly sampling the space of all scale-free networks with a given degree distribution gave a different distribution of networks from what would be expected of preferential attachment. In this
CS PhD
Erdos
Internet
US Air 97
Fig. 3. In each panel we compute one of the three quantities from Fig. 2: (a) mean shortest path length; (b) assortativity; and, (c) local clustering. The box plot on the left (red/yellow) is for preferential attachment, while the one on the right of each pair (blue) is for uniform sampling. The darker boxes depict mean, 25%, and 75% percentiles. The lighter box is total range (i.e. minimum to maximum) the solid dashed line is the same quantity computed for the original real world data. The six real networks we study here are : 1. CS PhD collaboration 2. Erd¨os collaboration 3. a symmetrized snapshot of the structure of the Internet at the level of autonomous systems 4. US Airport connection, 5. the S.cerevisiae protein-protein interaction network 6. metabolic network. In all cases the range of results obtained with uniform sampling is far greater, and, with one except (path length of the Erd¨os collaboration network) provided better agreement with the experimental data.
section we consider real world networks and attempt to determine which approach givens network properties more typical of the observed data? In other words, which networks are consistent with a growth process and which are not? We consider an assortment of typical complex networks which all exhibit nominally scale-free distribution of node degree. For conciseness, details of these networks are omitted from the current paper. Full details may be found in [8]. The networks cover a range of social systems (CS PhD, Erd¨os), engineered technological networks (Internet and US Air) and biological systems (metabolic and protein). Figure 3 summarises our results. The most obvious conclusion is that uniform sampling does a superior job of covering a range of structural values including those consistent with the original data. The Erd¨os collaboration network is an exception. However, in this case the network is specifically defined to
have one super-hub3 — Erd¨os himself. This is also the only network which is not disassortative. Suggesting that, while scale free, this network is rather atypical. For the remainder of the data it is notable that biological networks typically have larger clustering and also larger average path-length than the scale-free models. Typically, (dis-) assortativity is overestimated by the models — for preferential attachment we have already discussed reasons behind this, for the uniform sampling scheme the reasons are more subtle, but already noted from Fig. 2. Conversely, clustering in real social networks is smaller than expect according to both preferential attachment or uniform sampling. IV.
C ONCLUSION
We have discussed generalisations [8] to an algorithm introduced elsewhere [2] for randomly sampling scale-free networks in an unbiased manner. The algorithm requires one to specify the desired degree distribution (we focus here on scale free networks) and a network size N . This leads naturally to a comparison with preferential attachment — which we have carried out in greater depth in our ongoing work [8] and summarise here. The main conclusions of this comparison is that preferential attachment is not typical. Scalefree networks exhibit a much wider range of properties and, in particular, a range of robustness and variable network diameter (mean shortest path length). Each of those properties can be understood from first principles when one examines how preferential attachment works: growing a network with an attachment mechanism leads to a hub of interconnected hubs, and a series of low-degree leaf nodes connected into these hubs. Defining scale-free networks only be the degree distribution, rather than this growth mechanism, leads to a far wider range of behaviours.
inter-node paths are surely signatures of biological function: in the metabolic network this is clear, it is also reasonable for the protein-protein interaction network as it is a signature of a highly clustered network. Finally, the engineering networks: the Internet and the US Air network. In these cases the mean path-length are typical of scale-free networks, or perhaps even smaller than average (particularly for the air transportation network). The deliberately designed nature of these networks is evident: it makes no sense to build an air transportation network which involves more flight transfers than necessary. Conversely, these networks are less clustered. This too is natural. Clustering in these networks represents redundancy. Where-as biological systems have a high level of redundancy, engineered systems are deliberately more efficient. This is true, in particular, of the Internet. Although the Internet was originally designed to possess a very specific robustness [5], its development has potentially made it less optimal. A secondary factor in the complexity of these transportation networks is that they are essentially embedded in a Euclidean space. This assumption is missing from the random networks. Of course, a modification of our algorithms to include the constraints of geographical embedding can be implemented and is worth examining in more detail. ACKNOWLEDGEMENT MS is supported by an Australian Research Council Future Fellowship (FT110100896). LZ received travel support from the USTC-UWA collaborative research training programme. R EFERENCES [1]
Our comparison of uniformly sampled scale-free networks to numerical simulations of the preferential attachment process, did uncover one surprise. For small (finite) networks, uniform sampling increases disassortativity. This is the opposite of what we achieved in [3], suggesting that the altruistic attachment algorithm of [3] generates rather atypical networks.
[2]
This brings us to the central question of this communication: where does the complexity of complex networks come from? For preferential attachment, it is the inherent strong interconnection of hubs and peripheral leaves. We see this from the results of Fig. 2 and have confirmed this with the application of extensive modification procedures in [8]. Essentially, we examine the effect of these probability distributions of adding or removing interconnection of hubs and peripheral leaves. We find (in [8]) that this process allows us to modify a preferential attachment network to something like a uniformly sampled one, or vice versa.
[6]
We also ask the same question of real data networks. The Erd¨os network we find is atypical because of its single primary hub (Erd¨os) — something which is absent in the computationally generated networks. For biological networks we find a particular large average path length. While these networks are scale-free, they have additional features which are atypical of scale-free networks in general. These long average 3 One
hub to rule them all.
[3] [4] [5]
[7]
[8]
A. Barab´asi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999. K. Judd, M. Small, and T. Stemler. What exactly are the properties of scale-free networks? Europhys Lett, 103:58004, 2013. arXiv: 1305.7296. P. Li, J. Zhang, and M. Small. Emergence of scaling and assortative mixing through altruism. Physica A, 390:2192–2197, 2011. M. Newman. Networks: An Introduction. Oxford University Press, 2010. M. Small. Information technology and the Internert: The kernel. McGraw-Hill, 2007. X.-K. Xu, J. Zhang, P. Li, and M. Small. Changing motif distributions in complex netowkrs by manipulating rich-club connections. Physica A, 390:4621–4626, 2011. X.-K. Xu, J. Zhang, and M. Small. Rich club connectivity dominates assortativity and transitivity of complex networks. Physical Review E, 82:046117, 2010. L. Zhang, M. Small, and K. Judd. Exactly scale-free scale-free networks. arXiv, 1309.0961v2, 2013.