Large Graph Models: A Review Georgios Drakopoulosa,∗, Stavros Kontopoulosb, Christos Makrisb , Vasileios Megalooikonomoua
arXiv:1601.06444v1 [cs.SI] 24 Jan 2016
a Multidimensional
Data Analysis and Knowledge Management Lab Computer Engineering and Informatics Department University of Patras, Hellas {drakop, vasilis}@ceid.upatras.gr b Graphics, Multimedia, and GIS Lab Computer Engineering and Informatics Department University of Patras, Hellas {kontopou, makri}@ceid.upatras.gr
Abstract Large graphs can be found in a wide array of scientific fields ranging from sociology and biology to scientometrics and computer science. Their analysis is by no means a trivial task due to their sheer size and complex structure. Such structure encompasses features so diverse as diameter shrinking, power law degree distribution and self similarity, edge interdependence, and communities. When the adjacency matrix of a graph is considered, then new, spectral properties arise such as primary eigenvalue component decay function, eigenvalue decay function, eigenvalue sign alternation around zero, and spectral gap. Graph mining is the scientific field which attempts to extract information and knowledge from graphs through their structural and spectral properties. Graph modeling is the associated field of generating synthetic graphs with properties similar to those of real graphs in order to simulate the latter. Such simulations may be desirable because of privacy concerns, cost, or lack of access to real data. Pivotal to simulation are low- and high-level software packages offering graph analysis and visualization capabilities. This survey outlines the most important structural and spectral graph properties, a considerable number of graph models, as well the most common graph mining and graph learning tools. ∗ Corresponding
author: University of Patras, Patras 26500 Hellas,
[email protected]
Preprint submitted to Journal of Discrete Algorithms
January 26, 2016
Keywords: Graph mining, Graph modeling, Large graphs, Graph spectrum, Static graphs, Dynamic graphs, Self similarity, Scale free graphs, Degree distribution, Aiello models, Kronecker models, Albert-Barab´ asi model, Adjacency matrix
1. Introduction Graph theory has recently captured the interest of a considerable portion of computer and social scientists alike, in a research wave that appears to be aiming at the development of a mathematical theory of social interaction, at least in the way the latter is manifested in the social networks. This can be attributed to a number of reasons. First and foremost, research on ad hoc and sensor networks during the last decade has generated strong interest in graphs as a way to formulate time dependent network topology. Moreover, social media expansion and their adaptation as the primary means of opinion exchange in many societies since the beginning of current decade is the focus of modern sociological research. Also, idea propagation through either aforementioned social media or regular social interaction has been also recently studied1 . Additionally, structure and pattern discovery in graphs, especially in evolving ones, is an open and widely investigated problem in machine learning community. Finally, opinion mining relies heavily on both real and synthetic graphs. Real social graph data are now beginning to reach the necessary size for reliable conclusions to be drawn. The field of graph modeling plays a central part in the above research direction. The primary motive for this report has been to compile a number of graph models, namely those in [30][29][1][28]. Additionally, the most prominent of these models will be outlined. Finally, a taxonomy along with criteria for 1 One
famous example of massive social media collaboration took place during the events of
Arab Spring in Egypt which eventually led to the deposition of President Mubarak. Despite Egyptian government efforts to disable a large segment of local Internet, Egyptian opposition leaders successfully managed to arrange a series of massive demonstrations through a number of modern social media including Facebook and Twitter.
2
selecting graph models will be provided. This work is organized as follows. Sections 2, 2.2, and 2.3 introduce the scale-free graphs and their spectral and structural properties respectively. In section 2.4 the driving reasons behind graph modeling as well as its principles are outlined. Sections 3 to 7 present a number of graph models. Where appropriate, important variants are also mentioned. Table 5 contains the graph models included in this survey. Section 8 outlines software for graph generation and manipulation. Finally, section 9 summarizes this report and enumerates possible future research directions. Throughout this survey the following symbols are used. Their meaning is listed in table 1.
2. Scale-free graphs 2.1. Definition Mathematically a graph G = (V, E) is an ordered pair of vertices2 which may or may not be pairwise connected by edges. V is the set of vertices of a given graph while E is the set of edges expressing the connectivity pattern. Between any given pair of vertices there is at most one edge, therefore it is totally legitimate for a vertex not to be connected with any other vertex. Often one is interested in patterns and structural regularities within E, which can be the mere edge existence between a given vertex pair, or the discovery of connected components and complete subgraphs, or the check whether a partition of V satisfies certain conditions or constraints. Historically Euler was the first to establish in 1735 a graph pattern seeking algorithm in his seminal paper regarding the K¨ onigsberger Br¨ uckenproblem. A graph is well suited for representing relations through patterns of vertices or edges. Graph mining is the associated pattern discovery field. When either E 2 Literally
peaks in Latin.
3
or V or both vary with time, then the graph is termed time evolving or dynamic. Otherwise, the graph is called static. The important class of scale-free graphs can be defined in two equivalent ways [23][15]. The first is based on the vertex degrees and the second relies on the notion of graph growth. The vertex degrees are distributed according to a power function. Specifically, the number g (k) of vertices whose degree equals k is g (k) = α0 k −γ0
α0 > 0, γ0 ≥ 1
(1)
where α0 and γ0 are fixed constants. For most real large graphs γ0 lies in [2, 3]. α0 is a constant depending on the given graph. Alternatively, if f (n) is the graph growth function in terms of vertices indexed by the discrete time variable n, then f (β0 n) = f˜ (β0 ) f (n)
(2)
where β0 is an arbitrary constant and f˜ (s) is a fixed function of the continuous variable s. Note that [1] and [17] examine an alternative degree distribution, called the power-cutoff law, which is described by the equation g (k) = α0 e−β0 k k −γ0
α0 > 0, β0 > 0, γ0 ≥ 1
(3)
This alternative form will not be further examined in this survey. 2.2. Spectral properties This section illustrates certain important aspects regarding the properties of the graph adjacency matrix spectrum. Recall that any graph can be represented in at least five different matrices, namely the adjacency matrix [30], the incidence matrix [2], and the normalized Laplace matrix [8]. Of these possibilities, the adjacency matrix is by far the most common choice because of its simplicity and the rich properties of its spectrum as defined in equation (7). For completeness, the definitions of the three matrices will be given. However, only the properties
4
of the adjacency matrix will be outlined. Table 2 lists the scale free graph spectral properties. The adjacency matrix A definitions of an undirected graph is defined as 1 (vi , vj ) ∈ E △ |V |×|V | ∈ {0, 1} (4) A[i, j] = 0 (vi , vj ) ∈ /E
The definition for a directed graph is analogous. The main difference is that the
adjacency matrix of an undirected graph is symmetric and, hence, its spectrum is real. When the underlying graph is weighted, namely there is a function h : E → R mapping edges to real values, then the values of h appear at the appropriate matrix entries. The incidence matrices Gd and Gu in the directed and undirected cases respectively are defined as 1 vi is the head of edge j △ |V |×|E| Gd [i, j] = −1 vi is the tail of edge j ∈ {0, ±1} 0 otherwise 1 vi is incident to edge j △ |V |×|E| ∈ {0, 1} Gu [i, j] = 0 otherwise
(5)
Note now that index j runs over E instead of V as in the adjacency matrix. Gd and Gu codify the relations between vertices and edges, while the adjacency matrix captures the relations between vertices. The normalized Laplace matrix L is derived from the adjacency matrix A as follows △
L = I − diag (A)
−1
A
(6)
When examining the graph adjacency matrix, of primary interest are its eigenpairs n
o λk , gk | Agk = λk gk ,
1 ≤ k ≤ |V |
(7)
and particularly its spectrum △
λ = {λk },
λ1 > |λ2 | ≥ . . . ≥ λ|V | 5
(8)
The convention that λ is ranked in descending absolute value order will be assumed throughout this survey. Recently, considerable research focuses on λ, as it can reveal significant graph information without resorting to extensive graph structure examination. The primary eigenpair of a graph is defined as △ π1 = λ1 , g1
(9)
and it contains information regarding information flow along the graph. The primary eigenvector components are indicative, subject to weighing by the primary eigenvalue, of the each vertex centrality or value within the graph structure. According to the Perron-Frobenius theorem [44][29], λ1 is always positive. λ2 is a measure of graph capacity, namely how quickly or easily can an idea or a meme can spread through a social network. A meme was defined by Dawkins as the cultural equivalent of the gene and it is considered by some authors the basic unit of cultural information, although it is in practice difficult to isolate [11]. Similar definitions exist for messages and packets in computer networks and for viruses in biological ones. λ|V | always equals zero. If there are µ0 zero eigenvalues other than λ|V | , then the corresponding graph consists of exactly µ0 connected components. Consequently, if only λ|V | equals zero, then the corresponding graph is connected. Undirected graphs have real spectra as their adjacency matrices are symmetric. Therefore, computations are simplified and their intuitive interpretation is easier. Permutations of rows or columns do not have any effect on λ, implying that isomorphic graphs have the same spectrum. The inverse is not true, as certain graphs which are not isomorphic have been shown to have the same spectrum [32]. Such graphs are called isospectral. Two graphs are defined to be isomorphic when there is a bijection between their vertex sets. Graph eigenvalues indicate also whether there is a set of vertices S ⊂ V , |V | ≥ 2|S| acting as a connection hub for the rest of the graph. More formally,
6
the Cheeger number C defined as |(vi , vj ) ∈ E : (vi ∈ S) ∧ (vj ∈ V \ S)| △ C = min |S|
(10)
is an indicator of how easy it is to find a relatively large subgraph that has many connections to the remaining graph. The connection with the graph spectrum comes through the Cheeger inequality [35] which states that p λ1 − λ2 ≤ C ≤ 2λ1 (λ1 − λ2 ) 2
(11)
△
The quantity ∆ = λ1 − λ2 is termed the spectral gap and it can be interpreted as follows: A large ∆ indicates the existence of a relatively small subset S which is well connected with the rest of the graph while a small ∆ indicates that no matter how large S becomes, it remains poorly connected with the rest of the graph through a few edges. In other words, when ∆ is large, the “central” part of the graph can be located and analyzed with a fairly good chance that any local results holding in S will also be valid for V . Moreover, a large ∆ indicates the existence of short or quick paths, as most of vertex pairs can efficiently connect through S. The smaller in absolute value eigenvalues tend to alternate in sign around zero [35]. This has been determined experimentally and as yet it is not known how many are actually the sign alternating eigenvalue pairs. At any rate, it is an indication that leading eigenvalues are suitable for low rank adjacency matrix representation. The quantities △
E[S] =
X
eλk
(12)
sinh (λk )
(13)
vk ∈S △
E ′ [S] =
X
vk ∈S △
N [S] =
X
vk ∈S
1 1 − λk
,
max {|λk |} < 1 k
(14)
which depend heavily on the adjacency matrix spectrum are frequently employed to determine the importance of a vertex subset S within the graph. They are called the Estrada index (equation 12), the odd length Estrada index (13), 7
and the Neuman index (equation 14) respectively. All of them can be also employed as centrality measures for single vertices when S is a singleton. The above definitions can be naturally extended to global graph metrics in graph comparison or approximation scenaria as △
E=
|V | X
eλk ≈
p X
eλk
k=1
k=1 |V |
△
E′ =
sinh (λk ) ≈
△
|V | X
p X 1 1 ≈ , 1 − λk 1 − λk
k=1
sinh (λk )
k=1
k=1
N=
p X
X
k=1
max {|λk |} < 1 k
where the approximation which is based on the p largest absolute value eigenvalues is justified in light of the eigenvalue sign alternation and by the eigenvalue absolute value descrease according to a power function [35]. The selection of p is of no concern in this report, although graph centrality measure approximation is a very significant research area. Finally, centrality measures have been recently used in visualizing social networks [10]. 2.3. Structural properties Despite the linear-algebraic light shed upon graphs, they remain mostly combinatorial objects. As such, their structure is significant. Real large scale graphs across a variety of scientific fields ranging from biology and botanology to social sciences and computer science appear to share a common structure rich with properties [30][29][32]. The most important structural properties are listed in the following paragraphs and are also summarized in table 3. The primary observation is that vertex degree ranking decays according to a Zipf function. The degree of a vertex vk , denoted by deg (vk ), is the number of vertices whose one end lies in vk . As a Zipf function decays much slower than an exponential one, there are relatively many vertices with a large number of neighbors, which intuitively serve as hubs within the graph. The number of
8
vertices whose degree equals an integer k is |{v : deg (v) = k}| = α0 k −γ0 ,
γ0 ≥ 1, α0 > 0
(15)
It is conjectured that a given Zipf exponent γ0 characterizes an entire subclass within the class of Kronecker graphs [30], but as yet it is unknown whether additional information is required in order to specify a graph subclass. α0 is a normalization constant which can be used to count, depending on its value, either the actual number or the relative appearance frequency of vertices of a given degree k. The adjacency matrix eigenvalues in absolute value ranking is also a Zipf function and the same is true for the g1 components, which is termed the primary eigenvector. The number of triangles a given vertex participates to is also a Zipf function. A triangle is defined as any triplet of vertices where each one is directly connected with an edge to the other two. In graph theoretic terms it is a complete graph of order three. Its significance is more evident in social graphs, as a friend of any given person is especially likely to also know that person’s other friends [45]. Community formation is another main trait of real graphs. Using the number of triangles each vertex participates to, clustering coefficients can be constructed to indicate how well connected a vertex is to its neighbors and how compact the graph is locally. Moreover, similar communities tend to further form larger communities, where each original community remains distinguished. This is especially apparent to social graphs, where an automobile and a motorcycle community can be grouped in a larger community, as they both are quite distinct from a hiking community. Nonetheless, both original communities remain discernible from each other. This hierarchical structure allows multilevel graph mining which is in some ways reminiscent of the multiresolution wavelet analysis in signal processing and of the hierarchical clustering in data mining. Self similarity in scale free graphs derive almost immediately from the scaling property. In fact, [30] suggests that self similarity is a major tool in achieving 9
scaling as the graph progressively connects to some transformed copies of itself. Furthermore, self similarity is known to be related to fractal dimensions [38] and, therefore, certain graphs contain fractals or are fractals themselves. Real large graphs become denser over time [34]. Specifically the graph density measures [29][30] △
ρ0 =
|E| |V |
and
ρ′0 = △
ln |E| ln |V |
(16)
increase in each time step. In fact, ρ0 increases according to a power function and, consequently, ρ′0 increases linearly, implying that the graph becomes denser with time. The above is related to the fact that real world graphs diameters and effective diameters shorten over time, leading thus to a more compact and robust structure. In social sciences this is termed the small world phenomenon, which is a more general case of the six degrees of separation phenomenon observed in social sciences [47]. The graph diameter is defined as the maximum shortest path length whereas the effective diameter refers to the maximum shortest path length required for any vertex to reach a fraction of the total graph vertices. The rationale behind the effective diameter definition is that the diameter is prone to outliers, resulting frequently in unnaturally high results for compact graphs. Finally, it should be noted that in a scale free graph edges the do not exist independently of each other. Although the way edges interact, mainly through incidence to the same vertex or through common participation to shortest paths and circles, may be difficult to capture, it plays a crucial role with the self similarity mechanism to the emergence of power laws. If complete edge independence were to apply, exponential laws would instead hold as the Erd¨ os-R´enyi model demonstrates. 2.4. Graph models Graph models are, as their name suggests, mathematical rules of various kinds which generate either a single graph, termed a graph instance or a graph 10
snapshot, or an entire graph sequence. In contrast to the real graphs, those created by a model are called synthetic graphs. Synthetic graph possess, depending on the underlying model, a number of the real graph characteristics outlined earlier. The reasons for employing synthetic graphs within an engineering context are many. First and foremost, real graphs may grow too slow for decisions based on their structure to be taken - this is for instance the case with some actual computer networks. Additionally, real graphs cannot be easily accessed or even duplicated in computer memory, for instance biological or power networks. Moreover, real graphs can actually be proprietary data. Telephone network connections by service providers in certain countries fall into this category. Recently, legitimate privacy concerns in social networks add momentum to the need for synthetic data. Finally, when insight of how a network has eventually evolved into a given state is required but network history reconstruction is prohibitively expensive or simply impossible, then synthetic data are to be employed. Synthetic graphs have a number of benefits and applications [28][30]. They are mathematically parsimonious, as a rather large combinatorial structure can be interpreted in terms of a relatively few parameters. Moreover, they can serve as a null model in hypothesis testing, allowing non trivial results regarding real world data to be easily assessed in terms of statistical significance and generality. Synthetic graphs can be used in time evolving scenaria, allowing thus extrapolation in time of very large graphs and meaningful predictions. At the same time they offer storage efficiency, as only a small number of parameters needs to be stored instead of the final graph. This greatly facilitates graph evolution study, as intermediate graph snapshots can be stored and retrieved when desired. On the other hand, graph model users and designers have to be careful, as with any other simulation-generated object, as to interpret their results within a statistical context. This is especially true as large, real graph properties have not been fully cataloged. Moreover, not every model is suitable for every scenario. Also, heuristic analysis should be carefully employed as certain graph 11
segments may not adhere to specific laws, even though the graph as a whole does. Additionally, even if a specific model is appropriate, parameter tuning may still be required in order for the full expressive model power to be applied to a given situation. Finally, simulation may be time consuming, as a considerable number of graph model runs may be required in order for meaningful results to be obtained. random graph generator can be classified according to certain basic criteria, which are summarized in table 4 and outlined in the remainder of this section. A static model run results in a single graph instance, whereas a dynamic model initializes a graph and control its evolution over time, often in discrete time steps as the graphs are themselves discrete in nature. Graph instances from an evolving sequence can be studied as if they were isolated instances. Thus, in this sense, static models form a proper subset of the dynamic ones. In undirected graphs, the existence of edge (u, v) always implies the existence of (v, u) or, alternatively, the graph semantic rules make no distinction between the edge endpoints. In a physical sense, an edge represents a connection which can be crossed both ways. As a result, the adjacency matrix of an undirected graph is symmetric by definition and, consequently, its spectrum is real. On the contrary, in directed graphs existence of (u, v) and (v, u) must be tested separately. The adjacency matrix spectrum of a directed graph is in general complex with the eigenvalues forming conjugate pairs. Directed graphs are a more general case than the undirected ones but are also more difficult to generate. Most often models create either directed or undirected graphs and conversion of a synthetic directed graph to undirected or vice versa may actually yield erroneous results. A deterministic model generates the same graph instance or the same graph sequence each and every time it is invoked. In contrast, a probabilistic model is based on random input in order to create more realistic graphs or to decrease computation time and, therefore, the graph structure, at least partially, is a random variable. Perhaps the most well known probabilistic graph model is the Gn,p , where there are n vertices and each of the possible n2 edges exist 12
independently with probability p. Notice that a probabilistic graph need not be a static one; on the contrary, probabilistic techniques are mostly needed in dynamic models. Naturally a different kind of analysis is required in the probabilistic case and almost always the model runs in some equivalent of batch mode generating many graph instances, depending on the underlying ergodic assumptions. Top-down models at least one real graph property chosen in advance and then a graph satisfying this set of properties is constructed. Bottom-up models have the complementary guiding principle that a graph is carefully constructed in a way that most or all of the known real graph properties are satisfied to a varying extent. Bottom-up models are usually more flexible and easier to design and analyze, however it should be tested whether a property of interest is actually satisfied in a graph constructed by such a model. Top-down models generate graphs which are guaranteed to possess certain properties, although usually this is done at the expense of other real graph properties resulting thus in a less realistically looking graph. A typical top-down model for instance is the deterministic Kronecker model variant where the vertices are guaranteed to be distributed according to a Zipf function at the expense of an artificially looking adjacency matrix spectrum, whereas its probabilistic variant is a true bottom-up model as the vertex degrees and the adjacency matrix spectrum are both close to a Zipf distribution with some outliers. Structural models work by manipulating the combinatorial graph structure, mostly by adding or deleting vertices or edges. On the contrary, spectral models operate indirectly on the graph by adjusting its adjacency matrix or, less frequently, its spectrum. The Gn,p model is the prime example of a structural graph model while both the deterministic and the probabilistic Kronecker model variants follow the spectral design paradigm. Probabilistic graph evolution models represent the next logical step in large scale-free graph modeling. Typically they are considered more flexible when compared to the deterministic evolution model class, since under a probabilistic model it is in principle easier for a graph with certain desired properties, either 13
structural or spectral, to be generated. In practice though, it may be hard for the necessary model parameters to be determined. Also probabilistic evolution models are capable of generating more graphs given the same parameter size with deterministic models, as the former produce either a graph distribution or a graph whose structure is at least partially stochastic in nature. Thus, a number of graphs can be generated by obtaining realizations of either the distribution or the graph stochastic structure. In light of the above, a graph or adjacency matrix sequence G[n] generated by a probabilistic graph evolution model can have three interpretations. G[n] can describe the stochastic characteristics of a single time evolving graph. In this case, only one instance of each G[n] is generated and yields an instance of the time evolving graph at time step n. This is the probabilistic equivalent of the single time evolving growth case. G[n] can also describe the stochastic characteristics of a single evolving graph sequence. In this case, only one instance of each G[n] is generated. This is the probabilistic counterpart of the deterministic graph sequence generation case. Finally, G[n] can describe the probability matrix of multiple evolving graph sequences, each generating an arbitrary number of graphs. In this case, many instances of G[n] can be generated. There is no deterministic graph evolution model interpretation for this case, the reason being that each deterministic G[n] is a fixed matrix, which most often cannot be subject to further manipulation without compromising some or all of its desired characteristics. Table 5 summarizes the models presented in this survey.
3. Erd¨ os-R´ enyi model Historically the first working random graph model [13] producing static graphs is the Erd¨ os-R´enyi model. There are two variants of this model, which are essentially equivalent and can be used interchangeably.
14
3.1. Gn,p variant The Gn,p model consists of selecting independently with equal probability p edges out of the n2 possible ones. Thus, the expected number of edges e under the Gn,p model is
n e = p 2
(17)
An interesting feature is that the number of triangles in the Gn,p model can be estimated given the model parameters. Using Janson inequality, the probability that under the Gn,p a given vertex v belongs to a triangle, in other words the probability that property Tv is valid, can be shown to satisfy the relationship [6] lim
|V |→+∞
where
prob {Tv } = e−γ
|V | − 1 3 ln γ = ln |V | − p 2
(18)
(19)
The Gn,p model generates static graphs, which may be undesired in certain cases. However, its worst drawback is that the expected degree follows an exponential law instead of a power law, rendering this Gn,p unsuitable for modeling real, large scale graphs such as the Web graph. Algorithm 1 Gn,p model. Require: Number of vertices n and edge probability p. Ensure: G is a random graph. 1:
Read n and p.
2:
G ← K¯n
3:
for all v ∈ G do
4: 5: 6:
for all u ∈ G \ {v} do Select (u, v) independently with probability p. end for
7:
end for
8:
return G
15
3.2. Gn,m variant The Gn,m selects among all possible graphs with n vertices and m edges one with uniform probability. Gn,m is described in algorithm 2. Algorithm 2 Gn,m model. Require: Number of vertices n and edges m. Ensure: G is a random graph. 1:
Read n and m.
2:
Construct all graphs with n vertices and m edges.
3:
G ← K¯n
4:
Select one with equal probability.
5:
return G
4. Watts-Strogatz model Another important graph model is Watts-Strogatz model [48]. Like Erd¨ osR´enyi model, it produces a static graph. Moreover, Watts-Strogatz model generates graphs whose degree distribution diverges from Zipf law. However, it is an excellent vehicle for demonstrating the small world phenomenon. Algorithm 3 outlines the Watts-Strogatz model. It starts with a regular ¯ assumed to be an even integer, lattice graph of n vertices, the mean degree d, and a special parameter β0 with the property that 0≤β≤1 and 1 ≪ d¯ ≪ ln n ≪ n Then, the model constructs an undirected graph with n nodes and
nd¯ 2
edges
in the following way: From the definition of the regular ring lattice, each vertex is connected to d¯ neighbors,
d¯ 2
on each side. That is, if the vertices are labeled
from v0 to vn−1 , there is an edge ((vi , vj ) if and only if d¯ d¯ ≤ 0 < |i − j| mod n − 2 2 16
(20)
For every node vi every edge (vi , vj ) with i < j is examined, and a new endpoint is selected with probability β0 . Rewiring is done by replacing the edge (vi , vj ) with the edge (vi , vk ) where k is chosen with uniform probability from all possible values that avoid self loops and link duplication. Algorithm 3 Watts-Strogatz model ¯ and β0 as described. Require: Parameters n, d, Ensure: G is a small world graph. 1:
¯ and β0 . Read n, d,
2:
Construct regular ring lattice G.
3:
for i ← 0 to n − 1 do
4:
for j ← i to n − 1 do
5:
if (vi , vj ) ∈ E then
6:
Select a random k uniformly.
7:
Replace (vi , vj ) with (vi , vk ) with probability β0 .
8: 9:
end if end for
10:
end for
11:
return G
5. Albert-Barab´ asi model The network begins with an initial connected graph of m0 vertices [3]. New vertices are added to the graph one at a time. Each new node is connected to m ≤ m0 existing vertices with a probability that is proportional to the number of edges that the existing vertices already have. Formally, the probability pk that the new vertex is connected to existing vertex k is deg (vk ) pk = P i deg (vi )
(21)
Higher degree vertices tend to quickly accumulate more neighbors, while vertices with only a few edges are unlikely to be chosen as the destination for a new
17
edge. The new vertices tend to link to the already heavily connected vertices. This is termed the preferential attachment or the rich-get-richer phenomenon. The degree distribution of the Albert-Barab´ asi model follows a power law, in particular it is a power law of the form f (k) = α0 k −3
(22)
where k is the vertex degree. The average path ℓ¯ length of the Albert-Barab´ asi model increases approximately logarithmically with the graph size as ln n ¯ ℓ = Θ ln ln n
(23)
which is systematically shorter than that of a random graph. Algorithm 4 Albert-Barab´ asi model. Require: An initial graph G0 and m. Ensure: G[n] is a scale-free graph. 1:
Read G0 and m.
2:
G[0] ← G0
3:
for k ← 1 to n do
4:
Create a new vertex v ′ .
5:
for s ← 1 to m do
6:
Select a vertex v ∈ G[k − 1] as described in (21).
7:
Create (v ′ , v)
8: 9: 10:
end for end for return G[n]
6. Aiello models Aiello models have been introduced in [1] in an effort to model primarily telephone networks. Model C is derived from model B, while model B is based
18
on model A. Model D is a version of model C generating undirected graphs. All four models generate power law graphs but are unable to capture edge deletion. 6.1. Model A Model A is a variant of the Albert-Barab´ asi model where at each time step there are now two possible evolution options. Either with probability α0 the directed edge (u, v) is added to the existing graph, or with probability 1 − α0 a new vertex with in- and out-degree of 1 will be added to the graph as in the Albert-Barab´ asi model. The only exception to this evolution rule is at the first time step where a new vertex with in- and out-degree of 1 is added to the graph in order to be used as the basic building block for subsequent graph evolution. Notice that for α0 = 1 the model becomes the Albert-Barab´ asi one with m = 1. It should be highlighted that whenever the edge (u, v) is to be added to the current graph, the tail u is chosen with probability proportional to the weight wout which is the vertex out-degree plus one. In a similar fashion, the head v is chosen with with probability proportional to the weight win which is the vertex in-degree plus one. Thus, two parallel preferential attachment mechanisms are running, one for the out-degree and one for the in-degree. At time k both the sum of the in-weights and the sum of the out-weights are equal to k. So, the probability pu→v that the new edge (u, v) will be added is pu→v = α0
(1 + degi (v; k)) (1 + dego (u; k)) k2
(24)
Notice that in- and out-degrees are time-variant quantities and need to be indexed by time k. 6.2. Model B Aiello model B improves over model A in the sense that different power laws can be generated for the in- and out-degrees and that edge density can be also parameterized. These two additional functions are controlled through the combined effect of a parameter α0 as in model A and two new parameters, γ in and γ out . 19
Algorithm 5 Aiello model A. Require: Parameter α0 , 0 < α0 ≤ 1 Ensure: G[n] is a power law graph. 1:
Create a single vertex.
2:
for k ← 2 to n do
3:
Sample the uniform distribution to obtain δ ∈ [0, 1]
4:
if δ ≥ α0 then
5:
Add a new vertex v ′
6:
Select a v i according to win and add v ′ , v i
7:
Select a v o according to wout and add (v o , v ′ )
8:
else
9:
Select a v i ∈ G[k] according to win
10:
Select a v o ∈ g[k] according to wout Add v o , v i to G[k]
11: 12:
end if
13:
end for
14:
return G[n]
20
As in model A, model B starts with an empty graph. At the first step a single vertex with in-weight γ in and out-weight γ out is added. At time k, k > 1 with probability 1 − α0 a new vertex with in-weight γ in and out-weight γ out is added to the graph, whereas with probability α0 a new directed edge is added to the graph. Both its head and tail are selected with probability proportional to the total in- and the total out-weight of each vertex respectively. The total in-weight of a vertex equals the sum of the original in-weight γ in and of the in-degree of that vertex. The out-degree of a vertex is similarly defined. The procedure is described in algorithm 6. Algorithm 6 Aiello model B. Require: α0 , γ in , γ out , and n Ensure: G[n] is a scale free graph. 1:
Create a single vertex.
2:
for k ← 2 to n do
3:
Sample the uniform distribution to obtain δ ∈ [0, 1]
4:
if δ ≤ α0 then
5:
Compute the in-weight of each degree.
6:
Assign probabilities proportional to total in-weights.
7:
Select head v
8:
Repeat analogous steps to 5-7 to select tail u
9:
Add edge (u, v)
10: 11: 12:
else Add a new vertex. end if
13:
end for
14:
return G[n]
6.3. Model C Aiello model C requires four parameters me,e , mn,e , me,n , and mn,n and works differently from models A and B in the sense that, like the Albert-Barab´ asi 21
model, a new vertex is added to the graph at each step. The model is probabilistic as all four parameters are randomly drawn from the same bounded distribution, which may even be time dependent as long as a limit distribution exists. Unlike the previous models, model C starts with an initial directed component. At each step a vertex v ′ is added to the graph and four random integers are drawn: me,e is a number of edges to be added somewhere in the graph where neither their heads nor their tails include v ′ . Their endpoints are chosen with probability proportional to their degree as to ensure preferential attachment conditions. mn,e is the number of edges from v ′ to other vertices in the graph. Their heads are selected also according to a preferential attachment mechanism. In contrast to other models, multiple edges between vertices may exist and are neither fused nor deleted. Similarly, there are me,n edges from the rest of the graph to v ′ with the tails selected in a manner analogous to the previous case. Finally, mn,n self loops are added to v ′ . Notice that, the longer the model runs, the less the original component affects the final graph properties. Algorithm 7 Aiello model C. Require: Number of steps n and distribution f . Ensure: G[n] is a directed scale free graph. 1:
Read original graph component.
2:
for k ← 1 to n do
3:
Add a new vertex v ′
4:
Draw me,e , mn,e , me,n , and me,e from f .
5:
Ensuring preferential attachment conditions:
6:
Add me,e edges to G[k − 1]
7:
Add me,n edges from G[k − 1] to v ′
8:
Add mn,e edges from v ′ to G[k − 1]
9:
Add mn,n loops to v ′
10:
end for
11:
return G[n]
22
6.4. Model D Model D is a variant of model C for undirected graphs. It requires only three random parameters from any bounded and potentially time variant distribution, namely me,e , me,n , and mn,n . Of these, the first and the third are exactly the equivalent parameters of model C. The second one combines two parameters of model C and quantifies the interaction between the newly added vertex v ′ and the remainder of the network. Algorithm 8 Aiello model D. Require: Number of steps n and distribution f . Ensure: G[n] is an undirected scale free graph. 1:
Read original graph component.
2:
for k ← 1 to n do
3:
Add a new vertex v ′
4:
Draw me,e , mn,e , me,n , and me,e from f .
5:
Ensuring preferential attachment conditions:
6:
Add me,e edges to G[k − 1]
7:
Add me,n edges from G[k − 1] to v ′
8:
Add mn,n loops to v ′
9: 10:
end for return G[n]
7. Kronecker models 7.1. Deterministic variant The deterministic Kronecker graph evolution model supersedes the earlier R-MAT model [7] and it is based on direct adjacency matrix handling through the Kronecker matrix product. The latter derives from descriptive group theory. Although straightforward and rather simplistic looking, both the deterministic Kronecker graph evolution model and its probabilistic counterpart have been proven [29] to be important tools in synthetic graph creation, the main reason 23
being that most important real large graph properties, including those outlined earlier, are present in graphs generated by this model, termed Kronecker graphs for short. The model is outlined in algorithm 9. The Kronecker product is a matrix operation different than both classical matrix multiplication and Hadamard or entrywise multiplication. For matrices A ∈ Rm1 ×n1 and B ∈ Rm2 ×n2 the Kronecker product denoted by A⊗B ∈ Rm1 m2 ×n1 n2 is defined as A[1, 1] B A[2, 1] B △ A⊗B= .. .
A[m1 , 1] B
A[1, 2] B
...
A[1, n1 ] B
A[2, 2] B .. .
... .. .
A[2, n1 ] B .. .
A[m1 , 2] B
. . . A[m1 , n1 ] B
(25)
where A[i, j] denotes the entry of A at row i and column j. Notice that the Kronecker product can be computed in a straightforward manner in MATLAB or in NetworkX Python package. Kronecker product has a number of interesting properties that are useful in theoretical or computational problems. First and foremost, like ordinary matrix multiplication, the Kronecker product is non commutative. A1 ⊗ A2 6= A2 ⊗ A1
(26)
However, there exist permutation matrices Pr and Pc such that A1 ⊗ A2 = Pr (A2 ⊗ A1 ) Pc
(27)
Unlike matrix multiplication, the fact that A1 ⊗ A2 = O
(28)
does imply that either A1 or A2 equals zero. For symmetric matrices A1 ∈ Rn1 ×n1 and A2 ∈ Rn2 ×n2 T
A1 ⊗ A2 = (A1 ⊗ A2 )
= AT1 ⊗ AT2
(29)
Note that, contrary to the standard transposition rule, the order of A1 and A2 is preserved under the Kronecker product. 24
For invertible matrices A1 ∈ Rn1 ×n1 and A2 ∈ Rn2 ×n2 −1 (A1 ⊗ A2 )−1 = A−1 1 ⊗ A2
(30)
Again the order of A1 and A2 is preserved under the Kronecker product. For any matrices A1 , A2 , B1 , and B2 of compatible dimensions (A1 B1 ) ⊗ (A2 B2 ) = (A1 ⊗ A2 ) (B1 ⊗ B2 )
(31)
Therefore, if A1 and A2 are diagonalizable as A1 = G1 Λ1 GT1 and A2 = G2 Λ2 GT2 , then A1 ⊗ A2 = (G1 ⊗ G2 ) (Λ1 ⊗ Λ2 ) (G1 ⊗ G2 )
T
(32)
Thus, when the spectra of A1 and A2 are known, then the spectrum of A1 ⊗A2 is also known automatically. Note that the factor G1 ⊗G2 needs to be constructed only when it is explicitly required. Moreover, if only a part of G1 ⊗ G2 is required, then it can be constructed on demand easily. For rectangular matrices A1 ∈ Rn1 ×n1 and A2 ∈ Rn2 ×n2 n2
det (A1 ⊗ A2 ) = det (A1 )
det (A2 )
n1
(33)
Notice that det (A1 ) is raised to the n2 -th power while det (A2 ) is raised to the n1 -th power. n
For any given matrix A ∈ Rm×n whose columns are the n vectors {ak }k=1 ∈ Rm×1 , the vec (·) operator is defined as: a1 a2 . △ vec (A) = .. ∈ Rmn×1 an−1 an
(34)
That is, vec (A) is a vector with the elements of A stacked in column-major order, which is the way matrices are stored internally in MATLAB. Combining the vec (·) with the Kronecker product yields the following identity: vec (A1 A2 ) = (I ⊗ A1 ) vec (A2 ) 25
(35)
where I ⊗ A1 when I ∈ RmI ×mI and A1 ∈ RmA ×nA equals the block diagonal matrix
A1
OmA ,nA I ⊗ A1 = .. . OmA ,nA
OmA ,nA A1 .. . OmA ,nA
. . . OmA ,nA
. . . OmA ,nA .. .. . .
...
(36)
A1
where OmA ,nA denotes a mA × nA block of zeros. The primary advantage of Kronecker graphs is that they capture many of the real world graphs, especially when their scale exceeds the bound of a few hundred thousand vertices. Because of the Kronecker product nature itself and because of the way it is used in subsequent sections, the Kronecker graphs are quite adept in imitating or capturing most self-similarity patterns emerging in a considerable number of large scale graphs. These special patterns are outlined among others in [30][29][32] and are summarized in the following paragraphs. The node degree ranking decays according to a Zipf function. A degree of a vertex vk , denoted by deg (vk ), is the number of vertices whose one end lies in vk . As a Zipf function decays much slower than an exponential one, there are relatively many vertices with a large number of neighbors, which intuitively serve as hubs within the graph. The number of vertices whose degree equals an integer k is |{v : deg (v) = k}| = α0 k −γ0 ,
γ0 ≥ 1, α0 > 0
(37)
It is conjectured that a given Zip exponent γ0 characterizes an entire subclass within the class of Kronecker graphs [30], but as yet it is unknown whether additional information is required in order to specify a graph subclass. α0 is a normalization constant which can be used to count, depending on its value, either the actual number or the relative appearance frequency of vertices of a given degree k. The adjacency matrix eigenvalues in absolute value ranking is also a Zipf function and the same is true for the g1 components, which is termed the primary eigenvector. 26
The number of triangles a given vertex participates to is also a Zipf function. A triangle is defined as any triplet of vertices where each one is directly connected with an edge to the other two. In graph theoretic terms it is a complete graph of order three. Its significance is more evident in social graphs, as a friend of any given person is especially likely to also know that person’s other friends. Kronecker graphs become denser over time, with ρ0 increasing in each time step according to a power law. The above implies that Kronecker as well as real world graphs have diameters and effective diameters which shorten over time, leading thus to a more compact and robust structure. In social sciences this is termed the small world phenomenon, which is a more general case of the six degrees of separation. Given a graph generator G0 = G[0] = (V0 , E0 ), the Kronecker graph construction is straightforward. At each discrete time step n apply the Kronecker product to the adjacency matrices of G[n] and G0 , A[n+1] and A0 respectively, to obtain A[n+1] , the adjacency matrix of G[n] : A[n+1] = A[n] ⊗ A0
(38)
The above relationship states that G0 is employed as the building block, initially directly and subsequently indirectly, upon the entire Kronecker graph is constructed at the left-hand side of Kronecker operator and simultaneously as the rules, at each time step, according to which the graph evolves. The twofold role of G0 is in fact the key to self-similarity of a Kronecker graph. Depending on the application, this adjacency matrix sequence can represent either the evolution of the same graph, where each adjacency matrix is a snapshot when the graph has n, n2 . . . or nm vertices, or a sequence of related graphs. Assuming that A0 = Q0 Λ0 QT0 is the spectral factorization of A0 , which is easily obtained as |V0 | is small, typically less than a few hundred numbers long in each size, employing the
27
Kronecker product properties yields A[1] = Q0 Λ0 QT0 ⊗ Q0 Λ0 QT0
T
= (Q0 ⊗ Q0 ) (Λ0 ⊗ Λ0 ) (Q0 ⊗ Q0 )
T
A[n+1] = (Q0 ⊗ . . . ⊗Q0 ) (Λ0 ⊗ . . . ⊗Λ0 ) (Q0 ⊗ . . . ⊗Q0 ) | {z }| {z }| {z } n
[n]
[n]
= Q 0 Λ0
n
n
T [n] Q0
A direct implication is that the spectrum of G[n] is known once the spectrum of G0 is known. Another consequence is that the spectrum of G[n] follows the multinomial distribution. As stated in [30], this is a discrete distribution which, although behaves like a Zipf function, has gaps which may hinder easy Kronecker graph parameterization. This has been addressed in the probabilistic Kronecker model. Algorithm 9 Deterministic Kronecker model. Require: G0 is a valid adjacency matrix. Ensure: G[n] will be a valid adjacency matrix. 1:
Read G0 and n.
2:
G[1] ← G0
3:
for k ← 2 to n do
4:
G[k] ← G[k−1] ⊗G0
5:
end for
6:
return G[n]
7.2. Probabilistic variant It is the probabilistic counterpart of the deterministic Kronecker graph evolution model. While the latter captures with a remarkable simple mechanism the self-similar aspects of large scale-free graphs, it does still possess certain disadvantages. The primary one is that, even under row and column permutations, the Kronecker graph adjacency matrix may look random to the casual human observer but its artificial nature is at least partially visible to a field expert or, 28
in case of very large instances, to a computer. This is mainly attributed to the high adjacency matrix eigenvalue clustering, which becomes clearer as the Kronecker graph grows. This is a direct consequence of the Kronecker product nature and, hence, an inherent Kronecker graph charachteristic. Another major drawback of the deterministic variant is that, like virtually any other deterministic graph evolution model, there can be only one Kronecker graph of a certain size for a particular generator graph, which may be limiting in certain scenaria. The probabilistic Kronecker graph evolution model addresses both problems elegantly and efficiently. Instead of generating a Kronecker graph instance at each time step, it generates edge existence probabilities. Therefore, for a given graph size there is a number of Kronecker graphs that can be generated whose edge existence probabilities follow the same distribution. The probabilistic Kronecker graph evolution model requires an initial symmetric probability matrix P0 whose entry P0 [i, j] specifies the edge existence probability between vi and vj . Then the following matrix sequence is generated: P[1] = P0 ⊗P0 P[n+1] = P[n] ⊗P0
(39)
When the desired adjacency matrix size is reached, usually the nearest power of |V0 |, then a graph is constructed where an edge between vi and vj exists with probability P[n] [i, j]. Note that, edge existence probabilities may need to be normalized such that P[n] will be a stochastic or a doubly stochastic matrix. Thus multiple adjacency matrices can be constructed as realizations of the same P[n ]. This way the spectrum of A[n] can be shown [30] to be smoother. Naturally, in this case, the number of triangles and the elements of g1 are random variables, requiring thus a number of experiments in order to approximate the stochastic quantities of interest with the corresponding ensemble averages. The probablistic Kronecker model is outlined in algorithm 10.
29
Algorithm 10 Probabilistic Kronecker model. Require: P0 is a symmetric, doubly stochastic matrix. Ensure: P[n] will be a valid edge distribution. 1:
Read P[0] and n.
2:
P[1] ← P[0]
3:
for k ← 2 to n do
4:
P[k] ← P[k−1] ⊗P0
5:
end for
6:
Normalize P[n] to make it stochastic. {Optional}
7:
return P[n]
8. Software, data structures, and benchmarks Recently, a novel data structure based on splay trees and skip lists has been proposed offering efficient persistency operations on graphs [27]. Persistent graphs are significant as they store not only information regarding their current state but also information capable of reverting the graph to previous states. In a graph database context persistency enables rollback operations, enhancing thus database durability. Currently, the most common benchmark for assessing the performance of large scale graph software and algorithms is Graph 500 [18]. It is operated by a steering committee comprised by a large number of software and hardware companies. Regarding low level graph mining software, there are many choices. GraphLab [19] is a parallel machine learning framework developed in CMU that supports a number of machine learning paradigms [33]. It contains algorithms such as Bayesian inference and Expectation-Maximization. Giraph [42] is the Apache Foundation open large graph processing system. It is currently being employed, among others, by Facebook in order to support the social graph search functions. High level software for graph handling also abounds. Graph Magics [9] is a tool for graph theory focusing mostly on graph structure. Similar packages include iGraph [43], Pajek [4], and Gephi [16]. Finally, NetworkX [39] is a 30
popular Python package for generating and handling graphs. It contains a large array of synthetic graph generators as well as of graph processing algorithms.
9. Conclusions and future work This survey started with an outline of the large-scale, real graph characteristics and properties such as the Web graph and biological graphs. These properties have been employed in turn in order to derive criteria for a number of synthetic graph generators. Developing such a generator is by no means a trivial task, as rigorous mathematical analysis is required in order to ensure that many graph aspects such as the number of triangles or the graph spectrum obey a power law or a power-cutoff law. Finally, this survey concluded with a brief introduction to graph generation and manipulation software as well as to the benchmarks used to evaluate this software. Parallel graph generation is currently an important challenge [20][21]. Although it is straightforward to locally create subgraphs possessing some or even all the properties of the real world graphs such as the shrinking diameter or the vertex degree power law distribution, doing so along many distributed processors poses certain organizational and communications challenges. Data streams and online algorithms are very promising research directions in graph mining, as incremental graph generating and pattern detection tend to become prevalent in the big data era. New algorithms utilizing probabilistic methods and external memory as well as distributed systems techniques [21] need to be develop in order to successfully address these challenges.
References References [1] W. Aiello, F. Chung, and L. Lu. Random evolution in massive graphs. Handbook of massive data sets, pages 97–122, 2002. [2] B. Andr´ asfai. Graph theory: flows, matrices. CRC Press, 1991. 31
[3] A.-L. Barab´ asi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999. [4] V. Batagelj and A. Mrvar. Pajek, July 2014. [5] Z. Bi, C. Faloutsos, and F. Korn. The ”dgx” distribution for mining massive, skewed data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, pages 17–26, New York, NY, USA, 2001. ACM. [6] B. Bollob´ as. Modern graph theory. Springer-Verlag, 1998. [7] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, 2004. [8] F. R. Chung. Spectral graph theory. AMS, 1997. [9] D. Ciubatii. Graph magics, July 2014. [10] C. D. Correa, T. Crnovrsanin, K.-l. Ma, and K. Keeton. The derivatives of centrality and their applications in visualizing social networks, 2009. [11] R. Dawkins. The selfish gene. Popular Science Series. Oxford University Press, Incorporated, 1976. [12] N. Durak, A. Pinar, T. G. Kolda, and C. Seshadhri. Degree relations of triangles in real-world networks and graph models. In CIKM’12: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pages 1712–1716. ACM, 2012. [13] P. Erd¨ os and A. R´enyi. On the evolution of random graphs. In Publication of the mathematical institute of the hungarian academy of sciences, pages 17–61. Hungarian academy of sciences, 1960. [14] E. Estrada and D. J. Higham. Network properties revealed through matrix functions. SIAM Rev., 52(4):696–714, Nov. 2010.
32
[15] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In ACM SIGCOM Conference, pages 251–262, 1999. [16] Gephi team. Gephi, July 2014. [17] M. Girvan and M. Neuman. Community structure in social and biological networks. In PNAS, 2002. [18] graph500 Steering Committee, July 2014. [19] GraphLab Inc. Graphlab, July 2014. [20] M. Gupta. Automatic Data Partitioning on Distributed Memory Multicomputers. PhD thesis, UIUC, 1992. [21] S. Gupta. External memory based distributed generation of massive scale social networks on small clusters. CoRR, abs/1210.0187, 2012. [22] L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43, March 1953. [23] J. Kleinberg, S. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. The web as a graph: Measurements, models, and methods. In The fifth Annual International Conference on Combinatorics and Computing (COCOON), 1999. [24] C. Klymko, D. F. Gleich, and T. G. Kolda. Using triangles to improve community detection in directed networks. In The Second ASE International Conference on Big Data Science and Computing, BigDataScience, 2014. accepted 2014-04-16. [25] T. G. Kolda, A. Pinar, T. Plantenga, and C. Seshadhri. A scalable generative graph model with community structure. SIAM Journal on Scientific Computing, 36(5):C424–C452, September 2014.
33
[26] T. G. Kolda, A. Pinar, T. Plantenga, C. Seshadhri, and C. Task. Counting triangles in massive graphs with MapReduce. SIAM Journal on Scientific Computing, 2014. accepted 2013-12-04. [27] S. Kontopoulos and G. Drakopoulos. A space efficient scheme for persistent graph representation. In The IEEE International Conference on Tools with Artificial Intelligence, July 2014. [28] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, FOCS ’00, 2000. [29] J. Leskovec, D. Chakrabarti, J. Kleinberg, and C. Faloutsos. Realistic, mathematically tractable graph generation and evolution using kronecker multiplication. In PKDD05, pages 133–145. Springer, 2005. [30] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. J. Mach. Learn. Res., 11:985–1042, Mar. 2010. [31] J. Leskovec and C. Faloutsos. Scalable modeling of real graphs using kronecker multiplication. In Proceedings of the 24th international conference on Machine learning, pages 497–504. ACM, 2007. [32] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, KDD05, pages 177–187, New York, NY, USA, 2005. ACM. [33] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, California, July 2010. [34] L. Lu. The diameter of random massive graphs. In The twelfth ACM-SIAM Symposium on Discrete Algorithms (SODA), 2001. 34
[35] F. D. Malliaros and V. Megalooikonomou. Expansion properties of large social graphs. In Proceedings of the 16th international conference on Database systems for advanced applications, pages 311–322. Springer, 2011. [36] F. D. Malliaros, V. Megalooikonomou, and C. Faloutsos. Estimating robustness in large social graphs. Knowledge and Information Systems: An International Journal. [37] F. D. Malliaros, V. Megalooikonomou, and C. Faloutsos. Fast robustness estimation in large social graphs: Communities and anomaly detection. In Proceedings of the SIAM International Conference on Data Mining, 2012. [38] B. Mandelbrot. The fractal geometry of nature. W.H. Freeman and Company, 1977. [39] NetworkX developer team. Networkx, July 2014. [40] G. Rozenberg and A. Salomaa. The mathematical theory of L systems. Academic Press, New York, NY, USA, 1980. [41] C. Seshadhri, A. Pinar, and T. G. Kolda. An in-depth analysis of stochastic Kronecker graphs. Journal of the ACM, 60(2), April 2013. [42] The Apache Foundation. Apache giraph, July 2014. [43] The iGraph core team. iGraph, July 2014. [44] H. Tong, B. A. Prakash, T. Eliassi-Rad, M. Faloutsos, and C. Faloutsos. Gelling and melting large graphs by edge manipulation. In Proceedings of the 21st ACM international conference on Information and knowledge management, CIKM12, pages 245–254, 2012. [45] C. E. Tsourakakis. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM, pages 608–617, 2008. [46] J. S. Vitter. External memory algorithms and data structures: Dealing with massive data. ACM Comput. Surv., 33(2):209–271, Jun 2001. 35
[47] D. Watts. Six degrees: The science of a connected age. W.W.Norton and co., 2003. [48] D. Watts and S. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393(6684):440–442, 1998.
36
Symbol
Meaning
△
=
Right hand side equals left one by definition
b
Vector (lowercase underlined boldface)
b
T
Vector transpose (vectors assumed columns by default)
A
Matrix (uppercase boldface)
In
n × n identity matrix
Om,n
m × n zero block
λ
Matrix spectrum
E E
Estrada index (equation (12)) ′
Odd length Estrada index (equation (13))
N
Neuman index (equation (14))
C
Cheeger number (equation (10))
ρ0
Graph density (equation (16))
ρ′0
Logarithmic graph density (equation (16))
Kn
Complete graph with n nodes
⊗
Kronecker tensor product (equation (36))
hxk i
Sequence of elements xk
{s1 , . . . , sn }
Set with elements s1 , . . . , sn
|S|
Cardinality of set S
(u, v)
Edge between u and v
deg (v)
Degree of vertex v
degi (v)
In-degree of vertex v
dego (v)
Out-degree of vertex v
d¯
Graph average degree
ℓ¯
Graph average length
¯ G
Complement of graph G Table 1: Summary of symbols used in this survey.
37
Spectral Properties λ1 is always positive. λ2 is an information diffusion metric. λ|V | always equals zero. Number of zero eigenvalues equals the connected components minus one. ∆ is a metric of graph growth capability. The spectrum of unidrected graphs is always real. Smaller eigenvalues tend to alternate around zero. The spectrum is unaltered under isomorphisms. Eigenvalues come from a Zipf function. g1 components come from a Zipf function. Table 2: Scale free graph sprectral properties
Structural Properties Self similarity. Communities. Diameter and effective diameter shorten over time. Graph densifies over time. Number of triangles comes from a Zipf function. Vertex degree comes from a Zipf function. Edge existence interdependence. Table 3: Scale free graph structural properties.
38
Criterion
Options
Evolution type
Static vs dynamic
Edge direction
Directed vs undirected
Repeatability
Deterministic vs probabilistic
Design
Top-down vs bottom-up
Targeted properties
Structural vs spectral
Number of triangles
Closed form vs numerical
Edge probability
Closed form vs numerical
Scalability
Scalable vs non-scalable
Table 4: Basic synthetic graph model criteria.
Model
Section
Erd¨ os-R´enyi Gn,p
Section 3.1
Erd¨ os-R´enyi Gn,m
Section 3.2
Watts-Strogatz
Section 4
Albert-Barab´ asi
Section 5
Aiello model A
Section 6.1
Aiello model B
Section 6.2
Aiello model C
Section 6.3
Aiello model D
Section 6.4
Deterministic Kronecker
Section 7.1
Probabilistic Kronecker
Section 7.2
Table 5: Graph models listed in this survey.
39