CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2016; 00:1–19 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe
Distributed computing of distance-based graph invariants for analysis and visualization of complex networks Wojciech Czech∗ , Wojciech Mielczarek, Witold Dzwinel AGH University of Science and Technology, Institute of Computer Science
SUMMARY We present a new framework for analysis and visualization of complex networks based on structural information retrieved from their distance k-graphs and B-matrices. The construction of B-matrices for graphs with more than 1 million edges requires massive BFS computations and is facilitated using new software prepared for distributed environments. Our framework benefits from data-parallelism inherent to all-pair shortest-path (APSP) problem and extends Cassovary, an open-source in-memory graph processing engine, to enable multi-node computation of distance k-graphs and related graph descriptors. We also introduce a new type of B-matrix, constructed using clustering coefficient vertex invariant, which can be generated with a computational effort comparable to the one required for previously known degree B-matrix, while delivering additional set of information about graph structure. Our approach enables efficient generation of expressive, multi-dimensional descriptors useful in graph embedding and graph mining tasks. The experiments shown that new framework is scalable and for specific APSP task provides better performance, than existing generic graph processing frameworks. We also present how the developed tools helped in the analysis and c 2016 John visualization of real-world graphs from Stanford Large Network Dataset Collection. Copyright Wiley & Sons, Ltd. Received . . .
KEY WORDS: distance k-graph, graph invariant, APSP, graph comparison, graph visualization
1. INTRODUCTION Large datasets being abundant in many fields of science and technology are frequently structured, that is they form graph patterns or networks which aggregate information about relations between objects and provide system-wise views of mechanisms generating data. Graph processing tasks involve searching state spaces, filtering, link prediction, generation of invariants, sub-structure matching, link recommendation, community detection and classification. As the size of graphs increases to millions or even trillions of edges [1], their processing and analysis becomes challenging, both from computational and storage perspective. Additionally, specific structural properties of real-world graphs such as scale-free distribution of vertex degrees do not allow for efficient parallelization. This is because the existence of densely connected vertices (hubs) makes graph partitioning highly problematic and has negative impact on communication between multiple machines. The computing node storing adjacency lists of hubs has to be contacted much more frequently than any other node. This causes serious delays, multiplicative with a number of edges crossing partitions. In this work we discuss a specific subgroup of graph analysis tools, namely graph embedding that is a transformation of graph patterns to feature vectors otherwise known as graph descriptors ∗ Correspondence
to:
[email protected]
c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls [Version: 2010/05/13 v3.00]
2
W. C. OTHER
generation. Graph embedding is one of the most frequently used methods of graph comparison providing a bridge between statistical and structural pattern recognition [2]. After generating descriptors invariant under graph isomorphism, the power of well-established vector-based pattern recognition methods is brought to more complex structural world, which tackles with non-trivial combinatorial domain. Graph feature vectors enable computation of dissimilarity or similarity measures crucial in applications like image recognition, document retrieval and quantitative analysis of complex networks [3]. The rest of this work is organized as follows. In Section 2 we describe relation between graph embedding and complex network analysis, explaining our motivation for computing distance-based graph invariants. Next, in Section 3 the overview of tools for massive graph processing is presented. Section 4 introduces graph descriptors based on inter-vertex distances and proposes new invariants well-suited for large graph analysis. In Section 5, we describe implementation details of the new distributed graph processing framework built based on Twitter’s Cassovary and aimed at generating distance k -graphs and related descriptors. Section 6 contains performance and scalability analysis of our software, as well as comparison with other graph processing frameworks. Later, in Section 7 we present selected results of analyzing complex networks from Stanford Large Network Dataset Collection. Section 8 concludes the paper, offering final remarks and describing future work plans.
2. COMPLEX NETWORKS AND GRAPH COMPARISON Graph datasets appear frequently in various fields of contemporary science as data structures aggregating information about relations between numerous objects. Today, a variety of structured data is represented by different types of networks including cellular graphs describing regulatory mechanisms and metabolic reactions inside a cell, social networks encoding inter-personal links, spatial graphs representing proximity relations between elements embedded in metric space, web graphs gathering information about hyperlinks and many others. The non-trivial structure of realworld graphs studied extensively in recent decade gave rise to a new interdisciplinary research field of complex networks. Starting from pioneering works by Watts [4] studying small-world phenomenon and Barab´asi [5] explaining scale-free degree distributions in real-world graphs, the analysis of complex networks revealed a set of interesting structural properties emerging from underlying dynamics [6]. The heavy-tailed distributions of vertex degrees have big influence on spreading processes on graphs and determine network resistance to random attacks at the same time inducing vulnerability to targeted attacks. The quantitative analysis of complex networks developed a set of graph invariants which allow to compare different networks and to make conclusions regarding their generating mechanism and future evolution. † Those descriptors reflect local or global topological features of networks and the time complexity of their algorithms varies from O(|V (G)|) to O(|V (G)|3 ). Typically, for global descriptors such as efficiency computational challenge occurs at the scale of millions of edges. In parallel to descriptors originating from the theory of complex networks, the structural pattern recognition community developed multiple alternative approaches to graph embedding [2]. The common characteristic of those descriptors is that they are typically multi-dimensional, encoding variety of topological features. The goal is to capture most discriminative features of a graph to enable efficient classification and clusterization of graph patterns. The most recent studies in the field proposed descriptors based on different random walk models [7, 8], spectral graph theory [9], prototype-based embedding [10, 11], substructure embedding [12], wave kernel trace [13] and distribution of graph entropy [14]. Those descriptors were designed for graphs representing documents, molecules or images, which have considerably smaller size than a typical complex network with hundreds of thousands of edges. Therefore, they are less practical for analysis of large graphs due to considerable computational overhead.
† diameter,
efficiency, characteristic path length, vertex betweenness, vertex closeness, vertex eccentricity, transitivity, clustering coefficient, assortativity [6]. c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
DISTRIBUTED COMPUTING OF DISTANCE-BASED GRAPH INVARIANTS FOR ANALYSIS...
3
In the work [15] we introduced a new method of graph embedding which uses invariants of distance k -graphs and specifically degree B-matrices as a framework for graph embedding. Here, we extend our study by introducing new B-matrix based on clustering coefficient and presenting software dedicated for distributed computing of distance-based graph invariants. Based on Bmatrices derived from shortest paths distributions we construct multi-dimensional representations of large complex networks and provide universal method of their visualization. Our motivation for further exploring B-matrices is twofold: (i) they form space-efficient 2D fingerprints of networks useful in graph comparison and visualization (ii) real-world complex networks have the size three orders of magnitude larger than previously tested graphs, therefore all-pair shortestpaths computation is a technical challenge, which could be addressed by using state-of-art graph processing frameworks.
3. MASSIVE GRAPH PROCESSING Generating vector representations of complex networks and specifically computing shortest-paths distance matrices for large graphs is a computationally expensive task which can be facilitated using different types of processing frameworks, both distributed and single-node. The Hadoop technology, being the state-of-art implementation of the Map-Reduce parallel programming model, appears to be a suitable tool for processing of big structured data. Unfortunately, the iterative nature of graph traversals such as BFS requiring multiple consecutive executions of heavy Map-Reduce jobs, brings significant intrinsic computational cost to Hadoop-based implementations. Another negative factor stems from heavy-tailed vertex degree distributions typical for real-world graphs. It causes highly non-uniform workloads for reducers what badly affects overall performance. To overcome those limitations, several frameworks optimized for iterative Map-Reduce processing were proposed. One of them is HaLoop [16], the modified version of Hadoop, designed to comply with specific requirements of iterative parallel algorithms. The framework reduces the number of unnecessary data shuffles, provides loop-aware task scheduling and caching of loopinvariant data. Twister [17] provides distributed environment, which brings several extensions improving efficiency of iterative Map-Reduce jobs, e.g., caching static data used by tasks, increasing granularity of tasks, introducing additional phase of computation (combine) or more intuitive API. PrIter framework [18] accelerates convergence of iterative algorithms by selective processing of certain data portions and giving higher priorities to related subtasks. This allows achieving considerable speedups comparing to Hadoop implementations. GraphX [19] built on top of distributed dataflow framework Apache Spark provides convenient interface for interacting with graph structures based on vertex-centric computing model. Bulk-Synchronous Parallel (BSP) model is implemented by Google Pregel [20] (C++) or its open-source counterparts: Apache Giraph [21] (Java) and Stanford GPS [22] (Java). Here, the computation is divided into supersteps, which perform vertex-based local processing by evaluating functions defined on vertices and exchanging messages with neighbors. Synchronization barriers occur between supersteps imposing ordering required for on-time delivery of inter-vertex messages passed along edges. Again, the hub vertices present in scale-free graphs cause communication peaks for certain workers in distributed BSP model. This problem cannot be solved easily, as balanced graph partitioning is NP-complete task. The popular GraphLab framework [23] uses different Gather-Apply-Scatter (GAS) programming model, which enables execution without communication barriers (asynchronous mode). Moreover, it employs edge-based graph partitioning resulting in better-balanced communication for graphs with heavy-tailed degree distributions. The comprehensive analysis of performance and scalability of different distributed graph processing frameworks was reported in several works based on test cases such as Page Rank, Single-Source Shortest Path, Weakly-Connected Components [24], Graph Coloring, Bipartite Maximal Matching [25] or Label Propagation [1]. Currently, Apache Giraph and GraphLab appear to gain the greatest attention of graph mining community [1]. Distributed processing as a way of dealing with large graphs brings fault tolerance and horizontal scalability but at the cost of implementation complexity and troublesome communication bursts c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
4
W. C. OTHER
caused by uneven data partitions. The different approach assumes in-memory processing of a whole graph on a single-machine. Due to increased availability of servers with considerable RAM size (256GB+) and existence of vSMP hypervisors enabling aggregation of memory from different machines, this design principle seems to be an interesting alternative to distributed graph processing. Assuming adjacency list as a graph representation, the unlabeled graph with 12 billions of edges can be processed on enterprise-grade server with 64GB RAM. SNAP (C++, Python) [26] is an example of robust graph library which accommodates in-memory single-server processing. GraphChi [27] uses out-of-core computing model providing cost-effective solution well-suited to machines with SSD disks. An open-source Cassovary [28] is a Java/Scala library created and used by Twitter for computing of Who-To-Follow recommendations. The primary goal of our work is to enable efficient generation of distance matrices and B-matrices for complex networks with millions of edges. This requires computation of All-Pair Shortest Paths (APSP) and typically results in cubic time complexity. Distributed graph processing frameworks following vertex-state machine programming model (Giraph, GraphX) are not perfectly-suited to this specific task. Among other iterative algorithms on graphs their APIs provide ready-to-use Single-Source Shortest Path (SSSP) implementations, but this does not help much in APSP problem. Due to single per-vertex state, it is difficult to run multiple parallel SSSP jobs in an optimal way and aggregate information about all the shortest paths. The second approach we considered was GPGPU computing and parallelized version of R-Kleene APSP algorithm [29]. Unfortunately, R-Kleene algorithm requires complete distance matrix to reside in memory. Despite significant speedups comparing to serial version [30], the usability of this method is limited by memory size of GPU used for processing, e.g., Tesla C2070 with 6GB memory allowed generation of distance matrix for graphs with 56281 vertices. After considering capabilities and limitations of different approaches for parallel APSP computing, additionally taking into account implementation complexity, we decided to select Cassovary as production-proven, JVM-based platform with a good API. We extended Cassovary by creating module for distance-based graph embedding and visualization. The preliminary results of our work, focusing on a single-machine multi-threaded computation, were reported in [31]. Here, we present distributed version of the software, which utilizes data-parallelism and reduces B-matrix generation to multiple BFS runs with immutable graph data replicated across cluster machines.
4. GRAPH DESCRIPTORS FROM k -DISTANCE GRAPHS The inter-vertex dissimilarity measures such as the shortest path length or the commute time provide comprehensive information about graph structure. In [15,32], we presented how to use shortest paths for constructing ordered set of distance k -graphs and generating isomorphism invariants. Here, we briefly describe most important notions, define generic B-matrix based on any vertex-defined function and introduce clustering coefficient B-matrices. Definition 1 (Vertex distance k -graph) For an undirected graph G = (V (G), E(G)) we define vertex distance k -graph GVk as a graph with a vertex set V (GVk ) = V (G) and edge set E(GVk ), so that {u, v} ∈ E(GVk ) iff dG (u, v) = k . dG (u, v) is dissimilarity measure between vertex u and v , in particular the length of the shortest path between u and v . It follows that GV1 = G and for k > diameter(G), GVk is an empty graph. For a given graph G the invariants of G-derived vertex k -distance graphs can be aggregated to form new descriptor of length diameter(G). Moreover, the constant-bin histograms of selected vertex descriptor (e.g. degree) for GVk graphs form robust 2D graph representation called vertex B-matrix.
Definition 2 (Vertex B-matrix) For the ordered set of functions F = {f1 , . . . , fk } so that: fi : V (GVi ) → X and ordered set of n S disjoint categories b1 , . . . , bn : ∀ bi ⊆ X ∧ bi = X (called bins), the vertex B-matrix of i∈{1,...,n}
c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
i=1
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
DISTRIBUTED COMPUTING OF DISTANCE-BASED GRAPH INVARIANTS FOR ANALYSIS...
5
the graph G is defined as: V,F Bk,l (G) = |{v : v ∈ V (GV k ) ∧ fk (v) ∈ bl }|
(1)
where 1 ≤ l ≤ n and 1 ≤ k ≤ diameter(G). The k -th row of a vertex B-matrix is a histogram of values returned by function fk , which is defined on a vertex set of particular GVk graph. The functions from set F are typically representing the same family of vertex invariants, e.g., vertex degree. The number of bins n is adjusted to the value set X , e.g., for fk ≡ degree(v), fine-grained vertex B-matrix can be generated using n = |V (G)| same-size bins. The next type of B-matrix can be constructed using edge-vertex distances. Definition 3 (Edge distance) Let G = (V (G), E(G)) be an undirected, unweighted, simple graph. The distance from a vertex w ∈ V (G) to an edge euv = {u, v} ∈ E(G), denoted as dEG (w, euv ), is the mean of distances dG (w, u) and dG (w, v). 1 (2) dEG (w, euv ) = (dG (w, u) + dG (w, v)) 2 For unweighted graphs dEG has integer or half-integer values. For a selected vertex w, integer values of k = dEG (w, euv ) occur for edges euv , whose endpoints are equidistant from w. This means that euv belongs to odd closed walk of length 2k + 1 starting and ending at w. Definition 4 (Edge distance k -graph) We define edge distance k -graph as a bipartite graph GEk = (U (GEk ), V (GEk ), E(GEk )) = (V (G), E(G), E(GEk )) such that for each w ∈ V (G) and euv ∈ E(G), {w, euv } ∈ E(GEk ) iff dEG (w, euv ) = k . The maximal value of k for which GEk is non-empty is 2 · diameter(G). The descriptors of graph G constructed based on edge distance k -graphs bring more discriminating information than ones obtained for vertex distance k -graphs. Definition 5 (Edge B-matrix) For the ordered set of functions F = {f1 , . . . , f2k } so that: fi : U (GEi/2 ) → X and an ordered set n S of disjoint categories b1 , . . . , bn : ∀ bi ⊆ X ∧ bi = X (called bins), the edge B-matrix is i∈{1,...,n}
defined as:
i=1
E,F Bk,l (G) = |{v : v ∈ U (GEk/2 ) ∧ fk (v) ∈ bl }|,
(3)
where l ≤ n and 1 ≤ k ≤ 2 · diameter(G). Assuming fk ≡ degree(v), the number of bins n of the fine-grained edge B-matrix is typically set to |E(G)|. In work [15], we described discriminating capabilities of degree B-matrices in machine learning and visual graph comparison. Being isomorphism invariant, they can capture density, regularity, assortativity, disassortativity, small-worldliness, branching factor, transitivity, bipartity, closed walks count and many other structural properties. Table I summarizes selected correspondences between degree B-matrix characteristics and different topological features of an underlying graph. The space complexity of B-matrices is O(n · diameter(G)). As the diameter of complex network is typically proportional to log(|V (G)|) or even log(log(|V (G)|)), the memory consumed by the Bmatrices is much less than one used by the raw distance matrices, requiring O(|V (G)|2 ) space. E.g. for LiveJournal graph from SNAP database [33], which has 4847571 vertices and 68993773 edges, more than 10 TB is required to store raw distance matrix (assuming unsigned byte type). At the same time, degree vertex B-matrix needs 84 MB (unsigned int) storage space, which can be further optimized because of multiple empty columns. This makes B-matrices feasible tool for embedding and indexing big real-world graphs. The pre-selected rectangular fragments of B-matrices form long pattern vectors useful in machine learning tasks. For lower-dimensional representation the aggregated statistics of B-matrix rows (e.g. relative standard deviation) are applied to obtain feature vectors of a size proportional to a graph diameter. c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
6
W. C. OTHER
Table I. The relations between degree B-matrix elements and structural properties of underlying graph (Y stands for V or E symbol). For B V , n = |V (G)|, while for B E , n = |E(G)|, F is a set of vertex degree functions for subsequent distance k-graphs, µE (k) denotes average value of k-th row of edge B-matrix.
Degree B-matrix feature P V,F |Vˆ (GV k )| = l Bk,l P E,F ˆ (GE )| = |U k l B2k,l
Graph feature
P Y,F 1 |E(GY k )| = 2 α l lBαk,l , where α = 1 for B V,F and α = 2 for B E,F P E,F n4 (G) = 31 |E(GE1 )| = 13 l lB2,l P E,F c3 = − 13 µE (2) l B2,l P P V,F wi(G) = 12 k l klBk,l P P V,F ef (G) = |V (G)|(|V1 (G)|−1) k l kl Bk,l
Number of edges in GY k
All even rows of B E,F are empty
Graph G is bipartite (does not contain odd cycles)
Second row of B E,F is empty
Graph G does not have any triangles
Number of non-isolated vertices in GVk Number of non-isolated vertices belonging to set U (GEk ) from GEk
Number of triangle graphlets in G Relation with third Ihara zeta coefficient Relation with Wiener index Relation with graph efficiency
The function f used in the B-matrix definitions can represent any scalar vertex invariant, e.g., Page Rank probability or eigenvector centrality. Nevertheless, for practical reasons we limit our study to local vertex descriptors, which can be computed in time O(|V (G)|) or O(|E(G)|) for every distance k -graph. Generating distance k -graphs is a task of the high computational cost O(|V (G)|3 ) and we should avoid increasing it further by employing too heavy vertex invariants. Apart from vertex degree, one of the most widely used local vertex descriptors is clustering coefficient [6], therefore we decided to test B-matrices constructed from it. Definition 6 (Clustering coefficient) The clustering coefficient of a vertex v defined as: cc(v) =
2|{u : euv ∈ E(G)}| , kv (kv − 1)
(4)
where euv = {u, v} and kv = degree(v), is a ratio of a number of connections between neighbors of the vertex v , denoted by |{u : euv ∈ E(G)}|, to a number of links that could possibly exist between them, i.e., kv (kv − 1)/2. The clustering coefficient describes local topology of a graph, reflecting how close the neighborhood of a given vertex is to form a complete graph. Dense neighborhood connectivity, resulting in the higher than average values of this invariant, is a key feature of small-world networks [4]. The clustering coefficient for a graph G with |V (G)| vertices is an average of clustering coefficients for each vertex: cc(G) =
X 1 cc(v). |V (G)|
(5)
v∈V (G)
As cc : V (G) → [0, 1], the number of bins n can be fixed to some user-defined value. This makes clustering coefficient B-matrices of different-size graphs easier to compare (the same number c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
DISTRIBUTED COMPUTING OF DISTANCE-BASED GRAPH INVARIANTS FOR ANALYSIS...
7
of columns). In Figure 5 and Figure 7, we present sample B-matrices for graphs from SNAP database [33]. Clustering coefficient vertex B-matrix reflects the small-worldliness of subsequent vertex distance k -graphs, that is the probability of forming cliques via same-length distant links. The B-matrix of a graph G can be also represented as a sum of binary B-matrices generated for each vertex v ∈ V (G) (Y stands for V or E symbol). ( 1 fk (v) ∈ bl Y,F Bk,l (v) = (6) 0 otherwise B Y,F (G) =
X
B Y,F (v)
(7)
v∈V (G) Y,F Bk,l (v) can be treated as a binary feature vector representing vertex v . Based on Equation 7, one can generate partial B-matrices constructed using subsets of vertices selected according to some external criteria. If for a given k and v ∈ V (G) the information about bin membership of v is preserved, the enumeration of vertices accounting for a given fragment of a B-matrix reveals correspondence between structure of B-matrix and structure of underlying graph.
4.1. B-matrix generation In this section, we briefly discuss a time and space complexity of the B-matrix computation. Let G = (V (G), E(G)) be a graph for which B-matrix of a given type should be generated. Enumeration of all distance k -graphs requires solving APSP problem, which has O(|V (G)|3 ) time complexity (Floyd-Warshall, R-Kleene, repreated BFS) or O(|V (G)|2 log|V (G)| + |V (G)||E(G)|) for Johnson algorithm, being more efficient for sparse graphs. Storing full distance matrices is impractical due to considerable data volumes. Even using binary format and adjusting a code-length based on estimated graph diameter do not prevent consumption of terabytes for 1M+ graphs. Nevertheless, for real-world networks, the diameters scale logarithmically with a number of vertices, which means that distance matrices have a small number of unique values and can be compressed efficiently using, e.g., Huffman encoding. Generating degree B-matrices does not require access to full distance matrix and explicit computation of distance k -graphs. Rather, it can be performed gradually based on a partial information about single-source shortest-paths (SSSP). Frequency information about vertex degrees of the distance k -graphs can be collected online based on the breadth first trees traversed in parallel for different source vertices. This enables utilizing data parallelism and distributing separate disjoint groups of SSSP problems across computational nodes. Provided that each node has full information about graph structure (e.g. in a form of compressed adjacency lists), no messages has to be exchanged between members of a cluster. The same paradigm can be applied on a thread-level (single node). Threads have shared read-only access to a graph stored in RAM and update individual partial degree B-matrices during the computation phase. Partial B-matrices are later merged on a level of a single node and on a level of cluster. This approach assumes replication of adjacency lists across computational nodes but limits operational memory usage to O(n · log|V (G)|), where n denotes number of categories. In case of a clustering coefficient B-matrix, the computing flow is more complex. The BFS tree started from vertex v and constructed by a single thread delivers information about k neighborhood of the root vertex v (therefore degree of v in GY k can be obtained easily), but the information about possible link between neighbors is not present and has to be acquired from different threads. Moreover, the presence of densely connected vertices (hubs) requires scanning long neighborhood lists for any vertex adjacent to hub. Therefore, we decided to use two approaches for a clustering coefficient B-matrix computation. In the first one, the B-matrix is generated in two phases: constructing distance k -graphs and storing them on a disk (distributed phase) and computing histograms of the clustering coefficient invariants for each persistent distance k -graph (distributed phase). This method provides scalability but requires considerable storage space as for some values of k distance k -graphs can be dense. In the second approach, we used single-node in-memory c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
8
W. C. OTHER
Figure 1. The conceptual diagram of distributed degree B-matrix computation.
computation and data structures shared between threads for fast k -neighborhood scanning. This method provides better time-efficiency but consumes greater amounts of RAM, and therefore can be applied only for smaller graphs (up to 50000 vertices).
5. SOFTWARE Motivated by the need of visualization and in-depth quantitative analysis of big real-world graphs we have developed extension to Cassovary [28], which enables distributed computing of all-pair shortest-paths, B-matrices and generation of the distance-based descriptors. The prepared module can be used for batch processing of graphs in multi-node environments and producing data for Graph Investigator [34] application, which performs clusterization or classification on the sets of graphs. The core of B-matrix computation lies in calculating shortest paths for all pairs of vertices. By its nature B-matrices are additive - full vertex/edge B-matrix can be constructed as a sum of partial B-matrices calculated for subsets of nodes. At the same time this task does not seem to fit well into the distributed computation model, as for each partial B-matrix the information about distances to all vertices has to be accumulated. These two observations led us to design choices regarding computational model and selection of Cassovary as a way of providing good memory optimization by using efficient FastUtil and native Java/Scala collections. 5.1. Overview of Cassovary Graph data may be provided to Cassovary either as an adjacency list or a list of edges. The latter format is dominant in the SNAP graph repository, but it requires more computational effort to be transformed into Cassovary internal graph representation, which resembles compressed adjacency list. Cassovary provides a handful of graph data types, suitable for a range of purposes, possessing different features such as mutability/immutability or bi-partity. For our purpose, we decided to use array-based directed graph as immutable and memory-efficient representation. The key memory optimization of this type of structure lies in using Java arrays and instantiating only vertices directly required for a particular graph (existing in the input graph file and having non-empty neighborhood). This comes at the cost of calculating array sizes, required for each vertex to store its neighbors. c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
DISTRIBUTED COMPUTING OF DISTANCE-BASED GRAPH INVARIANTS FOR ANALYSIS...
9
Additionally, the neighbors of each node can be sorted to enable faster neighbor-checking using binary search. Graphs created this way are immutable and thread-safe, which makes them a perfect fit for our use-case. The custom BFS traversing capabilities were added on a top of immutable graph by the mutable BreadthFirstTraverser class, holding the state of traversal and distances from the source node to any other visited node in FastUtils Int2IntOpenHashMap. These structures are not shared between threads, which eliminates the synchronization and allows utilizing data-parallelism. 5.2. B-matrix computation flow As a preliminary step of the computation, the whole graph is loaded into memory. The master application specifies, which partition should be calculated be a given process. Each process is provided with a vertex set defining partition and is supposed to calculate partial B-matrix for the designated range of nodes. Current implementation provides random partitioning and partitioning based on splitting ordered vertex identifiers. After loading a graph, BFS worker threads are created and assigned to a fixed thread pool in the execution context, which holds thread-specific data structures (such as partial B-matrices). Each BFS worker is fed with a small batch of nodes (default value 25, set up experimentally), until the whole partition is calculated. This approach ensures, that no worker receives the much more expensive part of the computation, which would lead to considerably varying finishing times. Each thread runs BFS and aggregates results into its own partial B-matrices, which are not shared between the workers, therefore synchronization is not required. Apart from the BFS itself, the edge B-matrix aggregation is computationally expensive part of the process, since it requires an iteration over all edges in the graph. After processing all vertices the partial B-matrices are being merged (addition) and the output file is written. We use FastUtils Int2IntObjectHashMap to represent B-matrices because they are typically sparse. In case of clustering coefficient B-matrices, different approach was implemented. We divided the computation into two phases, each one being executed using multiple threads. The first phase generates all distance k -graphs, while the second one calculates clustering coefficients and generates partial clustering coefficient B-matrix for subsequent distance k -graphs. In this case one thread is responsible for calculating a partial clustering coefficient B-matrix for a given distance k -graph. This initial step allowed us to investigate performance constraints related with the computation of clustering coefficients. As an additional optimization, the neighbor arrays were sorted to allow fast adjacency checking using binary search. The downside of this approach is necessity of keeping all the distance k -graphs in the memory (or on a hard disk) and uneven partition of the workload between threads. The biggest challenge in this case lies in the fact that for many real-world lowdiameter graphs and k = 2, 3, 4 distance k -graphs are dense. Apart from the B-matrices, the application can be configured to generate compressed distance matrices, distance histograms and compute descriptors such as efficiency or vertex eccentricity (12 distance-based descriptors available). It also provides implementation of Betweenness Centrality for unweighted graphs based on methods described in [35]. This algorithm can be used for community detection on large graphs. The program accepts graph input in a form of adjacency matrix and gzipped list of edges and assumes that the graph is immutable after loading into memory. The whole project relies on Scala 2.11.5 and Cassovary 5.0.0, and is built using SBT, which may result in a single-jar deployment or SBT package published to any environment with a JVM in version 1.6 or newer installed. The software can be accessed via Web page http: //home.agh.edu.pl/czech/cassovary-plugin/.
6. PERFORMANCE The performance of the distance-based descriptor generation was tested on a set of complex networks from Stanford Large Network Dataset Collection [33]. We have selected 16 graphs of different size representing several types of structured data including social networks, networks with c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
10
W. C. OTHER
ground-truth communities, communication networks, citation networks, web graphs, co-purchasing networks, road networks, peer-to-peer networks and location based networks. All graphs were treated as undirected. We used the resources of the Academic Computer Centre Cyfronet AGH (www.cyfronet. krakow.pl/en, Zeus Cluster) to perform computations required for B-matrices generation. Figure 2 presents dependency of degree B-matrices (vertex and edge together) computation time on a number cluster nodes for three selected graphs: epinions (75879 vertices, 508837 edges, social network from epinions.com), slashdot0902 (82168 vertices, 948464 edges, social network from slashdot.org) and webnotredame (325729 vertices, 1497134 edges, Web graph of Notre Dame University). The reported computation time does not include a time of loading edge list into memory. The cluster nodes used in experiment had following specification: 2x Intel Xeon X5650 2.67GHz CPU, Hyper-Threading (24 threads), 24 GB RAM. We used JVM version 1.8.0 40 (64Bit) and 20 GB maximum heap space for a Java process. For particular configuration (number of nodes, partitioning) each test run was repeated 5 times. BFS root vertices were distributed between nodes based on random partitioning. The process running on a single node used 24 threads. In Figure 2a, we report average value of the maximal computation time over all partitions (for p nodes each one is performing |V (G)| BFS traversals for disjoint sets of root vertices). For a p graph with close to 1M edges (slashdot0902), the two types of degree B-matrix were generated in 5 minutes using 12 nodes (288 threads). Figure 2b presents the same type of dependency for webnotredame graph, which has 4 times greater number of vertices than slashdot0902. The time of computation on a single node varies to some extent depending on a centrality of the root vertex. Therefore, in Figure 2b, the errorbars reflecting the maximal and minimal per-node computation time were also shown. B-matrices generation for Notre Dame Web graph requires about 25 minutes using 20 nodes. In Figure 2c and in Figure 2d we present speedup computed relative to singlenode run with 24 threads. As expected, the dependency is linear, because inter-node communication overhead is minimal - reduced to the final B-matrix merging at the end of the computation. The key factor, which influences usability of the proposed distributed computing schema is a time of loading and compressing edge lists in RAM (determined by Cassovary engine). The whole process is repeated on each node, what provides fault-tolerance but also limits horizontal scalability, as a whole graph has to fit into memory. Moreover, the loading time should not prevail single-node computation time. In Figure 3 the loading time vs. graph size (measured as |V (G)| + |E(G)|) for 13 sample SNAP graphs is reported. Loading times were tested in the same environment as previous experiment with B-matrix computation. For epinions graph loading time was 1.5% of the computing time on a single node and 16.2% of the computing time on 12 nodes. In case of bigger slashdot0902 graph, this was 2.4% and 27.7% accordingly. Nevertheless, for the larger graph webnotredame we obtained 0.6% for a single node and 6.6% for 12 nodes, which means that depending on a graph density, loading overhead can increase slower than partial-APSP computation time on a single cluster node. After extrapolating dependency presented in Figure 3 and cross-matching it with the timing calculated for the particular graph size, one can determine the maximum number of nodes, to which the B-matrix computation problem can be scaled up. Additionally, we compared the performance of the single-thread APSP computation (BFS-based) implemented using our software, SNAP library and GraphChi framework. Our aim was to estimate JVM overhead introduced by reference types, and compare it with native C++ executions based on arrays and primitive types. The measured dependency of the computation time on a size of graph (|V (G)| + |E(G)|) is presented in Figure 4. SNAP is on average 20%-35% faster than our implementation and therefore can be considered as more promising core library for distributed Bmatrix computation, especially for larger graphs. The performance of GraphChi, relying on frequent I/O operations, is an order of magnitude worse than one achieved by our framework.
c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
11
DISTRIBUTED COMPUTING OF DISTANCE-BASED GRAPH INVARIANTS FOR ANALYSIS...
500 ’epinions’ ’slashdot0902’
50
webnotredame
450 Computation time [minutes]
Computation time [minutes]
60
40 30 20 10
400 350 300 250 200 150 100 50
0
0 2
4
6 8 Number of nodes
10
12
2
4
6
8 10 12 14 Number of nodes
a
18
20
18
20
b
12
20 epinions slashdot0902
10
webnotredame
18 16 14 Speedup
8 Speedup
16
6 4
12 10 8 6 4
2
2 0
0 2
4
6 8 Number of nodes
10
12
2
4
6
8 10 12 14 Number of nodes
c
16
d
Figure 2. Results of degree B-matrices computation performance tests for three graphs from SNAP database: Epinions (75879 vertices, 508837 edges), Slashdot0902 (82168 vertices, 948464 edges) and Web Notre Dame (325729 vertices, 1497134 edges).
60 google
50
Loading time [minutes]
40
youtube roadNet-CA
30
20 roadNet-TX
roadNet-PA amazon0601 stanford
10
0 0
gowalla dblp notredame twitter slashdot0902 epinions1 1e+06 2e+06 3e+06
4e+06
5e+06 |V| + |E|
6e+06
7e+06
8e+06
9e+06
1e+07
Figure 3. Time of loading and indexing list of edges by Cassovary vs. graph size (number of vertices + number of nodes).
c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
12
W. C. OTHER
900 ’Cassovary’ ’SNAP’ ’GraphChi’
800
Computation time [minutes]
700
600
500
400
300
200
100
0 20000
30000
40000
50000
60000 70000 |V| + |E|
80000
90000
100000
110000
Figure 4. Comparison of single-thread BFS-based APSP computation times using 3 single-node graph processing frameworks: Cassovary (our optimized implementation of BFS, Java), SNAP (C++), GraphChi (C++, SSD disk).
7. GRAPH ANALYSIS In this section we perform comparison of multi-level structural properties of networks from Stanford Dataset and provide their visualization in a form of B-matrices. The computed distance matrices and B-matrices enable in-depth analysis of the graphs and provide supplementary information to scalar descriptors generated using the SNAP library and published on the SNAP website [33]. 7.1. Visualization The degree edge B-matrices generated for selected SNAP networks are presented in Figure 5. The number of bins was set to 1500. For particular network the bins have the same width but among different graphs bin width differs, being adjusted to their density. To provide better visibility, the B-matrices were truncated to cover only non-zero areas (note that relative volume and proportions of filled area also brings important information about graph structure). Logarithmic color scale was used to reduce dynamic range effects. The patterns visible on coarse-grained B-matrices encode underlying structural properties and enable visual comparison of the networks. The road network of California [36] (see Figure 5a) is a sparse graph with a big number of odd cycles of different size reflected by the odd rows of the edge B-matrix. The structure of road net B-matrices is much different than one exhibited by the rest of analyzed networks. The degree edge B-matrix of Web graph representing links among pages of Notre Dame University (Figure 5b) follows community structure of this network, represented by disjoint areas and increased density islands (see also [37]). The clustered structure of this network can be confirmed by graph drawing using nr-MDS method [38] (see Figure 6). Vertical lines present on Figure 5c are reflecting separated, path-like chains of links present in Berkeley and Stanford Web graph (see [39]). The Google Web graph shown in Figure 5d is a dense, centralized graph with a high local clustering represented by a number of odd closed walks such as triangles or pentagons (see also [40]). Its structure is similar to one exhibited by Web Notre Dame (increased density islands). c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
DISTRIBUTED COMPUTING OF DISTANCE-BASED GRAPH INVARIANTS FOR ANALYSIS...
0
13
0
200 10
400
600 20 800
1000
30
1200 40
1400
1600 0
200
400
600
800
1000
1200
1400
0
200
400
600
a
800
1000
1200
1400
800
1000
1200
1400
b
0
0
5 10 10 20 15
30
20
25 40 30 50 35
0
200
400
600
800
c
1000
1200
1400
0
200
400
600
d
Figure 5. Degree edge B-matrices generated for selected networks from SNAP database: a. road network of California (1965206 vertices, 2766607 edges), b. Web graph from University of Notre Dame (325729 vertices, 1090108 edges), c. Web graph from Berkeley and Stanford Universities (685230 vertices, 6649470 edges), d. Web graph from Google programming contest (875713 vertices, 4322051 edges).
Figure 6. Visualization of Notre Dame Web graph (325729 vertices, 1497134 edges) using nr-MDS [38].
In Figure 7 we present two clustering coefficient vertex B-matrices computed based on 500 categories of the same size. The number of rows was adjusted to the diameter of the bigger graph (Arxiv Astro Physics collaboration network). Here, two images can be compared directly as a number of B-matrix columns does not depend on a graph size. The skewness of clustering coefficient c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
14
W. C. OTHER
0
0
2
2
4
4
6
6
8
8
10
10
12
12
14
14 0
50
100
150
200
250
300
a
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
b
Figure 7. Clustering coefficient vertex B-matrices generated for selected networks from SNAP database: a. Arxiv Astro Physics collaboration network (18772 vertices, 198110 edges), b. Oregon autonomous systems graph - snapshot from Apr 14 2001 (11019 vertices, 31761 edges).
distributions for consecutive distance k -graphs is changing with growing k . For k close to graph diameter the majority of vertices have zero clustering coefficient (they become isolated or have one neighbor). The portrait of a network formed by clustering coefficient B-matrix describes whether local clustering property is preserved when communicating via distant links. 7.2. Graph invariants As described in Section 4, B-matrices can be treated as space-efficient fingerprints of networks and form substrate for further computation of low-dimensional graph invariants. In Table I we presented how selected well-known scalar graph descriptors can be computed based on data contained in Bmatrices. Those equations were used to obtain three scalar graph descriptors for 14 SNAP networks with more than 1M edges. In Table II the values of triangle ratio, Wiener index and efficiency are reported. Triangle ratio is a number of triangles in a graph divided by a maximal number of triangles |V (G)| . It reflects small-worldliness of a network similarly to clustering coefficient. The highest 3 values of triangle ratio are observed for Twitter social circles network and Web graphs (Notre Dame, Stanford, Berkeley-Stanford). Wiener index is a sum of lengths of the shortest paths between every pair of vertices in a given graph. It measures branching factor, achieving maximal value for a path graph and minimal value for a complete graph. For sparse road networks the values of Wiener index reach an order of 1014 . Efficiency is a normalized harmonic mean of geodesic lengths over all couples of nodes. It reflects transport capacity of a network and achieves maximal value 1 for complete graphs. For a networks reported in Table II efficiency is a good discriminating factor, which has similar values for related generating mechanisms. Web graphs (Notre Dame, Stanford, Berkeley-Stanford, Google) have efficiency close to one obtained for co-purchasing Amazon networks. 7.3. Embedding in 2D In this section we examine discriminating capabilities of B-matrices in the unsupervised learning task. To this end, we constructed feature vectors representing graphs from different groups by coarse-graining and row-packing of the B-matrices. Next, the dimensionality reduction techniques such as Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding [41] (t-SNE) were used to obtain 2D embedding of graphs for visual comparison of clusters proximity. c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
DISTRIBUTED COMPUTING OF DISTANCE-BASED GRAPH INVARIANTS FOR ANALYSIS...
15
Table II. The values of selected graph descriptors computed based on B-matrices.
DBLP Notre Dame
Vertices
Edges
Triangle ratio
Wiener index
Efficiency
317080
1049866
4.19 · 10−10
3.41 · 1011
0.07689
1497134
1.55 ·
10−9
3.29 ·
1011
0.07674
10−9
1.73 ·
1011
0.07463
325729
Stanford
281903
2312497
3.03 ·
Berk Stan
685230
7600595
1.21 · 10−9
8.97 · 1011
0.07827
1768149
1.46 ·
10−7
1.28 ·
1010
0.13437
10−10
2.31 ·
1012
0.08067
Twitter
81306
Google
875713
5105039
1.20 ·
Youtube
1134890
2987624
1.25 · 10−11
3.32 · 1012
0.10015
1234877
2.39 ·
10−10
3.03 ·
1011
0.05944
10−10
5.18 ·
1011
0.08063
Amazon0302
262111
Amazon0312
400727
3200440
3.44 ·
Amazon0601
403394
3387388
3.64 · 10−10
5.23 · 1011
0.08094
3356824
3.43 ·
10−10
5.43 ·
1011
0.08068
10−13
1.81 ·
1014
0.00237
Amazon0505
410236
RoadnetPA
1088092
1541898
3.12 ·
RoadnetTX
1379917
1921660
1.89 · 10−13
3.73 · 1014
0.00170
2766607
10−14
1014
0.00216
RoadnetCA
1965206
9.54 ·
5.85 ·
Let Y stand for V or E symbol, so that B Y,F denotes B V,F or B E,F matrix of a graph G. The set F contains degree or clustering coefficient vertex-domain functions defined for consecutive distance k -graphs. The number of categories n was chosen so that n < |V (G)|. The size of category sb is the same for all categories (bins) in a test setup. We perform high-dimensional graph embedding by packing rows or columns of B-matrices to form a long pattern vector. Y,F Y,F Dlong (kmin , kmax , lmin , lmax ) = [Bk,l ] 1 ≤ kmin ≤ k ≤ kmax , 1 ≤ lmin ≤ l ≤ lmax ≤ n
(8)
By fixing embedding parameters n, sb , kmin , kmax , lmin and lmax we select well-defined rectangular fragment of a given B-matrix, which encodes a part of information about graph structure. Adjusting those parameters enable going from local features (low values of k ) to global features (values of k close to graph diameter). In the first experiment, we selected five groups of graphs from SNAP database. The first cluster labeled as contains 122 CAIDA autonomous systems graphs retrieved between 01.2004 and 11.2007 from BGP table snapshots. The sizes of as graphs vary from 8020 to 26475 (vertices) and 36406 to 106762 (edges). The next group marked as ca contains 5 Arxiv collaboration networks: Astro Physics, Condense Matter Physics, General Relativity and Quantum Cosmology, High Energy Physics - Phenomenology and High Energy Physics - Theory. The sizes of ca networks are as follows: 5242 to 23133 (vertices), 14496 to 198110 (edges). The set p2p consists of 9 peer-to-peer graphs from the Gnutella sharing network. The dataset has following ranges of vertices/edges: 6301 to 62586 and 20777 to 147892. Last two groups oregon1 and oregon2 (each one has 9 members) contain autonomous systems peering information retrieved from different data sources: route-views (oregon1), route-views, glass data and routing registry (oregon1). The sizes of oregon graphs vary from 10670 to 11461 (vertices) and 22002 to 32730 (edges). Initially, the fine-grained (sb = 1) degree B-matrices were generated for the described dataset. Next, based on pre-defined embedding parameters we constructed coarse-grained versions of the B-matrices, which were subsequently packed into long feature vectors. The number of categories n and bin size sb were fixed for each test run, so that we obtained the same dimensionality of the input feature space. In addition, logarithmic scaling of non-zero entries of B-matrices allowed us to make heavy-tailed degree distributions of k -distance graphs more tractable and decrease impact of c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
16
W. C. OTHER
a graph size on the results of comparison. In order to extract most relevant data and reduce noise, we decided to apply Principal Component Analysis (PCA) and reduce dimensionality of the feature space to first 30 components of the highest variance. In a final step, based on 30D features, t-SNE was applied to obtain 2D embedding of tested SNAP networks. Figure 8 presents sample results of this experiment. Specifically, in Figure 8a the 2D embedding of 1000D input feature space covering more than 90% of non-empty degree vertex B-matrix areas is shown. The 1D curved shape of dominating as cluster reflects local proximity between weekly snapshots of the autonomous systems graphs, but two oregon groups were not separated. Also p2p and ca clusters remained divided according to graph size. In case of different-size graphs, the categories of higher l’s are more sensitive to perturbations. In Figure 8b, we present the 2D embedding for the same-size input space constructed from degree edge B-matrix. Even though the feature vectors contain only a half of the information present in edge B-matrix (kmax = 10), the cluster separation is much better than for vertex B-matrix: oregon groups are no longer merged and p2p group is compact. Three collaboration networks: Astro Physics, Condense Matter Physics and High Energy Physics - Phenomenology are also located close to each other. Figure 8c presents results obtained for 400D input space covering the upper-left quarter of a vertex B-matrix (lower values of k , lower values of l). This feature space captures a local information about graph structure and ignores categories sensitive to graph size. Indeed, the results based on reduced features are better than ones presented in Figure 8a. Cluster p2p becomes compact, but oregon groups are still not well-separated. The information needed to distinguish oregon1 graphs from oregon2 graphs is present in the edge B-matrix (see Figure 8d). In some cases, the local information collected for lower values of k seems sufficient for discriminating network types. This enables to further decrease computational cost of a B-matrix generation, as much cheaper limited BFS traversals can be applied. In the second experiment, we verified discriminating capabilities of clustering coefficient Bmatrices generated for two oregon groups, which were not clearily separated using invariants derived from the degree vertex B-matrices (see Figure 8a). To this end, we computed different coarse-grained variants of clustering coefficient B-matrix and performed linear dimensionality reduction to 2D using Principal Component Analysis (PCA). For kmax > 3, very good separation of the two groups was obtained, similarly as for the degree edge B-matrix covering all k ’s. The sample result for 1000D input space is presented in Figure 9. The labels presented on the chart correspond with a date of taking snapshot of the given autonomous systems graph. The snapshots generated in consecutive weeks are close to each other - the local structure of input feature space was preserved.
8. CONCLUSIONS AND FUTURE WORK The paper describes the framework for graph analysis based on information about all-pair shortestpaths. The main contribution is the software, which enables generation of distance matrices and Bmatrices for real-world graphs with millions of edges. Distributed implementation of BFS allowed us to compute B-matrices and multiple distance-based graph invariants for the networks from SNAP database. In addition, we proposed new type of B-matrix (based on clustering coefficient), which facilitates comparison of different-size graphs, providing user-defined resolution of sampling smallworld properties of distance k -graphs. Experiments showed, that developed software and clustering coefficient B-matrix can be used for complex networks comparison based on feature vectors aggregating different structural properties of graphs. The B-matrices computed once for SNAP networks can serve as permutation-invariant multidimensional representation of a graph helpful in further analysis and visualization (after feature selection). Based on experiments with 2D graph embedding we shown that a full B-matrix is not always needed to provide good separation between clusters of graphs. Local features can provide enough discriminating information. Therefore, distributed depth-limited BFS traversal should be regarded as a cheaper alternative to comparison of large graphs. Utilizing data parallelism and distributing degree B-matrix computation across multiple cluster nodes was a cost and time-efficient strategy of obtaining multiple distance-based invariants for complex networks of considerable size. Nevertheless, the tests revealed also limitations and c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
DISTRIBUTED COMPUTING OF DISTANCE-BASED GRAPH INVARIANTS FOR ANALYSIS...
12
17
15 as ca oregon1 oregon2 p2p
10 8
as ca oregon1 oregon2 p2p
10
6
5
4 2
0
0 -5
-2 -4
-10
-6 -8
-15 -10
-5
0
5
10
15
20
-6
-4
-2
a
0
2
4
6
8
b
15
15 as ca oregon1 oregon2 p2p
10
10
5 5 0 0 -5 as ca oregon1 oregon2 p2p
-5
-10 -15
-10
-5
0
5
10
-10
-15 -15
15
-10
-5
c
0
5
10
15
d
Figure 8. 2D embedding of long feature vectors obtained from degree B-matrices using t-SNE. It presents similarity between different groups of graphs from SNAP database (as - CAIDA autonomous systems graphs, ca - scientific collaboration networks, p2p - Internet peer-to-peer networks (Gnutella), oregon1 Oregon route-view network, oregon2 - Oregon route-view network extended) a. degree vertex B-matrix (n = 100, sb = 150, kmin = 1, kmax = 10) b. degree edge B-matrix (n = 100, sb = 150, kmin = 1, kmax = 10) c. degree vertex B-matrix (n = 100, sb = 30, kmin = 1, kmax = 4) d. degree edge B-matrix (n = 100, sb = 30, kmin = 1, kmax = 4).
10
2nd Principal Component
5
o1-010519 o1-010526 o1-010512
o2-010505 o2-010512 o2-010407
0
o1-010505
o2-010414 o2-010428
o1-010428 o1-010421 o1-010407 o1-010414
o2-010331
-5
-10
o1-010331
-15 -15
-10
-5
0
5
10
15
1st Principal Component
Figure 9. 2D PCA embedding of 1000D feature vectors (n = 100, sb = 5, kmin = 1, kmax = 10) obtained from clustering coefficient vertex B-matrices generated for oregon1 (prefix o1) and oregon2 (prefix o2) graphs.
c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
18
W. C. OTHER
drawbacks of our approach. The comparison with the SNAP framework shown that the overall time of distributed B-matrix computation could be further decreased just by eliminating JVM overhead (reference types and dereference operation) and migrating to native C++ code execution. Still, this was an expected cost of a more convenient implementation. Storing a whole graph within node memory was a key assumption, which allowed us to avoid expensive communication. On the other hand, the graph loading times increasing with a number of edges, together with limited RAM size on a single node, should be also considered as the factors worsening scalability. In our future works, we plan to further optimize clustering coefficient B-matrix computation and perform in-depth analysis of its structure (including relations with well-known scalar invariants). We would also like to generate and publish B-matrices for the largest graphs from SNAP database (more than 10 million vertices). Those graphs are unlabeled and still can fit into memory of typical enterprise-grade server. Last but not least, we plan to employ binary per-vertex B-matrices for detecting communities on clustered social networks. Acknowledgments. This research was supported by the Polish National Center of Science (NCN) DEC-2013/09/B/ST6/01549 and supported in part by PL-Grid Infrastructure.
REFERENCES 1. Ching A, Edunov S, Kabiljo M, Logothetis D, Muthukrishnan S. One trillion edges: graph processing at facebookscale. Proceedings of the VLDB Endowment 2015; 8(12):1804–1815. 2. Foggia P, Percannella G, Vento M. Graph matching and learning in pattern recognition in the last 10 years. International Journal of Pattern Recognition and Artificial Intelligence 2014; 28(01). 3. Czech W. Clustering of real-world data using multiple-graph representation and centrality measures. Computational Intelligence: methods and applications, Rutkowski L, Tadeusiewicz R, Zadeh LA, Zurada J (eds.), Proceedings of 9th Conference on Artificial Intelligence and Soft Computing, 2008. 4. Watts D, Strogatz S. Collective dynamics of small-worldnetworks. Nature 1998; 393(6684):440–442. 5. Barab´asi A, Albert R. Emergence of scaling in random networks. Science 1999; 286(5439):509. 6. Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang D. Complex networks: Structure and dynamics. Physics Reports 2006; 424(4-5):175–308. 7. Qiu H, Hancock E. Clustering and embedding using commute times. IEEE Transactions on Pattern Analysis and Machine Intelligence 2007; 29(11):1873–1890. 8. Emms D, Wilson RC, Hancock ER. Graph matching using the interference of continuous-time quantum walks. Pattern Recognition 2009; 42(5):985–1002. 9. Xiao B, Hancock E, Wilson R. A generative model for graph matching and embedding. Computer Vision and Image Understanding 2009; 113(7):777–789. 10. Lee WJ, Duin RP. A labelled graph based multiple classifier system. Multiple Classifier Systems. Springer, 2009; 201–210. 11. Borzeshi EZ, Piccardi M, Riesen K, Bunke H. Discriminative prototype selection methods for graph embedding. Pattern Recognition 2013; 46(6):1648–1657. 12. Gibert J, Valveny E, Bunke H. Dimensionality reduction for graph of words embedding. Graph-based representations in pattern recognition. Springer, 2011; 22–31. 13. Aziz F, Wilson RC, Hancock ER. Graph characterization using wave kernel trace. 2014 22nd International Conference on Pattern Recognition (ICPR), IEEE, 2014; 3822–3827. 14. Ye C, Wilson RC, Hancock ER. Graph characterization from entropy component analysis. 2014 22nd International Conference on Pattern Recognition (ICPR), IEEE, 2014; 3845–3850. 15. Czech W. Invariants of distance k-graphs for graph embedding. Pattern Recognition Letters 2012; 33(15):1968– 1979. 16. Bu Y, Howe B, Balazinska M, Ernst MD. Haloop: efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment 2010; 3(1-2):285–296. 17. Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G. Twister: a runtime for iterative mapreduce. Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ACM, 2010; 810–818. 18. Zhang Y, Gao Q, Gao L, Wang C. Priter: A distributed framework for prioritizing iterative computations. Parallel and Distributed Systems, IEEE Transactions on 2013; 24(9):1884–1893. 19. Xin RS, Gonzalez JE, Franklin MJ, Stoica I. Graphx: A resilient distributed graph system on spark. First International Workshop on Graph Data Management Experiences and Systems, ACM, 2013; 2. 20. Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel: a system for large-scale graph processing. Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, ACM, 2010; 135–146. 21. Avery C. Giraph: Large-scale graph processing infrastructure on hadoop. Proceedings of the Hadoop Summit. Santa Clara 2011; . 22. Salihoglu S, Widom J. Gps: A graph processing system. Proceedings of the 25th International Conference on Scientific and Statistical Database Management, ACM, 2013; 22. c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe
DISTRIBUTED COMPUTING OF DISTANCE-BASED GRAPH INVARIANTS FOR ANALYSIS...
19
23. Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1408.2041 2014; . 24. Han M, Daudjee K, Ammar K, Ozsu MT, Wang X, Jin T. An experimental comparison of pregel-like graph processing systems. Proceedings of the VLDB Endowment 2014; 7(12):1047–1058. 25. Lu Y, Cheng J, Yan D, Wu H. Large-scale distributed graph computing systems: An experimental evaluation. Proceedings of the VLDB Endowment 2014; 8(3):281–292. 26. Leskovec J, Sosiˇc R. SNAP: A general purpose network analysis and graph mining library in C++. http: //snap.stanford.edu/snap Jun 2014. 27. Kyrola A, Blelloch G, Guestrin C. Graphchi: Large-scale graph computation on just a pc. Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), 2012; 31–46. 28. Gupta P, Goel A, Lin J, Sharma A, Wang D, Zadeh R. Wtf: The who to follow service at twitter. Proceedings of the 22nd international conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2013; 505–514. 29. D’Alberto P, Nicolau A. R-kleene: A high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks. Algorithmica 2007; 47(2):203–213. 30. Czech W, Yuen DA. Efficient graph comparison and visualization using gpu. Proceedings of the 14th IEEE International Conference on Computational Science and Engineering (CSE 2011) 2011; :561–566doi:10.1109/ CSE.2011.223. 31. Czech Wojciech DW Mielczarek Wojciech. Comparison of large graphs using distance information. PPAM 2015, 2016. 32. Czech W. Graph descriptors from b-matrix representation. Graph-Based Representations in Pattern Recognition, Proceedings of GbRPR 2011, LNCS, vol. 6658, Springer, 2011; 12–21. 33. Leskovec J, Krevl A. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford. edu/data Jun 2014. 34. Czech W, Goryczka S, Arodz T, Dzwinel W, Dudek A. Exploring complex networks with graph investigator research application. Computing and Informatics 2011; 30(2). 35. Brandes U, Pfeffer J, Mergel I. Studying social networks: A guide to empirical research. Campus Verlag, 2012. 36. Road network of california. http://www.cise.ufl.edu/research/sparse/matrices/SNAP/ roadNet-CA.html. 37. Web graph of notre dame. http://www.cise.ufl.edu/research/sparse/matrices/SNAP/webNotreDame.html. 38. Dzwinel W, Wcisło R. Very fast interactive visualization of large sets of high-dimensional data. Proceedings of ICCS 2015, Reykjavik 1-3 June 2015, Iceland, Procedia Computer Science, in print ; . 39. Web graph of berkeley and stanford. http://www.cise.ufl.edu/research/sparse/matrices/ SNAP/web-BerkStan.html. 40. Web graph from google. http://www.cise.ufl.edu/research/sparse/matrices/SNAP/webGoogle.html. 41. Van der Maaten L, Hinton G. Visualizing data using t-sne. Journal of Machine Learning Research 2008; 9(25792605):85.
c 2016 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls
Concurrency Computat.: Pract. Exper. (2016) DOI: 10.1002/cpe