NetworKit: An Interactive Tool Suite for High ... - Semantic Scholar

4 downloads 221600 Views 2MB Size Report
The current feature set includes various analytics kernels such as connected compo- ... and integration into an ecosystem of tested tools for data analysis and scientific ...... use the package without coding, advanced visualization capabilities ...
NetworKit: An Interactive Tool Suite for High-Performance Network Analysis Christian L. Staudt∗

Aleksejs Sazonovs†

Abstract

technology and society: Node ranking by centrality is the basis for modern web search [36], community detection methods find applications on social media sites [21] or in cancer research [27], and tracking social influence through networks is equally interesting to sociologists [20] and advertisers. It seems plausible that network analysis is going to yield groundbreaking insights in the future as our theoretical understanding and our computing capabilities increase: Consider for example that the human brain at the neuronal scale is a complex network on the order of 1010 nodes and 1014 edges, whose mapping and analysis is still beyond current technology. Already on the more modest scale of 106 to 1010 edges, complex networks challenge current algorithms and available implementations. It is evident that the need to process such networks within a data analysis workflow requires parallel processing whenever possible. With NetworKit, we intend to push the boundaries of what can be done interactively on a sharedmemory parallel computer, also by users without indepth programming skills. The tools we provide make it easy to characterize large networks and are geared towards and informed by network science research. In this work we give an introduction to the toolkit and describe it in terms of algorithm and software engineering aspects. We also discuss important algorithmic features (including considerations for efficient implementations) such as network analytics kernels and generators, give example use cases and evaluate the performance of its kernels experimentally. Our experiments show that NetworKit is capable of quickly processing large-scale networks for a variety of analytics kernels. It does so with a lower memory footprint and in a more reliable manner compared to closely related tools. We recommend NetworKit for the comprehensive structural analysis of large complex networks in the range of 105 to 1010 edges.

We introduce NetworKit, an open-source software package for high-performance analysis of large complex networks. Complex networks are equally attractive and challenging targets for data mining, and novel algorithmic solutions, including parallelization, are required to handle data sets containing billions of connections. Our goal for NetworKit is to package results of our algorithm engineering efforts and put them into the hands of domain experts. NetworKit is a hybrid combining the performance of kernels written in C++ with a convenient Python frontend. The package targets shared-memory platforms with OpenMP support. The current feature set includes various analytics kernels such as connected components, diameter, clustering coefficients, community detection, k-core decomposition, degree assortativity and multiple centrality indices, as well as a collection of graph generators. Scaling to massive networks is enabled by techniques such as parallel and sampling-based approximation algorithms. In comparison with related software, we propose NetworKit as a package geared towards large networks and satisfying three important criteria: High performance, interactive workflows and integration into an ecosystem of tested tools for data analysis and scientific computation. Keywords: complex networks, network analysis, network science, parallel graph algorithms, data analysis software

1

Henning Meyerhenke‡

Motivation

Complex networks are heterogeneous relational datasets appearing in very different domains but sharing certain structural characteristics. Social (e.g. a friendship graph), technical (e.g. the internet or electric power grids), biological (e.g. protein interaction networks) or informational (e.g. the world wide web) networks are only a few examples for the large variety of phenomena modeled this way [14, 10]. Accordingly, network analysis methods are quickly becoming pervasive in science,

2 ∗ Karlsruhe

Methods

2.1 Design Goals. There is a variety of software packages which provide graph algorithms in general and network analysis capabilities in particular (see Section 5 for a comparison to related packages). However, NetworKit aims to balance a specific combination of

Institute of Technology (KIT), Germany, [email protected] † University of St Andrews, United Kingdom, [email protected] ‡ Karlsruhe Institute of Technology (KIT), Germany, [email protected]

1

strengths. Our software is designed to stand out with respect to three areas: Performance. Algorithms and data structures are selected and implemented with high performance and parallelism in mind. Some implementations are among the fastest in published research. For example, community detection in a 3.3 billion edge web graph can be performed on a 16-core server with hyperthreading in less than five minutes [43]. Interface. Networks are as diverse as the series of questions we might ask of them – e. g., what is the largest connected component, what are the most central nodes in it and how do they connect to each other? A practical tool for network analysis should therefore provide modular functions which do not restrict the user to predefined workflows. In this respect we take inspiration from software like R, MATLAB and Mathematica, as well as a variety of Python packages. An interactive shell, which the Python language provides, meets these requirements. While NetworKit works with the standard Python 3 interpreter, combining it with the IPython Notebook [37] allows us to integrate it into a fully fledged computing environment for scientific workflows. It is also easy to set up and control a remote compute server. Integration. As a Python module, NetworKit enables seamless integration with Python libraries for scientific computing and data analysis, e. g. pandas for data frame processing and analytics, matplotlib for plotting or numpy and scipy for numerical and scientific computing. For certain tasks, we provide interfaces to specialized external tools, e. g. Gephi for graph visualization.

Python extension module. The Python layer comprises about 4000 lines of code. IP[y]

IPython Notebook matplotlib

scipy

numpy

pandas

networkx

NetworKit Python Task-oriented Interface

Additional Functionality

Pythonized Classes Cython

Wrapper Classes

C++ / OpenMP Data Structures

Algorithms

I/O

Tests

Figure 1: NetworKit architecture overview

2.3 General Framework Features. As the central data structure, the NetworKit.Graph class implements a directed or undirected, optionally weighted graph using an adjacency array data structure with O(n+m) memory requirement for a graph with n nodes and m edges. A node is represented by a 64 bit integer index, and an edge is identified by a pair of nodes. Optionally, edges can be indexed as well. This approach enables a lean graph data structure, while also allowing arbitrary node and edge attributes to be stored in any container addressable by indices. Our API aims for an intuitive and concise formulation of graph algorithms on both the C++ and Python layer. In general, a combinatorial view on graphs – representing edges as tuples of nodes – is used. However, a recent addition to NetworKit is an algebraic interface that enables the implementation of graph algorithms in terms of the adjacency matrix, while transparently using the same graph data structure. While some algorithms may benefit from different data layouts, this lean, general-purpose representation has proven competitive for writing performant implementations. Large complex networks are difficult to inspect visually. Our package has basic visualization/graph drawing capabilites, but delegates graph visualization to Gephi, a network analysis application specialized on visualization features, via a convenient interface.

2.2 Architecture. In order to achieve the design goals described above, we implement NetworKit as a two-layer hybrid of high-performance code written in C++ with an interface and additional functionality written in Python. NetworKit is distributed as a Python package, ready to be used interactively from a Python shell, which is the main usage scenario we envision. The code can be used as a library for application programming as well, either at the Python or C++ level. Throughout the project we use object-oriented and functional concepts. Shared-memory parallelism is realized with OpenMP, providing loop parallelization and synchronization constructs while abstracting away the details of thread creation and handling. The roughly 35 000 lines of C++ code include core implementations and unit tests. As illustrated in Figure 1, connecting 3 Analytics these native implementations to the Python world is enabled by the Cython toolchain [7]. Currently we use The following describes the core set of network analysis Cython to integrate native code by compiling it into a algorithms implemented in NetworKit. In addition, NetworKit also includes a collection of basic graph

algorithms, such as breadth-first and depth-first search, prohibitive for large graphs, NetworKit also implements a Dijkstra’s algorithm for shortest paths or code for sampling approximation algorithm [41], whose constant computing approximate maximum weight matchings. time complexity is independent of graph size. 3.1 Degree Distribution. The distribution of degrees, i. e. number of connections per node, plays an important role in characterizing a network: Empirically observed complex networks tend to show a heavy tailed degree distribution which follow a power-law with a characteristic exponent: p(k) ∼ k −γ . The powerlaw Python module, developed by Alstott et al. [3], employs statistical methods from [13, 28] to determine if a probability distribution fits a power law. The module is distributed with NetworKit.

3.5 Components and Cores. Components and cores are related concepts for subdividing a network: All nodes in a connected component are reachable from each other. A typical pattern in real-world complex networks is the emergence of a giant connected component, accompanied by a large number of very small components. We compute connected components in linear time using breadth-first search. Core decomposition allows a more fine-grained subdivision of the node set according to connectedness. k-cores result from successively peeling away nodes of degree k. It also categorizes nodes 3.2 Degree Assortativity. Generally, a network according to the highest-order core in which they are conshows assortative mixing with respect to a certain prop- tained, assigning a core number to each node. The kernel erty P if nodes with similar values for P tend to be runs in O(m) time, matching other implementations [6]. connected to each other. Degree assortativity measures this correlation with respect to node degree, which can 3.6 Centrality. Centrality refers to the relative impoint to important aspects such as a hierarchical net- portance of a node within a network. We distribute work composition. Its strength is often expressed as efficient implementations for betweenness, eigenvector degree assortativity coefficient r, which lies in the range centrality and PageRank. Betweenness centrality ex−1 ≤ r ≤ 1. We implement Newman’s formulation [33] presses the concept that a node is important if it lies on with O(m) time and constant memory requirements. many shortest paths between nodes in the network. We implement the well-known algorithm by Brandes [11], 3.3 Diameter. The diameter of a graph is the maxi- which requires O(nm+n2 log n) time and O(n+m) space mum length of a shortest path between any two nodes. for weighted graphs. For unweighted graphs, running A surprising observation about the diameter of complex time is reduced to O(nm). Since this is still practically networks is often referred to as the small world phe- infeasible for the large data sets we target, NetworKit nomenon [32]: The diameter tends to be very small and includes also two parallelized implementations of recent is often constant or even shrinks with network growth. approximation algorithms: One [40] gives a probabilistic We use the iFUB algorithm [15] both for the exact com- guarantee that the error is at most an additive conputation as well as an estimation of a lower and upper stant, while the other [23] uses an approach without bound on the diameter. iFub has a worst case complexity such guarantees. Our current research extends the forof O(nm) but has shown excellent scalability on complex mer approach to dynamic graph processing [8]. Eigennetworks, where it often converges on the exact value in vector centrality and its variant PageRank [36] assign short time. relative importance to nodes according to their connections, incorporating the idea that edges to high-scoring 3.4 Clustering Coefficients. Clustering coefficients nodes contribute more. Both variants are implemented are key figures for the amount of transitivity in networks, in NetworKit based on parallel power iteration. i. e. the tendency of edges to form between indirect neighbor nodes [32]. The global clustering coefficient is 3.7 Community Detection. Community detection the ratio of the number of triangles in a network versus is the task of identifying groups of nodes in the network the number of triads (paths of length 2). This ratio which are significantly more densely connected among is typically high in social networks, whose generative each other than to the rest of nodes. It is a data mining processes tend to close triangles. In contrast, the problem where various definitions of the structure to clustering coefficient is close to 0 for random graphs. The be discovered – the community – exist. This fuzzy average local clustering coefficient computes the fraction task can be turned into a well-defined though NPabove locally for each node and averages then over all hard optimization problem by using community quality nodes. A straightforward calculation of the clustering measures, first and foremost modularity [24]. We coefficient with a node iterator requires O(nd2max ) time, approach community detection from the perspective of where dmax is the maximum degree. As this might be modularity maximization and engineer parallel heuristics 3

which deliver a good tradeoff between modularity and running time [44, 43]. The PLP algorithm implements community detection by label propagation [39], which extracts communities from a labelling of the node set. A node adopts the most frequent label in its neighborhood until densely connected groups agree on a common label. Each iteration takes O(m) time, and the algorithm has been empirically shown to reach a stable solution in only a few iterations. The Louvain method for community detection [9] can be classified as a locally greedy, bottomup multilevel algorithm. It combines local node moves with multilevel coarsening to optimize modularity. In our version PLM, node moves and coarsening are performed in parallel, and optional refinement phases (PLMR) can be added. In extensive experiments, we compared our community detection methods to other current efforts in the field [43]. We recommend the PLM and PLMR algorithms as the default choice for modularity-driven community detection in large networks. For very large networks in the range of billions of edges, PLP delivers a better time to solution, albeit with a quality loss. In recent work we adapted these algorithms to perform dynamic community detection, i. e. update a partition according to changes in the graph. Our algorithms have been thoroughly evaluated and optimized, making community detection one of the areas in which NetworKit distributes novel, state of the art algorithmic tools. 4

Network Generators

Generative network models aim to explain how networks form and evolve specific structural features. Such models and their implementations as generators have at least two important uses: On the one hand, algorithm or software engineers want generators for synthetic datasets which can be arbitrarily scaled and parametrized and produce graphs which resemble the real application data. On the other hand, network scientists employ models to increase their understanding of network phenomena. NetworKit provides a versatile collection of graph generators for this purpose. In the simple probabilistic Erd˝ os-R´enyi model [35] edges are created with a uniform probability. We include an efficient implementation as described in [5]. Variations of Erd˝os-R´enyi, the planted partition model and the stochastic blockmodel are useful for generating graphs which have distinctive dense areas with sparse connections between them (i. e. communities). The Barabasi-Albert model [2] implements a preferential attachment process which results in a power-law degree distribution. The Recursive Matrix (R-MAT) model [12] was proposed to recreate properties including a powerlaw degree distribution, the small-world property and self-similarity. Design goals also include few parameters and high generation speed. The Chung-Lu model [1] is

a random graph model which aims to replicate a given degree distribution. Given a degree sequence, edges P are created with a probability of p(u, v) = deg(u)deg(v) , k deg(k) which recreates the degree sequence in expectation. The model has been shown to have similar capabilities as the R-MAT model [38]. For a given realizable degree sequence, the algorithm of Havel and Hakimi [26] generates a graph with exactly this degree sequence. Unlike the Chung-Lu model, the generative process promotes the formation of closed triangles, leading to a higher (and possibly more realistic) clustering coefficient. The PubWeb generator implements a geometry-driven model to create community structure [22]. Recent work yielded a parallel geometric graph generator that combines the unit disk graph model with hyperbolic geometry, producing large networks with power-law degree distribution and high clustering [45], which will be distributed with an upcoming release. 5

Comparison to Related Software

Recent years have seen a proliferation of graph processing and network analysis tools which vary widely in terms of target platform, user interface, scalability and feature set. We therefore locate NetworKit relative to these efforts. Although the boundaries are not sharp, we would like to separate network analysis toolkits from general purpose graph frameworks (e. g. Boost Graph Library and JUNG [34]), which are less focused on studying empirical networks. As closest in terms of architecture, functionality and target use cases, we see NetworkX [25], igraph [16] and graph-tool. All of them are distributed as Python modules, provide a broad feature set for network analysis workflows, and have active user communities. NetworkX is a mature toolkit and the de-facto standard for the analysis of small to medium networks in a Python environment, but not suitable for massive networks due to its pure Python implementations. Due to the similar interface, users of NetworkX are likely to move easily to NetworKit for larger networks. igraph and graph-tool address this scalability issue also by implementing core data structures and algorithms in C or C++. graph-tool builds on the Boost Graph Library and parallelizes some kernels using OpenMP. These similarities make those packages ideal candidates for an experimental comparison with NetworKit (see Section 6.2). Related efforts from the algorithm engineering community are KDT [31] (built on an algebraic, distributed parallel backend), GraphCT [17] (focused on massive multithreading architectures such as the Cray XMT), STINGER (a dynamic graph data structure with some analysis capabilities) [18] and Ligra [42] (a recent shared-memory parallel library). They offer high performance through native, parallel implementations of certain kernels. However, to characterize a complex

network in practice we need a substantial set of analytics which those frameworks currently do not provide. Furthermore, the SNAP [29] network analysis package has also recently adopted the hybrid approach of C++ core and Python interface. GUI applications like Gephi [4] and Pajek have a stronger focus on visual network exploration and less on performance on massive datasets. Distributed computing solutions like GraphLab [30] become relevant for massive graph applications, though as we argue, shared-memory multicore machines go a long way. A more detailed discussion of related software is provided in the Supplementary Material. It is our hope that all of these diverse but overlapping lines of development will cross-pollinate and lead to network analysis tools with robust performance and methodological foundations. Through further development, NetworKit will continue to set itself apart with a feature set geared towards large-scale network analysis and by including novel algorithmic approaches. 6

Experimental Evaluation

ConnectedComponents actually scan every edge at a rate of 107 to 108 edges per second. Higher throughput is so far only achieved by algorithms that scan node attributes (DegreeAssortativity), and sampling-based approaches like clustering coefficient approximation. For sublinear algorithms such as the latter, scalability is in practice only constrained by the available memory. Betweenness calculation remains very time-consuming in spite of parallelization, but approximate results with an error bound of 0.1 can be obtained two order of magnitudes faster. Fig. 3 breaks processing rates of the Diameter kernel down to the particular instances, in decreasing order of size, illustrating that performance is often strongly dependent on the specific structure of complex networks.

PowerLaw (nk)

6.1 Performance Benchmark. PageRank (nk) Fig. 2 shows results of a benchmark of Diameter (nk) the most important analytics kernels DegreeAssortativity (nk) in NetworKit. Our platform is a sharedCoreDecomposition (nk) memory server with 256 GB RAM and ConnectedComponents (nk) 2x8 Intel(R) Xeon(R) E5-2680 cores (32 CommunityDetectionLP (nk) threads due to hyperthreading) at 2.7 CommunityDetectionLM (nk) GHz. The algorithms were applied to a ClusteringCoefficientApprox (nk) diverse set of 14 real-world networks in ClusteringCoefficient (nk) the size range from 7k to 260M edges, BetweennessApprox (nk) including web graphs, social networks, Betweenness (nk) connectome data and internet topology BFS (nk) networks (see Table 1 for a description). 10 10 10 10 10 10 10 10 10 10 10 10 edges/s Kernels with quadratic running time (like Betweenness) were restricted to Figure 2: Processing rates of NetworKit analytics kernels the subset of the 4 smallest networks. Speed is measured in edges per second to relate the size of the instance to the time required for a solution, and the box plots illustrate the range name type n m of processing rates achieved (blue dots are outliers). power powergrid 4941 6594 CommunityDetectionLP and CommunityDetectionLM call fb-Caltech36 social 769 16656 the PLP and PLM algorithms, respectively. Clustering PGPgiantcompo social 10680 24316 as-22july06 internet 22963 48436 coefficient here refers to the average local coefficient, fb-Smith60 social 2970 97133 and BetweennessApprox to the method with bounded erfb-MIT8 social 6440 251252 caidaRouterLevel internet 192244 609066 ror [40]. The benchmark illustrates that a set of efficient coAuthorsDBLP social 299067 977676 linear-time kernels, including ConnectedComponents, the fb-Texas84 social 36371 1590655 in-2004 web 1382908 13591473 community detectors, PageRank, CoreDecomposition and cit-Patents citations 6009555 16518948 ClusteringCoefficient, scales robustly to networks in the soc-LiveJournal social 4847571 43369619 con-fiber big brain 591428 46374120 order of 109 edges. The iFub algorithm demonstrates uk-2002 web 18520486 261787258 its surprisingly good performance on complex networks, moving diameter calculation effectively into the class Table 1: Networks used for performance benchmark of linear-time kernels. Algorithms like BFS and 1

5

2

3

4

5

6

7

8

9

10

11

12

uk-2002 con-fiber_big soc-LiveJournal cit-Patents in-2004 fb-Texas84 coAuthorsDBLP caidaRouterLevel fb-MIT8 fb-Smith60 as-22july06 PGPgiantcompo fb-Caltech36 power

PageRank (nx)

Diameter (nx)

DegreeAssortativity (nx)

CoreDecomposition (nx)

ConnectedComponents (nx)

4

10

5

10

6

10

7

10

8

10

ClusteringCoefficient (nx)

edges / s BFS (nx) 1

2

10

Figure 3: Processing rate of diameter calculation on different networks

4

3

10

10

10

5

10

6

10

7

10

edges/s

Figure 5: Processing rate of several NetworkX kernels

6.2 Comparative Benchmark. NetworKit, igraph and graph-tool rely on the same hybrid architecture is the memory footprint of the graph data structure. of C/C++ implementations with a Python inter- NetworKit provides a lean implementation in which the face and implement igraph uses non-parallel C code 260M edges of the uk-2002 web graph occupy only 9 GB, while graph-tool also features parallelism. We bench- compared with igraph (93GB) and graph-tool (14GB). marked all packages in comparison on our 16-core After indexing the edges, NetworKit requires 11 GB for parallel platform and present the measured perfor- the graph. Our code also enables fast I/O: loading the mance in Fig. 4. For comparison, the performance of web graph from a GML file took 4 minutes, 25 minutes some of NetworkX’s Python-based kernels is shown in and 29 minutes, respectively. Fig. 5. Various issues complicated the benchmarking: In our benchmark, NetworKit was the only framework that could robustly run PowerLaw (nk) PowerLaw (ig) the set of fast kernels (excluding the exPageRank (nk) PageRank (gt) pensive betweenness, its approximation, DegreeAssortativity (nk) DegreeAssortativity (ig) and exact clustering coefficients) on the DegreeAssortativity (gt) CoreDecomposition (nk) full set of networks in less than an hour. CoreDecomposition (ig) Not all packages implement the same CoreDecomposition (gt) ConnectedComponents (nk) set of kernels in a comparable way – we ConnectedComponents (ig) ConnectedComponents (gt) compared where we could find an equivCommunityDetectionLP (nk) CommunityDetectionLP (ig) alent. igraph’s PageRank crashed on CommunityDetectionLM (nk) CommunityDetectionLM (ig) our instances and is therefore not listed, ClusteringCoefficientApprox (nk) ClusteringCoefficient (nk) and its community detection code, while ClusteringCoefficient (ig) ClusteringCoefficient (gt) very fast on smaller instances, did not BFS (nk) terminate in a nightly run on large netBFS (ig) BFS (gt) works. graph-tool’s BFS implementa10 10 10 10 10 10 10 10 tion was significantly slow compared to edges/s the performance of its other kernels and could therefore not be run on the larger Figure 4: Processing rates of NetworKit (nk) in comparison with igraph instances. We therefore restricted the (ig) and graph-tool (gt) benchmark set for the comparison to the 8 smaller networks (see Table 1). This allows us to assess relative performance, but the results are somewhat distorted since some acceleration techniques such 7 Example Use Cases. as parallelism pay off primarily on large instances. igraph, In the following, we present possible workflows and use graph-tool and NetworKit are all capable of similar high cases, highlighting the capabilities of NetworKit as an processing rates which can surpass NetworKit’s rates, but interactive tool and a library. the performance varies strongly among the implemenNetworks at a Glance. NetworKit uses many of tations of different kernels. Another scalability factor the analytics kernels described above to provide an 3

4

5

6

7

8

9

10

overview of important structural features of a network. The properties.overview function prints a profile of the data set in tabular form. For typical networks with several thousands of edges, this output is produced almost instantly. The following example tests the scalability of the function by giving an overview of the uk-2007-05 web graph, a network of more than 100 million pages and 3.3 billion hyperlinks from the .uk domain. We see that the network is sparse, highly clustered, has one giant connected component among many small ones and a diameter that is low relative to its size. Its degree distribution is highly skewed and follows a power law with exponent γ ≈ 2.4, and degrees of neighbors are correlated negatively. The network has a modular structure in which several hundred thousand communities can be identified. The full analysis required less than 100 GB of memory and took 7 hours and 11 minutes on our test platform, time which is unequally distributed on the various kernels. For example, diameter estimation takes 16 minutes, and another parallel community detection algorithm (PLP) processes the graph in 1 min and 20 seconds (at 40M edges/s).

networks are retrieved from a database, and NetworKit is called via the Python frontend. The C++ core has been extended to enable more efficient analysis of tissuespecific PPI networks, by implementing in-place filtering of the network to the subgraphs of proteins that occur in given cell types. Finally, statistical analysis and visualization is applied to the network analysis data. The system is close to how we envision NetworKit as a high-performance algorithmic component in a real-world data analysis scenario, and we therefore place emphasis on the toolkit being easily scriptable and extensible.

Network Properties: uk-2007-05 ================== Basic Properties ------------------------- --------------------nodes, edges 105896555, 3301876564 directed? False weighted? False isolated nodes 742603 self-loops 0 density 0.000001 clustering coefficient 0.738643 max. core number 5704 connected components 756936 size of largest component 104288749 estimated diameter range (112, 122) ------------------------- --------------------Node Degree Properties ----------------------------- ---------------------min./max. degree (0, 975419) avg. degree 62.360415 power law?, likelihood, gamma True, 597.0328, 2.3996 degree assortativity -0.1045 ----------------------------- ---------------------Community Structure ------------------------- ----------- -------community detection (PLM) communities 762775 modularity 0.996244 ------------------------- ----------- --------

Figure 6: PPI network analysis pipeline with NetworKit as central component

8

Conclusion

Among solutions for large-scale graph analytics, distributed computing frameworks are often prominently named. However, graphs arising in many data analysis scenarios are not bigger than the billions of edges that fit into a conventional main memory and can therefore be processed far more efficiently in a shared-memory parallel model (Shun and Blelloch [42] make a similar case for shared-memory graph processing). With NetworKit, we provide a package of network analytics kernels, network generators and utility software to explore and characterize large relational datasets on typical parallel workstations. Techniques that allow NetworKit to scale to large networks include both efficient C++ implementations and appropriate algorithmic methods (parallelism, graph sampling and others). The interface provided by our Python module allows domain experts to focus on data analysis workflows instead of the intricacies of programming. The open-source package is under continuous

Example Pipeline. A thesis by Flick [19] provides an early example of NetworKit as a component in an application-specific data mining pipeline (Fig. 6), which performs analysis of protein-interaction (PPI) networks. The pipeline implements a preprocessing stage in Python, in which networks are compiled from heterogeneous datasets containing interaction data as well as expression data about the occurrence of proteins in different cell types. During the network analysis stage, preprocessed 7

We thank Maximilian Vogel for co-maintaining the software. We development and will be amended by optimizations and thank Lukas Barth, Miriam Beddig, Elisabetta Bergamini, Stefan new features informed by algorithm engineering research also Bertsch, Pratistha Bhattarai, Andreas Bilke, Simon Bischof, Guido as well as developments in network science. Packages Br¨uckner, Patrick Flick, Michael Hamann, Lukas Hartmann, Daniel Gerd Lindner, Moritz von Looz, Yassine Marrakchi, Marcel which also follow the hybrid approach of C/C++ kernels Hoske, Radermacher, Klara Reichard, Marvin Ritter, Florian Weber, Michael and Python interface achieve similar high performance, Wegner and J¨org Weisbarth for contributing code. but specific figures vary among frameworks and kernels. Our experimental study showed that NetworKit is ca- References pable of quickly processing large-scale networks for a variety of analytics kernels in a reliable manner, whereas [1] W. Aiello, F. Chung, and L. Lu. A random graph model the other two frameworks in the study could not always for massive graphs. In Proceedings of the thirty-second solve all test cases. To achieve high performance overall, annual ACM symposium on Theory of computing, pages NetworKit uses a combination of algorithmic acceleration 171–180. Acm, 2000. techniques, efficient implementations and low memory [2] R. Albert and A.-L. Barabasi. Statistical mechanics of footprint. We recommend NetworKit for the comprehencomplex networks. June 2001. sive structural analysis of large complex networks in the [3] J. Alstott, E. Bullmore, and D. Plenz. powerlaw: a python package for analysis of heavy-tailed distributions. range of 105 to 1010 edges. Specialized programming PLoS ONE, 9(1):e85777, 2014. skills are not required, though users familiar with the [4] M. Bastian, S. Heymann, and M. Jacomy. Gephi: an Python ecosystem of data analysis tools will appreciate open source software for exploring and manipulating the possibility to seamlessly integrate our toolkit. networks. In International Conference on Weblogs and Installation and Support. NetworKit is free softSocial Media, pages 361–362, 2009. ware licensed under the permissive MIT License for the [5] V. Batagelj and U. Brandes. Efficient generation of large least amount of friction with respect to sharing and random networks. Physical Review E, 71(3):036113, reuse of code in a scientific setting. We would like to 2005. encourage usage and contributions by a diverse com[6] V. Batagelj and M. Zaversnik. An O(m) algorithm munity, including data mining users and algorithm enfor cores decomposition of networks. arXiv preprint gineers. The package source and additional resources cs/0310049, 2003. [7] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. S. can be obtained from http://networkit.iti.kit.edu. The Seljebotn, and K. Smith. Cython: The best of both included documentation provides support for setting up worlds. Computing in Science & Engineering, 13(2):31– and getting started with NetworKit. It also includes 39, 2011. a tutorial on basic network analytic tasks. The mod[8] E. Bergamini, H. Meyerhenke, and C. L. Staudt. ule networkit is also installable via the Python packApproximating betweenness centrality in large evolving age manager pip. Optionally, the Python layer can networks. arXiv preprint arXiv:1409.6241, 2014. be omitted by building a subset of functionality as a [9] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and native library. General discussion (updates, support, E. Lefebvre. Fast unfolding of communities in large feature requests etc.) takes place on the open e-mail list networks. Journal of Statistical Mechanics: Theory and [email protected]. Experiment, 2008(10):P10008, 2008. History and Roadmap. NetworKit started as a [10] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D.U. Hwang. Complex networks: Structure and dynamics. collection of community detection algorithms written Physics reports, 424(4):175–308, 2006. in C++, first released in March 2013. The current release is version 3.3 of August 2014, and we will release [11] U. Brandes. A faster algorithm for betweenness centrality. J. Mathematical Sociology, 25(2):163–177, an update with additional features and improvements 2001. later this year. Features of upcoming releases are likely [12] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A to focus on multiple areas: Sparsification and filtering recursive model for graph mining. Computer Science of networks to reduce computational work or enhance Department, page 541, 2004. properties; algorithms for layouting and visualizing large [13] A. Clauset, C. R. Shalizi, and M. E. Newman. Powercomplex networks; tools for the analysis of dynamic law distributions in empirical data. SIAM review, graphs; as well as general usability improvements in 51(4):661–703, 2009. response to applications. [14] L. d. F. Costa, O. N. Oliveira Jr, G. Travieso, F. A. Acknowledgements. This work was partially supported by the project Parallel Analysis of Dynamic Networks – Algorithm Engineering of Efficient Combinatorial and Numerical Methods, which is funded by the Ministry of Science, Research and the Arts BadenW¨ urttemberg. A. S. acknowledges support by the RISE program of the German Academic Exchange Service (DAAD).

Rodrigues, P. R. Villas Boas, L. Antiqueira, M. P. Viana, and L. E. Correa Rocha. Analyzing and modeling realworld phenomena with complex networks: a survey of applications. Advances in Physics, 60(3):329–412, 2011. [15] P. Crescenzi, R. Grossi, M. Habib, L. Lanzi, and A. Marino. On computing the diameter of real-world

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

undirected graphs. Theoretical Computer Science, 514:84–95, 2013. G. Csardi and T. Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems, 1695(5), 2006. D. Ediger, K. Jiang, E. J. Riedy, and D. A. Bader. Graphct: Multithreaded algorithms for massive graph analysis. Parallel and Distributed Systems, IEEE Transactions on, 24(11):2220–2229, 2013. D. Ediger, R. McColl, J. Riedy, and D. Bader. Stinger: High performance data structure for streaming graphs. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on, pages 1–5, Sept 2012. P. Flick. Analysis of human tissue-specific proteinprotein interaction networks. Master’s thesis, Karlsruhe Institute of Technology, 2014. J. H. Fowler, N. A. Christakis, Steptoe, and D. Roux. Dynamic spread of happiness in a large social network: longitudinal analysis of the framingham heart study social network. BMJ: British medical journal, pages 23–27, 2009. U. Gargi, W. Lu, V. S. Mirrokni, and S. Yoon. Large-scale community detection on youtube for topic discovery and exploration. In International Conference on Weblogs and Social Media, 2011. J. Gehweiler and H. Meyerhenke. A distributed diffusive heuristic for clustering a virtual P2P supercomputer. In Proc. 7th High-Performance Grid Computing Workshop (HGCW’10) in conjunction with 24th Intl. Parallel and Distributed Processing Symposium (IPDPS’10). IEEE Computer Society, 2010. R. Geisberger, P. Sanders, and D. Schultes. Better approximation of betweenness centrality. In ALENEX, pages 90–100. SIAM, 2008. M. Girvan and M. Newman. Community structure in social and biological networks. Proc. of the National Academy of Sciences, 99(12):7821, 2002. A. Hagberg, P. Swart, and D. S Chult. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Laboratory (LANL), 2008. S. L. Hakimi. On realizability of a set of integers as degrees of the vertices of a linear graph. i. Journal of the Society for Industrial & Applied Mathematics, 10(3):496–506, 1962. P. F. Jonsson, T. Cavanna, D. Zicha, and P. A. Bates. Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis. BMC Bioinformatics, 7:2, 2006. A. Klaus, S. Yu, and D. Plenz. Statistical analyses support power law distributions found in neuronal avalanches. PloS one, 6(5):e19779, 2011. J. Leskovec and R. Sosiˇc. SNAP: A general purpose network analysis and graph mining library in C++. http://snap.stanford.edu/snap, June 2014. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: a

[31]

[32] [33] [34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

9

framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8):716– 727, 2012. A. Lugowski, D. Alber, A. Bulu¸c, J. R. Gilbert, S. Reinhardt, Y. Teng, and A. Waranis. A flexible open-source toolbox for scalable complex graph analysis. In Proceedings of the Twelfth SIAM International Conference on Data Mining (SDM12), pages 930–941, April 2012. M. Newman. Networks: an introduction. Oxford University Press, 2010. M. E. J. Newman. Assortative mixing in networks. Phys. Rev. Lett., 89:208701, Oct 2002. J. O’Madadhain, D. Fisher, S. White, and Y. Boey. The JUNG (java universal network/graph) framework. University of California, Irvine, California, 2003. A. R. P. Erd˝ os. On the Evolution of Random Graphs. Publication of the Mathematical Institute of the Hungarian Academy of Sciences, 1960. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. 1999. F. Perez, B. E. Granger, and C. Obispo. An open source framework for interactive, collaborative and reproducible scientific computing and education, 2013. A. Pinar, C. Seshadhri, and T. G. Kolda. The Similarity between Stochastic Kronecker and Chung-Lu Graph Models. Oct. 2011. U. N. Raghavan, R. Albert, and S. Kumara. Near linear time algorithm to detect community structures in largescale networks. Physical Review E, 76(3):036106, 2007. M. Riondato and E. M. Kornaropoulos. Fast approximation of betweenness centrality through sampling. In Proceedings of the 7th ACM Conference on Web Search and Data Mining, WSDM, volume 14, 2013. T. Schank and D. Wagner. Approximating clustering coefficient and transitivity. Journal of Graph Algorithms and Applications, 9(2):265–275, 2005. J. Shun and G. E. Blelloch. Ligra: a lightweight graph processing framework for shared memory. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13, Shenzhen, China, February 23-27, 2013, pages 135–146, 2013. C. L. Staudt and H. Meyerhenke. Engineering highperformance community detection heuristics for massive graphs. arXiv preprint arXiv:1304.4453, 2013. C. L. Staudt and H. Meyerhenke. Engineering highperformance community detection heuristics for massive graphs. In proceedings of the 2013 International Conference on Parallel Processing. Conference Publishing Services (CPS), 2013. M. von Looz, C. L. Staudt, H. Meyerhenke, and R. Prutkin. Fast generation of dynamic complex networks with underlying hyperbolic geometry. manuscript, October 2014.

A

Supplementary Material

A.1 Detailed Comparison to Related Software. In the following, we compare NetworKit to a wider set of software packages with a possibly similar scope of application. Network analytics packages offer a collection of analysis kernels through a specific user interface. We selected a number of software packages which we see as close to NetworKit in one aspect or the other, and compare them with respect to design goals, usability, feature set and other criteria. Although the boundaries to graph frameworks (e. g. Boost Graph Library) are often blurred, we do not consider them sufficiently similar. Table 2 shows a comparison of the feature sets. NetworkX [25] is a feature-rich Python package for network analysis whose development started in 2002. It is considered the de-facto standard for the analysis of small to medium networks in a Python environment. For these reasons, NetworKit aims for compatibility with NetworkX through functions for the conversion of graph object. NetworkX has a large feature set and provides a highly flexible graph data structure, but its applicability to large graphs is limited. igraph [16] is a C library aimed at creating and manipulating networks with millions of nodes and edges. In addition to the C library, igraph provides interfaces for Python, R, and Ruby. graph-tool is a Python library for manipulation and analysis of networks. In order to achieve high performance, the core of graph-tool is written in C++, utilizing the Boost Graph Library, and uses OpenMP for parallelization. The module has a rich collection of features for analysis and visualization of networks. We consider graph-tool, igraph and NetworkX closest in scope and design to NetworKit. Another related effort is the SNAP [29] network analysis package which focuses on scalability through efficient sequential implementations of some kernels and generators. It has recently adopted the hybrid approach of C++ core and Python interface. Gephi [4] is a visualization and analysis package. It has a rich graphical user interface that allows to use the package without coding, advanced visualization capabilities, analytics and plotting functionality, as well as a “data laboratory” view. This makes Gephi a popular choice for data exploration. Gephi Toolkit makes the functionality accessible as a Java library. Pajek is a program for analyzing and visualizing large networks. As a closed sourced stand-alone tool, it is available for free for noncommercial use. JUNG [34] (Java Universal Network/Graph Framework) is a Java-based library for modeling, analysing and visualizing data that can be represented as a network. In addition to static networks, JUNG has support for dynamic (evolving) networks, which are unsupported by the majority of

popular libraries. The Knowledge Discovery Toolbox (KDT) [31] is a project with a focus similar to NetworKit: It provides high-performance kernels through an interactive Python interface. The underlying graph representation is algebraic. KDT is layered on top of Combinatorial BLAS, a linear algebra library which provides sparse matrix classes and operations. It is written in C++ and uses MPI for distributed memory parallelism. KDT currently has a limited feature set. Ligra [42] is a graph processing framework for shared-memory machines. Algorithms are written in C++ and expressed in terms of edge maps, vertex maps and vertex subsets. Ligra provides several fundamental graph algorithms and analytic kernels, but its feature set is still limited from a network analysis point of view. Nor does it support interactive network analysis. GraphCT (Graph Characterization Toolkit) [17] is written in C and targets both general multicore systems and the specialized parallel platform Cray XMT. Similar to NetworKit, the graph resides in main memory and is accessed by multiple threads. The defining feature of the Cray XMT architecture is that it uses massive hardware multithreading and fast thread switching to hide memory latency. GraphCT includes a basic collection of network analysis kernels. STINGER (SpatioTemporal Interaction Networks and Graphs Extensible Representation) [18] is mainly a dynamic graph structure and has been extended over time by several graph and network analysis algorithms. These algorithms can be used within a standalone command line tool or as a library. Targeted at shared-memory platforms including the Cray XMT, STINGER focuses on fast execution. Recently, experimental interfaces to Python and Java have been added. The GraphLab project [30] follows a parallel, distributed and asynchronous model for the implementation of graph algorithms. While GraphLab ships with a number of network analytics kernels, it fills a different niche as a framework for building large-scale distributed applications such as web-based recommender systems.

Feature Group

Feature

General

API



interactive



Data Structures

Sub-Feature

graph (undirected/directed/ weighted) hypergraph

Algorithms

igrap h

graph -tool

Gep hi

JUNG Pajek

KDT

STIN GER

Grap SNA hCT P

Grap hLab

Ligra















































⬤⬤⬤

⬤⬤⬤

⬤⬤⬤

⬤⬤⬤

⬤⬤⬤

⬤⬤⬤

⬤⬤⬤

⬤⬤⬤

⬤⬤⬤

⬤⬤⬤

⬤⬤⬤

⬤⬤⬤

⬤⬤⬤























































⬤⬤

⬤⬤

⬤⬤

⬤⬤

⬤⬤

⬤◯

⬤⬤

⬤◯

⬤◯

⃝⃝

⬤⬤

⃝⃝

⬤◯

weighted shortest path























⃝⃝



degree power law estimation



























degree assortativity









































connected components search (BFS/DFS)

core decomposition

centrality









⬤⬤

⬤◯

⬤◯

⬤⬤

⬤◯

⬤◯

⬤◯

⃝⃝

⬤◯

⃝⬤

⬤⬤

⃝⬤

⃝⬤

⬤⬤

⬤◯

⬤⬤

⬤◯

⬤◯

⬤◯

⬤◯

⬤⬤

⬤◯

⬤◯

⬤◯

⃝⃝

⃝⬤

closeness (ex./ap.)

⬤⬤

⬤◯

⬤⬤

⬤◯

⬤◯

⬤◯

⬤◯

⃝⃝

⃝⃝

⃝⃝

⬤◯

⃝⃝

⃝⃝



























⬤⬤

⬤◯

⬤◯

⬤◯





⬤◯







⬤◯

⬤◯

⃝⃝































































































eigenvector/page rank clustering coefficients (ex./ap.) community detection modularity matching





betweenness (ex./ap.)

diameter (ex./ap.)











flows



























Erdös-Renyi



























Barabasi-Albert



























R-MAT













































independent set

Generators

Netw Netw orKit orkX

Watts-Strogatz









Chung-Lu



























Havel-Hakimi



























Hyperbolic Unit-Disk



























Table 2: Feature matrix of network/graph analysis software (Legend: • present, • upcoming,◦ missing)

11

1

fb-Texas84

as-22july06

caidaRouterLevel fb-MIT8

PGPgiantcompo

fb-Smith60 as-22july06

fb-Caltech36

PGPgiantcompo fb-Caltech36

power

power 1

2

10

4

3

10

5

10

10

10

4

3

5

10

10

edges / s

10

edges / s

(a) Betweenness

(b) Betweenness approximation

uk-2002 con-fiber_big soc-LiveJournal cit-Patents in-2004 fb-Texas84 coAuthorsDBLP caidaRouterLevel fb-MIT8 fb-Smith60 as-22july06 PGPgiantcompo fb-Caltech36 power

fb-Texas84 caidaRouterLevel fb-MIT8 fb-Smith60 as-22july06 PGPgiantcompo fb-Caltech36 power 7

8

10

9

10

10

10

10

5

7

6

10

10

10

edges / s

edges / s

(c) Breadth-first search

(d) Average local clustering coefficient uk-2002 con-fiber_big soc-LiveJournal cit-Patents in-2004 fb-Texas84 coAuthorsDBLP caidaRouterLevel fb-MIT8 fb-Smith60 as-22july06 PGPgiantcompo fb-Caltech36 power

uk-2002 con-fiber_big soc-LiveJournal cit-Patents in-2004 fb-Texas84 coAuthorsDBLP caidaRouterLevel fb-MIT8 fb-Smith60 as-22july06 PGPgiantcompo fb-Caltech36 power 7

10

8

9

10

10

11

10

10

10

12

10

4

5

10

6

10

10

edges / s

edges / s

(e) Approximation of average local clustering coefficient uk-2002 con-fiber_big soc-LiveJournal cit-Patents in-2004 fb-Texas84 coAuthorsDBLP caidaRouterLevel fb-MIT8 fb-Smith60 as-22july06 PGPgiantcompo fb-Caltech36 power

(f) Community detection with PLM uk-2002 con-fiber_big soc-LiveJournal cit-Patents in-2004 fb-Texas84 coAuthorsDBLP caidaRouterLevel fb-MIT8 fb-Smith60 as-22july06 PGPgiantcompo fb-Caltech36 power

4

10

5

6

10

10

7

10

5

10

edges / s

6

10

7

10

8

10

9

10

edges / s

(g) Community detection with PLP

(h) Connected components

uk-2002 con-fiber_big soc-LiveJournal cit-Patents in-2004 fb-Texas84 coAuthorsDBLP caidaRouterLevel fb-MIT8 fb-Smith60 as-22july06 PGPgiantcompo fb-Caltech36 power

uk-2002 con-fiber_big soc-LiveJournal cit-Patents in-2004 fb-Texas84 coAuthorsDBLP caidaRouterLevel fb-MIT8 fb-Smith60 as-22july06 PGPgiantcompo fb-Caltech36 power 7

6

10

10

edges / s

(i) Core decomposition

7

10

8

10

edges / s

(j) Degree assortativity

Figure 7: NetworKit benchmark results broken down by network instance

9

10

uk-2002 con-fiber_big soc-LiveJournal cit-Patents in-2004 fb-Texas84 coAuthorsDBLP caidaRouterLevel fb-MIT8 fb-Smith60 as-22july06 PGPgiantcompo fb-Caltech36 power

uk-2002 con-fiber_big soc-LiveJournal cit-Patents in-2004 fb-Texas84 coAuthorsDBLP caidaRouterLevel fb-MIT8 fb-Smith60 as-22july06 PGPgiantcompo fb-Caltech36 power 4

10

5

10

4

6

10

10

edges / s

5

6

10

10

edges / s

(a) PageRank

(b) Degree power law fitting

Figure 8: NetworKit benchmark results broken down by network instance

13

7

10

A.2 Code Example. The following block of C++ code illustrates NetworKit’s programming model for parallel graph algorithms. It is a parallelized implementation of Brandes’ algorithm for betweenness centrality which is used for the Betweenness kernel. The core function runs a search from a source node to compute the so-called dependencies of other nodes on the source node, with the centrality score of a node being the sum over all dependencies. This is formulated as a lambda expression which is then passed to the node iterator Graph::balancedParallelForNodes, which executes the function in parallel on the set of nodes while also performing load balancing. To enable efficient parallelization and avoid data races, the algorithm first computes thread-local centrality scores which are then added for the final result. We strive for an API on the C++ layer that facilitates the concise formulation of graph algorithms, and thereby the extensibility of the library. void Betweenness::run() { count z = G.upperNodeIdBound(); scoreData.clear(); scoreData.resize(z); // thread-local scores for efficient parallelism count maxThreads = omp_get_max_threads(); std::vector scorePerThread(maxThreads, std::vector(G.upperNodeIdBound())); auto computeDependencies = [&](node s) { std::vector dependency(z, 0.0); // run SSSP algorithm and keep track of everything std::unique_ptr sssp; if (G.isWeighted()) { sssp.reset(new Dijkstra(G, s, true, true)); } else { sssp.reset(new BFS(G, s, true, true)); } sssp->run(); // compute dependencies for nodes in order of decreasing distance from s std::stack stack = sssp->getStack(); while (!stack.empty()) { node t = stack.top(); stack.pop(); for (node p : sssp->getPredecessors(t)) { // workaround for integer overflow in large graphs bigfloat tmp = sssp->numberOfPaths(p) / sssp->numberOfPaths(t); double weight; tmp.ToDouble(weight); dependency[p] += weight * (1 + dependency[t]); } if (t != s) { scorePerThread[omp_get_thread_num()][t] += dependency[t]; } } }; G.balancedParallelForNodes(computeDependencies); INFO("adding thread-local scores"); // add up all thread-local values for (auto local : scorePerThread) { G.parallelForNodes([&](node v){ scoreData[v] += local[v]; }); } if (normalized) { // divide by the number of possible pairs count n = G.numberOfNodes(); count pairs = (n-2) * (n-1); G.forNodes([&](node u){ scoreData[u] = scoreData[u] / pairs; }); } }