M-Flash: Fast Billion-scale Graph Computation Using Block Partition

Preprint submitted to the ACM Web Search and Data Mining Conference 2015 manuscript No. (will be inserted by the editor)

M-Flash: Fast Billion-scale Graph Computation Using Block Partition Model

arXiv:1506.01406v1 [cs.DB] 3 Jun 2015

Hugo Gualdron, Robson Cordeiro, Jose F. Rodrigues Jr. University of Sao Paulo, Sao Carlos, SP, Brazil {gualdron,robson,junio}@icmc.usp.br Duen Horng (Polo) Chau, Minsuk Kahng Georgia Institute of Technology, Atlanta, USA {polo,kahng}@gatech.edu U Kang KAIST, Daejeon, Republic of Korea [email protected] the date of receipt and acceptance should be inserted later

Abstract Recent graph computation approaches such as GraphChi, X-Stream, TurboGraph and MMap demonstrated that a single PC can perform efficient computation on billion scale graphs. While they use different techniques to achieve scalability through optimizing I/O operations, such optimization often does not fully exploit the capabilities of modern hard drives. We contribute: (1) a novel and scalable graph computation framework called M-Flash that uses a block partition model to boost computation speed and reduce disk accesses, by logically dividing a graph and its node data into blocks that can fully fit in RAM for reuse; (2) a flexible and deliberatively simple programming model, as part of M-Flash, that enables us to implement popular and essential graph algorithms, including the first single-machine billion-scale eigensolver; and (3) extensive experiments on real graphs with up to 6.6 billion edges, demonstrating M-Flash’s consistent and significant speed-up against state-of-the-art approaches. Keywords Graph Algorithms, Single Machine Scalable Graph Computation, Block Model

1 Introduction Large graphs with billions of nodes and edges are increasingly common in many domains and applications, such as the studies of social networks, transportation route networks, citation networks, and more.

Distributed frameworks have become popular choices for analyzing these large graphs (e.g., GraphLab [7], PEGASUS [11] and Pregel [19]). However, distributed approaches may not always be the best option, because they can be expensive to build [16], hard to maintain and optimize, and may just be “overkill” for smaller graphs. These potential challenges prompted researchers to create scalable graph computation frameworks that run on a single machine, such as GraphChi [16] and TurboGraph [8]. They define sophisticated processing schemes to overcome challenges induced by limited main memory and poor locality of memory access observed in many graph algorithms [20]. For example, most frameworks use an iterative vertexcentric programming model to implement algorithms: in each iteration, a scatter step first propagates the data or information associated with vertices (e.g., node degrees) to their neighbors, followed by a gather step, where a vertex accumulates incoming updates from its neighbors to recalculate its own vertex data. Recently, X-Stream [22] introduced a related edge-centric scattergather processing scheme that achieved better performance over the vertex-centric approaches, by favoring sequential disk access over un-ordered data, instead of favoring random access over ordered and indexed data (as in most other approaches). Having studied these approaches, we noticed that despite their sophisticated schemes and novel programming models, they often do not fully exploit the capabilities of modern hard drives. For example, reading or writing to disk are often performed at a lower speed than the disk would support;

2

or, graph data may be re-read from disk multiple times, which could be avoided. We present M-Flash, a novel scalable graph computation framework that overcomes many of the critical issues from existing approaches. M-Flash outperforms state-of-the-art approaches in large graph computation. Our major contributions include: 1. M-Flash Framework & Methodology: we propose the novel M-Flash framework that achieves fast and scalable graph computation via a block partition method that significantly boosts computation speed and reduces disk accesses by dividing a graph and its node data into blocks that can fully fit in RAM for reuse. 2. Programming Model: M-Flash provides a flexible, and deliberately simple programming model, made possible by our block partition model. We demonstrate how popular, essential graph algorithms may be easily implemented (e.g., PageRank, connected components, the first single-machine eigensolver over billion-node graphs, etc.), and how many others can be supported. 3. Extensive Experiments: we compare M-Flash with state-of-the-art frameworks using large real graphs with up to 6.6 billion edges (YahooWeb) [26]. M-Flash is consistently and significantly faster than GraphChi [16], X-Stream [22], TurboGraph [8] and MMap [18] across all graph sizes. And it sustains high speed even when memory is severely constrained (e.g., 6.4X as fast as X-Stream, when using 4GB of RAM).

2 M-Flash The most common strategy used to represent and store a graph’s edges in memory is using an adjacency list. For efficient processing, the graph’s adjacency list is often sorted (e.g., by source vertex ID) to speed up processing both for out-of-memory methods [16, 9] and for in-memory methods [9]. With a sorted and indexed list of edges, one can randomly access any given edge, rather than sequentially reading an entire list and looking up for the desired edge in memory. However, random access is much slower than sequential access, and the random access is desirable only when the number of random accesses is much smaller than the number of sequential accesses. X-Stream [22] avoids sorting edges by exploiting the efficiency of sequential operations. This is a desired feature because, then, preprocessing is not necessary for data sorting. In a similar direction, MFlash uses a block partition scheme, as we shall explain next, to maximize the number of edges processed each

Gualdron et al.

time data is transferred from disk to RAM, minimizing the number of sequential I/O operations. Our experiments (see Section 3) demonstrate that our approach vastly outperforms other approaches. Below, we will first describe how a graph is stored in M-Flash (Subsection 2.1). Then we detail how our block-based processing model which enables fast computation while using little RAM (Subsection 2.2). Finally, we explain how algorithms can be implemented using M-Flash’s generic programming model (Subsection 2.3).

2.1 Representing and Storing Graphs in M-Flash A graph in M-Flash is a directed graph G = (V, E) with vertices v ∈ V labeled from 1 to |V |. Each vertex has a set of attributes γ = {γ1 , γ2 , . . . , γK } stored in separate dense vectors on disk (see Figure 1); γi (j) is the i-th attribute associated with the vertex j. Edges e = (source, destination), e ∈ E, and their data (e.g., edge weight) are stored in blocks. Blocks in M-Flash: Given a graph G, we divide its vertices V into β intervals denoted by I (p) , where 0 1 ≤ p ≤ β. Note that I (p) ∩ I (p ) = ∅ for p 6= p0 , while S (p) = V . Thus, as shown in Figure 1, the graph is p I divided into β 2 blocks. Each block, thus, has a source node interval p and destination node interval q, where 1 ≤ p, q ≤ β, as G(p,q) . In Figure 1, G(2,1) is the block that contains edges with source vertices in the interval I (2) and destination vertices in the interval I (1) . We call this logical (not physical) creation of the β 2 blocks partitioning. The value for β can be easily and automatically determined based on the available RAM, the total number of vertices in the graph, and the data we need to store for each vertex. For example, for 4 GB RAM, a graph with 2 billion nodes, and 8 bytes of data per node, β = (2000000000 × 8)/(4 × 1024 × 1024 × 1024) = 3.72, thus requiring 42 = 16 blocks .

2.2 The M-Flash Processing Model Na¨ıve Block Processing (NBP): After logically partitioning the graph into blocks, we process them in a vertical zigzag order, as illustrated in Fig. 2, for three reasons: (1) computation results are stored by the destination vertices; so, we can “pin” a destination interval (e.g., I (1) ) while we process all the vertices that are sources to this destination interval (see the red vertical arrow in Figure 2); (2) using this order leads to fewer reads because the destination vertices’ attributes only need to be read once, regardless of the number of source intervals. (3) after reading all the blocks in a column,

Preprint submitted to the ACM Web Search and Data Mining Conference 2015

3

Fig. 1: Left: an example graph’s adjacency matrix (in light blue color) organized in M-Flash using 3 logical intervals (i.e., β = 3). G(p,q) is an edge block with source vertexes in interval I (p) and destination vertexes in interval I (q) . Right: Graph computation stores vertex-based data as dense vectors logically divided into the same number of intervals (i.e., γ1 ... γk ), as in the case of PageRank, which stores node degrees and intermediate PageRank scores.

G (1,1)

G (1,2)

G (1,3)

Source I(1)

G (2,1)

G (2,2)

G (2,3)

Source I(2)

G (3,1)

G (3,2)

G (3,3)

Source I(3)

Destination I(1)

Destination I(2)

Destination I(3)

Fig. 2: M-Flash’s computation schedule for an example graph with 3 intervals. Blocks are processed in a vertical zigzag manner, indicated by the sequence of red, orange and yellow arrows. This enables reuse of input (e.g., when going from G(3,1) to G(3,2) M-Flash reuses source node interval I (3) ), which reduces memory consumption and boosts speed. Figure best viewed in color.

4

we take a “U turn” (see the orange arrow in Fig. 2) to benefit from the fact that the data associated with the previously-read source interval is already in memory, so we can reuse that. Within a block (see Fig. 3), besides loading the attributes of the source and destination intervals into RAM, edges are sequentially read and processed, e = hsource, destination, edge propertiesi. After all blocks in a column are processed, the destination vertices’ updated attributes are written to disk. Extended Block Processing (EBP). The NBP model works well for graphs with dense blocks. But, if the blocks are sparse, the performance decreases. Because for a given block, we have to read more data from the the source intervals of vertices than from the very blocks of edges. We could solve this problem using for vertex ordering/clustering algorithms, such as CrossAssociation[5], Metis[12] or Slash-Burn[17]. Conceptually, these algorithms re-order the rows and columns of a graph’s adjacency matrix to create very dense and very sparse blocks, and a lot of empty blocks (since real graphs show power law degree distributions [6]). Empty blocks will not be processed, thus boosting overall speed. But, we want to avoid expensive preprocessing steps by avoiding these re-orderings. So, we opt for a different solution. Instead of using a pre-processing algorithm, we designed the Extended Block Processing model (EBP ), which works by storing the data of the source vertices together with the edge data (in the same block), as illustrated in Figure 4. Therefore, instead of reordering the whole graph before processing, we reorder only single blocks under demand. This improvement allows all the data (vertices and edges) to be read sequentially using fewer operations, avoiding the problem of inefficient data transfer observed in sparse blocks over NBP. Deciding when to use NBP or EBP. NBP is good for dense blocks, and EBP is good for sparse blocks. How can we quickly decide which to use when we are given a graph to process, where we do not have control over its edge ordering? We propose to compare the I/O cost between NBP (Equation 1) and EBP (Equation 2), based on the theoretical I/O cost model proposed by Aggarwal and Vitter [1]. We parametrize the algorithms’ complexity with the size of block transfer B from disk to RAM. Once the NBP and EBP costs are calculated, we apply Equation 3 to select the best option according to Equation 4. In these equations, ξ is the number of edges in G(p,q) , ϑ is the number of

Gualdron et al.

Algorithm 1 MAlgorithm: Algorithm Interface for coding in M-Flash initialize (Vertex v); gather (Vertex u, Vertex v, EdgeData data); sum (Accum v 1, Accum v 2, Accum v out); apply (Vertex v);

vertices in the interval, φ and ψ are, respectively, the number of bytes to represent a vertex and an edge e. Θ NBP G(p,q) = Θ((2ϑφ + ξψ)/B)

(1)

= Θ((ξψ + ϑφ/β + 2ξφψ + ϑφ)/B) Θ EBP G(p,q) (2)

1 2ξφ + β ϑ ( EBP < 1 sparse, Θ NBP BlockType G(p,q) = dense, otherwise Θ

EBP NBP

=Θ

(3)

(4)

When we logically partition the graph (discussed in Subsection 2.1), we decide how to physically create each block on the hard disk. We classify each block, into dense or sparse, using Equation 4. We create one file on disk to store each dense block. For the sparse blocks in a row, we create one file on disk that contains all the edges found in those blocks.

2.3 Programming Model in M-Flash M-Flash supports directed weighted graphs, where each edge has a source and destination vertex. M-Flash’s computational model, called MAlgorithm (short for Matrix Algorithm Interface) is shown in Algorithm 1; MAlgorithm is a vertex-centric model, storing computation results with the destination vertices. Inspired by previous programming models such as GIM-V [11] of Pegasus and GAS of PowerGraph [7], MAlgorithm can flexibly be used to implement many common and essential graph algorithms, especially those that are iterative, such as PageRank, Random Walk with Restarts (RWR), Weakly Connected Components (WCC) and diameter estimation. The MAlgorithm interface has four main operations: initialize, gather, sum, and apply. The initialize operation loads the initial value of each destination vertex; the gather operation collects data from neighboring vertices; the sum operation processes the data gathered from the neighbors; finally, the apply operation stores the new computed values of the destination vertices to


5

Fig. 3: Example I/O operations to process the dense block G(2,1) .

Fig. 4: Example I/O operations to process the sparse block G(2,1) . Algorithm 2 PageRank in M-Flash degree(v) = out degree for Vertex v initialize (Vertex v):

v.value = 0 gather (Vertex u, Vertex v, EdgeData data): v.value += u.value/ degree(v) sum (Accum v 1, Accum v 2, Accum v out):

– sum: sum up intermediate PageRank values (MFlash supports multiple threads, so many sum operations may run concurrently) – apply : calculates vertices’ new PageRank values (damping factor = 0.85, as recommended by Brian and Sergei [4])

v out = v 1 + v 2 apply (Vertex v);

v.value = 0.15 + 0.85 * v.value

the hard disk, making them available for the next iteration. To demonstrate the flexibility of MAlgorithm, we show in Algorithm 2 the pseudo code of how the PageRank algorithm (using power iteration) can be implemented. The input to PageRank are two vectors, one storing node degrees, and another one for storing intermediate PageRank values, initialized to 1/ |V |. The algorithm output is a third node vector that stores the final computed PageRank values. For each iteration, MFlash executes the MAlgorithm operations on the output vector as follows: – initialize: vertex values set to 0 – gather : accumulates the intermediate PageRank values of all in-neighbors u of vertex v

The input for the next iteration is the output from the current one. The algorithm runs until the PageRank values converges or after certain number of iterations defined by the user. Many other graph algorithms can be decomposed into the same four operations and implemented in similar ways, such as connected components, diameter estimation, and random walk with restart [11]. We show a few other algorithms in Appendix A.

3 Evaluation Overview: We compare M-Flash with multiple stateof-the-art approaches: GraphChi, TurboGraph, XStream, and MMap. For a fair comparison, we use experimental setups recommended by the researchers of the other approaches in the best way that we can. We first describe the datasets used in our evaluation

6

Gualdron et al.

(Subsection 3.1) and our experimental setup (Subsection 3.2). Then, we compare all approaches’ runtimes for two exemplary graph algorithms (Subsection 3.3 and 3.4), PageRank and Weakly Connected Component (WCC), that are available from all packages1 ; M-Flash outperforms the others. To demonstrate how M-Flash extend to more algorithms, we showcase our implementation of the Lanczos algorithm (with selective orthogonalization), one of the most computationally efficient approach to compute eigenvalues and eigenvectors [21, 11] (Subsection 3.5) — to the best of our knowledge, M-Flash provides the first design and implementation that can handle graphs with more than one billion nodes when the vertex data cannot fully fit in RAM (e.g., YahooWeb graph). Next, we show that MFlash continues to run at high speed even when the machine has little RAM (e.g., 4GB) when the other methods slow down (Subsection 3.6). Finally, through an analysis of I/O operations, we show that M-Flash performs far fewer read and write operations than other approaches, which empirically validates the efficiency of our block partitioning model (Subsection 3.7).

M-Flash performs minimal preprocessing (to construct disk-resident blocks) which essentially are two highspeed passes through the edge file. Similar to other approaches, such as GraphChi and X-Stream, such preprocessing incurs low one-time cost. Therefore, our experiments focus on comparing algorithm runtimes. The libraries are configured as follows: GraphChi C++ version, downloaded from their GitHub repository in February, 2015. Buffer sizes configured to those recommended by their authors2 . X-Stream C++ v0.9. Buffer size desirably configured close to available RAM. TurboGraph V0.1 Enterprise Edition. Buffer size desirably configured to 75% of available RAM; using more crashes the system (also observed by [18]). MMap Java version (64-Bit). M-Flash C++ version, single-threaded.

3.3 PageRank 3.1 Graph Datasets We use three real graphs of different scales: a LiveJournal graph [2] with 69 million edges (small), a Twitter graph [15] with 1.47 billion edges (medium), and the YahooWeb graph [26] with 6.6 billion edges (large). Table 1 describes their numbers of nodes and edges.

3.2 Experimental Setup All experiments were run on a desktop computer with an Intel i7-4770K quad-core CPU (3.50 GHz), 16 GB RAM and 1 TB Samsung 850 Evo SSD disk. Note that M-Flash does not requires an SSD to run (neither do GraphChi and X-Stream), while TurboGraph does; thus, we use an SSD to make sure all methods can perform at their best. GraphChi, X-Stream and M-Flash were run on Linux Ubuntu 14.04 (x64). TurboGraph was run on Windows (x64) since it only supports Windows [8]. MMap was written in Java, thus could run on both Linux and Windows; we ran it on Windows, following MMap’s authors setup [18]. All reported runtimes are averages across three cold runs (i.e., all caches and buffers purged between runs), to avoid any potential advantages gained due to caching or buffering effects. 1

We were not able to compare more algorithms, since no other algorithms are implemented by all packages.

Figure 5 shows how all approaches’ PageRank runtimes compares. LiveJournal (small graph; Fig. 5a): Since the whole graph and all node vectors fully fit in RAM, all approaches finish in seconds. Still, M-Flash was the fastest, up to 3.3X of GraphChi, and 2.4X of X-Stream and TurboGraph. Twitter (medium graph; Fig. 5b): This graph’s edges do not fit in RAM (requires 11.3GB) but its node vectors do. M-Flash is a few seconds slower than TurboGraph and MMap, possibly because their implementations were highly optimized as they do not provide a generic programming model for easy implementation of general graph algorithms like M-Flash does (as described in Section 2.2). However, comparing to GraphChi and X-Stream which both offer generic programming models, M-Flash is faster, at 1.7X to 3.6X speed. YahooWeb (large graph; Fig. 5c): For this billionnode graph, neither its edges nor node vectors fit in RAM; this situation is where M-Flash significantly outperforms the others. Figure 5(c) confirms this claim. M-Flash is faster, at a speed that is 2.2X to 3.3X that of all other approaches. 2 When evaluating how GraphChi performs with 16GB RAM (Section 3.6), we doubled GraphChi’s recommended buffer size for 8GB.


7

Table 1: Real graph datasets used in our experiments. Graph LiveJournal Twitter YahooWeb

Nodes

Edges

Size

4,847,571 41,652,230 1,413,511,391

68,993,773 1,468,365,182 6,636,600,779

Small Medium Large

Twi.er Graph -‐ PageRank (10 iter.)

Live Journal -‐ PageRank (10 iter.)

394

10.3 7.9 3.7 3.1 0

861

Graphchi X-‐Stream TurboGraph MMap M-‐Flash

7.5

5 10 Run$me (s)


241 220 236 0

15

500 Run$me (s)

(a)

1000

(b)

Yahoo -‐ PageRank (1 iter.) 676 Graphchi X-‐Stream TurboGraph MMap M-‐Flash

507 628 442 201 0

100

200

300 400 500 Run$me (s)

600

700

800

(c)

Fig. 5: Runtime of PageRank for LiveJournal, Twitter and YahooWeb graphs with 8GB of RAM. M-Flash is 3X as fast as GraphChi and TurboGraph. (a & b): for smaller graphs, such as Twitter, M-Flash is as fast as some existing approaches (e.g., MMap) and significantly faster than other (e.g., 4X of X-Stream). (c): M-Flash is significantly faster than all state-of-the-art approaches for YahooWeb: 3X of GraphChi and TurboGraph, 2.5X of X-Stream, 2.2X of MMap. 3.4 Weakly Connected Component To find all weakly connected components, the Union Find [23] algorithm is the best option when there is enough RAM to store all vertex data (component IDs); otherwise an iterative algorithm (see Algorithm 3) produces identical solutions [18]. Figures 6(a) and 6(b) show the runtimes for the LiveJournal and Twitter graphs with 8GB RAM; all approaches use Union Find, except X-Stream, which only offers an iterative algorithm that requires multiple iterations to converge, thus the significantly higher runtime for the Twitter graph. M-Flash is again among the fastest for these two smaller graphs: for the Twitter graph, M-Flash’s speed is 2.6X of MMap, 5X of Tur-

boGraph, and only a few second behind GraphChi. We are aware that the iterative algorithm’s from X-Stream is much slower (74X slower); we included it mainly to provide a full comparison.

For the YahooWeb graph, its node vectors does not fit in RAM, thus the iterative approach is required. Figure 6(c) compares the runtimes of GraphChi, X-Stream and M-Flash (TurboGraph and MMap do not offer the iterative algorithm). Similarly to the PageRank results, M-Flash is significantly faster: 9.2X of GraphChi and 5.8X of X-Stream.

8

Gualdron et al.

Twi.er Graph -‐ Connected Component

Live Journal -‐ Connected Component 21

0.6 6.1 4.4 1.0 0.6 0

2

4 6 Run$me (s)

1926


128 69 26 0

8

1000

(a)


2000 3000 Run$me (s) (b)

Yahoo -‐ Itera$ve Connected Component (1 iter.) 1381 Graphchi

874

X-‐Stream

149

0

200

M-‐Flash 400

600

800 1000 Run$me (s)

1200

1400

1600

(c)

Fig. 6: Runtime of Weakly Connected Component for LiveJournal, Twitter and YahooWeb graphs with 8GB of RAM. (a & b): for smaller graphs, such as Twitter, M-Flash is fast as some approaches (e.g., GraphChi) and significantly faster than other (e.g., 5X of TurboGraph). (c) M-Flash is significantly faster than all state-of-the-art approaches for YahooWeb: 9.2X of GraphChi and 5.8X of X-Stream. 3.5 Spectral Analysis using The Lanczos Algorithm Eigenvalues and eigenvectors are at the heart of numerous algorithms, such as singular value decomposition (SVD) [3], spectral clustering [25], triangle counting [24] and tensor decomposition [14]. We implemented the Lanczos algorithm with selective orthogonalization (LSO), to demonstrate how M-Flash’s design can be easily extended to support spectral analysis of billionscale graphs. To the best of our knowledge, M-Flash provides the first design and implementation that can handle graphs with more than one billion nodes when the vertex data cannot fully fit in RAM (e.g., YahooWeb graph). To compute the top 20 eigenvectors and eigenvalues of the YahooWeb graph, one iteration of LSO takes 737s when using 8 GB of RAM. The closest comparable results we were able find was from the HEigen system [10], at 150s for one iteration; however, it was for a much smaller graph with 282 million edges (23X fewer edges), using a 70-machine Hadoop cluster — us-

ing one machine, on a much large graph, M-Flash only takes roughly 4X longer. GraphChi offers an SVD implementation based on LSO but it requires all node vectors to fit in RAM. Thus, it cannot handle billion-scale graphs like the YahooWeb graph. 3.6 Effect of Memory Size As the amount of available RAM can affect both computation speed and data storage, here we study the effect of memory size in detail. Figure 7 summarizes how all approaches perform under 4GB, 8GB and 16GB of RAM, when running one iteration of PageRank on the YahooWeb graph. M-Flash continues to run at high speed even when the machine has little RAM (e.g., 4GB), since most edges are processed in dense blocks which considerably reduces I/O operations. Other methods tend to slow down. MMap does not perform well for when memory is severely constrained (e.g., 4GB), due to thrashing,


9

Yahoo Graph -‐ Pagerank (1 iter.) 508

4GB

718

1390

2703

217


676 507 628

8GB

201

16GB

214 193 180

0

442

403 507

500

1000

1500 2000 Run$me (s)

2500

3000

Fig. 7: Runtime comparison for PageRank (1 iteration) on YahooWeb graph. M-Flash is significantly faster than all state-of-the-art approaches with different memory sizes. where the machine spends a lot of time on mapping disk-resident data to RAM or un-mapping data from RAM, slowing down the overall computation. GraphChi and TurboGraph slows down significantly when RAM is reduced. For example, TurboGraph becomes more than two times slower when the size of the memory is reduced by half.

3.7 Input/Output (I/O) Operations Analysis Input/Output (I/O) operations is a commonly used objective measure for evaluating frameworks based on secondary memory [16]. Figure 8 shows how M-Flash compare with GraphChi and X-Stream in terms of their read and write operations, and data read from or written to disk3 . When running one iteration of PageRank on the 6.6 billion edge YahooWeb graph, M-Flash performs significantly fewer reads (77GB) and fewer writes (17GB) than other approaches. M-Flash achieves and sustains high-speed reading from disk (the “plateau” at the top-right), while other methods do not. For example, GraphChi generally writes data slowly across the whole computation iteration, and X-Stream shows periodic and spiky reads and writes.

4 Related work A typical approach to scalable graph processing is to develop a distributed framework. This is the case of PEGASUS [11], Apache Giraph (http://giraph.apache. 3 We did not compare with TurboGraph because it runs only on Windows, which does not provide readily available tools for measuring I/O speed.

org.), Powergraph [7], and Pregel [19]. In this work, we aim to scale up by maximizing what a single machine can do, which is considerably cheaper and easier to manage; single-node processing solutions have recently reached comparative performance to distributed systems for similar tasks [13]. Among existing works in single-node processing, some work on both conventional magnetic disks and emerging solid-state disks (SSDs); while some are restricted to SSDs, which relies on the remarkable lowlatency and improved I/O of SSDs compared with magnetic disks. This is the case of TurboGraph [8] and RASP [27], which rely on random accesses to the graph edges for processing — not well supported over magnetic disks. M-Flash avoids such drawback and demonstrates better performance over TurboGraph. One of the first approaches to avoid random disk/edge accesses is GraphChi [16], which partitions the graph on disk into units called shards. It requires a preprocessing step to sort the data by source vertex. GraphChi uses a vertex-centric approach that requires a shard to entirely fit in RAM, including both vertices in the shard and all their edges (in and out). As we demonstrate, this makes GraphChi less efficient when compared with our work. M-Flash requires only a subset of the vertex data to be stored in memory. MMap [18] introduced an interesting approach based on OS-supported mapping of disk data into memory (virtual memory). It allows graph data to be accessed as if they were stored in unlimited memory, avoiding the need to manage data buffering. This enables high performance with minimal code. Inspired by MMap, M-Flash uses memory-mapping when process-

10

Gualdron et al.

GraphChi 500

X-‐Stream

M-‐Flash

132GB read

235GB read

77GB read

31GB wri7en

135GB wri7en

17GB wri:en

400 Reads 300 (MB/s) 200 100 0 500 400 Writes 300 (MB/s) 200 100 0

0

676s

0 507s

0 201s

Fig. 8: I/O operations for 1 iteration of PageRank on the YahooWeb graph. M-Flash needs significantly fewer reads and writes (in green) than other approaches. M-Flash achieves and sustains high-speed reading from disk (the “plateau” in top-right), while other methods do not. ing edge blocks. As we demonstrate, M-Flash outperforms MMap. M-Flash also draws inspiration from the edge streaming approach introduced by X-Stream’s processing model [22], improving it with fewer disk writes for dense regions of the graph. Edge streaming is a sort of stream processing in that it refers to unrestricted data flows over a bounded amount of buffering. As we demonstrate, this leads to optimized data transfer by means of less I/O and more processing per data transfer. 5 Conclusions We proposed M-Flash, a fast and scalable graph computation framework that uses a block partition model to maximize disk access speed. M-Flash uses a deliberatively simple design, which allows us to easily integrate a wide range of popular graph algorithms. We conducted extensive experiments using large real graphs. M-Flash consistently and significantly outperforms all state-of-the-art approaches including GraphChi, XStream, TurboGraph and MMap. M-Flash runs at high speed for graphs of all sizes, including the 6.6 billion edge YahooWeb graph, even when the size of memory is reduced (e.g., 6.4X as fast as X-Stream, when using 4GB of RAM). References 1. Aggarwal, A., Vitter, Jeffrey, S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (Sep 1988)

2. Backstrom, L., Huttenlocher, D., Kleinberg, J., Lan, X.: Group formation in large social networks: Membership, growth, and evolution. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 44–54. KDD ’06, ACM, New York, NY, USA (2006) 3. Berry, M.W.: Large-scale sparse singular value computations. International Journal of Supercomputer Applications 6(1), 13–49 (1992) 4. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30(1), 107–117 (1998) 5. Chakrabarti, D., Papadimitriou, S., Modha, D.S., Faloutsos, C.: Fully automatic cross-associations. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 79–88. ACM (2004) 6. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On powerlaw relationships of the internet topology. In: ACM SIGCOMM Computer Communication Review. vol. 29, pp. 251–262. ACM (1999) 7. Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: Distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation. pp. 17–30. OSDI’12, USENIX Association, Berkeley, CA, USA (2012) 8. Han, W.S., Lee, S., Park, K., Lee, J.H., Kim, M.S., Kim, J., Yu, H.: Turbograph: A fast parallel graph engine handling billion-scale graphs in a single pc. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 77–85. KDD ’13, ACM, New York, NY, USA (2013) 9. Hong, S., Oguntebi, T., Olukotun, K.: Efficient parallel graph exploration on multi-core cpu and gpu. In: Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. pp. 78–88 (Oct 2011) 10. Kang, U., Meeder, B., Papalexakis, E.E., Faloutsos, C.: Heigen: Spectral analysis for billion-scale graphs. IEEE Transactions on Knowledge and Data Engineering 26(2), 350–362 (2014)

Preprint submitted to the ACM Web Search and Data Mining Conference 2015 11. Kang, U., Tsourakakis, C.E., Faloutsos, C.: Pegasus: A peta-scale graph mining system implementation and observations. In: Proceedings of the 2009 Ninth IEEE International Conference on Data Mining. pp. 229–238. ICDM ’09, IEEE Computer Society, Washington, DC, USA (2009) 12. Karypis, G., Kumar, V.: Metis – unstructured graph partitioning and sparse matrix ordering system, version 2.0. Tech. rep., Univ of Minnesota (1995) 13. Khan, A., Elnikety, S.: Systems for big-graphs. Proc. VLDB Endow. 7(13), 1709–1710 (Aug 2014)

11

Appendix A. Algorithms Implemented Using MAlgorithm Interface

Algorithm 3 Weak Connected Component in M-Flash initialize (Vertex v):

v.value = v.id gather (Vertex u, Vertex v, EdgeData data):

v.value = min(v.value, u.value) sum (Accum v 1, Accum v 2, Accum v out):

v out = min(v 1 , v 2)

14. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Review 51(3), 455–500 (2009) 15. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web. pp. 591–600. WWW ’10, ACM, New York, NY, USA (2010) 16. Kyrola, A., Blelloch, G., Guestrin, C.: Graphchi: Largescale graph computation on just a pc. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation. pp. 31–46. OSDI’12, USENIX Association, Berkeley, CA, USA (2012) 17. Lim, Y., Kang, U., Faloutsos, C.: Slashburn: Graph compression and mining beyond caveman communities. Knowledge and Data Engineering, IEEE Transactions on 26(12), 3077–3089 (Dec 2014) 18. Lin, Z., Kahng, M., Sabrin, K.M., Chau, D.H., Lee, H., Kang, U.: Mmap: Fast billion-scale graph computation on a pc via memory mapping. In: BigData (2014) 19. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. pp. 135–146. SIGMOD ’10, ACM, New York, NY, USA (2010) 20. Munagala, K., Ranade, A.: I/o-complexity of graph algorithms. In: Proceedings of the Tenth Annual ACMSIAM Symposium on Discrete Algorithms. pp. 687–694. SODA ’99, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (1999) 21. Parlett, B.N., Scott, D.S.: The lanczos algorithm with selective orthogonalization. Mathematics of computation 33(145), 217–238 (1979) 22. Roy, A., Mihailovic, I., Zwaenepoel, W.: X-stream: Edgecentric graph processing using streaming partitions. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. pp. 472–488. SOSP ’13, ACM, New York, NY, USA (2013) 23. Tarjan, R.E., van Leeuwen, J.: Worst-case analysis of set union algorithms. J. ACM 31(2), 245–281 (Mar 1984) 24. Tsourakakis, C.E.: Fast counting of triangles in large real networks without counting: Algorithms and laws. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining. pp. 608–617. ICDM ’08, IEEE Computer Society, Washington, DC, USA (2008) 25. Von Luxburg, U.: A tutorial on spectral clustering. Statistics and computing 17(4), 395–416 (2007) 26. Yahoo!Labs.: Yahoo altavista web page hyperlink connectivity graph, 2002. (2002), accessed: 2014-12-01 27. Yoneki, E., 0002, A.R.: Scale-up graph processing: a storage-centric view. In: Boncz, P.A., 0001, T.N. (eds.) GRADES. p. 8. CWI/ACM (2013)

Algorithm 4 SpMV for weigthed graphs in M-Flash initialize (Vertex v):

v.value = 0 gather (Vertex u, Vertex v, EdgeData data):

v.value += u.value * data sum (Accum v 1, Accum v 2, Accum v out):

v out = v 1 + v 2