SIGCHI Extended Abstracts Sample File: Note Initial ...

5 downloads 77086 Views 513KB Size Report
Sep 12, 2016 - graph data sets We host these platforms and the algorithms on an Intel ... use of checkpoints and failure recovery systems if needed. Giraph is ... It can achieve .... GraphX, the hard disk remains almost idle with a few peaks.
Energy Efficiency of Large Scale Graph Processing Platforms Kashif Nizam Khan Helsinki Institute of Physics Department of Computer Science Aalto University, Finland [email protected]

Zhonghong Ou Beijing University of Posts and Telecommunications, China [email protected]

Mohammad A. Hoque University of Helsinki, Finland [email protected]

Jukka K. Nurminen Helsinki Institute of Physics Finland [email protected]

Tapio Niemi Helsinki Institute of Physics CERN, Switzerland [email protected]

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). UbiComp/ISWC ’16 Adjunct , September 12-16, 2016, Heidelberg, Germany ACM 978-1-4503-4462-3/16/09. http://dx.doi.org/10.1145/2968219.2968296

Abstract A number of graph processing platforms have emerged recently as a result of the growing demand on graph data analytics with complex and large-scale graph structured datasets. These platforms have been tailored for iterative graph computations and can offer an order of magnitude performance gain over generic data-flow frameworks like Apache Hadoop and Spark. Nevertheless, the increasing availability of such platforms and their functionality overlap necessitates a comparative study on various aspects of the platforms, including applications, performance and energy efficiency. In this work, we focus on the energy efficiency aspect of some large scale graph processing platforms. Specifically, we select two representatives, e.g., Apache Giraph and Spark GraphX, for the comparative study. We compare and analyze the energy consumption of these two platforms with PageRank, Strongly Connected Component and Single Source Shortest Path algorithms over five different realistic graphs. Our experimental results demonstrate that GraphX outperforms Giraph in terms of energy consumption. Specifically, Giraph consumes 1.71 times more energy than GraphX on average for the mentioned algorithms.

Author Keywords Big Data, Energy efficiency, Distributed Computing, Graph Processing, GraphX, Giraph, RAPL, Hadoop, Spark

ACM Classification Keywords Contributions In summary, we make the following major contributions in this work:

C.4 [Performance of Systems]: Measurement techniques; H.3.4 [Systems and Software]: Performance evaluation (efficiency and effectiveness); G.2.2 [Graph Theory]: Graph algorithms

Introduction Firstly, we compare energy efficiency of Giraph and GraphX against their performance on PageRank, SCC, and SSSP algorithms through a set of experiments performed on an Intel Haswell-based platform. Secondly, our experimental results demonstrate that PageRank, SSSP, and SCC algorithms with GraphX are 2.35, 1.45, and 1.33 times, respectively, more energy efficient than Giraph on average. Thirdly, we observe that there are a few cases where GraphX is unable to process the workloads that Giraph can process, particularly in case of SCC algorithm. Finally, we pinpoint several impact factors that potentially cause extra energy consumption for Giraph and GraphX.

We are currently experiencing a paradigm shift in the computing field where more and more processing and computation is moving into the clouds. This has resulted in a massive amount of data generated and stored, and has created excitement about Big Data and processing platforms such as MapReduce [3], Hadoop [6], and Spark [21]. On top of these platforms numerous data-intensive applications have been developed. Therefore, data centers continue to grow in number and size to cope with the demand. For example, Yahoo utilizes more than 35 K servers distributed over 19 data centers for it’s Big Data applications [4, 13]. The amount of energy cost at this scale becomes significant and it is projected to increase [1]. Understanding and optimizing the energy consumption of Big Data applications and platforms is an important issue. One such application of Big Data analytics is large scale graph processing. Graph computations have applications in web-page ranking (used in Google), on-line advertisements or social media network analysis (used in Twitter and Facebook) [5, 7]. In order to handle the computationally intensive and complex graph processing applications Bulk Synchronous Parallel (BSP) model [20] was introduced. Google’s Pregel [19] incorporates BSP and provides simple application programming interfaces (APIs) for graph processing algorithms. After Pregel, numerous other graph processing platforms and frameworks have evolved to address different optimization criteria, including Apache Giraph [5], GraphLab [18], and Apache Spark’s GraphX [9]. In this work, we consider benchmarking the energy consumption of a number of large scale graph processing algorithms.

Several existing work compared these frameworks and platforms from performance, memory consumption, and message modeling point of view [10, 11, 7]. A few work also attempted to benchmark graph processing platforms [2]. In this work, we take a different view and compare the platforms from energy efficiency perspective. Specifically, we select two representatives, i.e., Apache Giraph and Spark’s GraphX, for the study. We implement three well known graph algorithms namely, PageRank [19], Strongly Connected Component (SCC), and Single Source Shortest Path (SSSP) and compare their energy consumption with five different realistic graph data sets We host these platforms and the algorithms on an Intel Haswell based machine and measure the energy consumption of the processor package, DRAM, pp0 and pp1 plane. This Intel architecture comes with a Running Average Power Limit (RAPL) interface [12], which provides the readings mentioned above.

Background In this article, we analyze the energy efficiency of two popular large scale graph analytics platforms, Giraph and GraphX, using a number of graph-based algorithms. Before delving into the measurement details, in this section we present a brief overview of these platforms, algorithms and the measurement tools. Giraph & GraphX Apache Giraph [5] is an open-source implementation of Pregel. It works on Hadoop and uses Hadoop Distributed File System (HDFS) for storing input and output. It makes use of checkpoints and failure recovery systems if needed. Giraph is designed for high scalability in graph processing and is used in Facebook for analyzing social graphs. We use Giraph-1.1.0 with Hadoop-0.20.203.0 in our experiment. On the other hand, GraphX is a distributed graph processing framework built on top of Apache Spark [9]. It can achieve faster speed over base data-flow framework like Spark for graph processing. GraphX applies a range of optimizations such as flexible vertex cut partitioning, and immutability to

Graph Algorithms PageRank algorithm [19] is widely used by Google to rank web-pages based on their importance and popularity. PageRank determines the rank of a web-page based on the number of incoming links received from other webpages. We use PageRank with a damping factor of 0.85 which is commonly used in literature [11]. SSSP calculates the shortest path from a fixed source vertex to all other vertices. We compute the distance or number of hops from the source vertex to the destination. We choose SSSP because of its dynamic network usage as it helps to determine the system’s behavior when the communication is variable. [11] A SCC of a directed graph is a subgraph where each vertex is reachable from every other vertex in the subgraph. We choose SCC because it gives us a distinct communication pattern from PageRank and SSSP. For SCC, the initial supersteps involve heavier network communication and it decreases gradually as the number of supersteps increases.

Processor Architecture Cores L3 cache Hyperthreads Memory Frequency Turbo Boost

Core i7-4770 Haswell 8 (4 + 4) 8 MB Enabled 16GB 3.4GHz Enabled

Name wv sd wg cp sl

File Size 1MB 11MB 72MB 268MB 1GB

Nodes 7,115 82,168 875,713 3,774,768 4,847,571

Edges 103,689 948,464 5,105,039 16,518,948 68,993,773

Table 2: Dataset Specifications

Table 1: System Specifications

achieve low cost fault tolerance, reduced memory overhead and improved performance [9]. We use Spark-1.4.1 in our experiment which includes GraphX. RAPL Intel has introduced the RAPL interface [12] to limit and monitor the energy usage on its recent processor architectures (from SandyBridge onwards). RAPL is implemented as Model-Specific Registers (MSR). MSRs are updated roughly per one millisecond. RAPL provides energy measurements for processor package, power plane (pp0), power plane 1 (pp1), and DRAM. Processor package includes the processor die that contains all the cores, on-chip devices, and other uncore components [12]. pp0 describes the CPU cores, whereas pp1 describes ‘specific device in the uncore’, e.g., on-chip graphics processing units (GPU). The DRAM plane reports energy consumption of dual in-line memory modules (DIMMs) installed in the system. In our experiments, RAPL counters are read using the MSR driver interface.

System and Dataset Specifications Table 1 presents the hardware used in our experiments. We use an Intel Haswell based workstation, which has in total 8 cores when hyperthreading is enabled. We make sure that all the 8 cores are used while running the workloads. The Turbo Boost feature is also enabled. We measure the disk power consumption using an external clamp meter Mastech MS2102 AC/DC.

For our experiments, we choose five real graph datasets presented in Table 2. These datasets contain directed graphs in edge list format obtained from Stanford Network Analysis Project (SNAP) [14]. wiki-Vote [15] is a graph obtained from Wikipedia voting for administrators. Slashdot0902 [17] dataset is obtained from Slashdot social network in February, 2009. web-Google [17] is a web graph from Google, citpatents [16] is a U.S. patent dataset which is a citation network graph, and soc-LiveJournal1 [17] is a dataset from LiveJournal online community which has more than 10 million users [14]. For brevity, we use wv, sd, wg, cp and sl to replace wiki-Vote, Slashdot0902, web-Google, cit-patents and soc-LiveJournal1 respectively. We choose these datasets because of their significantly different application fields, difference in their sizes, and graph properties. They are also popular among the research community. To obtain the same PageRank results, we set the MAX_SUPERSTEP parameter to 26 for Giraph and the number of iterations to 25 for GraphX. In case of SSSP and SCC we set MAX_SUPERSTEP and number of iterations as 21 and 20 respectively. Giraph requires an extra super step compared to GraphX, as Giraph implementation may have an extra master compute phase. We obtain the RAPL samples in 500 ms interval.

Experimental Results In this section, we present the experimental results obtained when running the PageRank, SSSP and SCC algorithms on Giraph and GraphX. We mainly focus on the energy spent in

processor package, DRAM and disks as these components consume bulk of the energy spent in graph processing. 25

SSSP of wiki-Vote -GraphX

25

20

Energy

15 10

Time in 500 ms frequency

Pagerank of Slashdot0902 −Giraph

30

Energy

Energy

Pagerank of wiki−Vote −Giraph

20

30

0 Package Energy Time in 500 ms frequency DRAM Energy

10

15 10 5 0

Time in 500 ms frequency

5

0

30

10

0

10

5

15

5

15

Energy

Energy

20

20 Energy

Pagerank of Slashdot0902 −GraphX

25

SSSP of wiki-Vote -Giraph

20

Package Energy DRAM Energy

30

Energy

Pagerank of wiki−Vote −GraphX

Energy

20 25

SSSP of Slashdot0902 -GraphX

10

Time in 500 ms frequency SSSP of Slashdot0902 -Giraph

20

10

20

0

0

Time in 500 ms frequency

10

Time in 500 ms frequency

(a) SSSP with wiki-Vote and Slashdot0902 datasets. 0

Time in 500 ms frequency

SSSP of soc-LiveJournal1 -GraphX

(a) PageRank with wiki-vote and Slashdot0902 datasets.

25

Energy

Pagerank of cit−Patents −GraphX

20 10

0

Energy

Energy

30

Time in 500 ms frequency SSSP of cit-Patents -Giraph

30 20 10

10

Package Energy DRAM Energy

SSSP of soc-LiveJournal1 -Giraph

30

20

10

0

Time in 500 ms frequency

10

0 Package Energy Time in 500 ms Frequency DRAM Energy Pagerank of cit−Patents −Giraph

30

15

5

15

Energy

Pagerank of web−Google −Giraph

20

20

5 0 Time in 500 ms Frequency

SSSP of cit-Patents -GraphX

10

20 Energy

Energy

30

Pagerank of web−Google −GraphX

25

30 Energy

Time in 500 ms frequency

Energy

0

20

10

20 0

0

Time in 500 ms frequency

10

Time in 500 ms frequency

(b) SSSP with soc-LiveJournal1 and cit-Patents datasets. 0

Time in 500 ms Frequency

0

Time in 500 ms Frequency

(b) PageRank with web-Google and cit-Patents datasets.

Figure 1: CPU package and DRAM Energy consumption (in Joule) of the system with respect to time while computing PageRank in Giraph and GraphX. Note the different scales of the upper and bottom subgraphs

Figure 2: CPU package and DRAM Energy consumption (in Joule) of the system with respect to time while computing SSSP in Giraph and GraphX. Note the different scales of the upper and bottom subgraphs

Figure 1(b) presents the package and DRAM energy consumption plots when we run PageRank for bigger graphs, i.e., web-Google and cit-Patents. For these two bigger graph data sets, we see much bigger fluctuations in package energy consumption graph due to the increased number of disk seeks. As mentioned previously, we measure the disk energy consumption simultaneously while we run the PageRank job, with the help of an external clamp meter. In case of GraphX, the hard disk remains almost idle with a few peaks in between, whereas for Giraph the disk energy consumption remains close to peak consumption with frequent fluctuations in between. Overall, the energy traces for PageRank with different algorithms, dataset and platforms are quite unique. For example, PageRank produces two energy peaks with the wiki-Vote dataset whereas with web-Google the energy consumption almost remains constant throughout the execution. Similar behavior is also observed with the Slashdot0902 and citPatents. Energy Consumption for SSSP In Figure 2, we present the energy consumption of Giraph and GraphX for wiki-vote, Slashdot0902, soc-LiveJournal1

25

SCC of wiki-Vote -GraphX

25

15 10 5

15 10 5

0

0

Time in 500 ms frequency 30

SCC of Slashdot0902 -GraphX

20 Energy

Energy

20

Package Energy DRAM Energy

SCC of wiki-Vote -Giraph

20

30

Energy

Energy

Energy Consumption for PageRank Figure 1(a) presents the processor package and DRAM energy consumption with respect to time when we compute the PageRank of wiki-vote and Slashdot0902 dataset with Giraph and GraphX. From this figure, we observe that the package energy consumption for Giraph fluctuates more than of GraphX for the same data set. These fluctuations are caused by the delay when the CPU waits for intermediate results/data read/written from/to disks in the case of Giraph (since it runs on Hadoop). On the other hand, GraphX causes less fluctuation, as it runs on Spark which uses resilient distributed datasets (RDDs) to cache the data and thus avoids storing and fetching intermediate data from the disk.

10

0

Time in 500 ms frequency SCC of Slashdot0902 -Giraph

20

10

0

Time in 500 ms frequency

Time in 500 ms frequency

Figure 3: CPU package and DRAM Energy consumption (in Joule) of the system with respect to time while computing SCC in Giraph and GraphX with wiki-Vote and Slashdot0902 datasets. Note the different scales of the upper and bottom subgraphs

and cit-Patents datasets. The observations are similar to Figure 1. However, in case of the largest dataset socLiveJournal1, the energy consumption of GraphX is higher than Giraph as opposed to other observations. Table 3 presents the energy spent in both cases. Energy consumption of GraphX in this case is substantially more than Giraph (1538J). Overall, we observe that SSSP algorithm behaves differently on Giraph and GraphX for different datasets. This shows that the energy consumption of graph processing platforms depends both on the graph structure and the algorithm itself. Energy Consumption for SCC Figure 3 presents the energy plots for the SCC algorithm running on Giraph and GraphX with wiki-vote and Slashdot0902 datasets. For both of these graphs, Giraph consumes 1.33 times more energy than GraphX. The difference is lower than previous observations but Giraph still consumes more energy than GraphX. Interestingly though, GraphX fails to run

for the larger datasets web-Google and cit-Patents whereas Giraph succeeds to do so on our workstation.

Discussion We present the processing time and energy spent for different algorithms in Table 3. From the table, we can see that Giraph is considerably slower than GraphX for PageRank computation. Consequently, from the perspective of energyefficiency, Giraph is also considerably less efficient. On average, GraphX is 2.06 times faster and consumes 1.71 times less CPU package energy than Giraph. The performance of the platforms depends on the data size and the algorithms. On the contrary, Giraph performs close to GraphX with SSSP. As the data size increases, SSSP requires more energy with GraphX. In case of SCC, GraphX crashes even with the moderate size dataset. We also observe that in case of SSSP, interestingly, both GraphX and Giraph consume less energy on cit-Patents dataset with compared to web-Google dataset, although cit-Patents is almost 4 times bigger in size than web-Google. We further analyse the graph statistics and find that although web-Google is smaller in size, it has more number of triangles than web-Google. This indicates that graph properties play crucial role on processing time and energy consumption. 16000

PackageEnergy-Giraph DRAMEnergy-Giraph PackageEnergy-GraphX DRAMEnergy-GRaphX

Energy in joules

14000 12000 10000 8000 6000 4000 2000 0 0

50

100

150

200

250

300

350

400

Performance (Time in Secs)

450

500

Name wv sd wg cp

GrH(s) 24 33 124 490

wv sd wg cp sl

22 23 28 27 84

wv sd wg cp

23 26 42 66

PageRank GrX(s) 11 16 48 168 SSSP 6 8 27 22 78 SCC 14 19 -

GrH(J) 773.11 1,060.21 3,511.60 15,549.62

GrX(J) 312.48 501.40 1,639.38 5,796.60

557.75 619.71 862.82 858.20 2277.07

222.35 292.99 867.50 845.57 3815.33

622.19 763.05 1428.47 2592.63

432.98 624.6 -

Table 3: Running time (in second) and Energy Consumption (in Joule) of Giraph(GrH) and GraphX (GrX).

As mentioned earlier, GraphX crashed while we performed the PageRank experiment for bigger graphs with default Spark memory parameters. GraphX or rather Spark not only crashed but took considerably more time to inform about the failure when we compare the timing with Giraph. We had to increase the memory to ensure that GraphX finishes the computations. As discussed in [10], there is a possibility that Spark might crash for larger data files since the memory usage can become high quickly. We do not experience any such problem in case of Giraph with Hadoop though. With Spark, our findings are that if the iteration count is high and there is little available memory, Spark crashes. It should be possible however to reduce the memory usage of Spark by using checkpoints. Saving checkpoints allows Spark to reduce the memory usage while still being fault tolerant.

Figure 4: Performance vs Energy Consumption.

Figure 4 presents an overview of performance vs. energy

consumption of Giraph and GraphX for PageRank computation. It illustrates that Giraph takes more time and consumes more energy than GraphX. GraphX, however, consumes a bit more package energy than Giraph, although GraphX finishes the PageRank computation earlier than Giraph. One reason for that can be more frequent use of caches in case of GraphX which may result in higher energy consumption. The DRAM consumption remains constant for both Giraph and GraphX. We also notice that the energy consumption of GraphX increases as the size of the graph data set increases. GraphX depends on Spark which is a memory-based system, which in turn is developed in Scala. Scala depends on the automatic memory management via Garbage Collection (GC). As the data volume increases, the large heap size strains the garbage collector. Consequently, GC spends a significant amount of processing time and energy [8]. We suspect that GC is responsible for the large energy consumption with the soc-LiveJournal1 data.

Conclusion In this work, we have shown that GraphX dominates Giraph in energy efficiency. The obvious reason is that GraphX takes advantage of Spark’s memory-based RDD. It is also demonstrated that the energy consumption characteristics of these algorithms are different. Unlike other Big Data applications, we find that the performance of the Graph-based algorithms may vary with the properties of the graph which is an interesting observation. Our work is the earliest in analyzing and comparing the energy efficiency of disk and memorybased Big Data platforms with Graph-based applications. In future, we plan to investigate more on how the graph size and properties play roles on the performance and energy consumption of the algorithms.

Acknowledgements The authors thank Juha Eskonen, student of Aalto University, for his valuable efforts in experimental setup.

REFERENCES 1. Tom Bostoen, Sape Mullender, and Yolande Berbers. 2013. Power-reduction Techniques for Data-center Storage Systems. ACM Comput. Surv. 45, 3, Article 33 (July 2013), 38 pages. ˘ Tim Hegeman, Alexandru Iosup, Arnau 2. Mihai Capota, Prat-Pérez, Orri Erling, and Peter Boncz. 2015. Graphalytics: A Big Data Benchmark for Graph-Processing Platforms. In Proceedings of the GRADES’15 (GRADES’15). ACM, New York, NY, USA, Article 7, 6 pages. 3. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113. 4. Andy Feng. 2015. Rise of Scalable Machine Learning at Yahoo. (2015). http://www.slideshare.net/Hadoop_Summit/surgerise-of-scalable-machine-learning-at-yahoo. 5. The Apache Software Foundation. 2016a. Apache Giraph. (2016). Retrieved 14th June, 2016 from http://giraph.apache.org/ 6. The Apache Software Foundation. 2016b. Apache Hadoop. (2016). Retrieved 14th June, 2016 from http://hadoop.apache.org/ 7. Yun Gao, Wei Zhou, Jizhong Han, Dan Meng, Zhang Zhang, and Zhiyong Xu. 2015. An Evaluation and Analysis of Graph Processing Frameworks on Five Key Issues. In Proceedings of the 12th ACM CF (CF ’15). ACM, New York, NY, USA, Article 11, 8 pages. 8. Ionel Gog, Jana Giceva, Malte Schwarzkopf, Kapil Vaswani, Dimitrios Vytiniotis, Ganesan Ramalingam, Manuel Costa, Derek G. Murray, Steven Hand, and Michael Isard. 2015. Broom: Sweeping Out Garbage Collection from Big Data Systems. In 15th Workshop on HotOS. USENIX Association, Kartause Ittingen, Switzerland.

9. Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX: Graph Processing in a Distributed Dataflow Framework. In Proceedings of the 11th USENIX OSDI (OSDI’14). USENIX Association, Berkeley, CA, USA, 599–613. 10. Lei Gu and Huan Li. 2013. Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark. In IEEE 10th International Conference on HPCC. 721–727. 11. Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, and Tianqi Jin. 2014. An Experimental Comparison of Pregel-like Graph Processing Systems. Proc. VLDB Endow. 7, 12 (Aug. 2014), 1047–1058. 12. Intel. 2014. Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B & 3C): System Programming Guide. http://www.intel.com/content/ dam/www/public/us/en/documents/manuals/64-ia32-architectures-software-developer-systemprogramming-manual-325384.pdf. Accessed 2016-06-14. 13. Rini T. Kaushik and Milind Bhandarkar. 2010. GreenHDFS: Towards an Energy-conserving, Storage-efficient, Hybrid Hadoop Compute Cluster. In Proceedings of the 2010 HotPower (HotPower’10). USENIX Association, Berkeley, CA, USA, 1–9. 14. Jure Leskovec. 2016. Stanford Network Analysis Project. (2016). Retrieved 14th June, 2016 from https://snap.stanford.edu/ 15. Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. 2010. Signed Networks in Social Media. In Proceedings of the SIGCHI Conference (CHI ’10). ACM, New York, NY, USA, 1361–1370. 16. Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. 2005. Graphs over Time: Densification Laws, Shrinking

Diameters and Possible Explanations. In Proceedings of the Eleventh ACM SIGKDD (KDD ’05). ACM, New York, NY, USA, 177–187. 17. Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2008. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. CoRR abs/0810.1355 (2008). 18. Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow. 5, 8 (April 2012), 716–727. 19. Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-scale Graph Processing. In Proceedings of the 2010 ACM SIGMOD (SIGMOD ’10). ACM, New York, NY, USA, 135–146. 20. Leslie G. Valiant. 1990. A Bridging Model for Parallel Computation. Commun. ACM 33, 8 (Aug. 1990), 103–111. 21. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX NSDI (NSDI’12). USENIX Association, Berkeley, CA, USA, 2–2.